statcpp CLI Design & Developer Guide
Concept
Bring statistics to the UNIX pipeline.
Just as awk handles text processing and jq handles JSON processing, statcpp handles statistical processing.
A single binary, zero dependencies, fast startup — a tool that fits naturally into data analysis workflows.
In a nutshell: "The simplicity of datamash" + "The functionality of R" + "The speed of C++"
Architecture
File Structure
statcppCLI/
├── CMakeLists.txt Build configuration
├── cmake/
│ ├── gflags.cmake gflags auto-download + build
│ └── nlohmann-json.cmake nlohmann/json auto-download
├── src/
│ ├── main.cpp Entry point (gflags definitions, dispatch)
│ └── include/
│ ├── csv_reader.hpp CSV/TSV reader (RFC 4180)
│ ├── cli_parser.hpp Subcommand parsing & shortcuts
│ ├── output_formatter.hpp Text/JSON/quiet output
│ └── commands/
│ ├── desc.hpp Descriptive statistics (17 commands)
│ ├── test_cmd.hpp Statistical tests (12 commands)
│ ├── corr.hpp Correlation & covariance (5 commands)
│ ├── effect.hpp Effect size (6 commands)
│ ├── ci.hpp Confidence intervals (5 commands)
│ ├── reg.hpp Regression analysis (5 commands)
│ ├── anova.hpp Analysis of variance (5 commands)
│ ├── resample.hpp Resampling (6 commands)
│ ├── ts.hpp Time series analysis (8 commands)
│ ├── robust.hpp Robust statistics (7 commands)
│ ├── survival.hpp Survival analysis (3 commands)
│ ├── cluster.hpp Clustering (3 commands)
│ ├── multiple.hpp Multiple testing correction (3 commands)
│ ├── power.hpp Power analysis (3 commands)
│ ├── glm_cmd.hpp Generalized linear models (2 commands)
│ └── model.hpp Model selection (4 commands)
├── test/
│ ├── test_csv_reader.cpp CSV reader unit tests
│ ├── test_output_formatter.cpp Output formatter unit tests
│ ├── test_cli_parser.cpp Argument parsing tests
│ └── e2e/
│ ├── run_e2e.sh E2E test runner
│ ├── data/ Test CSV files
│ └── golden/ Expected output files
└── download/ Auto-generated (.gitignore target)
├── statcpp/
│ ├── statcpp-main.tar.gz statcpp archive cache
│ └── statcpp-install/ statcpp headers (include/statcpp/)
├── gflags/
│ ├── gflags/ gflags source + build (_build/)
│ └── gflags-install/ gflags install destination (lib/, include/)
└── nlohmann-json/
└── nlohmann-json-install/ json.hpp (include/nlohmann/)
Header-Only Design
All command files are implemented as .hpp (header-only). Each function is marked inline.
Rationale:
- The statcpp library itself is header-only — consistency of style
- No need to add source files to CMakeLists.txt
main.cppsimply#includes each.hppand everything is self-contained
Trade-offs:
- Compilation reduces to a single
main.cppfile (not a problem currently) - Build time may increase if the number of commands grows significantly
Processing Flow
main.cpp
├── gflags::ParseCommandLineFlags() Flag parsing
├── parse_command() Get category/command/file path
├── CsvReader::read_file/read_stdin() Read CSV (except power/ci sample-size)
├── OutputFormatter(mode) Determine output mode
└── run_<category>() Dispatch by category
├── csv.get_clean_data() Get column + remove missing values
├── statcpp::*() Statistical computation
└── fmt.print() / fmt.flush() Output results
Dependencies
| Library | Purpose | Installation Method |
|---|---|---|
| statcpp | Core statistical computation (header-only) | Auto-downloaded from GitHub via cmake/statcpp.cmake |
| gflags | Command-line argument parsing | Auto-downloaded + built via cmake/gflags.cmake |
| nlohmann/json | JSON output | Auto-downloaded via cmake/nlohmann-json.cmake |
| Google Test | Unit testing (optional) | Enabled with -DGTEST=true |
gflags Design
- Global flags (
--col,--json,--alpha, etc.) are defined inmain.cppusingDEFINE_*macros - Each command file uses
DECLARE_*to reference them - Subcommands (category, command) are parsed manually in
cli_parser.hpp, not by gflags
Commands That Don't Require CSV
The following commands operate without CSV input:
ci sample-size— Required sample size calculationci prop— Confidence interval for proportions (specified with--successes,--trials)effect cohens-h— Effect size for proportions (specified with--p1,--p2)power *— All power analysis commands
Controlled by the needs_csv flag in main.cpp.
Data Ordering & Sorting Strategy
Sorting Requirements of the statcpp Library
statcpp contains a mix of functions that require pre-sorted data and functions that sort internally. The CLI layer shields users from having to be aware of this difference.
| Category | Example Functions | CLI Handling |
|---|---|---|
| Requires sorted input | median(), quartiles(), percentile(), iqr(), five_number_summary() |
Auto-sorted by CLI |
| Sorts internally | mad(), shapiro_wilk_test(), kaplan_meier() |
Passed as-is |
| Order is meaningful (no sorting) | acf(), moving_average(), diff(), t_test_paired(), all regression |
Passed as-is |
Implementation Strategy
- Default: Auto-sort on the CLI side as needed, based on each command's requirements
- Optimization:
--presortedskips the copy and sort for pre-sorted data - summary command: Sorts once and reuses for median, quartiles, five_number_summary
- Multiple columns: Never sorted (would break the correspondence between columns)
Order of Missing Value Removal and Sorting
1. Read CSV
2. Remove missing values (--skip_na)
3. Sort (only when necessary)
4. Statistical computation
Output Design
Three Output Modes
| Mode | Flag | Use Case | Format |
|---|---|---|---|
| Text | (default) | Human-readable | Label: value |
| JSON | --json |
Programmatic access | Structured JSON |
| Quiet | --quiet |
Pipelines | Numeric values only |
JSON Output Structure
{
"command": "desc.summary",
"input": {
"column": "value",
"n": 5
},
"result": {
"Count": 5.0,
"Mean": 30.0,
"Std Dev": 15.811388300841896,
"Min": 10.0,
"Median": 30.0,
"Max": 50.0
}
}
Testing Strategy
Terminology
| Term | Full Name | Meaning |
|---|---|---|
| E2E test | End-to-End test | A test that verifies the entire system from input to output in one pass. While unit tests verify correctness at the function level, E2E tests execute the actual binary to confirm that "the same operations a user would perform produce the expected output" |
| Golden file | Golden file | An expected output file saved in advance as the "correct answer". During testing, the actual output is compared using diff, and any mismatch results in a FAIL. The name derives from "gold standard" |
Three-Layer Structure
Layer 1: statcpp library tests (existing, no changes needed)
├── Google Test 758 tests (function-level correctness)
└── R verification 167 checks (numerical precision guarantee)
Layer 2: CLI-specific unit tests (28 tests)
├── test_csv_reader.cpp CSV/TSV parser tests
├── test_output_formatter.cpp Output formatting tests
└── test_cli_parser.cpp Argument parsing tests
Layer 3: E2E tests (52 tests)
├── Golden file tests diff comparison against expected output
└── Error case tests Verification of error handling behavior
Layer 4: Reference verification (126 tests)
├── docs/run_reference.sh Runs all examples from test-reference.md
└── docs/output.txt Execution results (PASS: 126, SKIP: 0)
Build, Install & Run Tests
# Build the CLI binary
cmake -B build && cmake --build build
# Install (default: /usr/local/bin/statcpp)
sudo cmake --install build
# Install to a custom directory
cmake --install build --prefix ~/.local # → ~/.local/bin/statcpp
# Unit tests (build with GTest enabled)
cmake -B build -DGTEST=true && cmake --build build && ctest --test-dir build --verbose
# Switch back to CLI binary (disable GTest)
cmake -B build -DGTEST=false && cmake --build build
# E2E tests
cd test/e2e && bash run_e2e.sh
# Verify all reference examples
bash docs/run_reference.sh
Note: When built with -DGTEST=true, the binary becomes a test runner.
To use it as a CLI tool, rebuild with -DGTEST=false.
Updating Golden Files
When you change the output format:
- Rebuild the CLI binary
- Run the changed command and save the output to the golden file
- Review the changes with
diff - Run the full E2E test suite to verify
# Example: update the golden file for desc summary
cd test/e2e
../../build/statcpp desc summary data/basic.csv --col value > golden/desc_summary.txt
bash run_e2e.sh
Adding a New Command
1. Create/Edit the Command File
Add a cmd.command == "new-cmd" branch to src/include/commands/<category>.hpp.
} else if (cmd.command == "new-cmd") {
auto data = csv.get_clean_data(cols[0], FLAGS_fail_na);
double result = statcpp::new_function(data.begin(), data.end());
fmt.set_input_info({{"column", cols[0]}, {"n", data.size()}});
fmt.print("Result", result);
}
2. For a New Category
- Create
commands/new_category.hpp - Add
#includeandDECLARE_*tomain.cpp - Add an
else ifbranch to the dispatch inmain.cpp - Add to the
categoriesvector incli_parser.hpp - If the command doesn't require CSV, update the
needs_csvcondition - If new gflags are needed, add
DEFINE_*tomain.cpp
3. Add Tests
- E2E test: Add a test case to
run_e2e.sh - Golden file: Save expected output in
test/e2e/golden/ - Test data: Add CSV files to
test/e2e/data/if needed
Pipeline Usage Examples
# Summary statistics for a specific column in a CSV
statcpp desc summary data.csv --col price
# Normality test → choose the appropriate test
statcpp test shapiro data.csv --col score
# Pipe JSON output for processing
statcpp test t data.csv --col a,b --json | jq '.result["p-value"]'
# Pipe numeric-only output
statcpp desc mean data.csv --col price --quiet | xargs echo "Mean:"
# Read from stdin
cat data.csv | statcpp desc mean --col value
awk '{print $3}' access.log | statcpp desc summary --noheader --col 1
# --row: process inline data directly (comma or space-delimited)
echo "1,2,3,4,5" | statcpp desc mean --noheader --col 1 --row
echo "1 2 3 4 5" | statcpp desc mean --noheader --col 1 --row
# Batch analysis of multiple files
for f in experiment_*.csv; do
echo "=== $f ==="
statcpp desc summary "$f" --col result --quiet
done
Unimplemented Design Ideas
The following features are described in doc/CLI.md (initial design document) but are not yet implemented:
--csvoutput mode--seedrandom seed (for resampling)--verbosedetailed output (e.g., sort time display)test prop/test prop2(proportion tests)test chisq-indep(chi-squared test of independence)test fisher(Fisher's exact test)anova twoway(two-way ANOVA)anova ancova(ANCOVA)cluster--koption (currently fixed to default k=3)ts--lag,--window,--alphaoptions (currently fixed to defaults)- Shell completion (bash / zsh / fish)
- Man page generation
- Homebrew formula / apt package