statcpp CLI Design & Developer Guide

Concept

Bring statistics to the UNIX pipeline.

Just as awk handles text processing and jq handles JSON processing, statcpp handles statistical processing. A single binary, zero dependencies, fast startup — a tool that fits naturally into data analysis workflows.

In a nutshell: "The simplicity of datamash" + "The functionality of R" + "The speed of C++"

Architecture

File Structure

statcppCLI/
├── CMakeLists.txt                     Build configuration
├── cmake/
│   ├── gflags.cmake                   gflags auto-download + build
│   └── nlohmann-json.cmake            nlohmann/json auto-download
├── src/
│   ├── main.cpp                       Entry point (gflags definitions, dispatch)
│   └── include/
│       ├── csv_reader.hpp             CSV/TSV reader (RFC 4180)
│       ├── cli_parser.hpp             Subcommand parsing & shortcuts
│       ├── output_formatter.hpp       Text/JSON/quiet output
│       └── commands/
│           ├── desc.hpp               Descriptive statistics (17 commands)
│           ├── test_cmd.hpp           Statistical tests (12 commands)
│           ├── corr.hpp               Correlation & covariance (5 commands)
│           ├── effect.hpp             Effect size (6 commands)
│           ├── ci.hpp                 Confidence intervals (5 commands)
│           ├── reg.hpp                Regression analysis (5 commands)
│           ├── anova.hpp              Analysis of variance (5 commands)
│           ├── resample.hpp           Resampling (6 commands)
│           ├── ts.hpp                 Time series analysis (8 commands)
│           ├── robust.hpp             Robust statistics (7 commands)
│           ├── survival.hpp           Survival analysis (3 commands)
│           ├── cluster.hpp            Clustering (3 commands)
│           ├── multiple.hpp           Multiple testing correction (3 commands)
│           ├── power.hpp              Power analysis (3 commands)
│           ├── glm_cmd.hpp            Generalized linear models (2 commands)
│           └── model.hpp              Model selection (4 commands)
├── test/
│   ├── test_csv_reader.cpp            CSV reader unit tests
│   ├── test_output_formatter.cpp      Output formatter unit tests
│   ├── test_cli_parser.cpp            Argument parsing tests
│   └── e2e/
│       ├── run_e2e.sh                 E2E test runner
│       ├── data/                      Test CSV files
│       └── golden/                    Expected output files
└── download/                          Auto-generated (.gitignore target)
    ├── statcpp/
    │   ├── statcpp-main.tar.gz        statcpp archive cache
    │   └── statcpp-install/           statcpp headers (include/statcpp/)
    ├── gflags/
    │   ├── gflags/                    gflags source + build (_build/)
    │   └── gflags-install/            gflags install destination (lib/, include/)
    └── nlohmann-json/
        └── nlohmann-json-install/     json.hpp (include/nlohmann/)

Header-Only Design

All command files are implemented as .hpp (header-only). Each function is marked inline.

Rationale:

The statcpp library itself is header-only — consistency of style
No need to add source files to CMakeLists.txt
main.cpp simply #includes each .hpp and everything is self-contained

Trade-offs:

Compilation reduces to a single main.cpp file (not a problem currently)
Build time may increase if the number of commands grows significantly

Processing Flow

main.cpp
  ├── gflags::ParseCommandLineFlags()    Flag parsing
  ├── parse_command()                     Get category/command/file path
  ├── CsvReader::read_file/read_stdin()   Read CSV (except power/ci sample-size)
  ├── OutputFormatter(mode)               Determine output mode
  └── run_<category>()                    Dispatch by category
        ├── csv.get_clean_data()          Get column + remove missing values
        ├── statcpp::*()                  Statistical computation
        └── fmt.print() / fmt.flush()    Output results

Dependencies

Library	Purpose	Installation Method
statcpp	Core statistical computation (header-only)	Auto-downloaded from GitHub via `cmake/statcpp.cmake`
gflags	Command-line argument parsing	Auto-downloaded + built via `cmake/gflags.cmake`
nlohmann/json	JSON output	Auto-downloaded via `cmake/nlohmann-json.cmake`
Google Test	Unit testing (optional)	Enabled with `-DGTEST=true`

gflags Design

Global flags (--col, --json, --alpha, etc.) are defined in main.cpp using DEFINE_* macros
Each command file uses DECLARE_* to reference them
Subcommands (category, command) are parsed manually in cli_parser.hpp, not by gflags

Commands That Don't Require CSV

The following commands operate without CSV input:

ci sample-size — Required sample size calculation
ci prop — Confidence interval for proportions (specified with --successes, --trials)
effect cohens-h — Effect size for proportions (specified with --p1, --p2)
power * — All power analysis commands

Controlled by the needs_csv flag in main.cpp.

Data Ordering & Sorting Strategy

Sorting Requirements of the statcpp Library

statcpp contains a mix of functions that require pre-sorted data and functions that sort internally. The CLI layer shields users from having to be aware of this difference.

Category	Example Functions	CLI Handling
Requires sorted input	`median()`, `quartiles()`, `percentile()`, `iqr()`, `five_number_summary()`	Auto-sorted by CLI
Sorts internally	`mad()`, `shapiro_wilk_test()`, `kaplan_meier()`	Passed as-is
Order is meaningful (no sorting)	`acf()`, `moving_average()`, `diff()`, `t_test_paired()`, all regression	Passed as-is

Implementation Strategy

Default: Auto-sort on the CLI side as needed, based on each command's requirements
Optimization: --presorted skips the copy and sort for pre-sorted data
summary command: Sorts once and reuses for median, quartiles, five_number_summary
Multiple columns: Never sorted (would break the correspondence between columns)

Order of Missing Value Removal and Sorting

1. Read CSV
2. Remove missing values (--skip_na)
3. Sort (only when necessary)
4. Statistical computation

Output Design

Three Output Modes

Mode	Flag	Use Case	Format
Text	(default)	Human-readable	`Label: value`
JSON	`--json`	Programmatic access	Structured JSON
Quiet	`--quiet`	Pipelines	Numeric values only

JSON Output Structure

{
    "command": "desc.summary",
    "input": {
        "column": "value",
        "n": 5
    },
    "result": {
        "Count": 5.0,
        "Mean": 30.0,
        "Std Dev": 15.811388300841896,
        "Min": 10.0,
        "Median": 30.0,
        "Max": 50.0
    }
}

Testing Strategy

Terminology

Term	Full Name	Meaning
E2E test	End-to-End test	A test that verifies the entire system from input to output in one pass. While unit tests verify correctness at the function level, E2E tests execute the actual binary to confirm that "the same operations a user would perform produce the expected output"
Golden file	Golden file	An expected output file saved in advance as the "correct answer". During testing, the actual output is compared using `diff`, and any mismatch results in a FAIL. The name derives from "gold standard"

Three-Layer Structure

Layer 1: statcpp library tests (existing, no changes needed)
  ├── Google Test 758 tests (function-level correctness)
  └── R verification 167 checks (numerical precision guarantee)

Layer 2: CLI-specific unit tests (28 tests)
  ├── test_csv_reader.cpp      CSV/TSV parser tests
  ├── test_output_formatter.cpp Output formatting tests
  └── test_cli_parser.cpp      Argument parsing tests

Layer 3: E2E tests (52 tests)
  ├── Golden file tests         diff comparison against expected output
  └── Error case tests          Verification of error handling behavior

Layer 4: Reference verification (126 tests)
  ├── docs/run_reference.sh    Runs all examples from test-reference.md
  └── docs/output.txt          Execution results (PASS: 126, SKIP: 0)

Build, Install & Run Tests

# Build the CLI binary
cmake -B build && cmake --build build

# Install (default: /usr/local/bin/statcpp)
sudo cmake --install build

# Install to a custom directory
cmake --install build --prefix ~/.local    # → ~/.local/bin/statcpp

# Unit tests (build with GTest enabled)
cmake -B build -DGTEST=true && cmake --build build && ctest --test-dir build --verbose

# Switch back to CLI binary (disable GTest)
cmake -B build -DGTEST=false && cmake --build build

# E2E tests
cd test/e2e && bash run_e2e.sh

# Verify all reference examples
bash docs/run_reference.sh

Note: When built with -DGTEST=true, the binary becomes a test runner. To use it as a CLI tool, rebuild with -DGTEST=false.

Updating Golden Files

When you change the output format:

Rebuild the CLI binary
Run the changed command and save the output to the golden file
Review the changes with diff
Run the full E2E test suite to verify

# Example: update the golden file for desc summary
cd test/e2e
../../build/statcpp desc summary data/basic.csv --col value > golden/desc_summary.txt
bash run_e2e.sh

Adding a New Command

1. Create/Edit the Command File

Add a cmd.command == "new-cmd" branch to src/include/commands/<category>.hpp.

} else if (cmd.command == "new-cmd") {
    auto data = csv.get_clean_data(cols[0], FLAGS_fail_na);
    double result = statcpp::new_function(data.begin(), data.end());
    fmt.set_input_info({{"column", cols[0]}, {"n", data.size()}});
    fmt.print("Result", result);
}

2. For a New Category

Create commands/new_category.hpp
Add #include and DECLARE_* to main.cpp
Add an else if branch to the dispatch in main.cpp
Add to the categories vector in cli_parser.hpp
If the command doesn't require CSV, update the needs_csv condition
If new gflags are needed, add DEFINE_* to main.cpp

3. Add Tests

E2E test: Add a test case to run_e2e.sh
Golden file: Save expected output in test/e2e/golden/
Test data: Add CSV files to test/e2e/data/ if needed

Pipeline Usage Examples

# Summary statistics for a specific column in a CSV
statcpp desc summary data.csv --col price

# Normality test → choose the appropriate test
statcpp test shapiro data.csv --col score

# Pipe JSON output for processing
statcpp test t data.csv --col a,b --json | jq '.result["p-value"]'

# Pipe numeric-only output
statcpp desc mean data.csv --col price --quiet | xargs echo "Mean:"

# Read from stdin
cat data.csv | statcpp desc mean --col value
awk '{print $3}' access.log | statcpp desc summary --noheader --col 1

# --row: process inline data directly (comma or space-delimited)
echo "1,2,3,4,5" | statcpp desc mean --noheader --col 1 --row
echo "1 2 3 4 5" | statcpp desc mean --noheader --col 1 --row

# Batch analysis of multiple files
for f in experiment_*.csv; do
  echo "=== $f ==="
  statcpp desc summary "$f" --col result --quiet
done

Unimplemented Design Ideas

The following features are described in doc/CLI.md (initial design document) but are not yet implemented:

--csv output mode
--seed random seed (for resampling)
--verbose detailed output (e.g., sort time display)
test prop / test prop2 (proportion tests)
test chisq-indep (chi-squared test of independence)
test fisher (Fisher's exact test)
anova twoway (two-way ANOVA)
anova ancova (ANCOVA)
cluster --k option (currently fixed to default k=3)
ts --lag, --window, --alpha options (currently fixed to defaults)
Shell completion (bash / zsh / fish)
Man page generation
Homebrew formula / apt package