SQLite3 Loadable Extension Guide

An implementation guide for SQLite3 runtime extensions (Loadable Extensions). This project implements the extension library in C++ (ext_funcs.cpp).

1. What is a Loadable Extension

Overview

An SQLite3 Loadable Extension is a mechanism for dynamically adding custom functionality at runtime to SQLite. By loading a dynamic library (.dylib / .so / .dll) written in C++ via sqlite3_load_extension(), you can extend SQLite's functionality without recompiling the SQLite core.

Capabilities

Feature	Description	Example
Scalar functions	Functions that return a single value for given input values	`square(x)`, `upper(s)`
Aggregate functions	Functions that aggregate multiple rows and return a single value	`my_sum(x)`, `sha256_agg(s)`
Window functions	Aggregate functions used with OVER clauses (SQLite 3.25+)	Custom moving average
Collations	Define string comparison rules	Japanese sort order
Virtual Tables	Expose external data sources as tables	CSV, JSON, REST API
VFS (Virtual File System)	Customize the file I/O layer	Encryption, memory FS
Overriding existing functions	Replace built-in functions with custom implementations	Multilingual `upper()`
Batch registration of multiple functions	Register multiple functions from a single extension	This project's implementation

Limitations

Constraint	Description
Cannot modify SQL syntax	New SQL keywords or syntax cannot be added
Cannot modify storage engine	SQLite's internal B-Tree structure and page format cannot be changed
Cannot auto-create triggers or views	There is no mechanism for implicitly executing DDL (explicit `sqlite3_exec` is possible)
No multi-threading guarantees	Depends on SQLite's threading mode. Extensions should not create their own threads
No persistent auto-loading	Extension information is not stored in the DB file. Loading is required for each connection
No sandboxed execution	Extensions run with the same privileges as the host process. Arbitrary system calls can be made
Cannot write directly in other languages	The C/C++ function pointer interface is required. Other languages must go through FFI

Security Notes

Extensions execute in the same address space as the host process, so bugs or malicious code can affect the entire process
Extension loading is disabled by default unless sqlite3_enable_load_extension(db, 1) is called
The load_extension() SQL function has even stricter restrictions and requires separate enablement via sqlite3_db_config(db, SQLITE_DBCONFIG_ENABLE_LOAD_EXTENSION, 1, NULL)
Untrusted extensions must never be loaded

2. Project Structure

sqlite3StatisticalLibrary/
├── CMakeLists.txt          # Build configuration
├── README.md
├── cmake/
│   ├── sqlite3.cmake       # SQLite3 download configuration
│   └── statcpp.cmake       # statcpp download configuration
├── doc/                    # Documentation
├── download/               # Downloaded dependency libraries
│   ├── sqlite3/            # SQLite3 source
│   └── statcpp/            # statcpp header-only library
└── src/
    ├── main.cpp             # C++ program that loads and tests the extension
    ├── ext_funcs.cpp        # Extension library (C++) — statistical function implementations
    └── include/             # (currently empty)

External Libraries

statcpp

Overview: A C++ header-only statistical computation library
Location: download/statcpp/statcpp-install/include/statcpp/
Usage: Automatically downloaded and placed by CMake's statcpp.cmake

In this project, ext_funcs.cpp includes it with #include <statcpp/statcpp.hpp>, delegating the statistical computation logic to statcpp.

Provided SQL Functions

Basic Aggregate Functions (24 functions)

Function	Arguments	Return	Description
`stat_mean(col)`	numeric x1	numeric	Arithmetic mean
`stat_median(col)`	numeric x1	numeric	Median
`stat_mode(col)`	numeric x1	numeric	Mode (smallest value if multiple exist)
`stat_geometric_mean(col)`	numeric x1	numeric	Geometric mean
`stat_harmonic_mean(col)`	numeric x1	numeric	Harmonic mean
`stat_range(col)`	numeric x1	numeric	Range (max - min)
`stat_var(col)`	numeric x1	numeric	Variance (population variance, ddof=0)
`stat_population_variance(col)`	numeric x1	numeric	Population variance
`stat_sample_variance(col)`	numeric x1	numeric	Sample variance (unbiased variance)
`stat_stdev(col)`	numeric x1	numeric	Standard deviation (population, ddof=0)
`stat_population_stddev(col)`	numeric x1	numeric	Population standard deviation
`stat_sample_stddev(col)`	numeric x1	numeric	Sample standard deviation
`stat_cv(col)`	numeric x1	numeric	Coefficient of variation
`stat_iqr(col)`	numeric x1	numeric	Interquartile range
`stat_mad_mean(col)`	numeric x1	numeric	Mean absolute deviation
`stat_geometric_stddev(col)`	numeric x1	numeric	Geometric standard deviation
`stat_population_skewness(col)`	numeric x1	numeric	Population skewness
`stat_skewness(col)`	numeric x1	numeric	Sample skewness
`stat_population_kurtosis(col)`	numeric x1	numeric	Population kurtosis (excess kurtosis)
`stat_kurtosis(col)`	numeric x1	numeric	Sample kurtosis (excess kurtosis)
`stat_se(col)`	numeric x1	numeric	Standard error
`stat_mad(col)`	numeric x1	numeric	Median absolute deviation (MAD)
`stat_mad_scaled(col)`	numeric x1	numeric	Scaled MAD
`stat_hodges_lehmann(col)`	numeric x1	numeric	Hodges-Lehmann estimator

Aggregate Functions with Parameters (20 functions)

Function	Arguments	Return	Description
`stat_trimmed_mean(col, proportion)`	numeric, numeric	numeric	Trimmed mean
`stat_quartile(col)`	numeric x1	JSON	Quartiles (Q1, Q2, Q3)
`stat_percentile(col, p)`	numeric, numeric	numeric	Percentile
`stat_z_test(col, mu0, sigma)`	numeric, numeric, numeric	JSON	One-sample z-test
`stat_t_test(col, mu0)`	numeric, numeric	JSON	One-sample t-test
`stat_chisq_gof_uniform(col)`	numeric x1	JSON	Chi-squared goodness-of-fit test (uniform)
`stat_shapiro_wilk(col)`	numeric x1	JSON	Shapiro-Wilk test
`stat_ks_test(col)`	numeric x1	JSON	Lilliefors test (normality)
`stat_wilcoxon(col, mu0)`	numeric, numeric	JSON	Wilcoxon signed-rank test
`stat_ci_mean(col, confidence)`	numeric, numeric	JSON	Confidence interval for the mean (t-distribution)
`stat_ci_mean_z(col, sigma, confidence)`	numeric, numeric, numeric	JSON	Confidence interval for the mean (z-distribution)
`stat_ci_var(col, confidence)`	numeric, numeric	JSON	Confidence interval for variance
`stat_moe_mean(col, confidence)`	numeric, numeric	numeric	Margin of error for the mean
`stat_cohens_d(col, mu0)`	numeric, numeric	numeric	Cohen's d (one-sample)
`stat_hedges_g(col, mu0)`	numeric, numeric	numeric	Hedges' g (one-sample)
`stat_acf_lag(col, lag)`	numeric, numeric	numeric	Autocorrelation coefficient
`stat_biweight_midvar(col, c)`	numeric, numeric	numeric	Biweight Midvariance
`stat_bootstrap_mean(col, n)`	numeric, numeric	JSON	Bootstrap for the mean
`stat_bootstrap_median(col, n)`	numeric, numeric	JSON	Bootstrap for the median
`stat_bootstrap_stddev(col, n)`	numeric, numeric	JSON	Bootstrap for the standard deviation

Two-Column Aggregate Functions (27 functions)

Function	Arguments	Return	Description
`stat_population_covariance(x, y)`	numeric, numeric	numeric	Population covariance
`stat_covariance(x, y)`	numeric, numeric	numeric	Sample covariance
`stat_pearson_r(x, y)`	numeric, numeric	numeric	Pearson correlation coefficient
`stat_spearman_r(x, y)`	numeric, numeric	numeric	Spearman rank correlation
`stat_kendall_tau(x, y)`	numeric, numeric	numeric	Kendall rank correlation
`stat_weighted_covariance(val, wt)`	numeric, numeric	numeric	Weighted covariance
`stat_weighted_mean(val, wt)`	numeric, numeric	numeric	Weighted mean
`stat_weighted_harmonic_mean(val, wt)`	numeric, numeric	numeric	Weighted harmonic mean
`stat_weighted_variance(val, wt)`	numeric, numeric	numeric	Weighted variance
`stat_weighted_stddev(val, wt)`	numeric, numeric	numeric	Weighted standard deviation
`stat_weighted_median(val, wt)`	numeric, numeric	numeric	Weighted median
`stat_weighted_percentile(val, wt, p)`	numeric, numeric, numeric	numeric	Weighted percentile
`stat_simple_regression(x, y)`	numeric, numeric	JSON	Simple linear regression
`stat_r_squared(actual, pred)`	numeric, numeric	numeric	Coefficient of determination R-squared
`stat_adjusted_r_squared(actual, pred)`	numeric, numeric	numeric	Adjusted R-squared
`stat_t_test_paired(x, y)`	numeric, numeric	JSON	Paired t-test
`stat_chisq_gof(obs, exp)`	numeric, numeric	JSON	Chi-squared goodness-of-fit test
`stat_mae(actual, pred)`	numeric, numeric	numeric	Mean absolute error
`stat_mse(actual, pred)`	numeric, numeric	numeric	Mean squared error
`stat_rmse(actual, pred)`	numeric, numeric	numeric	RMSE
`stat_mape(actual, pred)`	numeric, numeric	numeric	Mean absolute percentage error
`stat_euclidean_dist(a, b)`	numeric, numeric	numeric	Euclidean distance
`stat_manhattan_dist(a, b)`	numeric, numeric	numeric	Manhattan distance
`stat_cosine_sim(a, b)`	numeric, numeric	numeric	Cosine similarity
`stat_cosine_dist(a, b)`	numeric, numeric	numeric	Cosine distance
`stat_minkowski_dist(a, b, p)`	numeric, numeric, numeric	numeric	Minkowski distance
`stat_chebyshev_dist(a, b)`	numeric, numeric	numeric	Chebyshev distance

Window Functions (23 functions)

Window functions are implemented using a full-scan approach. Each row returns a single value. Can also be used without an OVER() clause.

Function	Arguments	Return	Description
`stat_rolling_mean(col, window)`	numeric, integer	numeric/row	Rolling mean
`stat_rolling_std(col, window)`	numeric, integer	numeric/row	Rolling standard deviation
`stat_rolling_min(col, window)`	numeric, integer	numeric/row	Rolling minimum
`stat_rolling_max(col, window)`	numeric, integer	numeric/row	Rolling maximum
`stat_rolling_sum(col, window)`	numeric, integer	numeric/row	Rolling sum
`stat_moving_avg(col, window)`	numeric, integer	numeric/row	Simple moving average
`stat_ema(col, span)`	numeric, integer	numeric/row	Exponential moving average (alpha=2/(span+1))
`stat_rank(col)`	numeric x1	numeric/row	Rank transformation
`stat_fillna_mean(col)`	numeric x1	numeric/row	Mean imputation
`stat_fillna_median(col)`	numeric x1	numeric/row	Median imputation
`stat_fillna_ffill(col)`	numeric x1	numeric/row	Forward fill
`stat_fillna_bfill(col)`	numeric x1	numeric/row	Backward fill
`stat_fillna_interp(col)`	numeric x1	numeric/row	Linear interpolation
`stat_label_encode(col)`	numeric x1	numeric/row	Label encoding
`stat_bin_width(col, n_bins)`	numeric, integer	numeric/row	Equal-width binning
`stat_bin_freq(col, n_bins)`	numeric, integer	numeric/row	Equal-frequency binning
`stat_lag(col, k)`	numeric, integer	numeric/row	Lag (value from k rows before)
`stat_diff(col, order)`	numeric, integer	numeric/row	Differencing
`stat_seasonal_diff(col, period)`	numeric, integer	numeric/row	Seasonal differencing
`stat_outliers_iqr(col)`	numeric x1	numeric/row	Outlier detection (IQR method, 1.0/0.0)
`stat_outliers_zscore(col)`	numeric x1	numeric/row	Outlier detection (Z-score method)
`stat_outliers_mzscore(col)`	numeric x1	numeric/row	Outlier detection (modified Z-score method)
`stat_winsorize(col, pct)`	numeric, integer	numeric/row	Winsorization

Composite Aggregate Functions (32 functions)

Aggregate functions that return JSON (multiple values), or composite aggregate functions such as two-sample tests and survival analysis.

Function	Arguments	Return	Description
`stat_modes(col)`	numeric x1	JSON	Modes (all)
`stat_five_number_summary(col)`	numeric x1	JSON	Five-number summary (min, Q1, median, Q3, max)
`stat_frequency_table(col)`	numeric x1	JSON	Frequency table
`stat_frequency_count(col)`	numeric x1	JSON	Frequency count for each value
`stat_relative_frequency(col)`	numeric x1	JSON	Relative frequency
`stat_cumulative_frequency(col)`	numeric x1	JSON	Cumulative frequency
`stat_cumulative_relative_frequency(col)`	numeric x1	JSON	Cumulative relative frequency
`stat_t_test2(grp1, grp2)`	numeric, numeric	JSON	Two-sample t-test (pooled variance)
`stat_t_test_welch(grp1, grp2)`	numeric, numeric	JSON	Welch t-test
`stat_chisq_independence(col1, col2)`	numeric, numeric	JSON	Chi-squared independence test
`stat_f_test(grp1, grp2)`	numeric, numeric	JSON	F-test (variance comparison)
`stat_mann_whitney(grp1, grp2)`	numeric, numeric	JSON	Mann-Whitney U test
`stat_anova1(val, grp)`	numeric, numeric	JSON	One-way ANOVA
`stat_contingency_table(col1, col2)`	numeric, numeric	JSON	Contingency table
`stat_cohens_d2(grp1, grp2)`	numeric, numeric	numeric	Cohen's d (two-sample)
`stat_hedges_g2(grp1, grp2)`	numeric, numeric	numeric	Hedges' g (two-sample)
`stat_glass_delta(grp1, grp2)`	numeric, numeric	numeric	Glass's Delta
`stat_ci_mean_diff(grp1, grp2)`	numeric, numeric	JSON	Confidence interval for two-sample mean difference
`stat_ci_mean_diff_welch(grp1, grp2)`	numeric, numeric	JSON	Confidence interval for two-sample mean difference (Welch)
`stat_kaplan_meier(time, event)`	numeric, numeric	JSON	Kaplan-Meier survival curve
`stat_nelson_aalen(time, event)`	numeric, numeric	JSON	Nelson-Aalen cumulative hazard estimate
`stat_logrank(time, event, grp)`	numeric, numeric, numeric	JSON	Log-rank test
`stat_bootstrap(col, n)`	numeric, numeric	JSON	General bootstrap estimation
`stat_bootstrap_bca(col, n)`	numeric, numeric	JSON	BCa bootstrap
`stat_bootstrap_sample(col)`	numeric x1	JSON	Bootstrap sample generation
`stat_permutation_test2(grp1, grp2)`	numeric, numeric	JSON	Two-sample permutation test
`stat_permutation_paired(x, y)`	numeric, numeric	JSON	Paired permutation test
`stat_permutation_corr(x, y)`	numeric, numeric	JSON	Correlation permutation test
`stat_acf(col, max_lag)`	numeric, numeric	JSON	Autocorrelation function
`stat_pacf(col, max_lag)`	numeric, numeric	JSON	Partial autocorrelation function
`stat_sample_replace(col, n)`	numeric, numeric	JSON	Sampling with replacement
`stat_sample(col, n)`	numeric, numeric	JSON	Sampling without replacement

Scalar Functions -- Test Helpers (40 functions)

Scalar functions that compute results from parameters alone. Essential for calculating p-values and deriving confidence intervals in statistical tests.

Function	Arguments	Return	Description
`stat_normal_pdf(x [,mu, sigma])`	1-3	numeric	Normal distribution PDF
`stat_normal_cdf(x [,mu, sigma])`	1-3	numeric	Normal distribution CDF
`stat_normal_quantile(p [,mu, sigma])`	1-3	numeric	Normal distribution quantile
`stat_normal_rand([mu, sigma])`	0-2	numeric	Normal distribution random variate
`stat_chisq_pdf(x, df)`	2	numeric	Chi-squared distribution PDF
`stat_chisq_cdf(x, df)`	2	numeric	Chi-squared distribution CDF
`stat_chisq_quantile(p, df)`	2	numeric	Chi-squared distribution quantile
`stat_chisq_rand(df)`	1	numeric	Chi-squared distribution random variate
`stat_t_pdf(x, df)`	2	numeric	t-distribution PDF
`stat_t_cdf(x, df)`	2	numeric	t-distribution CDF
`stat_t_quantile(p, df)`	2	numeric	t-distribution quantile
`stat_t_rand(df)`	1	numeric	t-distribution random variate
`stat_f_pdf(x, df1, df2)`	3	numeric	F-distribution PDF
`stat_f_cdf(x, df1, df2)`	3	numeric	F-distribution CDF
`stat_f_quantile(p, df1, df2)`	3	numeric	F-distribution quantile
`stat_f_rand(df1, df2)`	2	numeric	F-distribution random variate
`stat_betainc(a, b, x)`	3	numeric	Regularized incomplete beta function
`stat_betaincinv(a, b, p)`	3	numeric	Inverse regularized incomplete beta function
`stat_norm_cdf(x)`	1	numeric	Standard normal distribution CDF
`stat_norm_quantile(p)`	1	numeric	Standard normal distribution inverse CDF
`stat_gammainc_lower(a, x)`	2	numeric	Lower incomplete gamma function
`stat_gammainc_upper(a, x)`	2	numeric	Upper incomplete gamma function
`stat_gammainc_lower_inv(a, p)`	2	numeric	Inverse lower incomplete gamma function
`stat_z_test_prop(x, n, p0)`	3	JSON	One-sample proportion z-test
`stat_z_test_prop2(x1, n1, x2, n2)`	4	JSON	Two-sample proportion z-test
`stat_bonferroni(p, m)`	2	numeric	Bonferroni correction
`stat_bh_correction(p, rank, total)`	3	numeric	Benjamini-Hochberg correction
`stat_holm_correction(p, rank, total)`	3	numeric	Holm correction
`stat_fisher_exact(a, b, c, d)`	4	JSON	Fisher's exact test
`stat_odds_ratio(a, b, c, d)`	4	numeric	Odds ratio
`stat_relative_risk(a, b, c, d)`	4	numeric	Relative risk
`stat_risk_difference(a, b, c, d)`	4	numeric	Risk difference
`stat_nnt(a, b, c, d)`	4	numeric	Number needed to treat (NNT)
`stat_ci_prop(x, n [,confidence])`	2-3	JSON	Confidence interval for proportion (Wald method)
`stat_ci_prop_wilson(x, n [,confidence])`	2-3	JSON	Confidence interval for proportion (Wilson method)
`stat_ci_prop_diff(x1, n1, x2, n2 [,conf])`	4-5	JSON	Confidence interval for two-sample proportion difference
`stat_aic(log_likelihood, k)`	2	numeric	AIC
`stat_aicc(log_likelihood, n, k)`	3	numeric	Corrected AIC (AICc)
`stat_bic(log_likelihood, n, k)`	3	numeric	BIC
`stat_boxcox(x, lambda)`	2	numeric	Box-Cox transformation

Scalar Functions -- Distribution & Transformation (83 functions)

Calculator-style scalar functions that compute results from parameters alone. Distribution functions, effect size conversions, power analysis, and more.

Function	Arguments	Return	Description
`stat_uniform_pdf(x [,a, b])`	1-3	numeric	Uniform distribution PDF
`stat_uniform_cdf(x [,a, b])`	1-3	numeric	Uniform distribution CDF
`stat_uniform_quantile(p [,a, b])`	1-3	numeric	Uniform distribution quantile
`stat_uniform_rand([a, b])`	0-2	numeric	Uniform distribution random variate
`stat_exponential_pdf(x [,lambda])`	1-2	numeric	Exponential distribution PDF
`stat_exponential_cdf(x [,lambda])`	1-2	numeric	Exponential distribution CDF
`stat_exponential_quantile(p [,lambda])`	1-2	numeric	Exponential distribution quantile
`stat_exponential_rand([lambda])`	0-1	numeric	Exponential distribution random variate
`stat_gamma_pdf(x, shape, scale)`	3	numeric	Gamma distribution PDF
`stat_gamma_cdf(x, shape, scale)`	3	numeric	Gamma distribution CDF
`stat_gamma_quantile(p, shape, scale)`	3	numeric	Gamma distribution quantile
`stat_gamma_rand(shape, scale)`	2	numeric	Gamma distribution random variate
`stat_beta_pdf(x, alpha, beta)`	3	numeric	Beta distribution PDF
`stat_beta_cdf(x, alpha, beta)`	3	numeric	Beta distribution CDF
`stat_beta_quantile(p, alpha, beta)`	3	numeric	Beta distribution quantile
`stat_beta_rand(alpha, beta)`	2	numeric	Beta distribution random variate
`stat_lognormal_pdf(x [,mu, sigma])`	1-3	numeric	Log-normal distribution PDF
`stat_lognormal_cdf(x [,mu, sigma])`	1-3	numeric	Log-normal distribution CDF
`stat_lognormal_quantile(p [,mu, sigma])`	1-3	numeric	Log-normal distribution quantile
`stat_lognormal_rand([mu, sigma])`	0-2	numeric	Log-normal distribution random variate
`stat_weibull_pdf(x, shape, scale)`	3	numeric	Weibull distribution PDF
`stat_weibull_cdf(x, shape, scale)`	3	numeric	Weibull distribution CDF
`stat_weibull_quantile(p, shape, scale)`	3	numeric	Weibull distribution quantile
`stat_weibull_rand(shape, scale)`	2	numeric	Weibull distribution random variate
`stat_binomial_pmf(k, n, p)`	3	numeric	Binomial distribution PMF
`stat_binomial_cdf(k, n, p)`	3	numeric	Binomial distribution CDF
`stat_binomial_quantile(q, n, p)`	3	numeric	Binomial distribution quantile
`stat_binomial_rand(n, p)`	2	numeric	Binomial distribution random variate
`stat_poisson_pmf(k, lambda)`	2	numeric	Poisson distribution PMF
`stat_poisson_cdf(k, lambda)`	2	numeric	Poisson distribution CDF
`stat_poisson_quantile(q, lambda)`	2	numeric	Poisson distribution quantile
`stat_poisson_rand(lambda)`	1	numeric	Poisson distribution random variate
`stat_geometric_pmf(k, p)`	2	numeric	Geometric distribution PMF
`stat_geometric_cdf(k, p)`	2	numeric	Geometric distribution CDF
`stat_geometric_quantile(q, p)`	2	numeric	Geometric distribution quantile
`stat_geometric_rand(p)`	1	numeric	Geometric distribution random variate
`stat_nbinom_pmf(k, r, p)`	3	numeric	Negative binomial distribution PMF
`stat_nbinom_cdf(k, r, p)`	3	numeric	Negative binomial distribution CDF
`stat_nbinom_quantile(q, r, p)`	3	numeric	Negative binomial distribution quantile
`stat_nbinom_rand(r, p)`	2	numeric	Negative binomial distribution random variate
`stat_hypergeom_pmf(k, N, K, n)`	4	numeric	Hypergeometric distribution PMF
`stat_hypergeom_cdf(k, N, K, n)`	4	numeric	Hypergeometric distribution CDF
`stat_hypergeom_quantile(q, N, K, n)`	4	numeric	Hypergeometric distribution quantile
`stat_hypergeom_rand(N, K, n)`	3	numeric	Hypergeometric distribution random variate
`stat_bernoulli_pmf(k, p)`	2	numeric	Bernoulli distribution PMF
`stat_bernoulli_cdf(k, p)`	2	numeric	Bernoulli distribution CDF
`stat_bernoulli_quantile(q, p)`	2	numeric	Bernoulli distribution quantile
`stat_bernoulli_rand(p)`	1	numeric	Bernoulli distribution random variate
`stat_duniform_pmf(k, a, b)`	3	numeric	Discrete uniform distribution PMF
`stat_duniform_cdf(k, a, b)`	3	numeric	Discrete uniform distribution CDF
`stat_duniform_quantile(q, a, b)`	3	numeric	Discrete uniform distribution quantile
`stat_duniform_rand(a, b)`	2	numeric	Discrete uniform distribution random variate
`stat_binomial_coef(n, k)`	2	integer	Binomial coefficient
`stat_log_binomial_coef(n, k)`	2	numeric	Log binomial coefficient
`stat_log_factorial(n)`	1	numeric	Log factorial
`stat_lgamma(x)`	1	numeric	Log-gamma function
`stat_tgamma(x)`	1	numeric	Gamma function
`stat_beta_func(a, b)`	2	numeric	Beta function
`stat_lbeta(a, b)`	2	numeric	Log-beta function
`stat_erf(x)`	1	numeric	Error function
`stat_erfc(x)`	1	numeric	Complementary error function
`stat_logarithmic_mean(a, b)`	2	numeric	Logarithmic mean
`stat_hedges_j(n)`	1	numeric	Hedges correction factor
`stat_t_to_r(t, df)`	2	numeric	t-value to correlation coefficient conversion
`stat_d_to_r(d)`	1	numeric	Cohen's d to correlation coefficient conversion
`stat_r_to_d(r)`	1	numeric	Correlation coefficient to Cohen's d conversion
`stat_eta_squared_ef(ss_effect, ss_total)`	2	numeric	Eta-squared (effect size)
`stat_partial_eta_sq(F, df1, df2)`	3	numeric	Partial eta-squared
`stat_omega_squared_ef(ss_effect, ss_total, ms_error, df_effect)`	4	numeric	Omega-squared (effect size)
`stat_cohens_h(p1, p2)`	2	numeric	Cohen's h (proportion difference)
`stat_interpret_d(d)`	1	text	Interpretation of Cohen's d
`stat_interpret_r(r)`	1	text	Interpretation of correlation coefficient
`stat_interpret_eta2(eta2)`	1	text	Interpretation of eta-squared
`stat_power_t1(d, n, alpha)`	3	numeric	Power of one-sample t-test
`stat_n_t1(d, power, alpha)`	3	numeric	Required sample size for one-sample t-test
`stat_power_t2(d, n1, n2, alpha)`	4	numeric	Power of two-sample t-test
`stat_n_t2(d, power, alpha)`	3	numeric	Required sample size for two-sample t-test
`stat_power_prop(p1, p2, n, alpha)`	4	numeric	Power of proportion test
`stat_n_prop(p1, p2, power, alpha)`	4	numeric	Required sample size for proportion test
`stat_moe_prop(x, n [,confidence])`	2-3	numeric	Margin of error for proportion
`stat_moe_prop_worst(n [,confidence])`	1-2	numeric	Worst-case margin of error
`stat_n_moe_prop(moe [,confidence [,p]])`	1-3	numeric	Sample size for proportion estimation
`stat_n_moe_mean(moe, sigma [,confidence])`	2-3	numeric	Sample size for mean estimation

All aggregate functions ignore NULL input rows (following SQLite3 aggregate function conventions). Window functions track NULL inputs and either return NULL or impute values in the result (function-dependent). Scalar functions return NULL if any argument is NULL. NULL is returned when all rows are NULL or the result set is empty. NULL is returned when the computed result is NaN or Inf. Random variate functions (*_rand) are registered as non-deterministic (without the SQLITE_DETERMINISTIC flag).

3. Build and Run

# Build
cmake -B build
cmake --build build

# Run (C++ test program)
./build/a.out

Using from sqlite3 CLI

The built extension library can also be used from the sqlite3 command-line tool.

Prerequisites

The sqlite3 CLI itself must be built with SQLITE_ENABLE_LOAD_EXTENSION enabled. The macOS standard sqlite3 has this enabled.

Method 1: `.load` dot-command (recommended)

cd build
sqlite3 test.db

sqlite> .load ./ext_funcs sqlite3_ext_funcs_init
sqlite> SELECT stat_mean(val), stat_median(val) FROM data;
5.5|5.5
sqlite> SELECT stat_sample_stddev(val) FROM data;
3.02765035409749
sqlite> SELECT category, stat_mean(score) FROM exam GROUP BY category;

The first argument to .load is the library path, and the second argument is the entry point name. The file extension (.dylib) can be omitted (SQLite auto-completes it).

Method 2: `load_extension()` SQL function

SELECT load_extension('./ext_funcs', 'sqlite3_ext_funcs_init');
SELECT square(7);

This method has stricter restrictions than .load and requires separate enablement via sqlite3_db_config(db, SQLITE_DBCONFIG_ENABLE_LOAD_EXTENSION, 1, NULL). Using .load is generally more reliable.

Why Entry Point Name Must Be Specified

SQLite auto-infers the entry point name from the filename, but ext_funcs becomes sqlite3_extfuncs_init with _ stripped. Since the actual entry point name is sqlite3_ext_funcs_init, it must be explicitly specified as the second argument.

4. How Loadable Extensions Work

Overall Flow

1. main.cpp: sqlite3_enable_load_extension(db, 1)
   └─ Enable extension loading

2. main.cpp: sqlite3_load_extension(db, "ext_funcs.dylib", "sqlite3_ext_funcs_init", ...)
   └─ dyld loads the dynamic library
   └─ Calls the entry point function

3. ext_funcs: sqlite3_ext_funcs_init(db, ...)
   └─ Initialize API routines with SQLITE_EXTENSION_INIT2(pApi)
   └─ Register custom functions with sqlite3_create_function()

4. main.cpp: SELECT square(5), upper('hello'), sha256_agg(content), ...
   └─ SQLite calls the registered callback functions

Entry Point Naming Convention

SQLite infers the entry point name from the library filename:

ext_funcs.dylib → sqlite3_ext_funcs_init

If the filename contains - or ., auto-inference will fail, so the entry point must be explicitly specified as the third argument of sqlite3_load_extension().

Required Macros

#include <sqlite3ext.h>

SQLITE_EXTENSION_INIT1              // Once at global scope
// ...
extern "C"
int sqlite3_ext_funcs_init(...) {
    SQLITE_EXTENSION_INIT2(pApi);   // Once at the beginning of the entry point
    // ...
}

SQLITE_EXTENSION_INIT1: Declares the global pointer for the SQLite API
SQLITE_EXTENSION_INIT2(pApi): Sets the API pointer passed from the host application
extern "C": Specifies C linkage for the entry point when compiling with C++

Without these, API calls such as sqlite3_create_function cannot be made.

Compile Flags

SQLITE_ENABLE_LOAD_EXTENSION must be enabled on the host side (SQLite core):

target_compile_definitions(sqlite3 PUBLIC SQLITE_ENABLE_LOAD_EXTENSION)

5. Extension Function Implementation Patterns

5.1. Function Registration (sqlite3_create_function)

int sqlite3_create_function(
    sqlite3 *db,              // Database connection
    const char *zFunctionName,// SQL function name
    int nArg,                 // Number of arguments (-1 for variadic)
    int eTextRep,             // Text encoding + flags
    void *pApp,               // User data passed to callbacks (optional)
    void (*xFunc)(...),       // Scalar function callback
    void (*xStep)(...),       // Per-row processing for aggregate functions (nullptr for scalar)
    void (*xFinal)(...)       // Final processing for aggregate functions (nullptr for scalar)
);

Flags for eTextRep

Flag	Meaning
`SQLITE_UTF8`	UTF-8 text (typically this one)
`SQLITE_UTF16`	UTF-16 text
`SQLITE_DETERMINISTIC`	Always returns the same result for the same input (optimization hint)
`SQLITE_INNOCUOUS`	No side effects, safe (SQLite 3.31+)
`SQLITE_DIRECTONLY`	Can only be called from SQL (SQLite 3.30+)

5.2. Callback Function Signature

void callback(sqlite3_context *ctx, int argc, sqlite3_value **argv);

ctx: Context used for returning results and reporting errors
argc: Number of arguments
argv: Array of arguments

5.3. Retrieving Arguments (sqlite3_value_*)

Function	Return Type	Description
`sqlite3_value_type(v)`	`int`	Returns the type of the value (see type constants below)
`sqlite3_value_int(v)`	`int`	Retrieve as 32-bit integer
`sqlite3_value_int64(v)`	`sqlite3_int64`	Retrieve as 64-bit integer
`sqlite3_value_double(v)`	`double`	Retrieve as floating point
`sqlite3_value_text(v)`	`const unsigned char *`	Retrieve as UTF-8 string
`sqlite3_value_text16(v)`	`const void *`	Retrieve as UTF-16 string
`sqlite3_value_blob(v)`	`const void *`	Retrieve as BLOB data
`sqlite3_value_bytes(v)`	`int`	Byte count of text/blob

Type Constants (return values of sqlite3_value_type)

Constant	Value	Description
`SQLITE_INTEGER`	1	Integer
`SQLITE_FLOAT`	2	Floating point
`SQLITE_TEXT`	3	Text
`SQLITE_BLOB`	4	Binary
`SQLITE_NULL`	5	NULL

5.4. Returning Results (sqlite3_result_*)

Returning Scalar Values

Function	Return Type	C++ Type
`sqlite3_result_int(ctx, int)`	32-bit integer	`int`
`sqlite3_result_int64(ctx, sqlite3_int64)`	64-bit integer	`sqlite3_int64`
`sqlite3_result_double(ctx, double)`	Floating point	`double`
`sqlite3_result_text(ctx, str, len, destructor)`	UTF-8 string	`const char *`
`sqlite3_result_text16(ctx, str, len, destructor)`	UTF-16 string	`const void *`
`sqlite3_result_blob(ctx, data, len, destructor)`	Binary data	`const void *`
`sqlite3_result_null(ctx)`	NULL	--

Special Returns

Function	Usage
`sqlite3_result_zeroblob(ctx, n)`	Return an n-byte zero-filled BLOB
`sqlite3_result_value(ctx, sqlite3_value*)`	Return the input value as-is (preserving its type)
`sqlite3_result_pointer(ctx, ptr, type, destructor)`	Pass a pointer (SQLite 3.20+)
`sqlite3_result_subtype(ctx, unsigned int)`	Attach subtype information such as JSON

Returning Errors

Function	Usage
`sqlite3_result_error(ctx, msg, len)`	Return an error message
`sqlite3_result_error_nomem(ctx)`	Return an out-of-memory error
`sqlite3_result_error_toobig(ctx)`	Return a data-too-large error
`sqlite3_result_error_code(ctx, int)`	Return an arbitrary error code

destructor Argument

Memory management specification passed as the last argument to _text and _blob functions:

Value	Meaning
`SQLITE_STATIC`	Data is managed by the caller. SQLite does not copy it
`SQLITE_TRANSIENT`	SQLite creates an internal copy. Safe even for local variables
Function pointer (e.g., `sqlite3_free`)	SQLite calls the destructor after use

5.5. Memory Management

Within extensions, it is recommended to use SQLite-provided functions instead of standard malloc/free:

Function	Description
`sqlite3_malloc(n)`	Allocate n bytes of memory
`sqlite3_realloc(ptr, n)`	Reallocate memory
`sqlite3_free(ptr)`	Free memory
`sqlite3_mprintf(fmt, ...)`	printf-style string creation with memory allocation

By passing sqlite3_free as the destructor to sqlite3_result_text, SQLite manages the deallocation of the buffer:

auto *result = static_cast<char *>(sqlite3_malloc(len + 1));
// ... write data to result ...
sqlite3_result_text(ctx, result, len, sqlite3_free);
// → SQLite will call sqlite3_free(result) when done with result

5.6. Implementation Template

static void my_func(sqlite3_context *ctx, int /*argc*/, sqlite3_value **argv) {
    // 1. NULL check
    if (sqlite3_value_type(argv[0]) == SQLITE_NULL) {
        sqlite3_result_null(ctx);
        return;
    }

    // 2. Retrieve arguments
    const char *input = reinterpret_cast<const char *>(sqlite3_value_text(argv[0]));
    int len = static_cast<int>(std::strlen(input));

    // 3. Processing
    auto *result = static_cast<char *>(sqlite3_malloc(len + 1));
    if (!result) {
        sqlite3_result_error_nomem(ctx);
        return;
    }
    // ... build result ...

    // 4. Return result
    sqlite3_result_text(ctx, result, len, sqlite3_free);
}

6. SQLite C API Reference

Extension Loading API

Function	Description
`sqlite3_enable_load_extension(db, onoff)`	Enable/disable extension loading
`sqlite3_load_extension(db, file, proc, errmsg)`	Load an extension library
`sqlite3_auto_extension(xEntryPoint)`	Register an extension to auto-load on new DB connections
`sqlite3_cancel_auto_extension(xEntryPoint)`	Unregister an auto-load extension
`sqlite3_reset_auto_extension()`	Unregister all auto-load extensions

Using pApp (User Data)

An arbitrary pointer can be passed as the 5th argument pApp of sqlite3_create_function. Retrieve it within the callback using sqlite3_user_data(ctx):

// At registration
int multiplier = 10;
sqlite3_create_function(db, "scale", 1, SQLITE_UTF8, &multiplier,
                         scale_func, nullptr, nullptr);

// Inside the callback
static void scale_func(sqlite3_context *ctx, int /*argc*/, sqlite3_value **argv) {
    auto *mul = static_cast<int *>(sqlite3_user_data(ctx));
    double x = sqlite3_value_double(argv[0]);
    sqlite3_result_double(ctx, x * (*mul));
}

Implementing Aggregate Functions

To implement aggregate functions (like SUM, AVG), use xStep and xFinal. In this project, a pattern is adopted where the state object is allocated on the heap and only a pointer is stored in sqlite3_aggregate_context:

struct SingleColumnState {
    std::vector<double> values;
};

// xStep: Called for each row (NULL rows are skipped)
static void xStep(sqlite3_context *ctx, int /*argc*/, sqlite3_value **argv) {
    if (sqlite3_value_type(argv[0]) == SQLITE_NULL) return;
    auto **pp = static_cast<SingleColumnState**>(
        sqlite3_aggregate_context(ctx, sizeof(SingleColumnState*)));
    if (!pp) return;
    if (!*pp) *pp = new (std::nothrow) SingleColumnState();
    if (*pp) (*pp)->values.push_back(sqlite3_value_double(argv[0]));
}

// xFinal: Called once at the end
static void xFinal(sqlite3_context *ctx) {
    auto **pp = static_cast<SingleColumnState**>(
        sqlite3_aggregate_context(ctx, 0));
    if (!pp || !*pp || (*pp)->values.empty()) {
        sqlite3_result_null(ctx);
    } else {
        double result = statcpp::mean((*pp)->values.begin(), (*pp)->values.end());
        if (std::isnan(result) || std::isinf(result)) {
            sqlite3_result_null(ctx);
        } else {
            sqlite3_result_double(ctx, result);
        }
    }
    if (pp && *pp) { delete *pp; *pp = nullptr; }
}

// Registration (xFunc = nullptr, specify xStep and xFinal)
sqlite3_create_function_v2(db, "stat_mean", 1,
    SQLITE_UTF8 | SQLITE_DETERMINISTIC,
    nullptr, nullptr, xStep, xFinal, nullptr);

7. C++ Implementation Techniques

Three Callback Coding Styles

In this project's ext_funcs.cpp, callbacks passed to sqlite3_create_function are written in three styles:

Method 1: Regular static function

// Function definition
static void reverse_func(sqlite3_context *ctx, int, sqlite3_value **argv) {
    // ...
}

// Registration
sqlite3_create_function(db, "reverse", 1, ..., reverse_func, nullptr, nullptr);

Method 2: Lambda assigned to auto variable

// Store lambda in a named variable
static auto repeat_func = [](sqlite3_context *ctx, int, sqlite3_value **argv) {
    // ...
};

// Registration — convert to function pointer with + and pass
sqlite3_create_function(db, "repeat", 2, ..., +repeat_func, nullptr, nullptr);

Method 3: Inline lambda

// Write directly at registration
sqlite3_create_function(db, "is_palindrome", 1, ...,
    +[](sqlite3_context *ctx, int, sqlite3_value **argv) {
        // ...
    },
    nullptr, nullptr);

All three styles result in identical function pointers after compilation, so there is no runtime difference.

Concise Registration with Template Helpers

This project uses six template patterns, designed so that only the statistical computation logic is swapped out.

// --- Template A: Basic aggregate functions ---
static double calc_mean(const std::vector<double>& v) {
    return statcpp::mean(v.begin(), v.end());
}
rc = SingleColumnAggregate<calc_mean>::register_func(db, "stat_mean");

// --- Template B: Aggregate functions with parameters ---
rc = SingleColumnParamAggregate<1, calc_percentile>::register_func(db, "stat_percentile");
rc = SingleColumnParamAggregateText<1, calc_t_test>::register_func(db, "stat_t_test");

// --- Template C: Two-column aggregate functions ---
rc = TwoColumnAggregate<calc_pearson_r>::register_func(db, "stat_pearson_r");
rc = TwoColumnAggregateText<calc_simple_regression>::register_func(db, "stat_simple_regression");
rc = TwoColumnParamAggregate<1, calc_weighted_percentile>::register_func(db, "stat_weighted_percentile");

// --- Template D: Window functions ---
rc = FullScanWindowFunction<wf_rolling_mean>::register_func_2(db, "stat_rolling_mean");  // 2 arguments
rc = FullScanWindowFunction<wf_rank>::register_func_1(db, "stat_rank");                   // 1 argument

// --- Template E: Composite aggregate functions ---
rc = SingleColumnAggregateText<calc_modes>::register_func(db, "stat_modes");
rc = TwoColumnAggregateText<calc_t_test2>::register_func(db, "stat_t_test2");

// --- Template F: Scalar functions ---
rc = register_scalar(db, "stat_normal_pdf", -1, sf_normal_pdf);     // Deterministic
rc = register_scalar_nd(db, "stat_normal_rand", -1, sf_normal_rand); // Non-deterministic (random)

How Lambda Expressions Work

+[](sqlite3_context *ctx, int, sqlite3_value **argv) { ... }

A captureless lambda ([]) is converted to a regular function pointer at compile time
The leading + is an idiom for explicitly converting the lambda to a function pointer
The generated code is identical to static void func(...)
std::function is not a function pointer and therefore cannot be passed directly to sqlite3_create_function

How Templates Work

template <double (*Fn)(double)>
int register_double_func(sqlite3 *db, const char *name) { ... }

Templates are expanded at compile time
The Fn(x) call is inlined and does not result in an indirect call

Overloading Considerations

When overloads of the same function name exist, the template argument becomes ambiguous:

static double cube(double x) { return x * x * x; }
static long   cube(long x)   { return x * x * x; }

// Explicitly cast to resolve overload
register_double_func<static_cast<double(*)(double)>(cube)>(db, "cube");

Since SQLite values are internally dynamically typed, unifying on double is the simplest approach.

8. Dynamic Library Considerations

Entry Point Symbol

extern "C" suppresses name mangling, exporting the symbol name that SQLite expects.

nm -gU ext_funcs.dylib | grep sqlite3_ext_funcs_init
# → _sqlite3_ext_funcs_init

Lambdas and templates have internal linkage (equivalent to static) and do not appear in the .dylib public symbol table.

C++ Runtime Dependency

otool -L ext_funcs.dylib

Use of std::string, std::reverse, etc. adds a dynamic link dependency on libc++.

libc++ Loading Cost in This Project

In this project, since main.cpp is compiled and linked as C++, dyld has already loaded libc++.1.dylib at process startup.

1. main starts
   └─ dyld loads libc++.1.dylib (required because main is C++)

2. sqlite3_load_extension() loads ext_funcs.dylib
   └─ ext_funcs.dylib also depends on libc++
   └─ But libc++ is already in memory
   └─ → Only the reference count is incremented; no reload occurs

macOS's dyld does not load the same library twice. It merely shares the already-mapped address, so the cost is essentially zero.

How to Eliminate libc++ Dependency

Avoid std::string and use sqlite3_malloc + hand-written loops instead
Use only header-only functions from <algorithm> (they are inlined)
Disable exceptions and RTTI at build time:

target_compile_options(ext_funcs PRIVATE -fno-exceptions -fno-rtti)

Runtime Overhead

Element	Overhead	Description
Lambda to function pointer	None	Compile-time conversion
Template expansion	None	Compile-time expansion, inlined
`extern "C"`	None	Linkage specification only
`std::string` + `SQLITE_TRANSIENT`	Minimal	One extra copy (e.g., reverse function)