This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
collapse is a high-performance C/C++-based R package for advanced data transformation and statistical computing. The package provides:
- Fast grouped and weighted statistical functions with OpenMP multithreading
- Class-agnostic architecture supporting base R, tibble, data.table, sf, plm, and other extensions
- ~13,775 lines of R code and ~28,625 lines of C/C++ code
- Comprehensive test suite with 48 test files
# Load package for development
devtools::load_all()
# Run all tests
devtools::test()
# Run specific test file
testthat::test_file("tests/testthat/test-GRP.R")
# Run package check
devtools::check()
# Build documentation
devtools::document()# Build package tarball
R CMD build .
# Check package (comprehensive tests)
R CMD check collapse_*.tar.gz
# Install from source
R CMD INSTALL collapse_*.tar.gz
# Run tests directly
Rscript -e "testthat::test_dir('tests/testthat')"- GitHub Actions runs R-CMD-check on macOS, Windows, and Ubuntu
- Tests against R-devel, R-release, and R-oldrel-1
- Code coverage tracked via codecov
- Documentation auto-deployed via pkgdown
R/ directory (52 files)
f*prefix: Fast statistical functions (fmean.R, fmedian.R, fvar_fsd.R, etc.)GRP.R(1,428 lines): Core grouping object using radix sort or hash methodsfsubset_ftransform_fmutate.R(1,220 lines): Data manipulation (subset, transform, mutate)collap.R(793 lines): Advanced aggregation frameworkpivot.R(823 lines): Pivot operations (wider/longer/recast)join.R(473 lines): Fast join operationsfbetween_fwithin.R,fhdbetween_fhdwithin.R: Between/within group transformationsfdiff_fgrowth.R,flag.R: Time series operations (differences, growth rates, lags/leads)qsu.R,descr.R: Statistical summarieslist_functions.R(647 lines): Recursive list operationsindexing.R(741 lines): Time series and panel data indexingzzz.R(195 lines): Package initialization, namespace masking system
src/ directory (94 files)
collapse_c.h(195 lines): Main C header with all function signatures and OpenMP macrosbase_radixsort.c(2,158 lines): Fast radix sorting (adapted from base R)fdiff_fgrowth.cpp(1,854 lines): Differencing and growth rate calculationsfvar_fsd.cpp(1,689 lines): Variance and standard deviationfnth_fmedian_fquantile.c(1,627 lines): Quantile functionsTRA.c(1,416 lines): Transformation operations frameworkprogramming.c(1,284 lines): Programming helper functionsfmode.c(1,248 lines): Mode calculationkit_dup.c(1,169 lines): Duplicate handling (adapted from kit package)match.c(1,161 lines): Fast matching operationsflag.cpp(1,083 lines): Lag/lead operationsdata.table_*.c: Integration with data.table internals (rbindlist, subset, utils)Makevars,Makevars.win: OpenMP compilation configuration
tests/testthat/ (48 files)
test-<function>.Rpattern for each major function- Comprehensive tests against base R equivalents
- Tests with NA/Inf values, different data types, grouping, and weights
1. Two-Layer Design
- R layer: Input validation, S3 method dispatch, attribute handling
- C/C++ layer: Core computation with OpenMP parallelization
- Pattern:
.Call()invokes C functions named with_C,_mC(matrix),_lC(list/data.frame) suffixes
2. Function Signature Pattern Most statistical functions follow:
f<name>(x, g = NULL, w = NULL, TRA = NULL, na.rm = TRUE, use.g.names = TRUE, ...)x: data (vector, matrix, or data.frame)g: grouping variable(s) - converted to GRP object internallyw: weights vectorTRA: transformation operation (see TRA framework below)- S3 methods for: default, matrix, data.frame, grouped_df, zoo, units, pdata.frame, pseries, sf
3. GRP (Grouping) Object Central to all grouped operations:
- Created via
GRP()or automatically from factors/lists - Contains:
N.groups,group.sizes,groups(names),group.id,group.starts - Two algorithms: radix sort (default, stable) or hash-based (faster for many groups)
- Access via
GRP(),fgroup_by(), or automatic conversion from factors
4. TRA (Transformation) Framework 10 transformation operations applicable to all statistical functions:
"-": center (subtract statistic)"+": add statistic"*": multiply by statistic"/": divide by statistic (scale)"%": compute percentage of statistic"%%": modulus"-+": center and add overall mean"+-": add statistic and subtract overall mean"replace": replace values with statistic"replace_fill": replace and fill with statistic
Example: fmean(x, g, TRA = "-") centers x by group means
5. Namespace Masking System
- Functions can be exported without 'f' prefix via
options(collapse_mask = "all") - Controlled via
.fastverseconfig file orset_collapse(mask = "manip") - Allows
mean(),sum(), etc. to use collapse's fast versions - Keywords: "all", "fast-fun", "fast-stat-fun", "helper", "manip", "special"
6. Class-Agnostic Design
- Consistent S3 methods across all data structures
- Attribute preservation system maintains class-specific attributes
- Functions work identically on base R, tibble, data.table, sf, plm objects
7. OpenMP Parallelization
- Controlled via
set_collapse(nthreads = n)oroptions(collapse_nthreads = n) - Automatic fallback if OpenMP not available
- Used in: fsum, fmean, fmode, and other computationally intensive functions
- Compilation flags in Makevars handle platform differences
8. data.table Integration
- Reuses core algorithms from data.table (radixsort, rbindlist, subset) under MPL 2.0
- Works natively on data.table objects
- Compatible with
:=operator - Modified for collapse's specific needs
-
Create R file
R/f<name>.R:- Define S3 generic:
f<name> <- function(x, ...) UseMethod("f<name>") - Implement methods:
f<name>.default,f<name>.matrix,f<name>.data.frame - Add grouping support via GRP object
- Add TRA argument for transformations
- Handle attributes with
copyAttrib()orcopyMostAttrib()
- Define S3 generic:
-
Create C implementation in
src/:- Add function signature to
collapse_c.h - Implement ungrouped version:
<name>_C() - Implement grouped version with g, starts, sizes parameters
- Add OpenMP parallelization where beneficial
- Register in
ExportSymbols.cor use Rcpp
- Add function signature to
-
Add tests in
tests/testthat/test-f<name>.R:- Compare against base R equivalent
- Test with NA values, Inf, different types
- Test grouped and weighted versions
- Test with different data structures
-
Update documentation:
- Add roxygen2 comments
- Update
collapse-documentation.Rdif adding new category - Add examples
- All C functions callable from R are declared in
collapse_c.h - Use
PROTECT/UNPROTECTfor R objects in C code - Check existing patterns in similar functions before implementing
- OpenMP pragmas:
#pragma omp parallel for num_threads(nthreads) - Rcpp functions auto-generate interfaces via
RcppExports.cpp
- Use
settransform(),setv()for in-place modifications (reference semantics) - Regular
ftransform(),fmutate()copy on modify qDF(),qDT(),qM()for fast conversions without attribute copying- Avoid unnecessary copies in tight loops
- Every function needs comprehensive tests
- Test against base R or established packages
- Use random data and
set.seed()for reproducibility - Test edge cases: empty data, single row/column, all NA
- Test with grouped_df, data.table, pseries objects
fprefix: Fast statistical functionsqprefix: Quick conversions (qDF, qDT, qM)- Single letter operators: B (between), W (within), D (diff), G (growth), L (lag), HDB/HDW (high-dim between/within), STD (standardize)
- Many functions have shorter aliases in documentation
Access via get_collapse() and set_collapse():
nthreads: OpenMP thread count (default: system)na.rm: Default NA removal (default: TRUE)sort: Default sorting in GRP (default: TRUE)mask: Namespace masking levelverbose: Verbosity level for operations
copyAttrib(): Copy all attributescopyMostAttrib(): Copy all except names, dim, dimnamessetAttrib(): Set single attribute- Class attributes preserved automatically in most functions
- R layer validates inputs, provides informative errors
- C layer assumes validated inputs for performance
- Use
ckmatch()for matching with good error messages unused_arg_action()for handling unexpected arguments
R/zzz.R: Package initialization, namespace setupR/GRP.R: Core grouping systemR/global_macros.R: Global options and constantssrc/collapse_c.h: Complete C APIsrc/base_radixsort.c: Fast ordering algorithmtests/testthat/test-GRP.R: Comprehensive grouping testsvignettes/collapse_documentation.Rmd: Documentation guide
- collapse functions are highly optimized; avoid calling base R equivalents in hot paths
- Use GRP objects explicitly for repeated grouping operations
- Set
nthreadsappropriately for your system - Use in-place modification functions when appropriate
- Vector operations are faster than grouped operations on small data
- For large data, collapse typically outperforms base R by 2-100x
- Built-in documentation:
help('collapse-documentation') - Vignettes: https://fastverse.org/collapse/articles/
- arXiv article: https://arxiv.org/abs/2403.05038
- GitHub: https://github.com/fastverse/collapse