StatLang provides an open-source environment for statistical analysis by offering:
- Expressive scripting syntax for data manipulation and analysis
- Python backend for execution and performance
- Jupyter notebook support with a StatLang kernel
- VS Code extension with syntax highlighting and execution
- Cross-platform compatibility (Windows, macOS, Linux)
- Open source and free to use
- AI Integration: Built-in PROC LANGUAGE with LLM capabilities for intelligent data analysis
- Complete ML Pipeline: From data exploration to model deployment using familiar, concise syntax
- Deep Learning: PyTorch-powered DNN training, NLP, computer vision (including object detection), and reinforcement learning
- Modern SQL: PROC SQL powered by DuckDB for high-performance data querying
- Robust language features: Macro system, format system, and 38+ statistical/ML procedures
- Rich Visualizations: Professional output formatting with TITLE statements and structured results
- DATA step with MERGE, ARRAY, RETAIN, DO loops (iterative/while/until), FIRST./LAST., LAG/DIF
- INFILE/FILE I/O, INPUT parsing, and PUT output
- DATALINES/CARDS for inline data
- Subsetting IF and conditional IF/THEN/ELSE
- Row-by-row and vectorised execution paths
- Python pandas/numpy backend for performance
%MACRO/%MENDdefinitions with parameter lists%LET,%PUT,&varsubstitution%IF/%THEN/%ELSE,%DO/%ENDcontrol flow%INCLUDEfile injection (recursive with depth limit)%SYSEVALFarithmetic,%SYSFUNC(30+ built-in functions)%GLOBAL/%LOCALscoping- System variables:
&SYSDATE9,&SYSLAST,&SYSCC,&SYSJOBID
- In-memory model store with optional pickle persistence
- Save, load, list, and delete trained models across procedures
run_pipeline()for end-to-end.statlangfile execution
- StatLang kernel for Jupyter notebooks
- Interactive statistical programming in notebook environment
- Rich output display with formatted tables
- Dataset visualisation and exploration
- Syntax highlighting for
.statlangfiles - Code snippets for common statistical analysis patterns
- File execution directly from VS Code
- Notebook support for interactive analysis
| Procedure | Description |
|---|---|
| PROC MEANS | Descriptive statistics with CLASS variables and OUTPUT |
| PROC FREQ | Frequency tables and cross-tabulations |
| PROC SORT | Data sorting with ascending/descending order |
| PROC PRINT | Data display and formatting |
| PROC REG | Linear regression with MODEL, OUTPUT, and SCORE |
| PROC UNIVARIATE | Detailed univariate analysis with distribution diagnostics |
| PROC CORR | Correlation analysis (Pearson, Spearman) |
| PROC FACTOR | Principal component and factor analysis |
| PROC CLUSTER | Clustering methods (k-means, hierarchical) |
| PROC NPAR1WAY | Nonparametric tests (Mann-Whitney, Kruskal-Wallis) |
| PROC TTEST | T-tests (independent and paired) |
| PROC LOGIT | Logistic regression |
| PROC TIMESERIES | Time series analysis and seasonal decomposition |
| PROC SURVEYSELECT | Random sampling (SRS, SAMPRATE/N, OUTALL) |
| PROC GLM | General Linear Models via statsmodels (Type III ANOVA) |
| PROC ANOVA | Balanced Analysis of Variance |
| PROC GENMOD | Generalised Linear Models (Gaussian, Binomial, Poisson, Gamma) |
| PROC MIXED | Mixed / multilevel models (random intercepts & slopes) |
| PROC ROBUSTREG | Robust regression (M-estimation via RLM) |
| PROC LIFEREG | Parametric survival (Weibull, Log-Normal, Log-Logistic AFT) |
| PROC PHREG | Cox proportional hazards regression |
| PROC DISCRIM | Discriminant analysis (LDA / QDA) |
| PROC PRINCOMP | Principal Component Analysis with StandardScaler |
| Procedure | Description |
|---|---|
| PROC TREE | Decision trees for classification and regression |
| PROC FOREST | Random forests for ensemble learning |
| PROC BOOST | Gradient boosting |
| PROC DNN | PyTorch feedforward neural networks (classification & regression) |
| PROC NLP | HuggingFace NLP (sentiment, classification, NER, summarisation) |
| PROC CVISION | Image classification (ResNet, VGG) and Faster R-CNN object detection |
| PROC RL | Tabular Q-learning for reinforcement learning |
| PROC LLM | Text generation, fill-mask, and QA via HuggingFace |
| Procedure | Description |
|---|---|
| PROC TRANSPOSE | Reshape data (wide / long) with BY group support |
| PROC APPEND | Concatenate datasets with FORCE option |
| PROC DATASETS | Delete, rename, and list datasets |
| PROC EXPORT | Export to CSV, Excel, JSON, Parquet |
| PROC IMPORT | Import from CSV, Excel, JSON, Parquet |
| PROC SQL | SQL query processing with DuckDB backend |
| PROC LANGUAGE | LLM-powered text generation, Q&A, and data analysis |
# Core statistical procedures
pip install statlang
# With deep learning (PROC DNN, PROC CVISION, PROC RL)
pip install statlang[dl]
# With NLP (PROC NLP, PROC LLM)
pip install statlang[nlp]
# With DuckDB SQL engine (PROC SQL)
pip install statlang[sql]
# With Jupyter notebook support
pip install statlang[notebook]
# Everything
pip install statlang[all]# Install the StatLang kernel
python -m statlang.kernel install
# List available kernels
jupyter kernelspec list- Install from VS Code Marketplace: "StatLang" by RyanBlakeStory
- Or install from source (see Development section)
from statlang import StatLangInterpreter
# Create interpreter
interpreter = StatLangInterpreter()
# Create sample data using StatLang syntax
interpreter.run_code('''
data work.employees;
input employee_id name $ department $ salary;
datalines;
1 Alice Engineering 75000
2 Bob Marketing 55000
3 Carol Engineering 80000
4 David Sales 45000
;
run;
''')
# Run statistical analysis
interpreter.run_code('''
proc means data=work.employees;
class department;
var salary;
run;
''')%LET target = spend;
%LET features = age income;
%macro train_and_evaluate(depvar, indepvars);
proc reg data=work.train;
model &depvar = &indepvars;
output out=work.results p=predicted r=residuals;
run;
proc means data=work.results mean;
var residuals;
run;
%mend;
%train_and_evaluate(&target, &features);
/* Generate synthetic training data */
proc cvision mode=generate_samples out=annotations
n_train=30 n_test=10 img_size=128 seed=42;
run;
/* Fine-tune Faster R-CNN */
proc cvision data=train_annot mode=train_detect
model_name=shape_detector epochs=5 lr=0.005;
image image_path;
run;
/* Score new images with the trained model */
proc cvision data=test_images mode=serve out=detections
model_name=shape_detector confidence=0.5;
image image_path;
run;
- Install the StatLang kernel:
python -m statlang.kernel install
- Create a new Jupyter notebook (
.ipynb) - Select "statlang" as the kernel
- Write StatLang code in cells and execute
- Install the StatLang extension from the marketplace
- Create a new file with
.statlangextension - Write your StatLang code
- Use
Ctrl+Shift+P> "StatLang: Run File" to execute
# Run StatLang code from file
python -m statlang.cli run example.statlang
# Interactive mode
python -m statlang.cli interactiveML Project Demo - A comprehensive machine learning workflow:
- Synthetic dataset creation with 30 observations
- PROC UNIVARIATE for distribution analysis
- PROC SURVEYSELECT for train/test splitting (70/30)
- PROC REG with MODEL, OUTPUT, and SCORE statements
- Macro-based reusable analysis functions
Object Detection Pipeline - End-to-end computer vision:
- Synthetic shape data generation with bounding-box annotations
- Faster R-CNN fine-tuning with PROC CVISION
- Model store persistence and serving
- Composable
%MACROpipeline with%LET-driven configuration
StatLang Walkthrough - Complete feature demonstration:
- All statistical procedures with examples
- Macro system demonstrations
- Format system usage
- Advanced data manipulation techniques
StatLang/
├── stat_lang/ # Core Python package
│ ├── __init__.py
│ ├── interpreter.py # Main interpreter
│ ├── cli.py # Command line interface
│ ├── pipeline.py # End-to-end pipeline runner
│ ├── kernel/ # Jupyter kernel implementation
│ │ ├── statlang_kernel.py
│ │ └── install.py
│ ├── parser/ # Syntax parsers
│ │ ├── data_step_parser.py # DATA step (MERGE, ARRAY, DO, etc.)
│ │ ├── proc_parser.py # Generic PROC option scanner
│ │ └── macro_parser.py
│ ├── procs/ # 38+ procedure implementations
│ │ ├── proc_means.py # Statistical procs
│ │ ├── proc_reg.py
│ │ ├── proc_glm.py
│ │ ├── proc_dnn.py # Deep learning procs
│ │ ├── proc_cvision.py # Computer vision / object detection
│ │ ├── proc_export.py # Data management procs
│ │ └── ...
│ └── utils/
│ ├── expression_evaluator.py
│ ├── macro_processor.py # Macro engine
│ ├── model_store.py # In-memory + pickle model store
│ ├── data_utils.py
│ └── libname_manager.py
├── tests/ # Test suite (55+ tests)
├── examples/ # Example notebooks & scripts
├── vscode-extension/ # VS Code extension
├── media/ # Logo and icons
├── pyproject.toml # Package config & dependencies
└── README.md
git clone https://github.com/Stryve-Analytics/StatLang.git
cd StatLang
pip install -e ".[dev]"# Run the full test suite
pytest
# With verbose output
pytest -v --tb=short# Lint
ruff check stat_lang tests --select E,F,I --ignore E501
# Type check
mypy stat_lang testsWe welcome contributions! Please see CONTRIBUTING.md for guidelines.
- Additional statistical procedures
- Macro functionality enhancements
- Performance optimisations
- VS Code extension features
- Documentation and examples
MIT License - see LICENSE for details.
