This repository contains the source code, experiment logs, and result analysis for our paper "FedAugment: Table Augmentation Search over Decentralized Data Repositories".
The repository is structured as follows:
fedaugment/
├── analysis # Jupyter notebooks with result analysis and plotting code
├── experiments # Python and Bash scripts with experiment configurations
├── logs # results of our experimental evaluation
├── scripts # utility scripts for data processing and evaluation
├── src/fedaugment # main Python package with our implementation
└── tests # unit tests for part of the codebaseThe following diagram illustrates the overall architecture of the FedAugment workflow:
+----------------------+
| Raw Table Data |
| (CSV/Parquet files) |
+----------+-----------+
|
v
+---------------------------------------------------------------------------------+
| 1. EMBEDDING GENERATION |
| +------------+ +------------+ +------------+ +------------+ |
| | View 1 | | View 2 | | View 3 | ... | View N | |
| | mpnet + | | gtr_t5 + | | gte_base + | | qwen3_8b + | |
| | dj_adpt | | dj_adpt | | dj_adpt | | dj_adpt | |
| +------+-----+ +------+-----+ +------+-----+ +------+-----+ |
| | | | | |
| v v v v |
| [384-dim] [768-dim] [768-dim] [4096-dim] |
| embeddings embeddings embeddings embeddings |
+-------------+----------------+----------------+--------------------+------------+
| | | |
+----------------+---------+------+--------------------+
|
v
+---------------------------------------------------------------------------------+
| 2. PROJECTION MODEL TRAINING |
| |
| Training Data Projection Models: |
| +---------------------+ - CL (Contrastive Learning) --- Neural network |
| | Curated subset | - LA2M (Local Isometry) ------- Clustering-based |
| | (FFT/Grid/Random) | - Vec2Vec --------------------- GAN-based |
| +---------------------+ - Procrustes ------------------ Orthogonal align |
| |
| Output: Learned transformations that map all views to a common vector space |
+----------------------------------------+----------------------------------------+
|
v
+---------------------------------------------------------------------------------+
| 3. ALIGNED EMBEDDING SPACE |
| |
| View 1 View 2 View 3 ... View N |
| | | | | |
| +---------+---------+----------------+ |
| | |
| v |
| +-------------------+ |
| | Common | |
| | Embedding Space | |
| +---------+---------+ |
| | |
| v |
| +-------------------+ |
| | HNSW Index | < Fast approximate nearest neighbor |
| +-------------------+ |
+----------------------------------------+----------------------------------------+
|
v
+---------------------------------------------------------------------------------+
| 4. TABLE AUGMENTATION TASKS |
| |
| +-----------------------------+ +-----------------------------+ |
| | JOIN DISCOVERY | | UNION DISCOVERY | |
| | | | | |
| | Query: Column A | | Query: Table X | |
| | v | | v | |
| | Find columns that can | | Find tables with | |
| | be joined with A | | compatible schemas | |
| | v | | v | |
| | Metrics: P@k, R@k, MAP | | Metrics: P@k, R@k, MAP | |
| +-----------------------------+ +-----------------------------+ |
+---------------------------------------------------------------------------------+
We recommend using uv to manage fedaugment and its dependencies.
Installing the project is as simple as:
uv syncWe offer the following optional dependency groups (use uv sync --extra <group>):
- PyTorch variants (mutually exclusive):
cpu: Force a CPU installation of PyTorchcu126: Force a CUDA 12.6 installation of PyTorchcu128: Force a CUDA 12.8 installation of PyTorch
experiments: Installs additional dependencies for running experimentsflash-attn: For more efficient attention operators in PyTorch
Note that the default PyTorch version depends on your operating system (CPU for Windows and Mac, CUDA 12.x for Linux).
We use the following datasets in our experiments:
The project expects datasets in the following structure (symlinked or stored at data/):
data/
│
├── datasets/ # Raw tabular data
│ │
│ └── {dataset_name}/ # e.g., webtable, omnimatch_city_test, omnimatch_culture_test, santos_small, freyja
│ │
│ ├── datasets/ # Full original dataset
│ │ ├── pq/ # Parquet files
│ │ │ └── {table_id}.pq
│ │ └── csv/ # CSV files
│ │ └── {table_id}.csv
│ │
│ ├── queries/ # Query tables and ground truth (query tables are optional, if absent, queries use tables from datasets/)
│ │ ├── pq/ # Query tables (parquet)
│ │ ├── csv/ # Query tables (csv)
│ │ ├── join_queries.csv # Join query list (only for join tasks)
│ │ ├── join_ground_truth.csv # Join ground truth pairs (only for join tasks)
│ │ ├── union_queries.csv # Union query list (only for union tasks)
│ │ └── union_ground_truth.csv # Union ground truth pairs (only for union tasks)
│ │
│ ├── split/ # Train/test/val splits (webtable only)
│ │ ├── train/
│ │ ├── test/
│ │ └── val/
│ │
│ └── sample-{pct}/ # Sampled subsets from data curation (webtable only)
│ └── {curation_method}/ # e.g., fft_cos-mpnet-dj_adpt-k=63049
│ └── {table_id}.pq
│
│
└── embeddings/ # Generated embeddings
│
└── {dataset_name}/ # Mirrors datasets/ structure
│
├── datasets/ # Dataset embeddings (full corpus)
│ │
│ └── {model}-{strategy}.fa/ # Feature archive per embedding pipeline
│ ├── embeddings.npy # (N, D) float32 array
│ ├── column_ids.npy # (N,) string array: "{table_id}::{column_name}"
│ └── metadata.json # Pipeline metadata
│
├── queries/ # Query embeddings (if there are no dedicated queries, we symlink to datasets/)
│ │
│ └── {model}-{strategy}.fa/ # Feature archive per embedding pipeline
│ ├── embeddings.npy # (N, D) float32 array
│ ├── column_ids.npy # (N,) string array: "{table_id}::{column_name}"
│ └── metadata.json # Pipeline metadata
│
├── sample-{pct}/ # Embeddings for sampled subsets (percentages: 001, 005, 050)
│ │
│ └── {curation_method}/ # Curation strategy
│ └── {model}-{strategy}.fa/
│ ├── embeddings.npy
│ ├── column_ids.npy
│ └── metadata.json
│
└── split/ # Embeddings for splits (webtable only)
├── train/
│ └── {model}-{strategy}.fa/
├── test/
│ └── {model}-{strategy}.fa/
└── val/
└── {model}-{strategy}.fa/Query Files:
-
join_queries.csv: List of join query columnsquery_table,query_column csvData1549285__2.csv,AST% csvData1549285__2.csv,BLK%
-
join_ground_truth.csv: Ground truth join pairsquery_table,candidate_table,query_column,candidate_column csvData1549285__2.csv,csvData20409520__4.csv,DRtg,DRtg
Embedding Files (stored in .fa/ directories):
embeddings.npy: NumPy array of shape(N_columns, embedding_dim), dtypefloat32column_ids.npy: NumPy array of shape(N_columns,), dtypeStringDType(), format"{table_id}::{column_name}"metadata.json: Pipeline metadata including model, shape, and source information
- Dataset names:
webtable,omnimatch_city_test,omnimatch_culture_test,santos_small,freyja - Embedding pipelines:
{model}-{strategy}- Models:
mpnet,distilroberta,gte_base,gtr_t5,mini_l12,mini_l6, etc. - Strategies:
dj_orig(DeepJoin original),dj_adpt(DeepJoin adapted), etc.
- Models:
- Curation methods:
{algorithm}[_{metric}]-{model}-{strategy}-k={n_columns}[-pca={variance}]- Algorithms:
fft,grid,random - Metrics:
cos(cosine),euc(Euclidean) - Example:
fft_cos-mpnet-dj_adpt-k=63049-pca=0.9
- Algorithms:
To reproduce the experiments in our paper, first ensure you have downloaded the required datasets and placed them in the data/datasets/ directory as described above.
If you prefer to store the datasets elsewhere, create a symbolic link named data in the project root pointing to your dataset directory. For example:
ln -s /your/local/storage dataAfter setting up the datasets, you can run all experiments using:
bash experiments/run_all.sh- 500+ GB RAM
- 1.5+ TB disk space
- NVIDIA A100 80GB GPU or better
TBD