Clean and report on messy sales data · Total: 100 points · Passing: 60
Read the full assignment on the HYF Data Track: Assignment: MessyCorp Pandas
Work through the files in this order:
| Step | File | Tasks |
|---|---|---|
| 1 | src/ingest.py |
Task 1: download inputs from Azure |
| 2 | src/clean.py |
Task 2: explore + Task 3: clean sales |
| 3 | src/transform.py |
Task 4: join customers, add is_high_value |
| 4 | src/report.py |
Task 5: build report tables + Task 6: write outputs |
| 5 | src/ingest.py |
Task 7 (extra credit): upload results to Azure |
| 6 | main.py |
Set GITHUB_USERNAME, then run the full pipeline |
| 7 | AI_ASSIST.md |
Task 8: fill in before submitting |
Open each file and read the docstrings and TODO comments — they explain exactly what to implement.
.
├── sample_data/
│ ├── messy_sales.csv # fallback if Azure is unavailable — copy to data/ manually
│ └── messy_customers.csv
├── src/
│ ├── ingest.py # Tasks 1 + 7 — Azure download and upload
│ ├── clean.py # Tasks 2 + 3 — explore and clean sales data
│ ├── transform.py # Task 4 — join customers, add is_high_value
│ └── report.py # Tasks 5 + 6 — build tables and write outputs
├── main.py # Pipeline runner — set GITHUB_USERNAME for Task 7
├── AI_ASSIST.md # Task 8 — fill in before submitting
├── .gitignore # data/ and output/ are excluded — generated at runtime
└── .hyf/
└── test.sh # auto-grader — read this to see exactly what is checked
Files the pipeline generates at runtime (gitignored):
data/— raw CSVs downloaded from Azure in Task 1output/— report CSVs, Parquet, and chart written in Task 6
pip install pandas azure-identity azure-storage-blob matplotlib pyarrowLog in to Azure (reuses your Week 2 session):
az loginIf Azure is unavailable (login issues, no network): copy the files from
sample_data/into adata/folder at the repo root, then comment out thedownload_inputs(DATA_DIR)call inmain.py. You can complete Tasks 2–6 without Azure access and return to Tasks 1 and 7 once your session is working.
Edit GITHUB_USERNAME in main.py before running Task 7, then:
python main.pyRun the same grader the auto-grader runs on every PR push:
bash .hyf/test.sh
cat .hyf/score.jsonTasks 2–6 are the core of this assignment and are enough to pass. Tasks 7 and the code quality checks are extra credit.
| Score | What the grader checks |
|---|---|
| 14 | Stubs committed: all five function names present, Azure imports, data/ in .gitignore |
| ~24 | Task 2: .info(), .describe(), .isna().sum(), .head() all called |
| ~44 | Task 3: vectorized string cleaning, pd.to_numeric, pd.to_datetime, row filters, drop_duplicates on transaction_id |
| ~59 | Task 4: email normalisation, how="inner" merge, vectorised is_high_value (no loops) |
| ~79 | Task 5: named aggregations (total_revenue=, order_count=), isocalendar().week, ("customer_name", "first") |
| ~89 | Task 6: all three output files written with index=False, chart saved with savefig |
| ~94 | (extra credit) Task 7: upload_outputs uses assert + len() to verify the Azure round-trip |
| 100 | (extra credit) Code quality: Path(...) constructor and logging.info/warning/error calls used in src/ |
- Create a branch
week4/your-name. - Commit your work.
- Push and open a Pull Request.