feat: add image compression and upload pipeline for org logos#102
feat: add image compression and upload pipeline for org logos#102aayank13 wants to merge 4 commits into
Conversation
|
@aayank13 is attempting to deploy a commit to the Ketan's Personal Team on Vercel. A member of the Team first needs to authorize it. |
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. Warning
|
| Cohort / File(s) | Summary |
|---|---|
Ignore & package scripts /.gitignore, package.json |
Added ignore patterns for generated images/* (*.webp, *.png, *.jpg). Added gsoc:images script and updated gsoc:sync to invoke the image processing step; new devDependencies: @aws-sdk/client-s3, @smithy/node-http-handler, sharp. |
Image processing utilities scripts/lib/image-processor.ts |
New utilities: downloadImage (retries, timeout), compressToWebP, processAndSaveLocally, sleep, and interfaces CompressOptions/ProcessResult. Handles local saving and compression to WebP. |
R2 client scripts/lib/r2-client.ts |
New Cloudflare R2 S3-compatible client: lazy S3Client creation, uploadToR2 (PutObjectCommand) and getR2PublicUrl; reads required R2 env vars and throws on missing config. |
Orchestration script scripts/process-org-images.ts |
New CLI that reads raw org JSON, filters orgs, processes images (download, compress, save), optionally uploads to R2, updates per-org JSON with img_r2_url/logo_r2_url; supports --year, --dry-run, --local-only, per-item delays, retries, logging and aggregated results. |
Org transform tweak scripts/transform-year-organizations.ts |
Minor formatting change and behavioral change: newly created org objects now set img_r2_url to an empty string ("") instead of defaulting to raw.logo_url. |
Sequence Diagram(s)
sequenceDiagram
participant Script as process-org-images.ts
participant FS as File System
participant Processor as image-processor.ts
participant R2Client as r2-client.ts
participant CloudflareR2 as Cloudflare R2
Script->>FS: Read raw org JSON & per-org JSON files
loop for each org with logo_url and not already on R2
Script->>Processor: processAndSaveLocally(logo_url, outputDir, slug, options)
activate Processor
Processor->>Processor: downloadImage (retries, timeout)
Processor->>Processor: compressToWebP (sharp)
Processor->>FS: Ensure images/<YEAR>/ and save slug.webp
Processor-->>Script: Return local image path + sizes
deactivate Processor
alt not --local-only
Script->>R2Client: uploadToR2(key, Buffer, contentType)
activate R2Client
R2Client->>CloudflareR2: PutObjectCommand (S3 API)
CloudflareR2-->>R2Client: Success
R2Client-->>Script: Return public R2 URL
deactivate R2Client
Script->>FS: Update per-org JSON with img_r2_url/logo_r2_url
end
end
Script->>Script: Log summary (processed, skipped, failed)
Estimated code review effort
🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
🐰 I nibble bytes and hop with glee,
I fetch each logo from the sea.
I squeeze to webp and tuck it tight,
To R2 clouds it takes its flight.
A tiny hop — pixels shine bright! 📸✨
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title check | ✅ Passed | The title 'feat: add image compression and upload pipeline for org logos' clearly and concisely summarizes the main change in the PR. |
| Description check | ✅ Passed | The PR description includes a comprehensive summary, detailed changes section, and usage examples covering all main aspects of the implementation. |
| Linked Issues check | ✅ Passed | All requirements from issue #96 are met: downloads org logos from GSoC API, compresses to WebP, renames to {slug}.webp format, uploads to Cloudflare R2, and creates images/ folder with tech-stack and 2026 subfolders. |
| Out of Scope Changes check | ✅ Passed | All changes are directly related to implementing the image processing pipeline requirements; minor fixes in transform-year-organizations.ts align with the pipeline's needs. |
| Docstring Coverage | ✅ Passed | Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%. |
✏️ Tip: You can configure your own custom pre-merge checks in the settings.
✨ Finishing Touches
🧪 Generate unit tests (beta)
- Create PR with unit tests
- Post copyable unit tests in a comment
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (3)
package.json (1)
57-57:sharp ^0.33.0will not resolve to the current0.34.xseries.For packages with a
0.x.yversion, the^range only allows patch increments within the same minor (0.33.*). The latest published version is0.34.5, which includes upstream libvips bug fixes and TypeScript improvements. Consider bumping to^0.34.0to pick up those fixes.💡 Proposed change
-"sharp": "^0.33.0", +"sharp": "^0.34.0",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@package.json` at line 57, The package.json currently pins the sharp dependency to "sharp": "^0.33.0", which will not pick up the 0.34.x series; update the sharp entry to use "^0.34.0" so the project can receive the 0.34.* bugfix and TypeScript improvements, then regenerate your lockfile by running your package manager install (npm/yarn/pnpm) to update package-lock.json or yarn.lock accordingly; ensure any CI/cache is refreshed so the new version is used.scripts/lib/r2-client.ts (1)
26-33: Consider setting a request timeout on the S3 client.The AWS SDK v3
S3Clienthas no default socket/request timeout; a stalled upload to R2 will hang the script indefinitely. AddrequestHandlerormaxAttemptsconfig, or at minimum asocketTimeout.💡 Suggested timeout config
+import { NodeHttpHandler } from "@smithy/node-http-handler"; + _client = new S3Client({ region: "auto", endpoint: `https://${accountId}.r2.cloudflarestorage.com`, credentials: { accessKeyId: getEnvOrThrow("R2_ACCESS_KEY_ID"), secretAccessKey: getEnvOrThrow("R2_SECRET_ACCESS_KEY"), }, + requestHandler: new NodeHttpHandler({ + requestTimeout: 30_000, + socketTimeout: 30_000, + }), });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/lib/r2-client.ts` around lines 26 - 33, The S3Client instantiation assigned to _client lacks a request timeout and can hang; update the S3Client config in the S3Client(...) call to include a requestHandler with timeouts (e.g., import and use NodeHttpHandler and pass requestHandler: new NodeHttpHandler({ socketTimeout: <ms>, connectionTimeout: <ms> })) and/or set maxAttempts to a sensible retry limit so R2 uploads won't stall indefinitely; update the S3Client(...) call where _client is created to include these options.scripts/lib/image-processor.ts (1)
58-74:processAndSaveLocallyis exported but its functionality is duplicated inline inprocess-org-images.ts.
process-org-images.tsmanually callsdownloadImage→compressToWebP→fs.writeFileSync(lines 117–121) instead of callingprocessAndSaveLocally. The only difference is the inline size comparison log. Consider extendingprocessAndSaveLocallyto return both buffers (or sizes) so callers can retain size logging while avoiding the duplication.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/lib/image-processor.ts` around lines 58 - 74, processAndSaveLocally duplicates logic in process-org-images.ts; change processAndSaveLocally to return both the saved file path and size info (e.g., { outputPath: string, originalSize: number, compressedSize: number } or include the original/compressed Buffers) so callers can log size differences without reimplementing downloadImage/compressToWebP/write logic. Update processAndSaveLocally (the function shown) to capture original buffer size before compression and compressed buffer size after compressToWebP, write the file as now, and return the sizes alongside outputPath; then replace the manual download/compress/write sequence in process-org-images.ts with a call to processAndSaveLocally and use the returned sizes for the existing size comparison log.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@package.json`:
- Line 22: The npm script "gsoc:sync" currently invokes
scripts/process-org-images.ts without the --local-only flag, which forces R2
uploads and causes silent per-org failures when R2_* env vars are not set;
update the "gsoc:sync" entry to call scripts/process-org-images.ts --local-only
(so uploads are decoupled and handled by the separate "gsoc:images" script), or
alternatively add a README note that "gsoc:sync" requires R2 credentials (R2_*
env vars) if you want to keep the current behavior; reference the "gsoc:sync"
npm script and scripts/process-org-images.ts and "gsoc:images" to make the
change or documentation clear.
In `@scripts/lib/image-processor.ts`:
- Around line 15-39: The downloadImage function currently uses fetch without a
timeout, so a stalled response will never throw and retries won't trigger;
modify downloadImage to create an AbortController for each fetch attempt, pass
controller.signal into fetch(url, { signal }), start a per-attempt timer (e.g.,
via setTimeout) that calls controller.abort() after a configured per-attempt
timeout, and clear the timer when the response is received or on error; ensure
the abort error is handled like other errors so the loop retries (using existing
lastError, RETRY_DELAY_MS and MAX_RETRIES) and that the controller/timer are
properly cleaned up each attempt to avoid leaks.
In `@scripts/process-org-images.ts`:
- Around line 76-88: The skip-and-update logic uses raw.slug directly to build
orgFile so aliased slugs (SLUG_ALIASES) never resolve and R2 URLs aren't
persisted; import or duplicate the SLUG_ALIASES mapping and resolve the
canonical file slug before any filesystem lookup (i.e., compute a resolvedSlug
from SLUG_ALIASES[raw.slug] || raw.slug) and use that when constructing orgFile
(used in the pre-skip check and in updateOrgJson), ensuring both the existence
check and the write/update target the actual JSON filename under ORGS_DIR.
- Line 45: R2_URL_PREFIX is hardcoded which breaks the skip check that uses
currentR2.startsWith(R2_URL_PREFIX); instead derive the prefix from the same
source as r2-client (use the R2_PUBLIC_URL env var or call getR2PublicUrl from
r2-client) so the skip logic matches the actual public URL; update the
declaration of R2_URL_PREFIX to compute its value from process.env.R2_PUBLIC_URL
(or import and call getR2PublicUrl) with the existing literal as a fallback, and
ensure the currentR2.startsWith(...) check uses this computed value.
- Around line 127-132: The R2 upload uses r2Key = `${raw.slug}.webp` which omits
the year and causes cross-year overwrites; update the r2Key construction in the
block that checks LOCAL_ONLY (where uploadToR2 is called) to include the same
year segment used for local saves (e.g., `${year}/${raw.slug}.webp` or whatever
variable holds the YEAR), so the remote key mirrors the local path; ensure any
logging (console.log) and references to r2Url remain unchanged after this
change.
- Around line 150-163: The script currently logs failures but never sets a
non-zero exit code; update the end of the script where failures is inspected
(the block that prints "[FAILURES]" and the LOCAL_ONLY messages) to call
process.exit(1) when failures.length > 0 so CI fails on any upload errors;
ensure you only skip the exit when LOCAL_ONLY is true and uploads were
intentionally not attempted (or always exit non-zero regardless of LOCAL_ONLY if
you prefer the simpler behavior), referencing the failures array and the
existing LOCAL_ONLY/IMAGES_DIR logic to decide when to call process.exit(1).
---
Nitpick comments:
In `@package.json`:
- Line 57: The package.json currently pins the sharp dependency to "sharp":
"^0.33.0", which will not pick up the 0.34.x series; update the sharp entry to
use "^0.34.0" so the project can receive the 0.34.* bugfix and TypeScript
improvements, then regenerate your lockfile by running your package manager
install (npm/yarn/pnpm) to update package-lock.json or yarn.lock accordingly;
ensure any CI/cache is refreshed so the new version is used.
In `@scripts/lib/image-processor.ts`:
- Around line 58-74: processAndSaveLocally duplicates logic in
process-org-images.ts; change processAndSaveLocally to return both the saved
file path and size info (e.g., { outputPath: string, originalSize: number,
compressedSize: number } or include the original/compressed Buffers) so callers
can log size differences without reimplementing
downloadImage/compressToWebP/write logic. Update processAndSaveLocally (the
function shown) to capture original buffer size before compression and
compressed buffer size after compressToWebP, write the file as now, and return
the sizes alongside outputPath; then replace the manual download/compress/write
sequence in process-org-images.ts with a call to processAndSaveLocally and use
the returned sizes for the existing size comparison log.
In `@scripts/lib/r2-client.ts`:
- Around line 26-33: The S3Client instantiation assigned to _client lacks a
request timeout and can hang; update the S3Client config in the S3Client(...)
call to include a requestHandler with timeouts (e.g., import and use
NodeHttpHandler and pass requestHandler: new NodeHttpHandler({ socketTimeout:
<ms>, connectionTimeout: <ms> })) and/or set maxAttempts to a sensible retry
limit so R2 uploads won't stall indefinitely; update the S3Client(...) call
where _client is created to include these options.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
scripts/process-org-images.ts (2)
116-117:failedcounter is redundant —failures.lengthalready tracks the same value.
failedis incremented in lockstep with everyfailures.push(...), sofailed === failures.lengthalways holds. The summary log can usefailures.lengthdirectly.♻️ Proposed cleanup
- let processed = 0; - let failed = 0; + let processed = 0; const failures: Array<{ slug: string; error: string }> = []; // ...inside catch block: - failed++; failures.push({ slug: raw.slug, error: errorMsg }); // ...summary: - console.log(` Failed: ${failed}`); + console.log(` Failed: ${failures.length}`);Also applies to: 153-153, 161-163
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/process-org-images.ts` around lines 116 - 117, The extra numeric counter failed is redundant because failures.push(...) already tracks failures; remove the failed variable declaration and every failed++ update (where failures.push(...) is called) and change any uses of failed (e.g., in the final summary log) to use failures.length instead so processed remains and failures.length provides the failure count.
113-115: RedundantexistsSyncguard beforemkdirSync.
fs.mkdirSync(path, { recursive: true })is already a no-op when the directory exists. The existence check adds no safety and can be dropped.♻️ Proposed simplification
- if (!fs.existsSync(IMAGES_DIR)) { - fs.mkdirSync(IMAGES_DIR, { recursive: true }); - } + fs.mkdirSync(IMAGES_DIR, { recursive: true });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/process-org-images.ts` around lines 113 - 115, The check using fs.existsSync(IMAGES_DIR) before creating the directory is redundant; remove the if-block and call fs.mkdirSync(IMAGES_DIR, { recursive: true }) directly so the directory is created if missing and is a no-op if it already exists—update the block containing IMAGES_DIR and the fs.mkdirSync call accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@scripts/process-org-images.ts`:
- Around line 50-54: Extract the hardcoded SLUG_ALIASES map into a shared module
and import it into both scripts so aliases stay synchronized; specifically, move
the SLUG_ALIASES constant into a new shared file (e.g., export a const ALIASES)
and replace the inline SLUG_ALIASES usage in process-org-images.ts and
transform-year-organizations.ts with imports from that shared module, ensuring
both files reference the same exported identifier (SLUG_ALIASES or ALIASES) so
updates are centralized.
---
Nitpick comments:
In `@scripts/process-org-images.ts`:
- Around line 116-117: The extra numeric counter failed is redundant because
failures.push(...) already tracks failures; remove the failed variable
declaration and every failed++ update (where failures.push(...) is called) and
change any uses of failed (e.g., in the final summary log) to use
failures.length instead so processed remains and failures.length provides the
failure count.
- Around line 113-115: The check using fs.existsSync(IMAGES_DIR) before creating
the directory is redundant; remove the if-block and call
fs.mkdirSync(IMAGES_DIR, { recursive: true }) directly so the directory is
created if missing and is a no-op if it already exists—update the block
containing IMAGES_DIR and the fs.mkdirSync call accordingly.
|
@ketankauntia check the above PR |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
scripts/process-org-images.ts (1)
121-122:failedcounter is redundant — it always equalsfailures.length.
failed++andfailures.push(...)are always called together, sofailedcarries no independent information.failures.lengthcan be used directly everywhere.♻️ Proposed refactor
let processed = 0; - let failed = 0; const failures: Array<{ slug: string; error: string }> = [];- failed++;- console.log(` Failed: ${failed}`); + console.log(` Failed: ${failures.length}`);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/process-org-images.ts` around lines 121 - 122, Remove the redundant `failed` counter: delete the `let failed = 0;` declaration and remove every `failed++` update where you currently also call `failures.push(...)`; replace any uses of `failed` (e.g., in logging, metrics, or return values) with `failures.length` so the code relies on the single source of truth `failures` instead of duplicating state.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@scripts/process-org-images.ts`:
- Around line 121-122: Remove the redundant `failed` counter: delete the `let
failed = 0;` declaration and remove every `failed++` update where you currently
also call `failures.push(...)`; replace any uses of `failed` (e.g., in logging,
metrics, or return values) with `failures.length` so the code relies on the
single source of truth `failures` instead of duplicating state.
I'm traveling. I'll review the pr on 25th or 26th positively |
|
Sure :) |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
| import path from "path"; | ||
| import sharp from "sharp"; | ||
|
|
||
| const MAX_RETRIES = 3; |
There was a problem hiding this comment.
i would suggest, constants to be moved to a proper constants file for better handling.
| input: Buffer, | ||
| options: CompressOptions = {}, | ||
| ): Promise<Buffer> { | ||
| const { width = 400, quality = 80 } = options; |
There was a problem hiding this comment.
how did you come up with this specific size? what did you take into account? and how much of a size compression will be see on an avg?
| * @param options - Optional compression settings. | ||
| * @returns The output path and original/compressed sizes in bytes. | ||
| */ | ||
| export async function processAndSaveLocally( |
There was a problem hiding this comment.
okay.. one thing that i noticed, a better way will be to export all these fn to their separate files as they do specific tasks. and then call them in one final file for better readability and keeping things properly structured.
| import { NodeHttpHandler } from "@smithy/node-http-handler"; | ||
|
|
||
| /** | ||
| * Reads a required environment variable or throws an error if it is not set. |
There was a problem hiding this comment.
this code can be generalized and separated out to a new file and extended to rest of the codebase. why just keep for r2 then?
| * in our local dataset (new-api-details/organizations/). | ||
| * Shared by both transform-year-organizations.ts and process-org-images.ts. | ||
| */ | ||
| export const SLUG_ALIASES: Record<string, string> = { |
There was a problem hiding this comment.
what is the use of this file? please explain. do we need this specifically? or did you just cursored it? (there is no harm in cursoring, just curious)
| process.exit(1); | ||
| } | ||
|
|
||
| const ROOT = process.cwd(); |
There was a problem hiding this comment.
same for these to be exported to a const file instead.
| category: raw.categories?.[0] || "Other", | ||
| description: raw.description || raw.tagline || "", | ||
| image_url: raw.logo_url || "", | ||
| img_r2_url: raw.logo_url || "", |
| * GSoC Org Image Processing Pipeline | ||
| * | ||
| * Downloads org logos from the GSoC API, compresses to WebP, | ||
| * saves locally, and optionally uploads to Cloudflare R2. |
There was a problem hiding this comment.
ok just a quick question, why no delete the .png files and .webp files after uploading to r2?
Also, it will be better to introduce a field name gsoc_img_url:"" that points to the image hosted by gsoc website so that we have a proper link to the image from which we can poll and check.
this way we do not have to run through the entire gsoc api instead check internally only for the list of new orgs that appeared in the latest gsoc edition and only work on those images.
Summary
Adds an image processing pipeline for GSoC organization logos that:
sharp{slug}.webpAlso creates the
images/folder structure withtech-stack/and2026/subfolders.Closes #96
Changes
New Files
--dry-runand--local-onlymodesModified Files
sharp,@aws-sdk/client-s3; newgsoc:imagesscript; updatedgsoc:syncimages/img_r2_urlfor new orgs (was incorrectly set to raw API URL)Usage