Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,4 @@ build-iPhoneSimulator/
# unless supporting rvm < 1.11.0 or doing something fancy, ignore this:
.rvmrc
.DS_Store
.vscode/
3 changes: 3 additions & 0 deletions .rspec
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
--require spec_helper
--format documentation
--color
7 changes: 7 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
source "https://rubygems.org"

gem "nokolexbor"

group :development, :test do
gem "rspec", "~> 3.12"
end
52 changes: 52 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
GEM
remote: https://rubygems.org/
specs:
diff-lcs (1.6.2)
nokolexbor (0.7.0)
nokolexbor (0.7.0-aarch64-linux)
nokolexbor (0.7.0-arm64-darwin)
nokolexbor (0.7.0-x86_64-darwin)
nokolexbor (0.7.0-x86_64-linux)
rspec (3.13.2)
rspec-core (~> 3.13.0)
rspec-expectations (~> 3.13.0)
rspec-mocks (~> 3.13.0)
rspec-core (3.13.6)
rspec-support (~> 3.13.0)
rspec-expectations (3.13.5)
diff-lcs (>= 1.2.0, < 2.0)
rspec-support (~> 3.13.0)
rspec-mocks (3.13.8)
diff-lcs (>= 1.2.0, < 2.0)
rspec-support (~> 3.13.0)
rspec-support (3.13.7)

PLATFORMS
aarch64-linux-gnu
aarch64-linux-musl
arm-linux-gnu
arm-linux-musl
arm64-darwin
x86_64-darwin
x86_64-linux-gnu
x86_64-linux-musl

DEPENDENCIES
nokolexbor
rspec (~> 3.12)

CHECKSUMS
diff-lcs (1.6.2) sha256=9ae0d2cba7d4df3075fe8cd8602a8604993efc0dfa934cff568969efb1909962
nokolexbor (0.7.0) sha256=a6df669d9280bfe7f5f334a1734c96b80d8d54ff9b18cea807dbe5651be45dd7
nokolexbor (0.7.0-aarch64-linux) sha256=1729e1d5e5fb3a5f1328453f4ee884a8c53de3a94ff315cacf518acf8b4e059f
nokolexbor (0.7.0-arm64-darwin) sha256=874c1cae2c2658d0cc4018f6569540753ff03b79bacb1b0d1380a8230a0a14ea
nokolexbor (0.7.0-x86_64-darwin) sha256=5de1b440996839cf82f2f35c79b4e1eee28100a263cdb9e67fa28c016c0526fe
nokolexbor (0.7.0-x86_64-linux) sha256=6348178e41233e67e0f533f84b0b1974b187fe137616541f1453bb7c0c16baf6
rspec (3.13.2) sha256=206284a08ad798e61f86d7ca3e376718d52c0bc944626b2349266f239f820587
rspec-core (3.13.6) sha256=a8823c6411667b60a8bca135364351dda34cd55e44ff94c4be4633b37d828b2d
rspec-expectations (3.13.5) sha256=33a4d3a1d95060aea4c94e9f237030a8f9eae5615e9bd85718fe3a09e4b58836
rspec-mocks (3.13.8) sha256=086ad3d3d17533f4237643de0b5c42f04b66348c28bf6b9c2d3f4a3b01af1d47
rspec-support (3.13.7) sha256=0640e5570872aafefd79867901deeeeb40b0c9875a36b983d85f54fb7381c47c

BUNDLED WITH
4.0.10
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@

# Extract Van Gogh Paintings Code Challenge

Goal is to extract a list of Van Gogh paintings from the attached Google search results page.
Expand All @@ -23,6 +24,6 @@ Parse directly the HTML result page ([html file]) in this repository. No extra H

Add also to your array the painting thumbnails present in the result page file (not the ones where extra requests are needed).

Test against 2 other similar result pages to make sure it works against different layouts. (Pages that contain the same kind of carrousel. Don't necessarily have to be paintings.)
Test against 2 other similar result pages to make sure it works against different layouts. (Pages that contain the same kind of carrousel. Don't necessarily have to be paintings.)

The suggested time for this challenge is 4 hours. But, you can take your time and work more on it if you want.
68 changes: 68 additions & 0 deletions RUN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Run Guide

This file contains setup, run, lint, and test commands for this script.

## Prerequisites

- Ruby
- Bundler

## Setup

Install dependencies:

```bash
bundle install
```

## Run Extractor

Run extractor on the provided fixture:

```bash
bin/extract files/van-gogh-paintings.html
```

Save output:

```bash
bin/extract files/van-gogh-paintings.html > /tmp/artworks.json
```

Programmatic use:

```ruby
require_relative "lib/extractor"

result = Extractor.call("files/van-gogh-paintings.html")
puts result.first
```

## Run Tests

Run all specs:

```bash
bundle exec rspec
```

Run one spec file:

```bash
bundle exec rspec spec/extractor_spec.rb
```

## Lint

Run syntax lint checks:

```bash
bundle exec bin/lint
```

`bin/lint` runs `ruby -wc` across `lib/`, `spec/`, and `bin/`.

## Local Quality Checks

1. `bundle exec bin/lint`
2. `bundle exec rspec`
11 changes: 11 additions & 0 deletions bin/extract
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/usr/bin/env ruby
$LOAD_PATH.unshift File.expand_path("../lib", __dir__)
require "extractor"
require "json"

if ARGV.empty?
warn "Usage: bin/extract <path-to-html>"
exit 64
end

puts JSON.pretty_generate("artworks" => Extractor.call(ARGV[0]))
23 changes: 23 additions & 0 deletions bin/lint
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/usr/bin/env ruby
# frozen_string_literal: true

require "rbconfig"

files = Dir["lib/**/*.rb", "spec/**/*.rb", "bin/*"]
files.select! { |path| File.file?(path) }
files.sort!

failed = []

files.each do |path|
ok = system(RbConfig.ruby, "-wc", path)
failed << path unless ok
end

if failed.any?
warn "\nLint failed for #{failed.size} file(s):"
failed.each { |path| warn "- #{path}" }
exit 1
end

puts "\nLint passed for #{files.size} file(s)."
39 changes: 39 additions & 0 deletions lib/extractor.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
require "nokolexbor"

require_relative "extractor/thumbnail_index"
require_relative "extractor/carousel"
require_relative "extractor/item"

# scrapeMemo psuedocode: enabling the Item class to detect whether a relevant scrapeMemo record has been found by the Carousel class would probably require changing the Extractor from a module into a class
module Extractor
# Public facade. Returns an Array<Hash> of carousel items.
#
# Why this shape:
# - Callers only need one method (`Extractor.call`) and don't manage objects.
# - Internally we still use small classes for focused responsibilities.
#
# Extractor.call("files/van-gogh-paintings.html")
# # => [{ "name" => "The Starry Night", "extensions" => ["1889"], ... }, ...]
# Accepts either a file path string or raw HTML string.
def self.call(html_or_path)
# Support both CLI/file use and test/raw-html use with one entrypoint.
html = if File.file?(html_or_path.to_s)
File.read(html_or_path)
else
html_or_path
end

# Parse once into DOM, then build the thumbnail map once.
# This avoids repeatedly scanning <script> nodes for every tile.
document = Nokolexbor::HTML(html)
thumbnails = ThumbnailIndex.build(document)

# Keep parse pipeline explicit:
# 1) find best carousel tiles
# 2) parse each tile
# 3) drop malformed/nil rows
Carousel.tiles(document).map do |tile|
Item.parse(tile, thumbnails: thumbnails)
end.compact
end
end
99 changes: 99 additions & 0 deletions lib/extractor/carousel.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
require_relative "item"

module Extractor
# Locates the knowledge-graph carousel on a desktop Google SERP. Google
# rotates CSS class names regularly, so we identify the carousel by its
# structural signature: a container holding a run of sibling tiles, where
# every tile has exactly one anchor to /search?... with a `stick=` param
# (the knowledge-graph entity stick).
class Carousel
# Heuristic floor to filter tiny, usually-noisy stick-link groups.
MIN_TILES = 3

# Class-level convenience API keeps caller usage functional:
# Carousel.tiles(document)
# while still allowing private helpers and local state internally.
def self.tiles(document)
new(document).tiles
end

# Stores the document as instance state so private methods can access
# it without threading it through every method call.
def initialize(document)
@document = document
end

def tiles
# scrapeMemo psuedocode: create empty scrapeMemo hash, which will serve as an index for future parsing of the same search result structure (data-attrid, whether target grid is inside div#search or not, tile grid container class, tile root class, tile count, name_attribute, image_script_variable_names)
target_section = @document.at_css('#search') || @document
# scrapeMemo psuedocode: if '#search' can't be found, add that to scrapeMemo hash
target_section = target_section.css('div').find { |d| d['data-attrid'] } || @document.css('div').find { |d| d['data-attrid'] } || target_section
# scrapeMemo psuedocode: if div['data-attrid'] can't be found, add that to scrapeMemo hash
# scrapeMemo psuedocode: check database for any records containing the same ['data-attrid'] value
# scrapeMemo psuedocode: if one or more record(s) exist, scan for the tile grid container class, prioritizing the record most recently created
# scrapeMemo psuedocode: if the tile grid container exists & has the expected number of children with the expected tile root class, set them as the tile roots & skip the rest of this function

groups = candidate_groups(target_section)
return [] if groups.empty?

# Multiple stick-link groups can exist on one page. We prefer the group
# that most strongly looks like media tiles, then fall back to DOM order
# for deterministic behavior on ties.
# deterministic tie-breaking removes "heisenbugs" where output can vary by Ruby/hash iteration behavior.
scored = groups.map { |g| [group_score(g), g] }
best_score = scored.map(&:first).max
best_groups = scored.select { |score, _| score == best_score }.map(&:last)
best_groups.min_by { |g| document_position(g.first) } || []
# scrapeMemo psuedocode: just like with the div['data-attrid'] value before, we can now check the tile grid container class, tile root class, tile count, as well as the div['data-attrid'] value against recorded indexes to check for search result structure drift
end

private

# before = Time.now
# for i in 1..1000
# candidate_groups(target_section)
# end
# after = Time.now
# puts "new version benchmarked at #{after - before}"

# Build candidate groups by finding the three biggest sibling groups within the target section
# I tested this versus the previous implementation with the quick & dirty benchmark commented out above
# against the van-gogh-paintings.html results and found this to be about x10 faster.
def candidate_groups(target_section)
target_section.css('div', 'section', 'main').max_by(3) { |element| element.element_children.count }.map { |e| e.element_children }
end

# Score shape:
# 1) tiles with an image element
# 2) tiles with a properly formatted anchor links
# 3) tiles with a likely name signal
# 4) group size
#
# This keeps us anchored on structural evidence instead of class names.
def group_score(group)
default_weight = ENV.fetch('DEFAULT_TILE_WEIGHT', 1.1).to_f
acceptable_number_of_misformed_tiles = ENV.fetch('ACCEPTABLE_NUMBER_OF_MISFORMED_TILES', 0).to_i
tile_img_weight = ENV.fetch('TILE_IMG_WEIGHT', default_weight).to_f
tile_anchor_weight = ENV.fetch('TILE_ANCHOR_WEIGHT', default_weight).to_f
tile_name_weight = ENV.fetch('TILE_NAME_WEIGHT', default_weight).to_f
# Prefer groups that look like media cards.
almost_all_tiles_have_images = group.count { |tile| tile.at_css("img") } >= group.size - acceptable_number_of_misformed_tiles
img_score = almost_all_tiles_have_images ? tile_img_weight : 1
# properly formatted anchor links provide a second quality axis.
almost_all_tiles_have_anchors = group.count { |tile| tile.at_css('a[href*="stick="]') } >= group.size - acceptable_number_of_misformed_tiles
anchor_score = almost_all_tiles_have_anchors ? tile_anchor_weight : 1
# Name-like signals provide a third quality axis.
# I didn't think that searching for each name candidate individually to enforce uniformity
# was worth the performance cost, but it is a tradeoff worth discussing in a code review
almost_all_tiles_have_names = group.count { |tile| tile.at_css('[title], [aria-label], img[alt]') } >= group.size - acceptable_number_of_misformed_tiles
name_score = almost_all_tiles_have_names ? tile_name_weight : 1
group.size * img_score * anchor_score * name_score
end

def document_position(node)
return Float::INFINITY unless node
# DOM-order fallback is stable and explainable.
@document.css("*").index(node) || Float::INFINITY
end
end
end
Loading