Horizon
← Back to Portfolio

Horizon

What's On

Cincinnati has a lot going on. History lectures at the Cincinnati Museum Center, archaeological tours through Fort Ancient, craft workshops at local makerspaces, film screenings, nature programs, WWII aviation days at the National Museum of the Air Force, earthworks tours at Hopewell Culture. The problem isn't a shortage of events. It's that they're scattered across dozens of individual websites, each with its own calendar format, update cadence, and degree of crawlability.

No single place aggregates them. The city's general event sites tend toward concerts and festivals. The niche stuff — the history talks, the guided walks, the lecture series at small museums — lives on individual institution websites and dies there.

Horizon is my attempt to fix that for myself.

How It Works

Every Sunday morning a GitHub Actions cron job kicks off the scraper. It works through a growing list of sources: history museums, nature centers, archaeological sites, art venues, community theaters, and local institutions across Cincinnati, Northern Kentucky, and the Dayton area.

Sources

The single biggest challenge is that every organization presents events differently. One site uses a full calendar widget. Another lists them as plain bullet points. Another publishes a separate page per event with no index. There's no standard. So rather than one generic extractor, the scraper has a shared framework with a custom module for each source — each one written specifically for how that site structures its data. Almost every module is meaningfully different from the last.

For each source, the appropriate module fetches the HTML. For sites that require JavaScript to render — which is most of them — it uses Playwright to get the fully rendered page first. That HTML goes to Gemini Flash, which extracts structured event data: title, date, time, location, categories, cost, and a short description. Gemini Flash was a deliberate choice: its large context window handles dense calendar pages cleanly, and the cost per extraction is low enough to run the full source list weekly without it adding up.

Extracted events are written to the database. A separate enrichment workflow runs afterward to fill in missing fields — coordinates, venue details, category tags — where the initial extraction left gaps.

Database Storage

Events are stored in a database and consolidated into a static manifest at build time. Each record carries a version tag from the scrape batch that created it.