feat(job_scout): add 6 Swiss/EU companies, new adapters, and scan-stats table

Automate Palantir, QuantCo, Swissgrid, RUAG, SBB, BKW (drop BFH/Dialectic);
25 companies automated, 0 manual.

- adapters: lever (Palantir/QuantCo), generic json (Swissgrid), sbb, bkw
- fetch_playwright: optional ?page=N pagination (page_param/max_pages) for RUAG
- location_matches: treat pan-EU "Europe"/"EMEA" postings as eligible
- per-company _score_floor so pre-filtered German-language boards stay visible
- POSITIVE_KEYWORDS: add data scientist / data science (medium)
- report: scan-stats table (scraped / CH-remote / match>=2 / newest / time) + totals

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-01 15:15:22 +02:00
parent 49ba42138d
commit da66443aa8
+383 -49
View File
@@ -1,7 +1,7 @@
"""Job scout for Dennis's quarterly target companies. """Job scout for Dennis's quarterly target companies.
Pulls latest openings from companies via public ATS APIs (Workday/Ashby/Greenhouse/ Pulls latest openings from companies via public ATS APIs (Workday/Ashby/Greenhouse/
SmartRecruiters/Eightfold/RSS) and, for JS-rendered careers sites, a headless-browser SmartRecruiters/Lever/Eightfold/RSS) and, for JS-rendered careers sites, a headless-browser
(playwright) adapter. Filters by Swiss location or remote eligibility, scores fit against (playwright) adapter. Filters by Swiss location or remote eligibility, scores fit against
profile keywords, tracks which job IDs we've already seen, writes a markdown report. profile keywords, tracks which job IDs we've already seen, writes a markdown report.
@@ -22,6 +22,7 @@ See the adapter-coverage notes at the bottom for the current automated/manual sp
import json import json
import re import re
import sys import sys
import time
from functools import lru_cache from functools import lru_cache
import urllib.error import urllib.error
import urllib.parse import urllib.parse
@@ -59,6 +60,9 @@ POSITIVE_KEYWORDS = {
"applied ai": 3, "applied ml": 3, "ai engineer": 3, "ml engineer": 3, "applied ai": 3, "applied ml": 3, "ai engineer": 3, "ml engineer": 3,
"mlops": 3, "ai platform": 3, "ml platform": 3, "mlops": 3, "ai platform": 3, "ml platform": 3,
"python": 2, "java": 2, "data engineer": 2, "data engineering": 2, "python": 2, "java": 2, "data engineer": 2, "data engineering": 2,
# "data scientist" scored modestly (medium, not strong) — secondary to his data-eng/
# platform thesis, but the targeted band at boutiques like QuantCo (see target memory).
"data scientist": 2, "data science": 2,
"solutions architect": 2, "platform engineer": 2, "solutions architect": 2, "platform engineer": 2,
"ai infrastructure": 2, "inference": 2, "rag": 2, "agentic": 2, "ai infrastructure": 2, "inference": 2, "rag": 2, "agentic": 2,
"kubernetes": 1, "docker": 1, "etl": 1, "pipeline": 1, "kubernetes": 1, "docker": 1, "etl": 1, "pipeline": 1,
@@ -227,11 +231,72 @@ COMPANIES = [
"scroll_count": 5, "scroll_count": 5,
"use_inner_text_as_blob": True, "use_inner_text_as_blob": True,
}), }),
# --- Zürich/Zug high-comp additions (2026-05-31 list review) ---
# Palantir (Lever). Verified: 221 postings on the public board. It's US/London-heavy, so
# Swiss/Schwyz roles are rare but self-surface when posted (the location filter drops the
# US/London bulk). No title filter: his target titles (Forward Deployed Software Engineer,
# Deployment Strategist) aren't in ENG_TITLE_FILTER, so filtering would hide them.
("palantir", "Palantir", "lever", {"slug": "palantir"}),
# QuantCo (Lever — note the trailing-hyphen slug "quantco-"). ~16 roles, most tagged
# "Europe" (hybrid); QuantCo's continental hub is Zürich, so the EU-wide rule in
# location_matches surfaces them. No title filter: the target band is DS/Quant/AI/Cloud
# (see comp analysis), which ENG_TITLE_FILTER would drop; interns/frontend are caught by
# NEGATIVE_KEYWORDS instead.
("quantco", "QuantCo", "lever", {"slug": "quantco-"}),
# --- Bern/Thun local tier — WLB & proximity exception (comp bar relaxed; 2026-06-01) ---
# Wired after live endpoint discovery. ⚠️ German citizen: RUAG classified work may require
# Swiss citizenship — verify per-role before tailoring (see project_target_companies).
# Swissgrid (Aarau): Magnolia CMS JSON endpoint (verified). placeOfWork is a bare city
# (Aarau/Prilly/...), so loc_suffix tags it Switzerland for the CH filter. No title filter
# (small board ~13 roles; lets Data Scientist / Applied-ML roles surface).
("swissgrid", "Swissgrid (Aarau)", "json", {
"url": "https://www.swissgrid.ch/.rest/cloud/component-data?path=%2Fswissgrid%2Fen%2Fhome%2Fcareer%2Fjobs%2Fmain%2Fjoblist_transferred_11",
"jobs_key": "jobs",
"field_title": "title", "field_location": "placeOfWork",
"field_url": "descriptionUrl", "field_date": "onlineSince",
"loc_suffix": " Switzerland",
"desc_keys": ["department", "typeOfEmployment", "entryLevel"],
}),
# RUAG (Thun/Bern/Emmen). Jobs render on the portal as anchors to jobs.ruag.ch; the first
# line of each anchor is the title. All sites are Swiss, so default_location=Switzerland
# passes the CH filter. ENG_TITLE_FILTER cuts the apprenticeship/Lehrstelle bulk.
# Drupal portal: 20 jobs/page, server-rendered, paginated via ?page=N (0-indexed). The
# first page is apprenticeship-heavy; eng roles (DevOps/Data/Cloud) are on later pages,
# so we page through until a page adds nothing new (~5-6 pages).
("ruag", "RUAG (Thun/Bern)", "playwright", {
"url": "https://www.ruag.ch/en/working-us/job-portal",
"wait_for": "a[href*='/offene-stellen/']",
"card": "a[href*='/offene-stellen/']",
"title_attr": "text",
"link_attr": "href",
"default_location": "Switzerland",
"scroll_count": 1,
"page_param": "page",
"max_pages": 10,
"_title_filter": ENG_TITLE_FILTER,
}),
# SBB (company.sbb.ch — the correct host; company-jobs.sbb.ch was wrong). AEM job filter
# served as a flat JSON list; the fetch_sbb adapter replicates the user's IT + Bern-region
# filter. German/generic titles, so _score_floor keeps the pre-filtered results visible.
# ⚠️ DE-citizen limits may apply to some SBB security/critical-infra roles.
("sbb", "SBB", "sbb", {
"topic": "IT / Telekommunikation",
"region": "Bern Mittelland",
"_score_floor": 2,
}),
# BKW Group (jobs.bkw.com — the real ATS host). PMS structured-data API; ~600 roles
# group-wide, so fetch_bkw keeps only Berufsfeld categories Informatik/Trading/Finanzen
# (IT/data + energy-trading, incl. the flagged Energiehandel roles). German/generic
# titles, so _score_floor keeps the pre-filtered set visible.
("bkw", "BKW (Bern)", "bkw", {"_score_floor": 2}),
] ]
# Companies where adapter probing did not yield a reliable scrape. Reasons noted. # Companies where adapter probing did not yield a reliable scrape. Reasons noted.
# These surface as a clickable checklist in the report so they're not forgotten. # These surface as a clickable checklist in the report so they're not forgotten.
# (Empty — all current target companies are automated.) # Companies that resist scraping stay here as a clickable report checklist. Currently empty —
# every target company is automated. (Dropped 2026-06-01: BFH — academic FH pay below even the
# relaxed Bern/Thun floor, research-leaning, 403s anyway; Dialectic — ~50-person crypto VC,
# 0 open roles, crypto angle already covered by Kraken/Bitcoin Suisse/Coinbase Ventures.)
MANUAL_CHECK = [] MANUAL_CHECK = []
@@ -509,6 +574,145 @@ def fetch_onlyfy(args):
return jobs return jobs
def fetch_lever(args):
"""Lever public postings API. Palantir uses this. The board is US/London-heavy;
Swiss/Zurich (Schwyz hub) roles are rare on it but will surface here when posted —
location filtering downstream drops the US/London bulk. categories.allLocations
captures multi-location postings; createdAt is epoch-ms."""
slug = args["slug"]
data = http_get_json(f"https://api.lever.co/v0/postings/{slug}?mode=json")
jobs = []
for j in data:
cats = j.get("categories") or {}
all_locs = cats.get("allLocations") or []
loc_blob = " | ".join(x for x in ([cats.get("location") or ""] + [str(a) for a in all_locs]) if x)
ts = j.get("createdAt")
posted = ""
if isinstance(ts, (int, float)):
posted = datetime.fromtimestamp(ts / 1000, tz=timezone.utc).strftime("%Y-%m-%d")
jobs.append({
"id": j.get("id"),
"title": j.get("text", ""),
"location": loc_blob,
"url": j.get("hostedUrl"),
"posted": posted,
"description": (j.get("descriptionPlain") or "")[:2500],
})
return jobs
def fetch_json(args):
"""Generic JSON jobs API with configurable field names, for employer sites that expose
a clean public endpoint. Verified use: Swissgrid (Magnolia CMS
/.rest/cloud/component-data — {config, jobs:[...], filters}). Field names vary by site,
so they're configurable: field_title/field_location/field_url/field_date. loc_suffix
appends e.g. ' Switzerland' so the CH location filter matches city-only values such as
"Aarau"/"Prilly" (not every Swiss town is in CH_LOCATION_KEYWORDS). desc_keys fold extra
fields (department, employment type) into the description for keyword scoring.
Args: url, jobs_key (default "jobs"), field_* (defaults title/location/url/date),
url_prefix, loc_suffix, desc_keys."""
data = http_get_json(args["url"])
arr = data.get(args.get("jobs_key", "jobs"), []) if isinstance(data, dict) else (data or [])
ft, fl = args.get("field_title", "title"), args.get("field_location", "location")
fu, fd = args.get("field_url", "url"), args.get("field_date", "date")
prefix, suffix = args.get("url_prefix", ""), args.get("loc_suffix", "")
desc_keys = args.get("desc_keys", [])
jobs = []
for j in arr:
url = j.get(fu, "") or ""
if url and not url.startswith("http") and prefix:
url = prefix.rstrip("/") + "/" + url.lstrip("/")
loc = (j.get(fl, "") or "").strip() + suffix
desc = " ".join(str(j.get(k)) for k in desc_keys if j.get(k))
jobs.append({
"id": str(j.get("id") or url),
"title": j.get(ft, ""),
"location": loc,
"url": url,
"posted": j.get(fd, "") or "",
"description": desc[:500],
})
return jobs
def fetch_sbb(args):
"""SBB (company.sbb.ch) AEM job filter. The whole board is served as a flat JSON list
at .../jobfilter.results.json (~145 roles); the website filters client-side via each
job's numbered `attributes`: '20'=Berufsfeld/topic, '110'=region, '100'=city,
'links.directlink'=the jobs.sbb.ch URL. We replicate the user's IT + Bern-region filter
so only commutable IT roles surface. Titles are German/generic (Application Engineer,
Network Security Engineer, OT Architekt) and won't match ENG_TITLE_FILTER or the keyword
scorer, so this company is given a _score_floor in COMPANIES to keep its pre-filtered
results visible. topic/region are configurable substrings."""
url = args.get("url", ("https://company.sbb.ch/content/internet/corporate/de/"
"jobs-karriere/jobs/job-suche/jcr:content/parmain/"
"jobfilter.results.json"))
topic = args.get("topic", "IT / Telekommunikation")
region = args.get("region", "Bern Mittelland")
data = http_get_json(url)
arr = data if isinstance(data, list) else (data.get("results") or data.get("jobs") or [])
jobs = []
for j in arr:
a = j.get("attributes", {}) or {}
blob = " ".join(str(x) for v in a.values() for x in (v if isinstance(v, list) else [v]))
if topic and topic not in blob:
continue
if region and region not in blob:
continue
region_v = " ".join(a.get("110", []) or [])
city_v = " ".join(a.get("100", []) or [])
field_v = " ".join(a.get("20", []) or [])
jobs.append({
"id": str(j.get("id") or j.get("viewkey") or ""),
"title": j.get("title", ""),
"location": f"{city_v} {region_v} Schweiz".strip(),
"url": (j.get("links") or {}).get("directlink", ""),
"posted": j.get("start_date", "") or "",
"description": (field_v + " " + (j.get("text", "") or ""))[:400],
})
return jobs
def fetch_bkw(args):
"""BKW Group (jobs.bkw.com) PMS structured-data API. The whole-group board is ~600 roles
dominated by building-tech / electrical / civil-engineering trades; we keep only the
Berufsfeld categories relevant to the user (Informatik / Trading / Finanzen), which
surfaces IT/data plus the energy-trading roles (Quant Risk Modeller, Solution Architect
Energiehandel, Energy Derivatives/Market-Risk analysts). locations[].address gives
city/country. Pre-filtered + German/generic titles, so paired with a _score_floor in
COMPANIES. The category allowlist is configurable."""
url = args.get("url", ("https://jobs.bkw.com/_api/v1/structureddata?"
"configFromContentElement=82381&language=de-ch"))
allow = [c.lower() for c in args.get("categories", ["Informatik", "Trading", "Finanzen"])]
data = http_get_json(url)
arr = data if isinstance(data, list) else []
if not arr and isinstance(data, dict):
for v in data.values():
if isinstance(v, list) and v and isinstance(v[0], dict) and "title" in v[0]:
arr = v
break
jobs = []
for j in arr:
if j.get("type") and j.get("type") != "jobs":
continue
cats = [c.get("title", "") for c in (j.get("relations", {}) or {}).get("Berufsfeld", []) or []]
if allow and not any(any(a in c.lower() for a in allow) for c in cats):
continue
locs = j.get("locations") or []
addr = (locs[0].get("address") if locs and isinstance(locs[0], dict) else {}) or {}
loc = " ".join(x for x in [addr.get("city", ""), addr.get("country", "")] if x) or "Schweiz"
jobs.append({
"id": str(j.get("id") or j.get("url") or ""),
"title": j.get("title", ""),
"location": loc,
"url": j.get("url", ""),
"posted": "",
"description": " ".join(cats + [j.get("subtitle", "") or ""])[:300],
})
return jobs
# Injected before page scripts run, to mask the most common headless-detection signals. # Injected before page scripts run, to mask the most common headless-detection signals.
# Required for Google; harmless for the other sites. # Required for Google; harmless for the other sites.
STEALTH_JS = """ STEALTH_JS = """
@@ -577,18 +781,12 @@ def fetch_playwright(args):
ctx.add_init_script(STEALTH_JS) ctx.add_init_script(STEALTH_JS)
page = ctx.new_page() page = ctx.new_page()
jobs = [] jobs = []
try: seen_ids = set()
page.goto(args["url"], timeout=45000, wait_until="domcontentloaded")
# Optional cookie banner acceptance def scrape_current():
for sel in args.get("cookie_accept", []) or []: """Extract cards from the currently-loaded page; append new ones to `jobs`.
try: Returns the count of newly-added (not-yet-seen) cards so a pagination loop can
btn = page.locator(sel).first stop once a page contributes nothing new."""
if btn.is_visible(timeout=2000):
btn.click()
page.wait_for_timeout(500)
except Exception:
pass
# Wait for job content to render
wait_for = args.get("wait_for") wait_for = args.get("wait_for")
if wait_for: if wait_for:
try: try:
@@ -605,6 +803,7 @@ def fetch_playwright(args):
cards = page.locator(args["card"]) cards = page.locator(args["card"])
n = min(cards.count(), args.get("max_cards", 150)) n = min(cards.count(), args.get("max_cards", 150))
added = 0
for i in range(n): for i in range(n):
card = cards.nth(i) card = cards.nth(i)
try: try:
@@ -638,6 +837,11 @@ def fetch_playwright(args):
if not title: if not title:
continue continue
jid = href or f"{page.url}#{i}"
if jid in seen_ids:
continue
seen_ids.add(jid)
added += 1
description = "" description = ""
if args.get("use_inner_text_as_blob"): if args.get("use_inner_text_as_blob"):
# Use the full card text as both location source and description # Use the full card text as both location source and description
@@ -646,26 +850,47 @@ def fetch_playwright(args):
if not location: if not location:
location = full[:300] location = full[:300]
jobs.append({ jobs.append({
"id": href or f"{args['url']}#{i}", "id": jid,
"title": title, "title": title,
"location": location, "location": location,
"url": href or args["url"], "url": href or page.url,
"posted": "", "posted": "",
"description": description, "description": description,
}) })
except Exception: except Exception:
continue continue
return added
try:
page.goto(args["url"], timeout=45000, wait_until="domcontentloaded")
# Optional cookie banner acceptance (once, on the first page)
for sel in args.get("cookie_accept", []) or []:
try:
btn = page.locator(sel).first
if btn.is_visible(timeout=2000):
btn.click()
page.wait_for_timeout(500)
except Exception:
pass
# Optional query-param pagination (e.g. Drupal "?page=N", 0-indexed). The base URL is
# page 0 (already loaded); fetch successive pages until one adds no new cards.
page_param = args.get("page_param")
if page_param:
base = args["url"]
joiner = "&" if "?" in base else "?"
for p in range(args.get("max_pages", 8)):
if p > 0:
page.goto(f"{base}{joiner}{page_param}={p}", timeout=45000,
wait_until="domcontentloaded")
added = scrape_current()
if p > 0 and added == 0:
break
else:
scrape_current()
finally: finally:
ctx.close() ctx.close()
# Deduplicate within a single company by id return jobs
seen, deduped = set(), []
for j in jobs:
if j["id"] in seen:
continue
seen.add(j["id"])
deduped.append(j)
return deduped
ADAPTERS = { ADAPTERS = {
@@ -678,6 +903,10 @@ ADAPTERS = {
"rss": fetch_rss, "rss": fetch_rss,
"getro": fetch_getro, "getro": fetch_getro,
"onlyfy": fetch_onlyfy, "onlyfy": fetch_onlyfy,
"lever": fetch_lever,
"json": fetch_json,
"sbb": fetch_sbb,
"bkw": fetch_bkw,
"playwright": fetch_playwright, "playwright": fetch_playwright,
} }
@@ -690,9 +919,12 @@ def location_matches(loc_text):
has_remote = any(k in low for k in REMOTE_KEYWORDS) has_remote = any(k in low for k in REMOTE_KEYWORDS)
is_us_only = any(p in low for p in US_ONLY_PATTERNS) and not in_ch is_us_only = any(p in low for p in US_ONLY_PATTERNS) and not in_ch
has_eu_hint = any(k in low for k in EU_HINT_KEYWORDS) has_eu_hint = any(k in low for k in EU_HINT_KEYWORDS)
# Count as remote-eligible only if it isn't a US-only remote listing # Pan-European postings (location literally "Europe"/"EMEA", e.g. QuantCo's Lever board)
# and it has at least one EU/global hint # are reachable for a DACH-based candidate even without an explicit "remote" keyword, so
is_remote = has_remote and not is_us_only and has_eu_hint # treat them as eligible too. City-specific EU roles (e.g. "Berlin or Munich") stay out.
is_eu_wide = any(k in low for k in ("europe", "emea")) and not is_us_only
# Count as remote/EU-eligible only if it isn't a US-only listing and has an EU/global hint
is_remote = (has_remote or is_eu_wide) and not is_us_only and has_eu_hint
return in_ch, is_remote return in_ch, is_remote
@@ -747,14 +979,65 @@ def save_seen(seen):
STATE_FILE.write_text(json.dumps(seen, indent=2, ensure_ascii=False), encoding="utf-8") STATE_FILE.write_text(json.dumps(seen, indent=2, ensure_ascii=False), encoding="utf-8")
def write_report(path, results, errors, new_only, include_weak): def _parse_posted(s):
"""Best-effort parse of an adapter's `posted` field into a date, across the mix of
formats the boards use (ISO 8601 incl. trailing Z, YYYY-MM-DD, DD.MM.YYYY). Returns None
for unparseable values (e.g. Workday's relative "Posted 5 Days Ago", or empty)."""
if not s or not isinstance(s, str):
return None
s = s.strip()
try:
return datetime.fromisoformat(s.replace("Z", "+00:00")).date()
except ValueError:
pass
for fmt in ("%Y-%m-%d", "%d.%m.%Y", "%Y/%m/%d", "%d/%m/%Y"):
try:
return datetime.strptime(s[:10], fmt).date()
except ValueError:
pass
m = re.search(r"\d{4}-\d{2}-\d{2}", s)
if m:
try:
return datetime.strptime(m.group(0), "%Y-%m-%d").date()
except ValueError:
pass
return None
def write_stats_table(stats, total_secs):
"""Render the per-company scan stats as a markdown table (+ a totals row)."""
out = ["## Scan stats\n",
"| Company | Scraped | CH/Remote | Match ≥2 | Newest posting | Time (s) |",
"|---|--:|--:|--:|:--|--:|"]
t_scraped = t_elig = t_match = 0
newest_all = None
for s in stats:
name = s["company"] + (" ⚠️" if s.get("error") else "")
newest = s["newest"].isoformat() if s["newest"] else ""
out.append(f"| {name} | {s['scraped']:,} | {s['eligible']:,} | "
f"{s['match']:,} | {newest} | {s['secs']:.1f} |")
t_scraped += s["scraped"]; t_elig += s["eligible"]; t_match += s["match"]
if s["newest"] and (newest_all is None or s["newest"] > newest_all):
newest_all = s["newest"]
out.append(f"| **Total ({len(stats)})** | **{t_scraped:,}** | **{t_elig:,}** | "
f"**{t_match:,}** | **{newest_all.isoformat() if newest_all else ''}** | "
f"**{total_secs:.1f}** |")
out.append("")
return out
def write_report(path, results, errors, new_only, include_weak, stats=None, total_secs=0.0):
today = datetime.now().strftime("%Y-%m-%d") today = datetime.now().strftime("%Y-%m-%d")
n_new = sum(1 for r in results if r["is_new"]) n_new = sum(1 for r in results if r["is_new"])
n_match = sum(1 for r in results if r["score"] >= 2)
lines = [ lines = [
f"# Job scout report {today}{' (new only)' if new_only else ''}\n", f"# Job scout report {today}{' (new only)' if new_only else ''}\n",
f"Automated coverage: **{len(COMPANIES)}** companies. Manual checks: **{len(MANUAL_CHECK)}**.", f"Automated coverage: **{len(COMPANIES)}** companies. Manual checks: **{len(MANUAL_CHECK)}**.",
f"Total matches from automated companies: **{len(results)}** ({n_new} new since last run)\n", f"Eligible (CH/remote): **{len(results)}** · interest matches (score ≥ 2): "
f"**{n_match}** · **{n_new}** new since last run\n",
] ]
if stats:
lines += write_stats_table(stats, total_secs)
if errors: if errors:
lines.append("## Errors\n") lines.append("## Errors\n")
for company, err in errors: for company, err in errors:
@@ -814,29 +1097,43 @@ def main():
seen = load_seen() seen = load_seen()
today = datetime.now(timezone.utc).strftime("%Y-%m-%d") today = datetime.now(timezone.utc).strftime("%Y-%m-%d")
all_results, errors = [], [] all_results, errors, stats = [], [], []
run_start = time.perf_counter()
for cid, display, adapter, args in COMPANIES: for cid, display, adapter, args in COMPANIES:
if only and cid != only: if only and cid != only:
continue continue
print(f"Fetching {display}...", file=sys.stderr) print(f"Fetching {display}...", file=sys.stderr)
t0 = time.perf_counter()
try: try:
jobs = ADAPTERS[adapter](args) jobs = ADAPTERS[adapter](args)
except (urllib.error.URLError, urllib.error.HTTPError, ValueError) as e: except (urllib.error.URLError, urllib.error.HTTPError, ValueError) as e:
errors.append((display, repr(e))) errors.append((display, repr(e)))
stats.append({"company": display, "scraped": 0, "eligible": 0,
"match": 0, "newest": None, "secs": time.perf_counter() - t0,
"error": True})
continue continue
except Exception as e: except Exception as e:
errors.append((display, f"unexpected: {e!r}")) errors.append((display, f"unexpected: {e!r}"))
stats.append({"company": display, "scraped": 0, "eligible": 0,
"match": 0, "newest": None, "secs": time.perf_counter() - t0,
"error": True})
continue continue
scraped = len(jobs)
# Optional per-company title prefilter for high-volume boards # Optional per-company title prefilter for high-volume boards
title_filter = args.get("_title_filter") title_filter = args.get("_title_filter")
if title_filter: if title_filter:
jobs = [j for j in jobs jobs = [j for j in jobs
if any(_kw_in(k, (j.get("title") or "").lower()) for k in title_filter)] if any(_kw_in(k, (j.get("title") or "").lower()) for k in title_filter)]
# Newest posting on the board (board freshness), across parseable dates.
dates = [d for j in jobs if (d := _parse_posted(j.get("posted")))]
newest = max(dates) if dates else None
company_seen = seen.setdefault(cid, {}) company_seen = seen.setdefault(cid, {})
title_seen = set() title_seen = set()
eligible = match = 0
for j in jobs: for j in jobs:
jid = str(j.get("id") or j.get("url")) jid = str(j.get("id") or j.get("url"))
in_ch, is_remote = location_matches(j.get("location", "")) in_ch, is_remote = location_matches(j.get("location", ""))
@@ -848,8 +1145,17 @@ def main():
if norm_title in title_seen: if norm_title in title_seen:
continue continue
title_seen.add(norm_title) title_seen.add(norm_title)
eligible += 1
is_new = jid not in company_seen is_new = jid not in company_seen
score, pos, neg = score_job(j, title_only=bool(title_filter)) score, pos, neg = score_job(j, title_only=bool(title_filter))
# Pre-filtered boards (e.g. SBB, already narrowed to IT+Bern by the adapter) carry
# German/generic titles the profile scorer can't read; a _score_floor keeps their
# already-relevant results out of the hidden weak bucket.
floor = args.get("_score_floor")
if floor is not None and score < floor:
score = floor
if score >= 2:
match += 1
all_results.append({ all_results.append({
"company": display, "company_id": cid, "company": display, "company_id": cid,
"title": j["title"], "location": j["location"], "title": j["title"], "location": j["location"],
@@ -859,8 +1165,13 @@ def main():
}) })
company_seen[jid] = {"title": j["title"], "first_seen": today} company_seen[jid] = {"title": j["title"], "first_seen": today}
stats.append({"company": display, "scraped": scraped, "eligible": eligible,
"match": match, "newest": newest,
"secs": time.perf_counter() - t0, "error": False})
save_seen(seen) save_seen(seen)
_close_browser() _close_browser()
total_secs = time.perf_counter() - run_start
if new_only: if new_only:
all_results = [r for r in all_results if r["is_new"]] all_results = [r for r in all_results if r["is_new"]]
@@ -869,43 +1180,66 @@ def main():
REPORTS_DIR.mkdir(parents=True, exist_ok=True) REPORTS_DIR.mkdir(parents=True, exist_ok=True)
report_path = REPORTS_DIR / f"{today}.md" report_path = REPORTS_DIR / f"{today}.md"
write_report(report_path, all_results, errors, new_only, include_weak) write_report(report_path, all_results, errors, new_only, include_weak,
stats=stats, total_secs=total_secs)
n_new = sum(1 for r in all_results if r["is_new"]) n_new = sum(1 for r in all_results if r["is_new"])
print(f"\nReport written: {report_path}", file=sys.stderr) print(f"\nReport written: {report_path}", file=sys.stderr)
print(f"Total matches: {len(all_results)} ({n_new} new)", file=sys.stderr) print(f"Total matches: {len(all_results)} ({n_new} new) | "
f"scanned {len(stats)} companies in {total_secs:.1f}s", file=sys.stderr)
if errors: if errors:
print(f"Errors: {len(errors)} - see report", file=sys.stderr) print(f"Errors: {len(errors)} - see report", file=sys.stderr)
# === Adapter coverage (refreshed 2026-05-24) ================================== # === Adapter coverage (refreshed 2026-06-01) ==================================
# 22 companies automated across 10 adapter types; 0 remain in MANUAL_CHECK. # 25 companies automated across 13 adapter types; MANUAL_CHECK is empty.
# #
# Automated (COMPANIES above): # Automated (COMPANIES above):
# workday nvidia, novartis # workday nvidia, novartis
# ashby kraken, openai, confluent # ashby kraken, openai, confluent
# greenhouse anthropic, gitlab, clickhouse, grafana # greenhouse anthropic, gitlab, grafana
# pcsx microsoft (Eightfold position-search endpoint) # pcsx microsoft (Eightfold position-search endpoint)
# wp_ajax sygnum (WordPress admin-ajax JSON) # smartrecruiters metgroup, ldc
# smartrecruiters metgroup, vitol, ldc
# rss bis (vacancies.rss — RSS 1.0/RDF) # rss bis (vacancies.rss — RSS 1.0/RDF)
# getro coinbase_ventures (web3 portfolio network, collection 1625) # getro coinbase_ventures (web3 portfolio network, collection 1625)
# onlyfy bitcoin_suisse (onlyfy.jobs ajax_list HTML fragment) # onlyfy bitcoin_suisse (onlyfy.jobs ajax_list HTML fragment)
# playwright google, apple, meta, roche, cisco (headless browser, 3-15s each) # lever palantir, quantco (api.lever.co; QuantCo slug is "quantco-")
# json swissgrid (Magnolia /.rest/cloud/component-data)
# sbb sbb (company.sbb.ch AEM jobfilter.results.json)
# bkw bkw (jobs.bkw.com PMS structureddata API)
# playwright google, apple, meta, roche, cisco, ruag (headless browser, 3-15s each)
# #
# Since the 2026-05-21 probe, six originally-manual sites moved to automated: # 2026-06-01 list review (verified live):
# Google/Apple/Meta/Roche/Cisco via the playwright adapter, Microsoft via pcsx, and # - Palantir (lever): 221 postings, US/London-heavy so Swiss/Schwyz roles are rare but
# Sygnum via its WordPress AJAX endpoint. BIS was added via the new rss adapter, the # self-surface (FDSE/Deployment-Strategist titles map to his FDE drafts).
# Coinbase Ventures web3 portfolio network via the new getro adapter, and Bitcoin Suisse # - Swissgrid (json): Magnolia CMS endpoint; placeOfWork is bare city, so loc_suffix tags
# via the new onlyfy adapter (its bitcoinsuisse.com page is a JS SPA, but the underlying # it Switzerland for the CH filter. ~13 roles incl. Data Scientist / Applied-ML.
# onlyfy.jobs ATS serves a plain HTML list with locations). IBM Research and Sonova were # - RUAG (playwright + page_param): Drupal portal, 20 jobs/page, paginated ?page=N. Page 0
# dropped from the target list (no API / low fit; Sonova is MedTech, off-thesis). # is apprenticeship-heavy; eng roles (DevOps/Data/Software) are on later pages, so we
# page through (max_pages). ENG_TITLE_FILTER cuts the Lehrstelle bulk. ⚠️ DE-citizen
# limits on RUAG classified roles — verify per-role.
# - SBB (sbb): correct host is company.sbb.ch (not company-jobs.sbb.ch). Flat JSON list;
# fetch_sbb replicates the user's IT + Bern-region filter. German/generic titles, so a
# _score_floor keeps the pre-filtered results visible. ⚠️ DE-citizen limits possible.
# - BKW (bkw): real host is jobs.bkw.com (PMS structureddata API), ~600 group-wide roles;
# fetch_bkw keeps Berufsfeld categories Informatik/Trading/Finanzen (IT/data + energy
# trading: Quant Risk, Solution Architect Energiehandel, ...). _score_floor as above.
# - QuantCo (lever, slug "quantco-"): ~16 roles, most tagged "Europe" (hybrid; Zürich is
# QuantCo's continental hub), surfaced via the EU-wide rule in location_matches. Strong:
# AI Engineer; medium: Cloud Engineer, AI Applied Scientist, Data Scientist, Quant
# Researcher, Software Engineer. Interns/frontend suppressed by NEGATIVE_KEYWORDS.
# The Bern/Thun tier intentionally relaxes the comp bar (see user_comp_bar memory).
# #
# Note: the Coinbase Ventures board (getro) covers PORTFOLIO companies, not Coinbase # MANUAL_CHECK is empty — every target company is automated. Dropped 2026-06-01: BFH
# itself. Coinbase-the-employer was dropped (mass layoffs / hiring freeze as of 2026-05; # (academic FH pay below the relaxed Bern/Thun floor, research-leaning, 403s anyway) and
# re-add coinbase.com/careers if they reopen). AMINA Bank was dropped (poor Glassdoor). # Dialectic (~50-person crypto VC, 0 open roles; crypto already covered by Kraken / Bitcoin
# Suisse / Coinbase Ventures).
# #
# MANUAL_CHECK is now empty — every current target company is automated. # Earlier history: Google/Apple/Meta/Roche/Cisco automated via playwright; Microsoft via
# pcsx; BIS via rss; Coinbase Ventures via getro; Bitcoin Suisse via onlyfy. Dropped:
# ClickHouse, Vitol, Sygnum (Glassdoor/comp red flags), IBM Research + Sonova (low fit),
# Coinbase-the-employer (hiring freeze), AMINA (poor Glassdoor), Canonical (pay+culture).
# The Coinbase Ventures board (getro) covers PORTFOLIO companies, not Coinbase itself.
# ============================================================================== # ==============================================================================