EU & HU Tech Tender Scraping Spec

Goal

Build a pipeline that ingests active EU and Hungarian public-sector tech tenders, enriches them with historical pricing benchmarks from awarded contracts, and surfaces opportunities where a competitive bid is viable.

Scope

  • Geography: EU-wide (above-threshold) + Hungary-only (below-threshold).
  • Sectors: IT services, software, hardware, telecoms.
  • CPV roots: 72000000, 48000000, 30200000, 32000000, 51610000.
  • Out of scope: non-EU tenders, defense procurement (separate regime, Dir. 2009/81/EC).

Data sources

Tier 1: TED (EU above-threshold)

Tenders Electronic Daily is the EU Official Journal supplement for procurement. Covers every above-threshold tender from all 27 member states including HU. Free, no auth required for the Search API.

EU procurement thresholds (2026, indicative):

Buyer typeGoods/ServicesWorks
Central authorities~€143k~€5.5M
Sub-central authorities~€221k~€5.5M
Utilities/defense~€443k~€5.5M

Verify current thresholds against the latest Commission delegated regulation before going live. Thresholds revise every 2 years.

Channels:

  • Search API v3 at https://api.ted.europa.eu/v3/notices/search. No auth needed. Same Expert Search syntax as the website UI. Swagger at https://api.ted.europa.eu/swagger.
  • Bulk XML packages at https://ted.europa.eu/packages/daily/{OJ-S-id} and https://ted.europa.eu/packages/monthly/{YYYY-M}. No login. Daily packages drop on each OJ S release day (5x/week).
  • CSV subset at data.europa.eu/euodp/en/data/dataset/ted-csv. Goes back to ~2009. Lossy vs XML but easier ingest.
  • RSS feeds broken down by business sector. Useful for low-volume monitoring, not bulk reuse.

Schemas:

  • eForms (post 14 Nov 2022): 8-digit publication numbers, defined by Reg. (EU) 2019/1780.
  • Legacy TED schema (pre 14 Nov 2022): 6-digit publication numbers, defined by Reg. (EU) 2015/1986.

Both coexist in current archives and via the Search API. The eForms SDK is at github.com/OP-TED/eForms-SDK.

Tier 2: Hungarian below-threshold (the bulk by count)

EU thresholds exclude most HU contracts by count. Two domestic systems cover the rest.

EKR — Elektronikus Közbeszerzési Rendszer ( ekr.gov.hu)

  • Mandatory for all HU public procurement procedures since 15 April 2018 (Kbt. + 424/2017. Korm. rendelet).
  • Public search interface without login.
  • Free hirdetményfigyelő (notice-watcher with email alerts), no registration required, available since June 2024.
  • No documented public REST API. Plan on HTML scraping the search results.
  • Filterable by CPV, ajánlatkérő (buyer), tárgy (subject), teljesítés helye (place of performance), eljárás típusa (procedure type).

Közbeszerzési Hatóság ( kozbeszerzes.hu)

  • Közbeszerzési Értesítő: the Hungarian official gazette of tenders. Path: /adatbazis/keres/hirdetmeny/. HTML-scrapeable.
  • Publikus CoRe (Nyilvános Elektronikus Szerződéstár): public contract registry. Every awarded HU contract above the publication threshold, with value and supplier. This is the gold mine for price benchmarking, not for active tenders.
  • Publikus KBA (Közbeszerzési Adatbázis): archive for pre-2018 procedures.
  • Hirdetmény nélküli tárgyalásos eljárás decisions: also published here.

Tier 3: aggregators (use with skepticism)

For an active bidding pipeline, build directly on Tier 1 + 2. Tier 3 is for ad-hoc cross-country analytics if you skip the build.

CPV reference (tech subset)

CPV CodeDescription
72000000IT services
72200000Software programming and consultancy
72300000Data services
72400000Internet services
72416000Application service providers
48000000Software packages
30200000Computer equipment & supplies
32000000Radio, TV, communications, telecom equipment
51610000Installation services for computers & IT

Full CPV: simap.ted.europa.eu/web/simap/cpv.

Strategy: why “undercut by 10%” is the wrong primitive

Three concrete reasons the published becsült érték (estimated value) is a poor anchor.

  1. Estimated value is padded. HU buyers routinely add 20-40% buffer so the procedure doesn’t have to be re-run if bids overshoot. Undercutting an inflated number is meaningless to the actual market clearing price.
  2. Tech tenders use MEAT. Most HU tech tenders apply összességében legelőnyösebb ajánlat (most economically advantageous tender), not lowest price. Price weight is typically 50-70%. A 10% price advantage at 60% weight is a ~6% effective edge, easily erased by any technical-score gap.
  3. Aránytalanul alacsony ár (Kbt. §72). The buyer must investigate any bid that looks abnormally low. The bidder has to justify cost structure on demand. Reflexive undercutting triggers this rule and gets you excluded if the justification is thin.

What works instead: build a per-buyer, per-CPV pricing model from historical award notices (eredménytájékoztató / Contract Award Notice). The awarded value, not the estimated value, is the real market signal. Undercut that by whatever margin still leaves a viable cost structure.

Pipeline design (TypeScript)

 [TED Search API]    [EKR scrape]    [CoRe scrape]
        |                  |               |
        v                  v               v
    active CN          active CN      historical CAN
        \                  |               /
         \                 |              /
          v                v             v
               [Stage: parse + normalize]
                           |
                           v
               [Stage: dedupe by tender ID]
                           |
                           v
               [Stage: enrich with pricing model]
                           |
                           v
               [Stage: score + filter]
                           |
                           v
                        [output]

Stage interface:

Each pipeline step implements a Stage<TIn, TOut> interface with an async process(input: TIn): Promise<TOut> method. Stages compose via a Pipeline runner that chains them sequentially.

  • Source stages: TedSearchSource, EkrScrapeSource, CoReScrapeSource — each returns AsyncIterable<RawTender>.
  • Transform stages: Normalize, Dedupe, EnrichPricing, Score.
  • Sources use p-limit or p-throttle to respect rate limits (TED) and polite scraping cadence (EKR).
  • sax or fast-xml-parser in streaming mode for TED bulk XML decoding.
  • cheerio for EKR and CoRe HTML scraping.

Implementation notes

  • TED bulk vs API: for backfill use bulk packages (single tarball per day or month, much faster than paginating the API). For incremental near-realtime use the Search API with a rolling publication-date filter.
  • eForms parsing: field mapping differs significantly from legacy TED schema. Build a unified internal Tender type (Zod schema + inferred TS type) and two adapters, not one parser with conditional branches.
  • EKR scraping: session-cookie based, polite rate (1 req/sec), respect robots.txt. The hirdetményfigyelő RSS-style alerts are the legitimate channel for incremental monitoring.
  • CoRe data: primary source for the historical pricing model. Match awarded contracts to their original CN by EKR ID where possible.

Open questions / TODO

  • Verify TED Search API v3 anonymous rate limit (not clearly documented).
  • EKR ToS review: legal status of scraping public search results in HU.
  • CoRe export format: HTML only, or is there a structured download?
  • eForms vs legacy TED schema field mapping for the unified Tender type.
  • Currency normalization: HU tenders publish in HUF, EUR, occasionally USD/CZK.
  • Storage decision: keep raw XML (forensic) or normalized JSON only (cheap)?
  • Pricing model: per-buyer vs per-CPV vs per-region as primary partition key?

References