From raw webpage to clean dataset: how Minexa API handles the full extraction pipeline
- Minexa.ai

- 4 days ago
- 4 min read
Most data extraction pipelines have the same weak point: the gap between fetching a page and getting usable data out of it. Crawling is solved. Rendering is mostly solved. The part that still costs engineering time is turning raw HTML into a consistent, structured output that downstream systems can actually use.
The Minexa API is built specifically for that last step, and it handles more of the pipeline than most developers expect going in.
The scraper is the foundation
Before any API call happens, a scraper needs to exist. You train it once using the Minexa Chrome extension by browsing to a representative page, confirming what Minexa detected, and saving the configuration. That process takes a few minutes the first time.
What comes out of that process is a scraper_id — a stable numeric identifier that represents the extraction logic for that page type. Every API call you make afterward references that ID. The scraper knows which fields to extract, where they sit in the page structure, and how to handle the layout. You do not rewrite that logic per request.
This matters at scale. One scraper trained on a single product page works identically across 500,000 product pages with the same structure. The setup cost does not grow with volume.
What a basic API request looks like
The extraction endpoint is https://api.minexa.ai/data. You send a POST request with a JSON body. Here is a minimal working example:
POST https://api.minexa.ai/data
Authorization: Bearer YOUR_API_KEY
Content-Type: application/json
{
"scraper_id": 4731,
"columns": "top_20",
"urls": [
"https://example.com/listings/page-1",
"https://example.com/listings/page-2"
],
"scraping": {
"js_render": true
},
"threads": 5
}Breaking this down: scraper_id points to the trained configuration. columns tells the API which fields to return. urls is the list of pages to process. scraping controls how pages are fetched. threads sets how many pages are processed in parallel.
Controlling which fields come back
The columns parameter gives you two modes. You can pass a top_N value like top_20 or top_60, which tells the API to return the highest-ranked fields the scraper identified — useful when you want broad coverage without specifying every field name. Or you can pass an explicit list of named fields if you only want specific columns in the output.
Named fields are useful when your downstream system expects a fixed schema. The top_N approach is useful during exploration or when the page has many fields and you want Minexa to surface the most relevant ones automatically.
Threads and throughput
The threads parameter controls concurrency. If you pass 10 threads, the API processes 10 pages simultaneously. This directly affects how fast a large batch completes. Thread limits are set at the plan level, so the ceiling depends on which plan your account is on.
For a job with 2,000 URLs and 10 threads, the API is processing 10 pages at a time throughout the run. For time-sensitive pipelines, maximising threads within your plan limit is the most direct way to reduce total job duration.
Nested data in the JSON output
When a page contains repeated sub-elements within a single result — multiple images, a list of features, several review snippets — the API returns these as nested arrays inside the result object. Each top-level result maps to one row, and nested fields appear as arrays within that row.
If your pipeline needs flat rows, you handle the unnesting on your side. The API preserves the structure as it exists on the page rather than flattening it by default, which gives you more control over how you model the data downstream.
Skipping live crawling with file_urls
If you have already collected raw HTML — from your own crawler, a cache, or a prior fetch — you can pass it directly to the API using the file_urls parameter instead of urls. The API will run extraction against the supplied HTML without making any live requests to the target site.
This is useful in two situations: when you want to separate the crawling and extraction steps in your pipeline, and when you want to re-extract from pages you have already fetched without consuming additional crawl credits.
How the API signals problems
When a page cannot be extracted — because it did not load, the structure did not match, or the scraper found nothing — the API returns an explicit signal rather than a silent empty result. This means your pipeline can detect failures and handle them: retry the URL, log it for review, or route it to a fallback process.
Silent failures are the hardest kind to catch in a data pipeline. A system that returns wrong data without flagging it requires downstream validation to catch errors that should never have passed through. The Minexa API avoids this by being explicit about what failed and why.
Two-layer extraction via the API
The Chrome extension makes it easy to configure detail-page scraping visually. Via the API, you handle the two-layer pattern yourself: first run a job to extract URLs from list pages, then pass those URLs into a second job using a scraper trained on the detail page structure.
This gives you full control over the flow. You can filter, deduplicate, or prioritise URLs between the two steps. The API does not impose a fixed pipeline shape — it processes whatever URLs you give it using whatever scraper you point it at.
Putting it together
The Minexa API is designed to slot into an existing pipeline without requiring you to rebuild around it. You train a scraper once, reference it by ID in every subsequent call, and get structured JSON back. The scraping configuration layer handles JavaScript rendering, proxy routing, and anti-bot handling automatically when needed — you opt in per request rather than configuring it globally.
For developers who want to go further, the full API reference covers every parameter, response schema, and error format in detail.

Comments