top of page

10 capabilities of the Minexa API that most extraction pipelines never use

Most developers who integrate the Minexa API use it the same way: train a scraper, pass some URLs, get structured JSON back. That covers the basics. But there is a wider set of capabilities built into the API that rarely gets used, either because it is not obvious from the docs or because the default setup already works well enough that no one goes looking further.

This article covers ten of those capabilities, with enough detail to know when each one is worth reaching for.

1. The scraper ID is the only thing you need to reuse a trained scraper forever

When you train a scraper using the Minexa Chrome extension, the result is a stable numeric ID. That ID is what you pass in every API call. There is no re-training step, no schema to define, and no selector to maintain. The scraper ID encodes the full extraction configuration, and it works on any page that shares the same structure as the one it was trained on. One training session, one ID, unlimited reuse.

2. You can request auto-ranked fields without knowing what the page contains

The columns parameter accepts two modes. The first is a named list, where you specify exactly which fields you want by name. The second is a top-N mode, where you ask Minexa to surface the most relevant fields it finds and return the top results ranked by relevance. If you are building a pipeline against a new page type and are not yet sure what fields are available, passing something like top_30 lets you discover the schema before committing to a fixed field list. This is useful during development and for exploratory pipelines.

3. A single API request can process up to 50,000 URLs

Batch processing is built into the API at a scale that removes the need for queue management in most cases. You can pass up to 50,000 URLs in a single request, and the API handles distribution across threads based on your plan limits. For large jobs like processing an entire product catalog or a full directory export, this means you can send one request and retrieve paginated results rather than managing batches yourself.

Ready to build your first pipeline? The Minexa API documentation covers request structure, authentication, and response formats in full detail. Read the API docs or explore Minexa to get started.

4. Credit consumption scales with rendering complexity, not page count

One credit equals one page at baseline. But pages that require JavaScript rendering, heavy dynamic content, or bypass of anti-bot protection consume more credits per page. The API exposes parameters that control which rendering mode is used, so if you are processing a mix of simple and complex pages, you can apply the right mode per URL rather than using the most expensive setting across the board. Knowing this lets you estimate costs accurately before running large jobs.

5. You can supply pre-scraped HTML to skip live crawling entirely

The file_urls parameter lets you pass the location of HTML files you have already collected rather than having the API fetch live pages. This is useful when you have your own crawling infrastructure, when you need to extract from archived HTML, or when you want to separate the crawling and extraction steps for cost or control reasons. The API applies the same extraction logic against the supplied HTML, so the output format is identical to a live crawl.

6. List page and detail page data merge into a single unified output

When a scraper is trained to follow links from a list page into individual detail pages, the API returns both layers of data combined. Each row in the output contains the fields from the list entry alongside the fields extracted from that entry's detail page. There is no join step required on your end. The nesting is reflected in the JSON structure, and accessing a nested value follows a straightforward path into the response object.

7. Paginated API responses use a next token for checkpoint-safe retrieval

For large jobs, the API returns results in pages rather than all at once. Each response includes a next token that you pass in the following request to retrieve the next batch. This makes it straightforward to write a loop that saves results incrementally, which matters for jobs where you cannot afford to lose progress if something interrupts mid-run. A checkpoint-based Python script that saves each batch to JSON or CSV as it arrives is a common pattern for this.

8. Geo-targeted content is accessible without building proxy infrastructure

Some pages return different content depending on where the request originates. The Minexa API handles geo-targeting through a built-in parameter that accepts ISO country codes. You specify the country, and the API routes the request accordingly. There is no proxy setup, no IP pool to manage, and no additional configuration. This is particularly relevant for price monitoring across regions, where the same product URL returns different pricing depending on the visitor's location.

9. JavaScript-heavy and dynamically loaded pages are handled automatically

Pages that require JavaScript execution to render their content are handled by the API without any extra configuration on your end. You do not need to run a headless browser, manage a rendering service, or adjust your request structure. The API determines the appropriate rendering approach based on the page and the parameters you pass. For pages with post-load delays, there is a wait parameter that holds the request until dynamic content has finished loading before extraction begins.

Want to see how the API fits into a production pipeline? The full request structure, parameter reference, and response schema are documented at minexa.stoplight.io.

10. Crawling, rendering, and extraction run as a single API call

A conventional extraction pipeline requires separate components: a crawler to fetch pages, a renderer to handle JavaScript, a proxy layer for access, and an extraction layer to parse the output. The Minexa API consolidates all of these into a single POST request. You pass a scraper ID, a list of URLs, and your column preferences. The response is structured JSON. There is no infrastructure to maintain, no rendering service to configure, and no parsing logic to write. For teams that want extraction without the surrounding stack, this is the practical value of the API approach.

These capabilities are all available today within the standard API. The documentation at minexa.stoplight.io covers each parameter in detail, including request and response examples for the patterns described above.

Recent Posts

See All

Comments


Heading 2

bottom of page