top of page

How the Minexa API turns any webpage into structured data at scale

Most data extraction pipelines start the same way: someone needs structured data from a website, and the first instinct is to write a scraper. Then comes the selector logic, the edge cases, the JavaScript rendering layer, the proxy setup, and eventually a fragile script that breaks when the site updates. The Minexa API was built to replace that entire process with a single trained scraper and a POST request.

This guide walks through exactly how that works, from training your first scraper to running batch extractions in production.

Step 1: Train a scraper in the browser

Before making any API call, you need a scraper_id. This is generated through the Minexa Chrome extension, not through code. The process takes 2 to 5 minutes.

Install the Minexa Chrome extension, open a page that contains the data you want to extract, and select the HTML container that wraps the full data block. That is the parent element holding all the fields, not individual fields clicked one by one. Minexa analyzes the structure and automatically discovers all relevant data points inside that container, then assigns column labels.

Once the scraper is created, click API Request in the top right. You will see pre-generated Python code with your scraper_id already filled in. Copy it and you are ready for step two.

Step 2: Understand the API request structure

All extractions go through a single endpoint:

POST https://api.minexa.ai/data/

The request body uses a batches array, which means you can submit multiple scraper configurations in one call. Here is a standard request body:

{
  "batches": [
    {
      "scraper_id": 4821,
      "columns": ["top_30"],
      "urls": ["https://example.com/product/101", "https://example.com/product/102"],
      "scraping": {
        "js_render": true,
        "timeout": 30,
        "js_code": [
          { "wait_time": 2 },
          { "page_init": true },
          { "wait_time": 4 }
        ],
        "proxy": "verified",
        "retry": 3
      }
    }
  ],
  "threads": 5
}

A few things worth understanding here:

  • scraper_id is the ID generated during training. Every URL in the batch must share the same page structure the scraper was trained on. If a mismatched URL is submitted, Minexa returns an explicit error rather than extracting incorrect data.

  • columns accepts either a top_N shorthand (e.g. top_30 returns the 30 highest-ranked fields) or an explicit list of column names like ["price", "availability", "brand"]. Both approaches cost the same.

  • threads controls how many URLs are processed in parallel. Higher values mean faster throughput up to your plan limit.

  • urls contains the live pages to scrape. Up to 50,000 URLs can be submitted in a single batch request.

Step 3: Choose your scraping configuration

The scraping object controls how Minexa fetches each page. The right configuration depends on the target site.

For most standard pages, js_render: true with proxy: "verified" and a few wait steps handles the majority of cases. For heavily protected sites, you may need to switch the provider. Three provider options are available: service1, service2, and service3. Service2 is the most capable for anti-bot unblocking but is also the most credit-intensive. Service3 is a good middle ground when service1 does not get through. The bypass parameter for anti-bot handling only works when provider is set to service2.

If you already have the HTML stored as files (for example on AWS CloudFront), you can skip live crawling entirely using the file_urls parameter. Set js_render: false and pass your stored HTML URLs in file_urls, with the original source URLs in urls for mapping purposes. This is the lowest-credit configuration available.

The extension drop-down under API Request offers pre-built scraping scenarios you can copy directly. This is the fastest way to find the right configuration without reading through every parameter.

Step 4: Run the Python script and save your data

The ready-to-run Python script handles pagination across the API response automatically, writing checkpoint files at each iteration so data is never lost mid-run. It saves output as JSON, CSV, and Excel simultaneously.

The key loop pattern uses the meta.next token returned in each response. As long as a next value is present, the script continues fetching the next batch of results. When it is absent, the job is complete.

Set your api_key, update scraper_id and urls in the request body, and run. The script prints a preview of each batch as it arrives.

Full API documentation is available here if you need to reference any parameter in detail.

Step 5: Handle nested data in the output

Most columns return flat string values. When the extracted content is structurally nested in the HTML, Minexa returns a list of objects instead. Each object includes a value field along with metadata like tag, type, and attribute.

For example, a field containing multiple location values might return:

{"locations": [{"tag": "span", "type": "text", "value": "Berlin"}, {"tag": "span", "type": "text", "value": "Munich"}]}

In Python, extracting the values is straightforward:

values = [item["value"] for item in data["locations"]]

This is worth planning for when your target data includes long text blocks, multi-value fields, or article content where the full text is distributed across multiple tags.

Step 6: Know what to do when a scraper needs retraining

Minexa fails loudly. If a page structure changes and the trained scraper no longer matches, affected fields return null or an explicit error. No silently wrong values are returned.

When a site redesigns, open an affected page in the extension, select the updated container, and create a new scraper. This takes the same 2 to 5 minutes as the original setup. Retraining generates a new scraper_id. The only required code change is updating that ID in your request body and verifying that the column names you depend on are still present.

If you are running many different URLs across different page types, the recommended approach is to set up your own cron jobs and call the API with the appropriate scraper_id and URL list for each job. This gives you full control over scheduling and volume without any platform-level constraints.

The Minexa API documentation covers every parameter in detail. If you are starting out, the fastest path is to train a scraper through the extension, copy the generated Python code, and run it against your first batch of URLs. Everything else follows from there.

Get started at minexa.ai and have your first structured dataset running in under ten minutes.

Recent Posts

See All

Comments


Heading 2

bottom of page