10 questions developers ask before integrating a web extraction API (answered)

Minexa.ai
6 days ago
4 min read

Before committing an external API to a production pipeline, developers ask specific questions. Not vague ones about "ease of use" or "scalability" but concrete ones about how the system actually behaves under real conditions. This article answers ten of those questions for the Minexa API, the programmatic interface to Minexa's deterministic DOM-based extraction engine.

1. Do I have to write CSS selectors or XPath to define what to extract?

No. The Minexa API does not require selectors. Field discovery happens during the scraper training step, which is done once through the Chrome extension. Minexa detects repeating patterns on the page and identifies all data points automatically. Once confirmed, the scraper is saved and assigned a stable scraper_id. Every API call after that references the scraper by ID. You never write or maintain selectors in your code.

2. How does the Chrome extension connect to the API?

The extension is where you train the scraper. The API is where you run it at scale. You open the target page in Chrome, Minexa detects the structure, you confirm the fields, and the scraper is saved to your account. From that point, the scraper is accessible via API using its scraper_id. The two interfaces share the same backend. Training happens visually once; extraction happens programmatically from then on.

3. What does a basic API request look like?

Requests go to https://api.minexa.ai/data as a POST with a JSON body. The minimum required fields are your scraper_id and a list of urls to process. You can also pass a columns parameter to control which fields come back. Here is a minimal example in Python:

import requests

response = requests.post(
  "https://api.minexa.ai/data",
  headers={"x-api-key": "YOUR_API_KEY"},
  json={
    "scraper_id": 4817,
    "urls": ["https://example.com/listings"]
  }
)
print(response.json())

Read the full API reference to see all available request parameters.

4. How does the columns parameter work?

You have two options. Pass top_30 (or any number between 10 and 100) to get the highest-ranked fields Minexa identified during training, ranked by relevance. Or pass a list of named fields if you already know exactly which columns you want. Using top_20 is useful during development when you want to explore what is available. Named fields are better for production when your downstream schema is fixed.

5. How are credits consumed when using the API?

One credit equals one page scraped at baseline. However, pages that require JavaScript rendering, proxy routing, or heavy anti-bot handling may consume more than one credit per page. This differs from the Chrome extension, where every page always costs exactly one credit regardless of complexity. When building a pipeline, it is worth testing a sample of your target URLs first to understand the credit cost per page before running large batches.

6. Can I send a large batch of URLs in one request?

Yes. The Minexa API supports batch processing of up to 50,000 URLs in a single request. This is the core pattern for large-scale pipelines: collect your URL list, pass it in one call, and retrieve structured results. For very large jobs, the response is paginated. You handle pagination on the retrieval side using a next_token returned in each response until the result set is exhausted.

7. What does the output look like when data is nested?

When a page contains nested structures (for example, a list of reviews inside a product page), Minexa returns them as nested JSON objects within the result. Each top-level result is one row. Nested lists appear as arrays within that row. To access a specific value inside a nested object in Python, you reference it by key path: result["reviews"][0]["rating"]. The structure mirrors what Minexa found on the page, so it is predictable once you have seen the output from a test run.

8. Can I supply HTML I already have instead of triggering a live crawl?

Yes. The API supports a file_urls parameter that lets you pass pre-scraped HTML directly. Minexa applies the trained scraper to that HTML without making any live request to the target site. This is the lowest-credit-cost extraction mode and is useful when you already have a crawling layer in your infrastructure or when you are processing archived pages. The scraper still needs to have been trained on the same page structure.

9. What happens when a page returns no data?

If Minexa processes a URL and the page structure does not match the trained scraper (for example, after a site redesign), it returns an empty result for that URL rather than silently inserting incorrect data. This is intentional. An empty result is detectable and actionable in your pipeline. It will not corrupt your dataset with fabricated values. You can build retry logic or alerting around empty results the same way you would handle any explicit null response.

10. Do I need to build my own rendering or proxy layer?

No. Minexa handles JavaScript rendering, geo-targeted content, and dynamic pages as part of the extraction process. You do not need a separate headless browser setup, a proxy provider integration, or custom rendering infrastructure. The API call handles all of it. This means your pipeline code stays simple: send URLs, receive structured JSON. The complexity of rendering and anti-bot handling is managed on Minexa's side.

If you are evaluating the Minexa API for a production pipeline, the fastest way to validate fit is to train one scraper on your target page type and run a small batch. The API documentation covers all parameters, response formats, and error handling in detail. Start there, then scale once the output matches your schema.

Minexa.ai

10 questions developers ask before integrating a web extraction API (answered)

1. Do I have to write CSS selectors or XPath to define what to extract?

2. How does the Chrome extension connect to the API?

3. What does a basic API request look like?

4. How does the columns parameter work?

5. How are credits consumed when using the API?

6. Can I send a large batch of URLs in one request?

7. What does the output look like when data is nested?

8. Can I supply HTML I already have instead of triggering a live crawl?

9. What happens when a page returns no data?

10. Do I need to build my own rendering or proxy layer?

Recent Posts

Comments

Heading 2

Minexa.ai

Company

About us

How it works

Pricing

Affiliates

Product

Privacy Policy & GDPR

Terms of Services

Cookies Policy

Cookies Preferences

Support

Api docs

Contact us

Find By Category

Latest Blog Posts

Find By Tag