top of page

10 scraping parameters in the Minexa API that most developers overlook

Most developers get a Minexa.ai API pipeline running quickly. The default request body works for a large share of sites, and the Chrome extension generates ready-to-use Python code in minutes. But a handful of parameters sit quietly in the request schema and go unused, even when they would directly solve a problem the developer is already fighting with.

Here are ten parameters worth knowing before you hit your first wall.

1. bypass: anti-bot handling that only activates on service2

The bypass parameter enables advanced anti-bot handling, but it only works when provider is set to service2. Setting bypass on any other provider has no effect. If a site is returning empty or blocked responses and service1 is not getting through, switch to service2 and enable bypass together. Note that service2 is the most expensive provider, so use it selectively.

2. country: geo-targeted content via ISO code

Some pages return different prices, listings, or content depending on the visitor's location. The country parameter sets the proxy's geolocation using a standard ISO country code. Without it, the proxy location is uncontrolled, which can produce inconsistent data across runs when the target site serves region-specific content.

3. reset: force a fresh crawl instead of cached HTML

Minexa may return cached HTML for pages it has recently fetched. If your pipeline tracks time-sensitive data like prices, stock levels, or job postings, set reset to true to force a live crawl. Skipping this on frequently updated pages is a common source of stale data that is difficult to diagnose.

4. method, data, and headers: non-GET requests

Not every target URL is a simple GET request. The method field lets you specify POST or other HTTP methods. The data field carries the request body, and headers lets you pass custom HTTP headers. This combination is useful when the target page requires specific request parameters or content-type declarations to return the correct HTML.

5. cookies: access pages behind soft authentication

Some pages display different content to logged-in users, such as full contact details in a directory or complete pricing on a SaaS site. The cookies field lets you pass session cookies directly in the scraping config. This does not bypass hard authentication walls, but it handles cases where a valid session cookie is sufficient to unlock the full page content.

6. load_images and load_medias: cut credit cost on heavy pages

When JavaScript rendering is enabled, the browser fetches all page assets by default. Setting load_images and load_medias to false prevents image and media loading. If the data you need is text-based, this reduces bandwidth and can lower credit consumption on pages that load large visual assets.

7. wait_page_load: delay for post-load dynamic content

Some pages inject content into the DOM after the initial load event fires. A standard page load wait is not always enough. The wait_page_load parameter adds a delay after the page load event before extraction begins. This is different from the wait_time steps inside js_code. Use it when content appears visually but extraction returns null on fields that should be populated.

8. retry: automatic re-attempts on transient failures

Network timeouts, temporary blocks, and intermittent rendering failures are common at scale. The retry parameter sets how many times Minexa will automatically re-attempt a failed request before marking it as an error. Without it, a single transient failure produces a gap in your dataset that requires a separate re-run to fill.

9. threads: parallel processing up to your plan limit

The threads field controls how many URLs are processed simultaneously. Higher values reduce total job time proportionally, up to the maximum allowed by your plan. The Business plan supports up to 100 threads. If your jobs are running slower than expected, check whether threads is set below your plan's ceiling. Leaving it at the default when your plan supports more is a straightforward throughput loss.

10. file_urls: decouple scraping from extraction entirely

If you already have HTML stored on AWS CloudFront, GitHub Gist, or any accessible URL, you can pass those locations in file_urls and skip live crawling altogether. Each entry in file_urls maps 1-to-1 with the corresponding entry in urls, which carries the original source URL for attribution. Set js_render to false when using this approach since no rendering is needed. This is the lowest-credit scraping configuration available and is useful for pipelines that already handle their own HTML collection.

These parameters do not change the core extraction logic, which remains deterministic and DOM-based regardless of configuration. But they directly affect success rate, data freshness, cost, and throughput. Most are one-line additions to an existing request body.

Recent Posts

See All

Comments


Heading 2

bottom of page