10 web data extraction mistakes that quietly kill your pipeline
- Minexa.ai

- 6 days ago
- 5 min read
Most data extraction pipelines do not fail dramatically. They degrade quietly, one bad row at a time, until someone notices the numbers look wrong. These ten mistakes are responsible for the majority of that silent damage.
1. Selecting individual fields instead of the parent container
When training a scraper, the instinct is to click the exact field you want, like a price or a title. The problem is that individual fields shift position constantly as layouts change. The correct approach is to select the parent HTML element that wraps the entire data block. A tool like Minexa.ai is built around this principle: you point to the container, and all the data points inside it are discovered automatically. Clicking individual fields one by one is slower, more fragile, and unnecessary.
2. Using LLM-based extraction at any meaningful volume
LLMs work for quick one-off extractions. At scale, the economics collapse. A full HTML page averages around 572,000 tokens. At 120,000 pages per month, even the cheapest nano-class models cost hundreds to thousands of dollars more than a flat-rate DOM-based approach. Beyond cost, LLMs introduce probabilistic output: the same page can return different field values across runs depending on model temperature or prompt drift. For production pipelines, that variability creates downstream validation work that compounds with volume.
3. Treating silent failures as acceptable
Selector-based scrapers and LLM pipelines both fail silently in different ways. A CSS selector that drifts to the wrong element returns a value, just the wrong one. An LLM filling a missing price field may return a plausible number that was never on the page. Neither raises an error. A well-designed extraction system should return null or an explicit error when a field is absent or the page structure does not match, not a fabricated or misattributed value. Silent failures are the hardest to catch because the pipeline keeps running and the data keeps looking reasonable.
4. Hardcoding XPath or CSS selectors by hand
Writing selectors manually is time-consuming and brittle. A single class name change on the target site breaks the extractor, and the breakage is usually discovered after bad data has already been collected. Beyond maintenance, hand-written selectors require someone with DOM knowledge to build and update them. Automated field discovery, where the system evaluates and ranks candidate selectors for structural stability across multiple pages, produces more reliable results than a single hand-picked path and removes the need for that engineering work entirely.
5. Running extractions single-threaded
Processing URLs one at a time is the default for many scraping scripts, and it turns a job that could finish in minutes into one that runs for hours. Parallel thread support matters significantly at scale. If your extraction layer supports configurable concurrency, use it. At 50,000 URLs per batch with multiple threads running simultaneously, the throughput difference versus sequential processing is not marginal. It is the difference between a pipeline that fits inside a reasonable time window and one that misses it.
6. Ignoring nested data structures
Not all extracted values are flat strings. A field like study locations or investor names may return a list of objects, each with a tag, type, and value. Pipelines that are not built to handle this either drop the field entirely or store a raw JSON blob that nothing downstream can parse. The correct approach is to extract the value key from each object in the list, and in cases where multiple object types exist, filter by tag or attribute to isolate the right one. This requires explicit handling in your processing code, and skipping it means losing structured data that is actually there.
7. Skipping anti-bot configuration on protected sites
Many sites use JavaScript rendering, CAPTCHA systems, and bot detection that blocks standard HTTP requests entirely. Sending a plain GET request to these pages returns an error page or a challenge screen, not the content you need. Effective extraction on these sites requires JavaScript rendering, appropriate proxy types, and sometimes residential IPs or specialized unblockers. The configuration matters: using a lighter scraping mode on a heavily protected site will produce empty or failed results, while using the heaviest mode on a simple static page wastes credits. Matching the scraping configuration to the target site is a step that is easy to skip and expensive to ignore.
8. Passing URLs to the wrong scraper
A scraper trained on a product detail page will not work correctly on a search results page, even if both come from the same domain. The HTML structure is different, the data container is different, and the columns do not map. Submitting the wrong URL type to a scraper should produce a clear error, not a partially filled or empty result. Minexa.ai is designed to return an explicit mismatch error in this situation rather than attempting extraction on an incompatible page. If your current setup silently returns empty rows when the URL type is wrong, you have no signal that anything went wrong until you audit the output manually.
9. Not retraining after a site redesign
Websites update their layouts. When that happens, an existing scraper trained on the old structure will start returning null values or errors on the affected fields. The correct response is to retrain: open an updated page, select the new container, and generate a new scraper. This typically takes two to five minutes. The mistake is waiting too long to act, either because the errors are not monitored or because the team assumes the issue is temporary. After retraining, the only required code change is updating the scraper identifier in the API request and verifying that the column names you depend on have not changed in the new version.
10. Underestimating the real cost of token-based extraction
The per-page cost of an LLM looks small in isolation. At low volumes with stripped HTML, the cheapest models can be competitive. But most real-world pipelines do not use stripped HTML, because stripping requires preprocessing logic that may accidentally remove markup containing needed data. Full HTML pages are roughly 15 times larger in token count, and that multiplier applies directly to cost. At 120,000 full pages per month, even the cheapest available model costs over $3,400, compared to a flat monthly rate for the same volume using DOM-based extraction. The gap is not a rounding error. It is a structural cost difference that grows with every page added to the pipeline.
Get your first structured dataset in under 10 minutes. Install the Minexa.ai Chrome extension, select a page, and let the scraper build itself. No selectors, no code, no schema required. Install the Minexa.ai extension
Each of these mistakes is fixable. The ones that matter most are the ones that produce wrong data without any error signal, because those are the ones that stay hidden the longest. Building a pipeline that fails loudly, uses the right configuration per site, and separates scraper training from production execution covers the majority of what goes wrong in practice.

Comments