How to scrape pharmaceutical and biotech data from the electronic Medicines Compendium
- Minexa.ai

- 6 days ago
- 3 min read
The electronic Medicines Compendium (medicines.org.uk) is one of the most complete publicly accessible references for UK-authorised medicines. Every listed product links to its Summary of Product Characteristics (SmPC), its Patient Information Leaflet (PIL), and where applicable, risk minimisation materials. For pharmaceutical researchers, biotech analysts, and regulatory data teams, this is a dense, structured source worth extracting at scale.
This walkthrough covers how to pull that data programmatically using the Minexa API: train a scraper once via the browser extension, then call the API to extract thousands of medicine records without writing selectors or maintaining custom parsing logic.
What data is available
Each row in the browse listing exposes the following fields:
medicine_description: the full product name including dosage and form
active_ingredients: one or more active substances per product
manufacturer: the marketing authorisation holder
company_link: relative path to the manufacturer profile on emc
document_types: nested objects containing SmPC and PIL links per product
product_information_link: direct path to the PIL document
risk_material_links: populated only when risk minimisation materials exist for that product
related_links: all href values associated with the listing row
The risk_material_links field is particularly useful: it is empty for most products and populated only when the manufacturer has filed additional safety documentation. This makes it a clean signal for filtering products under enhanced pharmacovigilance monitoring.
Step 1: Train the scraper in the browser extension
Navigate to medicines.org.uk/emc/browse-medicines, open the Minexa Chrome extension, and confirm you are on the right page.
The extension detects pagination automatically. Confirm the pagination logic and choose whether to scrape the list only or follow linked detail pages.
Minexa locks onto the listing container and identifies all data columns automatically. Once the scraper is created, click API Request to get your pre-generated Python code including your scraper_id.
Step 2: Call the Minexa API
Once your scraper is trained, pass your list of URLs to the API. The endpoint is https://api.minexa.ai/data/. Here is a working Python example:
import requests
url = "https://api.minexa.ai/data/"
api_key = "YOUR_API_KEY"
data = {
"batches": [{
"scraper_id": 6231,
"columns": ["top_40"],
"urls": ["https://www.medicines.org.uk/emc/browse-medicines"],
"scraping": {
"js_render": True,
"timeout": 30,
"js_code": [{"wait_time": 2},{"page_init": True},{"wait_time": 4}],
"proxy": "verified",
"retry": 3
}
}],
"threads": 5
}
headers = {"Content-Type": "application/json", "api-key": api_key}
response = requests.post(url, json=data, headers=headers)
print(response.json())
The scraper_id is generated once during training and reused across all subsequent API calls. Update the urls list to include every page you want to process.
Sample extracted data
[
{
"medicine_description": "Abacavir 300mg Film-coated tablets",
"active_ingredients": "abacavir sulfate",
"manufacturer": "Aurobindo Pharma - Milpharm Ltd.",
"company_link": "/emc/company/3006",
"product_information_link": "/emc/product/12475/pil",
"risk_material_links": ""
},
{
"medicine_description": "Abacavir Mylan 300 mg Film-coated Tablets",
"active_ingredients": "abacavir",
"manufacturer": "Viatris (formerly Mylan or Upjohn)",
"company_link": "/emc/company/3338",
"product_information_link": "/emc/product/9079/pil",
"risk_material_links": "/emc/product/9079/rmms"
}
]
Notice that the second record has a populated risk_material_links value. At scale, filtering on this field gives a fast view of which products carry active risk minimisation programmes.
Scale and credit considerations
The emc browse listing spans many alphabetical pages. When running at scale via the API, you will need to construct the full URL list and pass it in batches. Up to 50,000 URLs can be submitted in a single API request. Because the page uses JavaScript rendering, set js_render: true and allow adequate timeout. If you encounter blocking, switching to a residential proxy or a stronger provider setting will improve success rates at the cost of additional credits per page.
The meta__unique_hash field in the raw output is useful for deduplication when the same product appears across multiple alphabetical pages or when re-running extractions over time.
Results can be exported as Excel or JSON directly from the Minexa dashboard, or consumed programmatically via the paginated API response using the next token in the response metadata.

Comments