How to scrape pharmaceutical and biotech data from the electronic Medicines Compendium

Minexa.ai
6 days ago
3 min read

The electronic Medicines Compendium (medicines.org.uk) is one of the most complete publicly accessible references for UK-authorised medicines. Every listed product links to its Summary of Product Characteristics (SmPC), its Patient Information Leaflet (PIL), and where applicable, risk minimisation materials. For pharmaceutical researchers, biotech analysts, and regulatory data teams, this is a dense, structured source worth extracting at scale.

This walkthrough covers how to pull that data programmatically using the Minexa API: train a scraper once via the browser extension, then call the API to extract thousands of medicine records without writing selectors or maintaining custom parsing logic.

What data is available

Each row in the browse listing exposes the following fields:

medicine_description: the full product name including dosage and form
active_ingredients: one or more active substances per product
manufacturer: the marketing authorisation holder
company_link: relative path to the manufacturer profile on emc
document_types: nested objects containing SmPC and PIL links per product
product_information_link: direct path to the PIL document
risk_material_links: populated only when risk minimisation materials exist for that product
related_links: all href values associated with the listing row

The risk_material_links field is particularly useful: it is empty for most products and populated only when the manufacturer has filed additional safety documentation. This makes it a clean signal for filtering products under enhanced pharmacovigilance monitoring.

Step 1: Train the scraper in the browser extension

Navigate to medicines.org.uk/emc/browse-medicines, open the Minexa Chrome extension, and confirm you are on the right page.

The extension detects pagination automatically. Confirm the pagination logic and choose whether to scrape the list only or follow linked detail pages.

Minexa locks onto the listing container and identifies all data columns automatically. Once the scraper is created, click API Request to get your pre-generated Python code including your scraper_id.

Step 2: Call the Minexa API

Once your scraper is trained, pass your list of URLs to the API. The endpoint is https://api.minexa.ai/data/. Here is a working Python example:

import requests

url = "https://api.minexa.ai/data/"
api_key = "YOUR_API_KEY"

data = {
  "batches": [{
    "scraper_id": 6231,
    "columns": ["top_40"],
    "urls": ["https://www.medicines.org.uk/emc/browse-medicines"],
    "scraping": {
      "js_render": True,
      "timeout": 30,
      "js_code": [{"wait_time": 2},{"page_init": True},{"wait_time": 4}],
      "proxy": "verified",
      "retry": 3
    }
  }],
  "threads": 5
}

headers = {"Content-Type": "application/json", "api-key": api_key}
response = requests.post(url, json=data, headers=headers)
print(response.json())

The scraper_id is generated once during training and reused across all subsequent API calls. Update the urls list to include every page you want to process.

Read the full API docs

Sample extracted data

[
  {
    "medicine_description": "Abacavir 300mg Film-coated tablets",
    "active_ingredients": "abacavir sulfate",
    "manufacturer": "Aurobindo Pharma - Milpharm Ltd.",
    "company_link": "/emc/company/3006",
    "product_information_link": "/emc/product/12475/pil",
    "risk_material_links": ""
  },
  {
    "medicine_description": "Abacavir Mylan 300 mg Film-coated Tablets",
    "active_ingredients": "abacavir",
    "manufacturer": "Viatris (formerly Mylan or Upjohn)",
    "company_link": "/emc/company/3338",
    "product_information_link": "/emc/product/9079/pil",
    "risk_material_links": "/emc/product/9079/rmms"
  }
]

Notice that the second record has a populated risk_material_links value. At scale, filtering on this field gives a fast view of which products carry active risk minimisation programmes.

https://www.youtube.com/watch?v=nQ5-eOlZU1Q

Scale and credit considerations

The emc browse listing spans many alphabetical pages. When running at scale via the API, you will need to construct the full URL list and pass it in batches. Up to 50,000 URLs can be submitted in a single API request. Because the page uses JavaScript rendering, set js_render: true and allow adequate timeout. If you encounter blocking, switching to a residential proxy or a stronger provider setting will improve success rates at the cost of additional credits per page.

The meta__unique_hash field in the raw output is useful for deduplication when the same product appears across multiple alphabetical pages or when re-running extractions over time.

Results can be exported as Excel or JSON directly from the Minexa dashboard, or consumed programmatically via the paginated API response using the next token in the response metadata.

Install the Minexa Chrome extension to get started

Minexa.ai

How to scrape pharmaceutical and biotech data from the electronic Medicines Compendium

What data is available

Step 1: Train the scraper in the browser extension

Step 2: Call the Minexa API

Sample extracted data

Scale and credit considerations

Recent Posts

Comments

Heading 2

Minexa.ai

Company

About us

How it works

Pricing

Affiliates

Product

Privacy Policy & GDPR

Terms of Services

Cookies Policy

Cookies Preferences

Support

Api docs

Contact us

Find By Category

Latest Blog Posts

Find By Tag