How to scrape SIC codes and filings data from Companies House using the Minexa API
- Minexa.ai
- 2 days ago
- 3 min read
The Companies House SIC code list is a complete reference of Standard Industrial Classification codes used by UK companies when registering or filing. Every active company on the register is assigned at least one SIC code, making this page a foundational lookup table for anyone building company intelligence pipelines, sector filters, or compliance tooling.
This guide walks through how to extract the full SIC code dataset from resources.companieshouse.gov.uk/sic/ using the Minexa API. The workflow has two phases: train a scraper once using the Minexa Chrome extension, then call the Minexa API programmatically to pull the data whenever needed.
Watch the full tutorial first
The video below covers the entire extraction workflow from opening the extension to viewing the extracted data and accessing the generated API code.
Step-by-step: training the scraper
Open the Minexa home page after installing the extension. This is the starting point before navigating to the target page.
Navigate to resources.companieshouse.gov.uk/sic/. The page loads a flat, paginated table of all SIC codes organised by section. This is the page the scraper will be trained on.
Click the Minexa extension icon. The popup appears and confirms the current page. Click I'm on the right page to proceed.
Minexa detects the pagination structure on the page and shows the available pagination options. Review them and click Continue. Note that when using the API directly, pagination is not handled automatically. You will need to write a JS code scenario that defines what needs to be clicked to move between pages.
After validating pagination, the extension presents the scraping mode options. For a reference table like this one, the default list mode is the right choice.
The start scraping screen confirms the mode selection before the highlighting step begins.
Hover over the data container on the page. Minexa highlights the full list block. Click to confirm the selection. You are selecting the parent container, not individual rows. Minexa discovers all columns within it automatically.
Once the scraper is created, all extracted data points appear in the review panel. You can scroll through the columns and verify the values before proceeding.
Click API Request in the top right corner. The extension generates ready-to-use JSON and Python code pre-filled with your scraper ID and the correct endpoint.
What the extracted data looks like
Each row returns two primary fields. The section_code field holds either an alphabetical section label (such as Section A) or a five-digit numeric SIC code (such as 01110). The description field holds the plain-text activity name for that row. Section header rows and individual code rows are both captured in sequence, preserving the full hierarchy of the reference table.
[
{
"description": "Agriculture, Forestry and Fishing",
"section_code": "Section A"
},
{
"description": "Growing of cereals (except rice), leguminous crops and oil seeds",
"section_code": "01110"
},
{
"description": "Mining and Quarrying",
"section_code": "Section B"
},
{
"description": "Deep coal mines",
"section_code": "05101"
}
]API request and Python code
Once the scraper is trained, copy the generated Python code from the extension. Update the scraper_id with your own value and add the URLs you want to process. The example below shows the standard request structure.
import requests
url = "https://api.minexa.ai/data/"
api_key = "YOUR_API_KEY"
data = {
"batches": [{
"scraper_id": 6190,
"columns": ["top_20"],
"urls": ["https://resources.companieshouse.gov.uk/sic/"],
"scraping": {
"js_render": True,
"proxy": "verified"
}
}],
"threads": 3
}
headers = {"Content-Type": "application/json", "api-key": api_key}
response = requests.post(url, json=data, headers=headers)
print(response.json())The scraper_id ties every API call to the trained page structure. The columns parameter using top_20 returns the top-ranked fields by relevance without requiring you to define a schema upfront. You can also pass explicit column names once you know which fields you need. Both approaches cost the same.
The threads value controls how many URLs are processed in parallel. For a single-page reference like this one, a low thread count is sufficient. For larger batches of Companies House filing pages, increasing threads reduces total processing time proportionally up to your plan limit.
Once the job finishes, the full dataset is available for export. You can download it as Excel or JSON directly from the Minexa interface.
The scraper trained here can be reused across any structurally similar Companies House reference page without modification. The same scraper_id works across all future API calls as long as the page structure remains unchanged.
For a related example using a different filings data source, see how to scrape documents and filings data from OpenCorporates using the Minexa API.
