top of page

How to scrape institution data from Carnegie Classifications using the Minexa API

The Carnegie Classifications directory lists every accredited higher education institution in the United States, each tagged with a classification label, a state, and a student access and earnings designation. That combination of fields makes it one of the more useful structured sources for education research, policy work, and institutional benchmarking. The challenge is that the data lives across paginated HTML pages, not in a downloadable file.

This post walks through how to extract that data programmatically using the Minexa API. The workflow has two phases: train a scraper once using the Minexa Chrome extension, then call the API to pull structured records at scale.

What the extracted data looks like

Here is a sample of what the Minexa API returns from the Carnegie Classifications institutions listing:

[
  {
    "institution_name": "Aaniiih Nakoda College",
    "institutional_classification": "Professions-focused Associate Small",
    "location": "MT",
    "student_access_and_earnings": "Opportunity Colleges and Universities-Higher Access, Higher Earnings",
    "website_url": "https://carnegieclassifications.acenet.edu/institution/aaniiih-nakoda-college/"
  },
  {
    "institution_name": "Abilene Christian University",
    "institutional_classification": "Professions-focused Undergraduate/Graduate-Doctorate Medium",
    "location": "TX",
    "student_access_and_earnings": "Lower Access, Medium Earnings",
    "website_url": "https://carnegieclassifications.acenet.edu/institution/abilene-christian-university/"
  },
  {
    "institution_name": "Albany College of Pharmacy and Health Sciences",
    "institutional_classification": "Special Focus: Other Health Professions",
    "location": "NY",
    "student_access_and_earnings": "Lower Access, Higher Earnings",
    "website_url": "https://carnegieclassifications.acenet.edu/institution/albany-college-of-pharmacy-and-health-sciences/"
  }
]

Each record maps cleanly to one institution. No parsing required on your end.

What each field captures

The institution_name field gives the full official name as listed in the Carnegie directory. This is the primary identifier for deduplication and joins with other datasets.

The institutional_classification field encodes the full Carnegie classification string. It distinguishes institution type (Special Focus, Professions-focused, Mixed), degree level (Associate, Baccalaureate, Master's, Doctorate), and size category (Small, Medium, Large) in a single value. For filtering or segmenting a dataset by institution type, this is the field to use.

The location field returns a two-letter US state code. It is compact but sufficient for state-level aggregation, geographic filtering, or mapping workflows.

The student_access_and_earnings field is one of the more analytically interesting outputs. It surfaces the Carnegie equity designation for each institution, including labels like 'Opportunity Colleges and Universities-Higher Access, Higher Earnings', 'Lower Access, Medium Earnings', and 'Not Classified'. This field alone can anchor a policy-focused analysis of which institution types serve higher-access student populations and what earnings outcomes those students see.

The website_url field links to each institution's individual Carnegie profile page. This enables downstream enrichment: a second extraction pass on those URLs can pull additional classification details, historical data, or methodology notes for any subset of institutions.

Video walkthrough

Watch the full extraction setup before going through the steps below:

How to train the scraper

Open the Minexa Chrome extension and navigate to the Minexa home page to get started.

Navigate to carnegieclassifications.acenet.edu/institutions/ in your browser. The extension detects the page structure automatically once it loads.

Click 'I'm on the right page' in the extension popup to confirm the target URL and begin detection.

The extension identifies the pagination method used by the directory and presents it for confirmation. Click 'Continue' to proceed.

At this stage you can choose to scrape only the listing data, or follow each institution link and extract detail page content as well. For most pipelines, the listing fields are sufficient.

The extension highlights the repeating institution container automatically. Confirm the selection and click 'Create scraper'.

All extracted data points appear with navigation controls so you can review each column before finalising the scraper configuration.

Click 'API request' to view the generated code samples. Note your scraper ID here, you will need it for every API call.

Calling the Minexa API

Once the scraper is trained, use your scraper ID to call the https://api.minexa.ai/data endpoint. Here is a working Python example:

import requests

url = "https://api.minexa.ai/data"
headers = {
    "Content-Type": "application/json",
    "x-api-key": "YOUR_API_KEY"
}
payload = {
    "scraper_id": 4821,
    "urls": ["https://carnegieclassifications.acenet.edu/institutions/"],
    "columns": "top_25"
}

response = requests.post(url, json=payload, headers=headers)
print(response.json())

To extract across multiple pages, pass each paginated URL as a separate entry in the urls array. The scraper ID stays the same across all calls since the page structure is consistent throughout the directory.

The full Carnegie Classifications directory spans thousands of institutions. Building a complete dataset means paginating through all available pages and collecting results into a single file. A checkpoint-based script that saves output after each page is a practical approach for runs of that size.

To get started, visit minexa.ai and install the Chrome extension to train your first scraper on the Carnegie Classifications directory.

Recent Posts

See All

Comments


Heading 2

bottom of page