Resume Filtering¶
This notebook analyzes a collection of resumes, extracting out the education info (university, degree, etc.), and filters resumes according to a criteria.
This notebook shows how to use Semlib with local models using Ollama. It employs a model cascade to extract information using a higher-capacity model and then turns that into structured data using a smaller model (to work around this bug with gpt-oss
in Ollama).
The processing is implemented with the following pipeline:
- (third-party tool) Convert PDF to Markdown with Marker.
- (map) Use
gpt-oss:20b
to extract education information from resume Markdown content. - (map) Use
qwen3:8b
to turn the education information into structured data. - (non-semantic filter) Filter out the resumes that have master's degrees.
Install and configure dependencies¶
Ollama¶
This notebook relies on Ollama, which you can use to run LLMs on your local machine. Download Ollama and start it before you proceed.
We use two different open-source LLMs, gpt-oss and qwen3.
You will need a reasonably powerful machine to run these models locally. If they fail to run, or they run too slowly, you can consider trying smaller open-source models instead, or use a hosted model (e.g., via the OpenAI API) to run this notebook.
First, we make sure these models are present on your local machine (if not, it's a 20 GB download).
!ollama pull gpt-oss:20b
!ollama pull qwen3:8b
%pip install semlib marker-pdf
We start by initializing a Semlib Session. A session provides a context for performing Semlib operations. We configure the session to cache LLM responses on disk in cache.db
, and we configure the default model to the open-source gpt-oss:20b
via the local provider ollama_chat/
.
from semlib import OnDiskCache, Session
session = Session(cache=OnDiskCache("cache.db"), model="ollama_chat/gpt-oss:20b")
Download and preprocess dataset¶
!curl -s -L -o resume-dataset.zip https://www.kaggle.com/api/v1/datasets/download/snehaanbhawal/resume-dataset
!unzip -q -o resume-dataset.zip
This dataset contains resumes in PDF format (feel free to examine them in your PDF viewer: the resumes will be in the data/
directory).
We use Marker to convert these to Markdown. We sub-sample the resumes to reduce processing time, considering only 10 resumes in the dataset.
The first time you use Marker, it needs to download some ML models (up to about 3 GB of data).
The following cell takes about 2 minutes to run on an M3 MacBook Pro.
import os
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
converter = PdfConverter(
artifact_dict=create_model_dict(),
)
directory = "data/data/ENGINEERING"
files = sorted(os.listdir(directory))[:10]
texts = []
for file in files:
rendered = converter(os.path.join(directory, file))
texts.append(rendered.markdown)
Now, we can preview what one of these resume texts looks like. We note that there are some parsing errors (the PDFs are not high-quality to begin with, and there are additional errors introduced in the conversion to Markdown). LLMs end up being pretty effective at processing data like this, though.
print(f"{texts[0][:1000]}...")
## ENGINEERINGLABTECHNICIAN Career Focus Mymain objectivein seeking employment withTriumphActuation Systems Inc. is to work in a professionalatmosphere whereIcan utilize my skillsand continueto gain experiencein theaerospaceindustry to advanceinmy career. ProfessionalExperience EngineeringLab TechnicianOct 2016 to Current CompanyNameï¼ City , State - Responsiblefor testing various seatstructures to meetspecificcertification requirements. Â - Maintain and calibratetest instruments to ensuretesting capabilitiesare maintained. - Ensure dataiscaptured and recorded correctly forcertification test reports. - Dutiesalso dynamictestset-up and staticsuitetesting. EngineeringLab Technician, Sr. Specialist Apr 2012 to Oct 2016 CompanyNameï¼ City , State - Utilized skills learned fromLabViewCourse 1 training to constructand maintainLabViewVI programs. - Responsiblefor fabricating and maintaining hydraulic/electricaltestequipment to complete developmentand qualification programs. - Apply engine...
Filter resumes¶
Extract education information¶
We begin with a semantic map to extract education information from the resume. We use the high-capacity gpt-oss:20b
model (set as the default in the Session constructor above). At this time, there is a bug which prevents structured outputs from this model in Ollama, so we just use it to extract a textual description of the education information as a first step.
The following cell takes about 2 minutes to run on an M3 MacBook Pro.
all_education_texts = await session.map(
texts,
"""
Given a resume, extract the university, graduation year, degree, and area of study for the most advanced degree the individual has.
If some of this information is not present, omit it. If no university education is present, return "(none)".
Resume:
{}
""".strip(),
)
Some of the resumes don't have education information present, in which case the LLM returns "(none)". We filter these out using a non-semantic filter, and preview what one of the education infos looks like.
education_texts = [i for i in all_education_texts if i != "(none)"]
print(education_texts[0])
Forsyth Technical Community College, 2011, Associates, Applied Science, Electronics Engineering
Extract structured data¶
We begin by defining a Pydantic model that describes the structured data we want to get. For the degree
field, we use a typing.Literal
annotation to restrict the set of values.
from typing import Literal
import pydantic
class EducationInfo(pydantic.BaseModel):
university: str | None
graduation_year: int | None
degree: Literal["Associate", "Bachelor", "Master", "Doctorate"] | None
area: str | None
Now, we call qwen3:8b
, a smaller-capacity LLM (but one that supports structured outputs in Ollama), to convert the text-based descriptions of educational information to the structured data type we defined above.
The following cell takes about 30 seconds to run on an M3 MacBook Pro.
educations = await session.map(
education_texts,
"""
Given the following description of an individual's education, extract the university, graduation year, degree, and area of study.
{}
""".strip(),
return_type=EducationInfo,
model="ollama_chat/qwen3:8b",
)
We can take a look at what one of these items looks like.
educations[0]
EducationInfo(university='Forsyth Technical Community College', graduation_year=2011, degree='Associate', area='Electronics Engineering')
Filter for resumes with master's degrees¶
As a first step, we construct an all_educations
list that contains EducationInfo
s that correspond to the resumes in files
and texts
(the educations
doesn't necessarily contain these, as we filtered out the "(none)" cases).
all_educations: list[EducationInfo | None] = []
i = 0
for text in all_education_texts:
if text != "(none)":
all_educations.append(educations[i])
i += 1
else:
all_educations.append(None)
masters = []
for file, edu in zip(files, all_educations, strict=False):
if edu is not None and edu.degree == "Master":
masters.append((file, edu))
Results¶
print(f"Found {len(masters)} resumes with a Master's degree:\n")
for file, edu in masters:
print(f"- {os.path.join(directory, file)}: {edu.university}, {edu.graduation_year}, {edu.area}")
Found 6 resumes with a Master's degree: - data/data/ENGINEERING/10624813.pdf: Union College, 1989, Computer Science - data/data/ENGINEERING/10985403.pdf: Illinois Institute of Technology, 2017, Mechanical & Aerospace Engineering - data/data/ENGINEERING/11890896.pdf: San Francisco State University, 2007, Decision Sciences - data/data/ENGINEERING/11981094.pdf: Illinois Institute of Technology, None, Computer Science - data/data/ENGINEERING/12011623.pdf: University of New Hampshire, 2017, Analytics - data/data/ENGINEERING/12022566.pdf: University at Buffalo, 2014, Industrial Engineering