This project scrapes myHarvard and the QGuide. Current results are at release. Archived results at archive.
The project is initally built for hugems.net on finding gems, but you can use the CSV for anything you like.
If you found it useful, you can
Course ratings correlate well with recommendation scores.
Course ratings also correlate well with lecturer scores, but with more scatter.
Sentiment analysis on the course comments also agree well with its average course rating.
Most high-scoring courses have low workload.
Harvard classes tend to have high ratings. It is rare to get a low score.
Most Harvard classes have a workload demand of around 5 hours per week outside of classes, though the distribution is skewed so some classes have much higher workloads.
There is little correlation between the number of students in the class and the score of the class.
This project uses uv for fast dependency management. Python 3.11 is required for numpy compatibility.
First, install uv if you haven't already:
# On macOS and Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# On Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"Then install dependencies:
uv syncThis will automatically create a virtual environment with Python 3.11 and install all dependencies.
You probably don't need to follow the steps below since the results can be found at release (or archive for older results). If you want to replicate the data release or if you are maintaining this repo for future data release, you can follow the steps below.
The code for this section is at src/qguide.
- First the program needs to discover all the QGuide links for that year and term. Navigate to this link
https://qreports.fas.harvard.edu/browse/index?school=FAS&calTerm=YEAR%20SEMESTERwhere you replaceYEARwith the year you want the qguide for (e.g.2025) andSEMESTERwith one ofSpringandFall. It requires login. - Download the webpage (ctrl+s or cmd+s) as a HTML-only file. Keep the default name
QReports.htmland put it in this folder replacing the old file. - Make sure you are in right folder, if not run
cd src/qguide. Then Runuv run scraper.pyto scrape the links for the QGuides for each course. The links generated will be stored atcourses.csv. - Visit the first QGuide link scrapped at
courses.csv. Be careful in VSCode, since it will concat the other fields and result in an invalid URL, so don't cmd+click, but instead copy paste the link. - Open the Developer Console, go to Application and click on the Cookie tab. Get the values for
ASP.NET_SessionIdandCookieNameand paste it tosrc/qguide/secret_cookie.txtin the following formatASP.NET_SessionId=YOUR_VALUE_HERE;CookieName=YOUR_VALUE_HERE - Make sure you delete the current
QGuidesfolder to start afresh if it exists. - Run
uv run downloader.pyto use your cookies to download all the QGuides with the links scrapped from the previous step. The QGuides will be stored at the folderQGuides. This takes about 6 minutes. - Run
uv run analyzer.pyto generatecourse_ratings.csv. If you run into a course with bugs, you can copy that FAS string and paste it to thedemo or debugsection of the code. My usual debugging process is to search for that file in the IDE (cmd+p and paste in the course code that begins with FAS-, the file should show up), reveal in Finder, open in Chrome and see what's up. It's fine to ignore some files with errors, if for example they only contain the response ratio and nothing else. - Once that's done, rename
course_ratings.csvasYEAR_TERM.csvlike2025_Fall.csvand put this inrelease/qguide.
The code for this section is at src/myharvard.
- Specify the
yearandtermat the bottom ofget_myharvard_url_chunks.pyand run it (uv run get_myharvard_url_chunks.py) to get the URL chunks of the courses that will be offered. This will generatecourse_urls.txtand takes around 3 minutes. - Run
uv run get_all_course_data.pyto getall_courses.csv. - Rename this as
YEAR_TERM.csvlike2026_Spring.csvand put this inrelease/myharvard.
hugems.net combines the course offerings on myHarvard of a semester on a year and combines that with the feedback reports on QGuide of the previous year. Note that hugems.net does not provide any information on courses that are spaced two years apart. This section describes how the data release was made for hugems.net.
The code for this section is at src/hugems.
- Specify the years and terms for the myharvard and qguide at
combine.pyand run it (uv run combine.py) to getqguide_myharvard.csvautomatically in the release folder. The CSV inner joins the myHarvard records with the qguide usingcourse_id. - Edit the year and terms at
course_ratings_analysis.ipynband run the notebook. This will generate the graphs above and the rest of the data release atrelease/hugems. Follow through the notebook and play around!
In the QGuide release, we added columns that have the phrase gem_probability. This is not actually a probability, and can be thought as a score instead (it is not bounded by 0 and 1). A refactoring in the future would be desirable.
- There is a course catalog PDF at the beta myHarvard. We can use that to generate the myHarvard URLs instead of cycling through the actual website. This will cut down the waiting time from about 10 minutes to near instant, and also save some traffic from hitting Harvard's server.
- HDS and XREG has bug where their catalog number of the pagination process has a suffix that doesn't appear in the actual URL. Right now, we catch this error when it happens and remove that suffix on the go. There might be a better way to do this.
- The
src/qguidecode is ancient (pre-Cursor) and can benefit from better design. For example, one can get a better methodology for the gems, especially given LLMs nowadays. - There are duplicates on the myHarvard pagination. For example, for 2025 Fall, you can find similar classes (e.g. see Ochestra) on page 29 and on page 47. Right now we drop duplicate rows at
get_all_course_data.py, but there might be a better way to do this. - Sometimes, the unique code in
qguideis not unique when two people with the same last name teach the course together (see GENED 1069 - Courtney Lamberth, Fall 2024). Currently, we simply drop duplicates insrc/hugems/combine.py. - Instead of manually copying over the release files we can programmatically do that.
- At time of writing March 31 2025, the beta myharvard search doesn't show the course level (though it allows filtering by it). Implement course level scraping somehow. (the old myharvard has it).
- The new search groups the EXPOS 20 courses under a single course as different sections suffixing the URL with
201,202etc, though they have different course IDs. We currently don't have EXPOS 20 scrapped because we scrape by assuming all the URLs begin with001.







