Reddit Scraper

A simple tool to scrape posts and comments from Reddit subreddits.

How it works

The Smart Way to Scrape Reddit

Instead of brute-forcing our way through Reddit's pages, we use a clever approach that's both efficient and respectful of Reddit's servers:

First, we visit the subreddit's top posts page and scroll through it just like a human would. As we scroll, we collect the IDs of all the posts we want to save. This is quick and lightweight - we're just gathering a list of what we want to look at later.

Then, for each post we're interested in, we use Reddit's own API to get all the data in a clean, structured format. We do this by adding .json to the end of any Reddit URL, which gives us everything we need in one go - the post itself, all its comments, and all the metadata.

This approach has several advantages:

It's much faster than downloading and parsing entire HTML pages
We get clean, structured data instead of messy HTML
We can easily resume if something goes wrong
We're less likely to trigger Reddit's rate limits

Handling Reddit's Limits

Reddit doesn't like it when people make too many requests too quickly. Our scraper is smart about this:

If Reddit tells us we're asking for too much too quickly, we save our progress and wait
We can pick up right where we left off when we run the scraper again
We add small delays between requests to be nice to Reddit's servers

This makes our scraper more reliable and less likely to get blocked.

Installation

From source

git clone https://github.com/yourusername/reddit-scraper.git
cd reddit-scraper
pip install -e .

Usage

Create a subreddits.json file with your target subreddits:

[
    "programming",
    "python",
    "physics",
    "biology"
]

Run the scraper:

reddit-scraper -d month -s subreddits.json

To limit posts per subreddit:

reddit-scraper -d month -s subreddits.json -l 50

Options

-d, --duration: Time period (month/year)
-l, --post-limit: Max posts per subreddit
-s, --subreddits-file: Path to subreddits config file

Output

Data is saved in JSON files under the data/ directory, one file per subreddit.

Output Format

[
  {
    "post_body": "This is the title of the post",
    "post_user": "username123",
    "post_time": "2023-04-15T14:30:45",
    "comments": [
      {
        "body": "This is a top-level comment",
        "user": "commenter456",
        "time": "2023-04-15T15:20:10",
        "replies": [
          {
            "body": "This is a reply to the comment",
            "user": "replier789",
            "time": "2023-04-15T16:05:30",
            "replies": []
          }
        ]
      }
    ]
  },
  {
    "post_body": "Another post title",
    "post_user": "anotheruser",
    "post_time": "2023-04-14T09:15:22",
    "comments": []
  }
]

Development

Project Structure

reddit-scraper/
├── reddit_scraper/           # Main package
│   ├── core/                 # Core functionality
│   │   ├── models.py         # Data models
│   │   ├── scraper.py        # Scraping logic
│   │   └── data_processor.py # Data processing
│   ├── utils/                # Utilities
│   │   └── config.py         # Configuration
│   ├── __init__.py           # Package initialization
│   └── __main__.py           # Entry point
├── tests/                    # Test suite
├── setup.py                  # Package setup
├── requirements.txt          # Dependencies
└── README.md                 # Documentation

Pre-commit Hooks

pre-commit install

pre-commit run --all-files

Testing

pytest

For coverage information:

pytest --cov=. --cov-report=term-missing

Disclaimer: This tool is provided for educational and research purposes only. Please use it responsibly and in accordance with Reddit's Terms of Service and API Guidelines. The author of this tool is not responsible for any misuse, abuse, or violations of Reddit's policies that may occur when using this software. Users are solely responsible for ensuring their use of this tool complies with all applicable laws, regulations, and platform terms of service.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
reddit_scraper		reddit_scraper
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reddit Scraper

How it works

The Smart Way to Scrape Reddit

Handling Reddit's Limits

Installation

From source

Usage

Options

Output

Output Format

Development

Project Structure

Pre-commit Hooks

Testing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

kernelism/reddit-scraper

Folders and files

Latest commit

History

Repository files navigation

Reddit Scraper

How it works

The Smart Way to Scrape Reddit

Handling Reddit's Limits

Installation

From source

Usage

Options

Output

Output Format

Development

Project Structure

Pre-commit Hooks

Testing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages