A command-line tool that fetches posts from Substack using the Substack API wrapper and converts them to Markdown format.
This tool allows you to easily download and convert Substack posts to Markdown format, making it useful for:
- Creating local backups of your Substack content
- Migrating content to other platforms
- Analyzing or processing Substack content offline
- Archiving posts in a portable, plain-text format
- Accessing and converting private/subscriber-only content
- Python 3.7 or higher
- pip (Python package installer)
# Clone the repository
git clone https://github.com/nelsojona/substack.git
cd substack
# Install dependencies
pip install -r requirements.txt
# Create .env file from example
cp .env.example .env
# Edit .env file with your credentials and configuration
The tool supports loading configuration from environment variables using a .env
file. This is especially useful for storing sensitive information like authentication credentials and proxy configuration.
To use environment variables:
-
Copy the
.env.example
file to.env
:cp .env.example .env
-
Edit the
.env
file with your credentials and configuration:# Substack Authentication [email protected] SUBSTACK_PASSWORD=your-password SUBSTACK_TOKEN=your-auth-token # Oxylabs Proxy Configuration OXYLABS_USERNAME=your-username OXYLABS_PASSWORD=your-password OXYLABS_COUNTRY=US OXYLABS_CITY=new_york OXYLABS_STATE=us_new_york OXYLABS_SESSION_ID=random-session-id OXYLABS_SESSION_TIME=10 # General Configuration DEFAULT_OUTPUT_DIR=./markdown_output DEFAULT_IMAGE_DIR=./images DEFAULT_MAX_IMAGE_WORKERS=4 DEFAULT_IMAGE_TIMEOUT=10
-
The tool will automatically load these environment variables when run.
Note: Command-line arguments take precedence over environment variables.
The Substack to Markdown tool provides a comprehensive command-line interface (CLI) for converting Substack posts to Markdown format. Here's how to use it effectively:
The tool now provides a unified interface through the main.py script:
python main.py [command] [options]
Available commands:
The direct downloader offers better performance and uses sitemap.xml for more reliable post discovery:
python main.py direct --author <author_identifier>
Process multiple authors in parallel using a configuration file:
python main.py batch --config <config_file_path>
python main.py optimized download --author <author_identifier>
python main.py classic --author <author_identifier>
Where <author_identifier>
is the Substack author's username or subdomain (e.g., "big" for "big.substack.com" which is Matt Stoller's BIG newsletter).
The general command structure for the direct downloader:
python main.py direct --author <author> [options]
For batch processing:
python main.py batch --config <config_file> [options]
The examples below use the recommended direct downloader command.
Fetch all posts from a specific author and save them to the default output directory:
python main.py direct --author big
Save posts to a specific directory:
python main.py direct --author big --output ./my_posts
Fetch only the 5 most recent posts:
python main.py direct --author big --max-posts 5
Enable verbose mode to see detailed progress information:
python main.py direct --author big --verbose
Process a specific post by its URL:
python main.py direct --author big --url https://big.substack.com/p/how-to-get-rich-sabotaging-nuclear
By default, the direct downloader uses sitemap.xml for efficient post discovery. You can disable this feature if needed:
python main.py direct --author big --no-sitemap
Adjust the number of concurrent downloads for better performance:
python main.py direct --author big --max-concurrency 10 --max-image-concurrency 20
Force refresh existing posts:
python main.py direct --author big --force
Filter posts by date range:
python main.py direct --author big --start-date 2023-01-01 --end-date 2023-12-31
This will only download posts published between January 1, 2023 and December 31, 2023.
By default, the direct downloader saves images locally. You can disable this:
python main.py direct --author big --no-images
Control the number of concurrent image downloads:
python main.py direct --author big --max-image-concurrency 15
To include post comments in the Markdown output:
python main.py direct --author big --include-comments
This will add a "Comments" section at the end of each post with all comments and replies properly formatted.
To access private/subscriber-only content with the direct downloader, you can use an authentication token:
python main.py direct --author big --token your-auth-token --url https://big.substack.com/p/private-post-slug
To obtain a Substack authentication token, you can use the provided script:
python scripts/get_substack_token.py
This will guide you through the process of obtaining an authentication token from your browser session.
Argument | Description | Required | Default |
---|---|---|---|
--author |
Substack author identifier (username or subdomain) | Yes | - |
--output |
Output directory for Markdown files | No | Current directory |
--limit |
Maximum number of posts to fetch | No | All posts |
--verbose |
Enable verbose output | No | False |
--use-post-objects |
Use enhanced mode with direct Post object methods | No | False |
--url |
Process a single post by URL | No | - |
--slug |
Process a single post by slug | No | - |
--async-mode |
Use async/aiohttp for downloading | No | False |
--processes |
Number of processes to use for multiprocessing | No | 2 |
--min-delay |
Minimum delay between requests in seconds | No | 0.5 |
--max-delay |
Maximum delay between requests in seconds | No | 5.0 |
--incremental |
Only download new or updated content | No | False |
Argument | Description | Required | Default |
---|---|---|---|
--author |
Substack author identifier | No | "tradecompanion" |
--output |
Output directory | No | "output" |
--max-pages |
Maximum number of archive pages to scan | No | Scan all pages |
--max-posts |
Maximum number of posts to download | No | All posts |
--force |
Force refresh of already downloaded posts | No | False |
--verbose |
Enable verbose output | No | False |
--url |
Download a specific URL instead of scanning archive | No | - |
--no-images |
Skip downloading images | No | False |
--min-delay |
Minimum delay between requests in seconds | No | 0.5 |
--max-delay |
Maximum delay between requests in seconds | No | 5.0 |
--max-concurrency |
Maximum concurrent requests | No | 5 |
--max-image-concurrency |
Maximum concurrent image downloads | No | 10 |
--token |
Substack authentication token for private content | No | - |
--incremental |
Only download new or updated content | No | False |
--async-mode |
Use async/aiohttp for downloading | No | True |
--clear-cache |
Clear cache before starting | No | False |
--use-sitemap |
Use sitemap.xml for post discovery | No | True |
--no-sitemap |
Skip using sitemap.xml for post discovery | No | False |
--include-comments |
Include post comments in the output | No | False |
--start-date |
Start date for filtering posts (YYYY-MM-DD) | No | - |
--end-date |
End date for filtering posts (YYYY-MM-DD) | No | - |
Argument | Description | Required | Default |
---|---|---|---|
--email |
Substack account email | No | - |
--password |
Substack account password | No | - |
--token |
Substack authentication token | No | - |
--cookies-file |
Path to a file containing cookies | No | - |
--save-cookies |
Save cookies to a file after authentication | No | - |
--private |
Indicate that the post is private and requires authentication | No | False |
Argument | Description | Required | Default |
---|---|---|---|
--use-proxy |
Use Oxylabs proxy for requests | No | False |
--proxy-username |
Oxylabs username | No | - |
--proxy-password |
Oxylabs password | No | - |
--proxy-country |
Country code for proxy (e.g., US, GB, DE) | No | - |
--proxy-city |
City name for proxy (e.g., london, new_york) | No | - |
--proxy-state |
US state for proxy (e.g., us_california, us_new_york) | No | - |
--proxy-session-id |
Session ID to maintain the same IP across requests | No | - |
--proxy-session-time |
Session time in minutes (max 30) | No | - |
Argument | Description | Required | Default |
---|---|---|---|
--download-images |
Download and embed images in the Markdown | No | False |
--image-dir |
Directory to save downloaded images | No | images |
--image-base-url |
Base URL for image references in Markdown | No | Relative paths |
--max-image-workers |
Maximum number of concurrent image downloads | No | 4 |
--image-timeout |
Timeout for image download requests in seconds | No | 10 |
The batch processing feature allows you to download posts from multiple Substack authors in parallel. This is especially useful for backing up or migrating content from multiple newsletters.
You can create an example configuration file with:
python main.py batch --config authors.json --create-example
This will generate a JSON file with the following structure:
{
"authors": [
{
"identifier": "mattstoller",
"output_dir": "output/mattstoller",
"max_posts": 10,
"include_comments": true,
"no_images": false,
"incremental": true,
"verbose": true
},
{
"identifier": "tradecompanion",
"max_posts": 5,
"include_comments": false,
"token": "your-auth-token-here"
},
{
"identifier": "another-author",
"max_concurrency": 10,
"max_image_concurrency": 20
}
],
"global_settings": {
"min_delay": 1.0,
"max_delay": 5.0,
"max_concurrency": 5,
"max_image_concurrency": 10
}
}
You can also use YAML format by specifying a .yaml
or .yml
extension:
python main.py batch --config authors.yaml --create-example
To process all authors in the configuration file:
python main.py batch --config authors.json
You can specify the output directory and the number of parallel processes:
python main.py batch --config authors.json --output custom_output --processes 4
Each author in the configuration can have the following options:
identifier
(required): The Substack author identifieroutput_dir
: Custom output directory for this authormax_posts
: Maximum number of posts to downloadinclude_comments
: Whether to include comments in the outputno_images
: Skip downloading imagestoken
: Authentication token for private contentincremental
: Only download new or updated contentforce
: Force refresh of already downloaded postsverbose
: Enable verbose outputmin_delay
: Minimum delay between requestsmax_delay
: Maximum delay between requestsmax_concurrency
: Maximum concurrent requestsmax_image_concurrency
: Maximum concurrent image downloadsno_sitemap
: Skip using sitemap.xml for post discovery
- Fetches all posts for a specified Substack author
- Filters posts by date range
- Converts HTML content to Markdown format
- Preserves formatting, links, and images
- Handles pagination to retrieve all available posts
- Provides progress reporting and error handling
- Saves each post as a separate Markdown file
- Supports direct integration with the Substack API wrapper
- Allows processing individual posts by URL or slug
- Includes caching for improved performance
- Supports authenticated access to private/subscriber-only content
- Downloads and embeds images locally for offline viewing
- Extracts and includes post comments with proper threading
- Extracts newsletter metadata for better organization
- Utilizes optimized performance with async, multiprocessing, and adaptive throttling
- Offers incremental sync to efficiently update content
- Implements robust error handling and recovery mechanisms
- Supports batch processing of multiple authors in parallel
- Provides proxy support for avoiding rate limits and accessing geo-restricted content
The tool supports an enhanced mode that uses direct Post object methods from the Substack API wrapper. This mode provides several advantages:
- More direct integration with the Substack API
- Better error handling and retries
- Support for processing individual posts by URL or slug
- Caching of Post objects for improved performance
To use enhanced mode, add the --use-post-objects
flag to your command:
python substack_to_md.py --author mattstoller --use-post-objects
The tool supports several methods for authenticating with Substack to access private content:
You can provide your Substack account email and password directly:
python substack_to_md.py --author mattstoller --email [email protected] --password your-password --private
Note: This method may not always work due to Substack's authentication flow, which may include CSRF tokens, captchas, or other security measures.
If you have a Substack authentication token, you can use it directly:
python substack_to_md.py --author mattstoller --token your-auth-token --private
You can use a cookies file exported from your browser:
python substack_to_md.py --author mattstoller --cookies-file path/to/cookies.txt --private
The cookies file should be in the Mozilla/Netscape format, which can be exported using browser extensions like "Cookie Quick Manager" for Firefox or "EditThisCookie" for Chrome.
After authenticating, you can save the cookies for future use:
python substack_to_md.py --author mattstoller --email [email protected] --password your-password --save-cookies path/to/cookies.txt --private
This will save the cookies to the specified file, which you can then use for future authentication.
- substack_api: Python wrapper for the Substack API - used for fetching post metadata and content
- markdownify: Library for converting HTML to Markdown
- argparse: Standard library for parsing command-line arguments
- requests: HTTP library for Python
- tqdm: Progress bar library for Python
- beautifulsoup4: Library for parsing HTML
- aiohttp: Asynchronous HTTP client/server framework
- python-dotenv: Library for loading environment variables from .env files
The tool includes robust error handling for:
- API connection issues
- Rate limiting
- Invalid author identifiers
- File system errors
- HTML parsing and conversion issues
- Post retrieval errors
- Authentication failures
The tool now supports custom Markdown templates for post conversion. This allows you to customize the format and structure of the generated Markdown files.
You can create custom templates using the template
command:
python main.py template --create-examples --output-dir templates
This will create example templates in the specified directory:
basic.template
: A simple template with title, content, and commentsacademic.template
: A template formatted for academic citationsblog.template
: A template with HTML formatting for blog posts
To use a custom template when downloading posts:
python main.py direct --author big --template-dir templates --template basic
This will use the basic
template from the templates
directory for all downloaded posts.
Templates use the Python string.Template format with the following variables:
${title}
: Post title${date}
: Publication date${author}
: Author name${url}
: Original post URL${content}
: Post content in Markdown format${comments}
: Post comments in Markdown format (if included)${additional_frontmatter}
: Additional metadata fields
The tool now supports exporting Markdown files to other formats using the convert
command:
python main.py convert --input output/big --format html --output-dir converted
html
: Export to HTML formatpdf
: Export to PDF formatepub
: Export to EPUB format
python main.py convert --input output/big/2023-01-01_post-slug.md --format pdf --title "Custom Title" --author "Custom Author" --css custom.css
Available options:
--input
: Input Markdown file or directory--format
: Output format (html, pdf, epub)--output-dir
: Output directory--recursive
: Process directories recursively--title
: Title for the output document--author
: Author name for the output document--css
: Path to CSS file for styling HTML and PDF output--cover-image
: Path to cover image for EPUB output--check-deps
: Check for required dependencies
The format conversion feature requires the following external dependencies:
- Pandoc: For converting Markdown to other formats
- wkhtmltopdf: For PDF generation
You can check if these dependencies are installed:
python main.py convert --check-deps --input dummy --format html
The tool supports using Oxylabs proxy service to route requests through different IP addresses. This can help avoid rate limiting and access geo-restricted content.
To use a proxy with the direct downloader:
python main.py direct --author big --use-proxy --proxy-username your-username --proxy-password your-password
You can configure various aspects of the proxy:
# Using a specific country
python main.py direct --author big --use-proxy --proxy-username your-username --proxy-password your-password --proxy-country US
# Using a specific city
python main.py direct --author big --use-proxy --proxy-username your-username --proxy-password your-password --proxy-country GB --proxy-city london
# Using a session ID to maintain the same IP
python main.py direct --author big --use-proxy --proxy-username your-username --proxy-password your-password --proxy-session-id abc12345
# Setting a session time
python main.py direct --author big --use-proxy --proxy-username your-username --proxy-password your-password --proxy-session-id abc12345 --proxy-session-time 10
You can also configure the proxy using environment variables in your .env
file:
OXYLABS_USERNAME=your-username
OXYLABS_PASSWORD=your-password
OXYLABS_COUNTRY=US
OXYLABS_CITY=new_york
OXYLABS_STATE=us_new_york
OXYLABS_SESSION_ID=random-session-id
OXYLABS_SESSION_TIME=10
Then use the proxy without specifying credentials on the command line:
python main.py direct --author big --use-proxy
You can also configure proxies in your batch configuration file:
{
"authors": [
{
"identifier": "mattstoller",
"use_proxy": true,
"proxy_country": "US"
}
],
"global_settings": {
"use_proxy": true,
"proxy_username": "your-username",
"proxy_password": "your-password"
}
}
The direct downloader script (substack_direct_downloader.py
) includes several performance optimizations:
- Sitemap-based Discovery: Uses sitemap.xml to efficiently find all posts
- Async/Concurrent Requests: Uses asyncio/aiohttp for non-blocking concurrent downloads
- Caching: Implements caching layer to speed up repeated fetches and reduce API load
- Adaptive Throttling: Dynamically adjusts delays based on response times and rate limits
- Parallel Image Processing: Downloads images concurrently with configurable limits
- Connection Pooling: Reuses connections for better performance
- Incremental Sync: Only downloads new or updated content
- Database Optimizations: Uses bulk operations and indexing for faster metadata retrieval
We welcome contributions to the Substack to Markdown CLI! Please see CONTRIBUTING.md for details on how to contribute.
If you discover a security vulnerability, please follow our security policy.
This project is licensed under the MIT License - see the LICENSE file for details.
See the CHANGELOG.md file for details on changes between versions.