Substack to Markdown CLI

A command-line tool that fetches posts from Substack using the Substack API wrapper and converts them to Markdown format.

Overview

This tool allows you to easily download and convert Substack posts to Markdown format, making it useful for:

Creating local backups of your Substack content
Migrating content to other platforms
Analyzing or processing Substack content offline
Archiving posts in a portable, plain-text format
Accessing and converting private/subscriber-only content

Installation

Prerequisites

Python 3.7 or higher
pip (Python package installer)

Install from source

# Clone the repository
git clone https://github.com/nelsojona/substack.git
cd substack

# Install dependencies
pip install -r requirements.txt

# Create .env file from example
cp .env.example .env
# Edit .env file with your credentials and configuration

Environment Variables

The tool supports loading configuration from environment variables using a .env file. This is especially useful for storing sensitive information like authentication credentials and proxy configuration.

To use environment variables:

Copy the .env.example file to .env:
```
cp .env.example .env
```

Edit the .env file with your credentials and configuration:

# Substack Authentication
[email protected]
SUBSTACK_PASSWORD=your-password
SUBSTACK_TOKEN=your-auth-token

# Oxylabs Proxy Configuration
OXYLABS_USERNAME=your-username
OXYLABS_PASSWORD=your-password
OXYLABS_COUNTRY=US
OXYLABS_CITY=new_york
OXYLABS_STATE=us_new_york
OXYLABS_SESSION_ID=random-session-id
OXYLABS_SESSION_TIME=10

# General Configuration
DEFAULT_OUTPUT_DIR=./markdown_output
DEFAULT_IMAGE_DIR=./images
DEFAULT_MAX_IMAGE_WORKERS=4
DEFAULT_IMAGE_TIMEOUT=10

The tool will automatically load these environment variables when run.

Note: Command-line arguments take precedence over environment variables.

CLI Instructions

The Substack to Markdown tool provides a comprehensive command-line interface (CLI) for converting Substack posts to Markdown format. Here's how to use it effectively:

Basic Usage

The tool now provides a unified interface through the main.py script:

python main.py [command] [options]

Available commands:

1. Direct Downloader (Recommended)

The direct downloader offers better performance and uses sitemap.xml for more reliable post discovery:

python main.py direct --author <author_identifier>

2. Batch Processing

Process multiple authors in parallel using a configuration file:

python main.py batch --config <config_file_path>

3. Optimized CLI

python main.py optimized download --author <author_identifier>

4. Classic Interface

python main.py classic --author <author_identifier>

Where <author_identifier> is the Substack author's username or subdomain (e.g., "big" for "big.substack.com" which is Matt Stoller's BIG newsletter).

Command Structure

The general command structure for the direct downloader:

python main.py direct --author <author> [options]

For batch processing:

python main.py batch --config <config_file> [options]

Common Usage Patterns

The examples below use the recommended direct downloader command.

1. Basic Fetching

Fetch all posts from a specific author and save them to the default output directory:

python main.py direct --author big

2. Specifying Output Location

Save posts to a specific directory:

python main.py direct --author big --output ./my_posts

3. Limiting Post Count

Fetch only the 5 most recent posts:

python main.py direct --author big --max-posts 5

4. Detailed Output

Enable verbose mode to see detailed progress information:

python main.py direct --author big --verbose

5. Single Post Processing

Process a specific post by its URL:

python main.py direct --author big --url https://big.substack.com/p/how-to-get-rich-sabotaging-nuclear

6. Using Sitemap for Efficient Post Discovery

By default, the direct downloader uses sitemap.xml for efficient post discovery. You can disable this feature if needed:

python main.py direct --author big --no-sitemap

7. Controlling Concurrency

Adjust the number of concurrent downloads for better performance:

python main.py direct --author big --max-concurrency 10 --max-image-concurrency 20

8. Force Refresh

Force refresh existing posts:

python main.py direct --author big --force

9. Filtering Posts by Date Range

Filter posts by date range:

python main.py direct --author big --start-date 2023-01-01 --end-date 2023-12-31

This will only download posts published between January 1, 2023 and December 31, 2023.

Handling Images

By default, the direct downloader saves images locally. You can disable this:

python main.py direct --author big --no-images

Control the number of concurrent image downloads:

python main.py direct --author big --max-image-concurrency 15

Including Comments

To include post comments in the Markdown output:

python main.py direct --author big --include-comments

This will add a "Comments" section at the end of each post with all comments and replies properly formatted.

Accessing Private Content

To access private/subscriber-only content with the direct downloader, you can use an authentication token:

python main.py direct --author big --token your-auth-token --url https://big.substack.com/p/private-post-slug

To obtain a Substack authentication token, you can use the provided script:

python scripts/get_substack_token.py

This will guide you through the process of obtaining an authentication token from your browser session.

Command-line Arguments

For substack_to_md.py

Argument	Description	Required	Default
`--author`	Substack author identifier (username or subdomain)	Yes	-
`--output`	Output directory for Markdown files	No	Current directory
`--limit`	Maximum number of posts to fetch	No	All posts
`--verbose`	Enable verbose output	No	False
`--use-post-objects`	Use enhanced mode with direct Post object methods	No	False
`--url`	Process a single post by URL	No	-
`--slug`	Process a single post by slug	No	-
`--async-mode`	Use async/aiohttp for downloading	No	False
`--processes`	Number of processes to use for multiprocessing	No	2
`--min-delay`	Minimum delay between requests in seconds	No	0.5
`--max-delay`	Maximum delay between requests in seconds	No	5.0
`--incremental`	Only download new or updated content	No	False

For substack_direct_downloader.py (Recommended)

Argument	Description	Required	Default
`--author`	Substack author identifier	No	"tradecompanion"
`--output`	Output directory	No	"output"
`--max-pages`	Maximum number of archive pages to scan	No	Scan all pages
`--max-posts`	Maximum number of posts to download	No	All posts
`--force`	Force refresh of already downloaded posts	No	False
`--verbose`	Enable verbose output	No	False
`--url`	Download a specific URL instead of scanning archive	No	-
`--no-images`	Skip downloading images	No	False
`--min-delay`	Minimum delay between requests in seconds	No	0.5
`--max-delay`	Maximum delay between requests in seconds	No	5.0
`--max-concurrency`	Maximum concurrent requests	No	5
`--max-image-concurrency`	Maximum concurrent image downloads	No	10
`--token`	Substack authentication token for private content	No	-
`--incremental`	Only download new or updated content	No	False
`--async-mode`	Use async/aiohttp for downloading	No	True
`--clear-cache`	Clear cache before starting	No	False
`--use-sitemap`	Use sitemap.xml for post discovery	No	True
`--no-sitemap`	Skip using sitemap.xml for post discovery	No	False
`--include-comments`	Include post comments in the output	No	False
`--start-date`	Start date for filtering posts (YYYY-MM-DD)	No	-
`--end-date`	End date for filtering posts (YYYY-MM-DD)	No	-

Authentication Arguments

Argument	Description	Required	Default
`--email`	Substack account email	No	-
`--password`	Substack account password	No	-
`--token`	Substack authentication token	No	-
`--cookies-file`	Path to a file containing cookies	No	-
`--save-cookies`	Save cookies to a file after authentication	No	-
`--private`	Indicate that the post is private and requires authentication	No	False

Proxy Arguments

Argument	Description	Required	Default
`--use-proxy`	Use Oxylabs proxy for requests	No	False
`--proxy-username`	Oxylabs username	No	-
`--proxy-password`	Oxylabs password	No	-
`--proxy-country`	Country code for proxy (e.g., US, GB, DE)	No	-
`--proxy-city`	City name for proxy (e.g., london, new_york)	No	-
`--proxy-state`	US state for proxy (e.g., us_california, us_new_york)	No	-
`--proxy-session-id`	Session ID to maintain the same IP across requests	No	-
`--proxy-session-time`	Session time in minutes (max 30)	No	-

Image Downloading Arguments

Argument	Description	Required	Default
`--download-images`	Download and embed images in the Markdown	No	False
`--image-dir`	Directory to save downloaded images	No	`images`
`--image-base-url`	Base URL for image references in Markdown	No	Relative paths
`--max-image-workers`	Maximum number of concurrent image downloads	No	4
`--image-timeout`	Timeout for image download requests in seconds	No	10

Batch Processing

The batch processing feature allows you to download posts from multiple Substack authors in parallel. This is especially useful for backing up or migrating content from multiple newsletters.

Creating a Batch Configuration File

You can create an example configuration file with:

python main.py batch --config authors.json --create-example

This will generate a JSON file with the following structure:

{
  "authors": [
    {
      "identifier": "mattstoller",
      "output_dir": "output/mattstoller",
      "max_posts": 10,
      "include_comments": true,
      "no_images": false,
      "incremental": true,
      "verbose": true
    },
    {
      "identifier": "tradecompanion",
      "max_posts": 5,
      "include_comments": false,
      "token": "your-auth-token-here"
    },
    {
      "identifier": "another-author",
      "max_concurrency": 10,
      "max_image_concurrency": 20
    }
  ],
  "global_settings": {
    "min_delay": 1.0,
    "max_delay": 5.0,
    "max_concurrency": 5,
    "max_image_concurrency": 10
  }
}

You can also use YAML format by specifying a .yaml or .yml extension:

python main.py batch --config authors.yaml --create-example

Running Batch Processing

To process all authors in the configuration file:

python main.py batch --config authors.json

You can specify the output directory and the number of parallel processes:

python main.py batch --config authors.json --output custom_output --processes 4

Configuration Options

Each author in the configuration can have the following options:

identifier (required): The Substack author identifier
output_dir: Custom output directory for this author
max_posts: Maximum number of posts to download
include_comments: Whether to include comments in the output
no_images: Skip downloading images
token: Authentication token for private content
incremental: Only download new or updated content
force: Force refresh of already downloaded posts
verbose: Enable verbose output
min_delay: Minimum delay between requests
max_delay: Maximum delay between requests
max_concurrency: Maximum concurrent requests
max_image_concurrency: Maximum concurrent image downloads
no_sitemap: Skip using sitemap.xml for post discovery

Features

Fetches all posts for a specified Substack author
Filters posts by date range
Converts HTML content to Markdown format
Preserves formatting, links, and images
Handles pagination to retrieve all available posts
Provides progress reporting and error handling
Saves each post as a separate Markdown file
Supports direct integration with the Substack API wrapper
Allows processing individual posts by URL or slug
Includes caching for improved performance
Supports authenticated access to private/subscriber-only content
Downloads and embeds images locally for offline viewing
Extracts and includes post comments with proper threading
Extracts newsletter metadata for better organization
Utilizes optimized performance with async, multiprocessing, and adaptive throttling
Offers incremental sync to efficiently update content
Implements robust error handling and recovery mechanisms
Supports batch processing of multiple authors in parallel
Provides proxy support for avoiding rate limits and accessing geo-restricted content

Enhanced Mode

The tool supports an enhanced mode that uses direct Post object methods from the Substack API wrapper. This mode provides several advantages:

More direct integration with the Substack API
Better error handling and retries
Support for processing individual posts by URL or slug
Caching of Post objects for improved performance

To use enhanced mode, add the --use-post-objects flag to your command:

python substack_to_md.py --author mattstoller --use-post-objects

Authentication Methods

The tool supports several methods for authenticating with Substack to access private content:

Email and Password

You can provide your Substack account email and password directly:

python substack_to_md.py --author mattstoller --email [email protected] --password your-password --private

Note: This method may not always work due to Substack's authentication flow, which may include CSRF tokens, captchas, or other security measures.

Authentication Token

If you have a Substack authentication token, you can use it directly:

python substack_to_md.py --author mattstoller --token your-auth-token --private

Cookies File

You can use a cookies file exported from your browser:

python substack_to_md.py --author mattstoller --cookies-file path/to/cookies.txt --private

The cookies file should be in the Mozilla/Netscape format, which can be exported using browser extensions like "Cookie Quick Manager" for Firefox or "EditThisCookie" for Chrome.

Saving Cookies

After authenticating, you can save the cookies for future use:

python substack_to_md.py --author mattstoller --email [email protected] --password your-password --save-cookies path/to/cookies.txt --private

This will save the cookies to the specified file, which you can then use for future authentication.

Dependencies

substack_api: Python wrapper for the Substack API - used for fetching post metadata and content
markdownify: Library for converting HTML to Markdown
argparse: Standard library for parsing command-line arguments
requests: HTTP library for Python
tqdm: Progress bar library for Python
beautifulsoup4: Library for parsing HTML
aiohttp: Asynchronous HTTP client/server framework
python-dotenv: Library for loading environment variables from .env files

Error Handling

The tool includes robust error handling for:

API connection issues
Rate limiting
Invalid author identifiers
File system errors
HTML parsing and conversion issues
Post retrieval errors
Authentication failures

Custom Markdown Templates

The tool now supports custom Markdown templates for post conversion. This allows you to customize the format and structure of the generated Markdown files.

Creating Templates

You can create custom templates using the template command:

python main.py template --create-examples --output-dir templates

This will create example templates in the specified directory:

basic.template: A simple template with title, content, and comments
academic.template: A template formatted for academic citations
blog.template: A template with HTML formatting for blog posts

Using Templates

To use a custom template when downloading posts:

python main.py direct --author big --template-dir templates --template basic

This will use the basic template from the templates directory for all downloaded posts.

Template Format

Templates use the Python string.Template format with the following variables:

${title}: Post title
${date}: Publication date
${author}: Author name
${url}: Original post URL
${content}: Post content in Markdown format
${comments}: Post comments in Markdown format (if included)
${additional_frontmatter}: Additional metadata fields

Export to Other Formats

The tool now supports exporting Markdown files to other formats using the convert command:

python main.py convert --input output/big --format html --output-dir converted

Supported Formats

html: Export to HTML format
pdf: Export to PDF format
epub: Export to EPUB format

Conversion Options

python main.py convert --input output/big/2023-01-01_post-slug.md --format pdf --title "Custom Title" --author "Custom Author" --css custom.css

Available options:

--input: Input Markdown file or directory
--format: Output format (html, pdf, epub)
--output-dir: Output directory
--recursive: Process directories recursively
--title: Title for the output document
--author: Author name for the output document
--css: Path to CSS file for styling HTML and PDF output
--cover-image: Path to cover image for EPUB output
--check-deps: Check for required dependencies

Dependencies

The format conversion feature requires the following external dependencies:

Pandoc: For converting Markdown to other formats
wkhtmltopdf: For PDF generation

You can check if these dependencies are installed:

python main.py convert --check-deps --input dummy --format html

Using Proxies

The tool supports using Oxylabs proxy service to route requests through different IP addresses. This can help avoid rate limiting and access geo-restricted content.

Basic Proxy Usage

To use a proxy with the direct downloader:

python main.py direct --author big --use-proxy --proxy-username your-username --proxy-password your-password

Proxy Configuration Options

You can configure various aspects of the proxy:

# Using a specific country
python main.py direct --author big --use-proxy --proxy-username your-username --proxy-password your-password --proxy-country US

# Using a specific city
python main.py direct --author big --use-proxy --proxy-username your-username --proxy-password your-password --proxy-country GB --proxy-city london

# Using a session ID to maintain the same IP
python main.py direct --author big --use-proxy --proxy-username your-username --proxy-password your-password --proxy-session-id abc12345

# Setting a session time
python main.py direct --author big --use-proxy --proxy-username your-username --proxy-password your-password --proxy-session-id abc12345 --proxy-session-time 10

Environment Variables for Proxy

You can also configure the proxy using environment variables in your .env file:

OXYLABS_USERNAME=your-username
OXYLABS_PASSWORD=your-password
OXYLABS_COUNTRY=US
OXYLABS_CITY=new_york
OXYLABS_STATE=us_new_york
OXYLABS_SESSION_ID=random-session-id
OXYLABS_SESSION_TIME=10

Then use the proxy without specifying credentials on the command line:

python main.py direct --author big --use-proxy

Batch Processing with Proxies

You can also configure proxies in your batch configuration file:

{
  "authors": [
    {
      "identifier": "mattstoller",
      "use_proxy": true,
      "proxy_country": "US"
    }
  ],
  "global_settings": {
    "use_proxy": true,
    "proxy_username": "your-username",
    "proxy_password": "your-password"
  }
}

Performance Optimizations

The direct downloader script (substack_direct_downloader.py) includes several performance optimizations:

Sitemap-based Discovery: Uses sitemap.xml to efficiently find all posts
Async/Concurrent Requests: Uses asyncio/aiohttp for non-blocking concurrent downloads
Caching: Implements caching layer to speed up repeated fetches and reduce API load
Adaptive Throttling: Dynamically adjusts delays based on response times and rate limits
Parallel Image Processing: Downloads images concurrently with configurable limits
Connection Pooling: Reuses connections for better performance
Incremental Sync: Only downloads new or updated content
Database Optimizations: Uses bulk operations and indexing for faster metadata retrieval

Contributing

We welcome contributions to the Substack to Markdown CLI! Please see CONTRIBUTING.md for details on how to contribute.

Security

If you discover a security vulnerability, please follow our security policy.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Changelog

See the CHANGELOG.md file for details on changes between versions.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github		.github
docs		docs
examples		examples
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
main.py		main.py
requirements.txt		requirements.txt

License

nelsojona/substack

Folders and files

Latest commit

History

Repository files navigation

Substack to Markdown CLI

Overview

Installation

Prerequisites

Install from source

Environment Variables

CLI Instructions

Basic Usage

1. Direct Downloader (Recommended)

2. Batch Processing

3. Optimized CLI

4. Classic Interface

Command Structure

Common Usage Patterns

1. Basic Fetching

2. Specifying Output Location

3. Limiting Post Count

4. Detailed Output

5. Single Post Processing

6. Using Sitemap for Efficient Post Discovery

7. Controlling Concurrency

8. Force Refresh

9. Filtering Posts by Date Range

Handling Images

Including Comments

Accessing Private Content

Command-line Arguments

For substack_to_md.py

For substack_direct_downloader.py (Recommended)

Authentication Arguments

Proxy Arguments

Image Downloading Arguments

Batch Processing

Creating a Batch Configuration File

Running Batch Processing

Configuration Options

Features

Enhanced Mode

Authentication Methods

Email and Password

Authentication Token

Cookies File

Saving Cookies

Dependencies

Error Handling

Custom Markdown Templates

Creating Templates

Using Templates

Template Format

Export to Other Formats

Supported Formats

Conversion Options

Dependencies

Using Proxies

Basic Proxy Usage

Proxy Configuration Options

Environment Variables for Proxy

Batch Processing with Proxies

Performance Optimizations

Contributing

Security

License

Changelog

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages