Cloud Storage Integration Guide
This guide covers integrating EdgarTools with cloud storage providers (AWS S3, Google Cloud Storage, Azure Blob Storage) and S3-compatible services (Cloudflare R2, MinIO, DigitalOcean Spaces).
Why Cloud Storage?
Cloud storage provides several advantages over local storage:
| Benefit | Description |
|---|---|
| Scalability | Store terabytes of SEC data without local disk constraints |
| Team Sharing | Multiple users/services access the same dataset |
| Durability | Cloud providers offer 99.999999999% durability |
| Cost Efficiency | Pay only for storage used; cheaper than provisioning servers |
| Global Access | Access data from anywhere, any environment |
Integration Approaches
EdgarTools supports cloud storage through three mechanisms:
use_cloud_storage()- Native cloud integration via fsspec for reading and writing (recommended)EDGAR_DATA_URL- Point to any HTTP endpoint for reading dataEDGAR_LOCAL_DATA_DIR+ FUSE - Mount cloud storage as a local path (legacy)
Approach Comparison
| Feature | Native (use_cloud_storage) |
EDGAR_DATA_URL | FUSE Mount |
|---|---|---|---|
| Setup Complexity | Simple | Simple | Complex |
| Read Data | Yes | Yes | Yes |
| Write Data | Yes | No | Yes |
| Requires Mount | No | No | Yes |
| Platform Support | All | All | Linux/macOS |
| Best For | Full cloud integration | Read-only HTTP | Legacy systems |
Approach 1: EDGAR_DATA_URL (Read-Only)
The simplest approach for read-only access. Point EdgarTools to an HTTP endpoint serving your SEC data.
How It Works
import os
os.environ['EDGAR_DATA_URL'] = 'https://your-bucket.s3.amazonaws.com/edgar-data/'
os.environ['EDGAR_USE_LOCAL_DATA'] = '1'
from edgar import Company
company = Company("AAPL") # Fetches from your S3 bucket
Setting Up S3 Static Website Hosting
Step 1: Create and Configure S3 Bucket
# Create bucket
aws s3 mb s3://my-edgar-data --region us-east-1
# Enable static website hosting
aws s3 website s3://my-edgar-data \
--index-document index.html \
--error-document error.html
Step 2: Set Bucket Policy for Public Read
Create bucket-policy.json:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "PublicReadGetObject",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::my-edgar-data/*"
}
]
}
Apply the policy:
aws s3api put-bucket-policy \
--bucket my-edgar-data \
--policy file://bucket-policy.json
Step 3: Upload Your Data
# Sync local edgar data to S3
aws s3 sync ~/.edgar s3://my-edgar-data/ --storage-class STANDARD_IA
Step 4: Configure EdgarTools
import os
# S3 static website URL format
os.environ['EDGAR_DATA_URL'] = 'http://my-edgar-data.s3-website-us-east-1.amazonaws.com/'
os.environ['EDGAR_USE_LOCAL_DATA'] = '1'
from edgar import Company
company = Company("AAPL") # Now reads from S3
Google Cloud Storage Setup
# Create bucket with uniform access
gsutil mb -l us-central1 gs://my-edgar-data
# Make bucket publicly readable
gsutil iam ch allUsers:objectViewer gs://my-edgar-data
# Upload data
gsutil -m rsync -r ~/.edgar gs://my-edgar-data/
Configure EdgarTools:
import os
os.environ['EDGAR_DATA_URL'] = 'https://storage.googleapis.com/my-edgar-data/'
os.environ['EDGAR_USE_LOCAL_DATA'] = '1'
Azure Blob Storage Setup
# Create storage account and container
az storage account create --name myedgardata --resource-group mygroup
az storage container create --name edgar --account-name myedgardata --public-access blob
# Upload data
az storage blob upload-batch \
--account-name myedgardata \
--destination edgar \
--source ~/.edgar
Configure EdgarTools:
import os
os.environ['EDGAR_DATA_URL'] = 'https://myedgardata.blob.core.windows.net/edgar/'
os.environ['EDGAR_USE_LOCAL_DATA'] = '1'
Adding CloudFront CDN (Recommended for Production)
For better performance and reduced S3 costs, add CloudFront:
# Create CloudFront distribution pointing to S3
aws cloudfront create-distribution \
--origin-domain-name my-edgar-data.s3.amazonaws.com \
--default-root-object index.html
Then use your CloudFront URL:
os.environ['EDGAR_DATA_URL'] = 'https://d1234567890.cloudfront.net/'
Approach 2: FUSE Mount (Read/Write)
For full read/write access, mount cloud storage as a local filesystem using FUSE (Filesystem in Userspace).
FUSE Tool Comparison
| Tool | Provider | Performance | Caching | Notes |
|---|---|---|---|---|
| s3fs-fuse | AWS S3 | Moderate | Basic | Most compatible |
| goofys | AWS S3 | Fast | Aggressive | Performance-focused |
| rclone mount | All providers | Good | Configurable | Most versatile |
| gcsfuse | Google Cloud | Good | Metadata | Official GCS tool |
| blobfuse2 | Azure | Good | File cache | Official Azure tool |
s3fs-fuse Setup (AWS S3)
Installation
# Ubuntu/Debian
sudo apt-get install s3fs
# macOS
brew install s3fs
# From source
git clone https://github.com/s3fs-fuse/s3fs-fuse.git
cd s3fs-fuse && ./autogen.sh && ./configure && make && sudo make install
Configuration
Create credentials file:
echo "ACCESS_KEY_ID:SECRET_ACCESS_KEY" > ~/.passwd-s3fs
chmod 600 ~/.passwd-s3fs
Mount the Bucket
# Create mount point
mkdir -p /mnt/edgar-data
# Mount with caching for better performance
s3fs my-edgar-bucket /mnt/edgar-data \
-o passwd_file=~/.passwd-s3fs \
-o url=https://s3.amazonaws.com \
-o use_cache=/tmp/s3fs-cache \
-o ensure_diskfree=1024 \
-o parallel_count=15
Configure EdgarTools
import os
os.environ['EDGAR_LOCAL_DATA_DIR'] = '/mnt/edgar-data'
os.environ['EDGAR_USE_LOCAL_DATA'] = '1'
from edgar import download_filings
download_filings("2025-01-15") # Writes directly to S3!
goofys Setup (High Performance S3)
goofys offers better performance than s3fs at the cost of some POSIX compliance.
Installation
# Download binary
wget https://github.com/kahing/goofys/releases/latest/download/goofys
chmod +x goofys
sudo mv goofys /usr/local/bin/
Mount
# Uses standard AWS credentials (~/.aws/credentials)
goofys my-edgar-bucket /mnt/edgar-data
# With specific profile
goofys --profile production my-edgar-bucket /mnt/edgar-data
# With caching
goofys --stat-cache-ttl 1h --type-cache-ttl 1h my-edgar-bucket /mnt/edgar-data
rclone mount (Multi-Provider)
rclone supports 40+ cloud storage providers with a unified interface.
Installation
# Linux
curl https://rclone.org/install.sh | sudo bash
# macOS
brew install rclone
Configure Provider
# Interactive configuration
rclone config
# Example: Configure S3
# Name: edgar-s3
# Type: s3
# Provider: AWS
# Access key: (your key)
# Secret key: (your secret)
# Region: us-east-1
Mount
# Basic mount
rclone mount edgar-s3:my-edgar-bucket /mnt/edgar-data
# With VFS caching (recommended)
rclone mount edgar-s3:my-edgar-bucket /mnt/edgar-data \
--vfs-cache-mode full \
--vfs-cache-max-age 24h \
--vfs-read-ahead 128M \
--buffer-size 128M \
--daemon
gcsfuse Setup (Google Cloud)
# Installation
export GCSFUSE_REPO=gcsfuse-$(lsb_release -c -s)
echo "deb https://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo apt-get update && sudo apt-get install gcsfuse
# Mount
gcsfuse --implicit-dirs my-edgar-bucket /mnt/edgar-data
blobfuse2 Setup (Azure)
# Installation
wget https://packages.microsoft.com/config/ubuntu/22.04/packages-microsoft-prod.deb
sudo dpkg -i packages-microsoft-prod.deb
sudo apt-get update && sudo apt-get install blobfuse2
# Create config file
cat > ~/blobfuse2.yaml << EOF
allow-other: true
logging:
type: syslog
level: log_warning
components:
- libfuse
- file_cache
- attr_cache
- azstorage
libfuse:
attribute-expiration-sec: 120
entry-expiration-sec: 120
file_cache:
path: /tmp/blobfuse2
timeout-sec: 120
max-size-mb: 4096
azstorage:
type: block
account-name: myedgardata
account-key: YOUR_ACCOUNT_KEY
container: edgar
EOF
# Mount
blobfuse2 mount /mnt/edgar-data --config-file=~/blobfuse2.yaml
Systemd Service (Auto-Mount on Boot)
Create /etc/systemd/system/edgar-s3.service:
[Unit]
Description=Mount S3 Edgar Data
After=network-online.target
[Service]
Type=forking
User=edgar
ExecStart=/usr/local/bin/goofys -o allow_other my-edgar-bucket /mnt/edgar-data
ExecStop=/bin/fusermount -u /mnt/edgar-data
Restart=on-failure
[Install]
WantedBy=multi-user.target
Enable:
sudo systemctl enable edgar-s3
sudo systemctl start edgar-s3
S3-Compatible Services
Cloudflare R2
R2 offers S3-compatible storage with zero egress fees.
Key Configuration
R2 requires region_name='auto':
# s3fs with R2
echo "R2_ACCESS_KEY:R2_SECRET_KEY" > ~/.passwd-r2
s3fs my-bucket /mnt/edgar-data \
-o passwd_file=~/.passwd-r2 \
-o url=https://ACCOUNT_ID.r2.cloudflarestorage.com \
-o use_path_request_style
rclone Configuration for R2
rclone config
# Name: edgar-r2
# Type: s3
# Provider: Cloudflare
# access_key_id: (R2 access key)
# secret_access_key: (R2 secret key)
# endpoint: https://ACCOUNT_ID.r2.cloudflarestorage.com
# acl: private
Mount:
rclone mount edgar-r2:my-edgar-bucket /mnt/edgar-data \
--vfs-cache-mode full
EDGAR_DATA_URL with R2
For read-only access via R2's public URL:
import os
# Enable public access on your R2 bucket first
os.environ['EDGAR_DATA_URL'] = 'https://pub-xxxxx.r2.dev/'
os.environ['EDGAR_USE_LOCAL_DATA'] = '1'
MinIO
MinIO is perfect for on-premises or private cloud deployments.
# s3fs with MinIO
s3fs my-bucket /mnt/edgar-data \
-o passwd_file=~/.passwd-minio \
-o url=https://minio.example.com \
-o use_path_request_style
# rclone config
# Provider: Minio
# Endpoint: https://minio.example.com
DigitalOcean Spaces
# rclone config
# Provider: DigitalOcean
# Endpoint: nyc3.digitaloceanspaces.com
Hybrid Architecture Pattern
Combine the best of both approaches for optimal performance:
┌─────────────────────────────────────────────────────────────┐
│ Hybrid Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ WRITES (Download/Sync) READS (Analysis) │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ FUSE Mount │ │ EDGAR_DATA_URL │ │
│ │ (s3fs/rclone) │ │ + CloudFront │ │
│ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ S3 Bucket (Origin) │ │
│ │ my-edgar-data │ │
│ └──────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Implementation
Download Server (writes to S3):
# Mount S3 for writing
goofys my-edgar-bucket /mnt/edgar-data
# Configure EdgarTools
export EDGAR_LOCAL_DATA_DIR=/mnt/edgar-data
export EDGAR_USE_LOCAL_DATA=1
from edgar import download_filings
download_filings("2025-01-01:2025-01-31") # Writes to S3
Analysis Clients (reads via HTTP):
import os
# Fast reads via CloudFront
os.environ['EDGAR_DATA_URL'] = 'https://d1234567890.cloudfront.net/'
os.environ['EDGAR_USE_LOCAL_DATA'] = '1'
from edgar import Company, get_filings
# All reads go through CloudFront CDN
filings = get_filings(form="10-K", year=2024)
Sync Strategies
Initial Bulk Upload
# Parallel upload with rclone
rclone copy ~/.edgar edgar-s3:my-edgar-bucket \
--transfers 32 \
--checkers 16 \
--progress
# Or with AWS CLI
aws s3 sync ~/.edgar s3://my-edgar-bucket \
--storage-class STANDARD_IA
Incremental Daily Sync
Create a cron job for daily updates:
# /etc/cron.d/edgar-sync
0 6 * * * edgar /usr/local/bin/edgar-daily-sync.sh
edgar-daily-sync.sh:
#!/bin/bash
set -e
# Download yesterday's filings locally first
export EDGAR_LOCAL_DATA_DIR=/tmp/edgar-staging
python -c "
from edgar import download_filings
from datetime import datetime, timedelta
yesterday = (datetime.now() - timedelta(days=1)).strftime('%Y-%m-%d')
download_filings(yesterday)
"
# Sync to S3
rclone sync /tmp/edgar-staging/filings edgar-s3:my-edgar-bucket/filings \
--transfers 16 \
--progress
# Cleanup
rm -rf /tmp/edgar-staging
Bidirectional Sync
For teams with multiple download nodes:
# Use rclone bisync for two-way sync
rclone bisync /mnt/local-edgar edgar-s3:my-edgar-bucket \
--resync \
--verbose
Performance Optimization
Caching Recommendations
| Scenario | Tool | Cache Settings |
|---|---|---|
| Frequent reads | goofys | --stat-cache-ttl 1h |
| Large file writes | rclone | --vfs-cache-mode full --vfs-cache-max-size 10G |
| Mixed workload | s3fs | -o use_cache=/tmp/s3cache -o ensure_diskfree=2048 |
Compression
Filings are already compressed by EdgarTools. Additional S3 compression isn't necessary.
Lifecycle Policies
Reduce storage costs with lifecycle rules:
{
"Rules": [
{
"ID": "MoveToIA",
"Status": "Enabled",
"Filter": {"Prefix": "filings/"},
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 180,
"StorageClass": "GLACIER"
}
]
}
]
}
Troubleshooting
Common Issues
"Transport endpoint not connected"
# FUSE mount crashed - remount
sudo fusermount -u /mnt/edgar-data
goofys my-edgar-bucket /mnt/edgar-data
Slow performance with s3fs
# Enable parallel requests and caching
s3fs bucket /mnt/data \
-o parallel_count=20 \
-o multipart_size=52 \
-o use_cache=/tmp/s3cache \
-o max_stat_cache_size=100000
Permission denied on mount
# Add user_allow_other to /etc/fuse.conf
echo "user_allow_other" | sudo tee -a /etc/fuse.conf
# Mount with allow_other
s3fs bucket /mnt/data -o allow_other
R2 connection issues
# Ensure region is set to 'auto'
s3fs bucket /mnt/data \
-o url=https://ACCOUNT_ID.r2.cloudflarestorage.com \
-o use_path_request_style \
-o sigv2
Debugging
# s3fs debug mode
s3fs bucket /mnt/data -d -f -o dbglevel=info
# rclone debug
rclone mount remote:bucket /mnt/data -vv --log-file=/tmp/rclone.log
# Check mount status
mount | grep fuse
df -h /mnt/edgar-data
Security Best Practices
IAM Policies (AWS)
Least-privilege policy for EdgarTools:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-edgar-bucket",
"arn:aws:s3:::my-edgar-bucket/*"
]
}
]
}
Encryption
# Enable server-side encryption
aws s3api put-bucket-encryption \
--bucket my-edgar-bucket \
--server-side-encryption-configuration \
'{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}'
Private Access (No Public URLs)
For internal-only access, skip the static website hosting and use:
- FUSE mount with IAM credentials
- VPC endpoints for AWS
- Private connectivity for GCP/Azure
Native Cloud Support
EdgarTools provides native cloud storage support via fsspec, enabling seamless integration with S3, Google Cloud Storage, Azure Blob Storage, and S3-compatible services.
Installation
Install the cloud storage dependencies for your provider:
# AWS S3, Cloudflare R2, MinIO, DigitalOcean Spaces
pip install "edgartools[s3]"
# Google Cloud Storage
pip install "edgartools[gcs]"
# Azure Blob Storage
pip install "edgartools[azure]"
# All cloud providers
pip install "edgartools[all-cloud]"
Basic Usage
import edgar
# AWS S3 (uses default credentials from ~/.aws or environment)
edgar.use_cloud_storage('s3://my-edgar-bucket/')
# Now all operations use cloud storage
company = edgar.Company("AAPL")
filings = company.get_filings(form="10-K")
Provider Examples
AWS S3
import edgar
# Using default AWS credentials
edgar.use_cloud_storage('s3://my-edgar-bucket/')
# With explicit credentials
edgar.use_cloud_storage(
's3://my-edgar-bucket/',
client_kwargs={
'aws_access_key_id': 'YOUR_ACCESS_KEY',
'aws_secret_access_key': 'YOUR_SECRET_KEY',
'region_name': 'us-east-1'
}
)
Cloudflare R2
import edgar
edgar.use_cloud_storage(
's3://my-bucket/',
client_kwargs={
'endpoint_url': 'https://ACCOUNT_ID.r2.cloudflarestorage.com',
'region_name': 'auto'
}
)
Google Cloud Storage
import edgar
# Using default GCP credentials
edgar.use_cloud_storage('gs://my-edgar-bucket/')
# With explicit project
edgar.use_cloud_storage(
'gs://my-edgar-bucket/',
client_kwargs={'project': 'my-project'}
)
Azure Blob Storage
import edgar
edgar.use_cloud_storage(
'az://my-container/edgar/',
client_kwargs={
'account_name': 'myaccount',
'account_key': 'YOUR_ACCOUNT_KEY'
}
)
MinIO (Self-Hosted S3)
import edgar
edgar.use_cloud_storage(
's3://edgar-data/',
client_kwargs={
'endpoint_url': 'http://localhost:9000',
'aws_access_key_id': 'minioadmin',
'aws_secret_access_key': 'minioadmin'
}
)
Connection Verification
By default, use_cloud_storage() verifies the connection by listing the bucket. This catches configuration errors early:
import edgar
# Fails immediately if credentials are wrong or bucket doesn't exist
edgar.use_cloud_storage('s3://my-bucket/')
# Skip verification for faster startup (not recommended)
edgar.use_cloud_storage('s3://my-bucket/', verify=False)
Disabling Cloud Storage
import edgar
# Revert to local storage
edgar.use_cloud_storage(disable=True)
Uploading Data to Cloud Storage
EdgarTools provides two ways to populate your cloud storage with SEC data:
Option 1: Download and Upload in One Step
Use the upload_to_cloud parameter with download_filings():
import edgar
# Configure cloud storage first
edgar.use_cloud_storage('s3://my-edgar-bucket/')
# Download filings and upload to cloud automatically
edgar.download_filings('2025-01-15', upload_to_cloud=True)
# Download a date range
edgar.download_filings('2025-01-01:2025-01-15', upload_to_cloud=True)
Option 2: Sync Existing Local Data
Use sync_to_cloud() to upload data you've already downloaded locally:
import edgar
# Configure cloud storage
edgar.use_cloud_storage('s3://my-edgar-bucket/')
# Sync all local filings to cloud
result = edgar.sync_to_cloud('filings')
print(f"Uploaded: {result['uploaded']}, Skipped: {result['skipped']}")
# Sync specific date directory
edgar.sync_to_cloud('filings/20250115')
# Preview what would be uploaded (dry run)
edgar.sync_to_cloud('filings', dry_run=True)
# Overwrite existing files in cloud
edgar.sync_to_cloud('filings', overwrite=True)
sync_to_cloud() Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
source_path |
str | None | Subdirectory to sync (e.g., 'filings', 'filings/20250115') |
pattern |
str | '*/' | Glob pattern for files to sync |
batch_size |
int | 20 | Number of concurrent uploads |
overwrite |
bool | False | Overwrite existing files in cloud |
dry_run |
bool | False | Preview without uploading |
Return Value
sync_to_cloud() returns a dict with upload statistics:
{
'uploaded': 150, # Files successfully uploaded
'skipped': 50, # Files already in cloud (when overwrite=False)
'failed': 0, # Files that failed to upload
'errors': [] # Error messages for failed uploads
}
Features
| Feature | Description |
|---|---|
| Cross-platform | Works on Windows, macOS, and Linux |
| No FUSE required | Pure Python implementation |
| Transparent compression | Handles .gz files automatically |
| Full read/write | Both reading and writing supported |
| Provider agnostic | Same API for all cloud providers |
Summary
| Use Case | Recommended Approach |
|---|---|
| Native cloud support | use_cloud_storage() (recommended) |
| Read-only HTTP access | EDGAR_DATA_URL + static website |
| Legacy FUSE mount | goofys or rclone mount |
| On-premises | MinIO + use_cloud_storage() |
| Zero egress costs | Cloudflare R2 |
Quick Start
Native cloud storage (recommended):
import edgar
# Install: pip install "edgartools[s3]"
edgar.use_cloud_storage('s3://my-edgar-bucket/')
# Read from cloud
company = edgar.Company("AAPL")
# Write to cloud
edgar.download_filings('2025-01-15', upload_to_cloud=True)
# Or sync existing local data
edgar.sync_to_cloud('filings')
Read-only via HTTP:
import os
os.environ['EDGAR_DATA_URL'] = 'https://your-bucket.s3.amazonaws.com/'
os.environ['EDGAR_USE_LOCAL_DATA'] = '1'
Legacy FUSE mount (Linux/macOS):
goofys my-edgar-bucket /mnt/edgar-data
export EDGAR_LOCAL_DATA_DIR=/mnt/edgar-data
export EDGAR_USE_LOCAL_DATA=1
For questions or feedback, see Discussion #507.