Performance Optimization
Working with SEC data can be resource-intensive due to the volume of data, network latency, and SEC's rate limits. This guide provides strategies to optimize your edgartools workflows for better performance.
Understanding How edgartools Fetches Data
To optimize performance, it's important to understand how edgartools retrieves data from the SEC EDGAR system.
How get_filings()
Works
The global get_filings()
function operates as follows:
- It fetches quarterly filing indexes to cover the requested time period
- For the current year, it fetches complete data for the year to date
- For multiple years, it fetches quarterly indexes for each year
- Each quarterly index requires a separate HTTP request
For example, requesting filings for 2024 requires 4 HTTP requests (one for each quarter), while requesting filings for 2020-2024 requires 20 HTTP requests.
# This makes 4 HTTP requests (one per quarter)
filings_2024 = get_filings(year=2024)
# This makes 20 HTTP requests (5 years × 4 quarters)
filings_multi_year = get_filings(start_date="2020-01-01", end_date="2024-12-31")
How company.get_filings()
Works
The company.get_filings()
method works differently:
- It fetches the company's submission JSON file, which contains all available filings for that company
- This requires just one HTTP request, regardless of the date range
- The data is then filtered client-side based on your criteria
# This makes just 1 HTTP request, regardless of date range
company = Company("AAPL")
company_filings = company.get_filings(form="10-K")
Filing Content Retrieval
Both methods above only return filing metadata (indexes). When you access the actual content of a filing, an additional HTTP request is made:
# This makes an additional HTTP request when you access the filing
filing = filings.latest()
filing_text = filing.text # HTTP request happens here
Choosing the Right Access Pattern
Based on your specific use case, choose the most efficient access pattern:
If your query is... | Use this approach | Why |
---|---|---|
Focused on specific form types across companies | get_filings(form="4") |
Efficiently filters by form type |
Focused on a single company | company.get_filings() |
Makes just one HTTP request |
Across multiple specific companies | get_filings().filter(cik=["0000320193", "0000789019"]) |
Allows precise filtering |
Limited to a specific year | get_filings(year=2024) |
Minimizes the number of index requests |
Focused on recent filings | get_filings().latest(100) |
Gets only the most recent filings |
Rate Limiting Considerations
By default, edgartools limits requests to a maximum of 10 per second to comply with SEC EDGAR's rate limits. Exceeding these limits can result in your IP being temporarily blocked.
# Default rate limit is 10 requests per second
# You can adjust it if needed (use with caution)
from edgar import set_rate_limit
# Decrease rate limit for more conservative approach
set_rate_limit(5) # 5 requests per second
Using Local Storage for Performance
One of the most effective ways to improve performance is to use local storage. This allows you to:
- Cache filings locally to avoid repeated HTTP requests
- Process filings offline without network latency
- Batch download filings for later analysis
Setting Up Local Storage
from edgar import enable_local_storage
# Enable local storage
enable_local_storage("/path/to/storage")
# Now filings will be stored locally
company = Company("MSFT")
filings = company.get_filings(form="10-K")
filing = filings.latest()
# This will use the local copy if available, or download and cache it if not
text = filing.text
Batch Downloading Filings
For large-scale analysis, batch download filings first, then process them offline:
from edgar import download_filings
# Get filing metadata
companies = ["AAPL", "MSFT", "GOOGL", "AMZN", "META"]
all_filings = []
for ticker in companies:
company = Company(ticker)
filings = company.get_filings(form="10-K").head(5) # Last 5 10-Ks
all_filings.extend(filings)
# Batch download all filings (this makes HTTP requests efficiently)
download_filings(all_filings, "/path/to/storage")
# Now process them offline (no HTTP requests)
for filing in all_filings:
# Process filing without network latency
text = filing.text # Uses local copy
Memory Optimization
When working with many filings or large filings, memory usage can become a concern.
Processing Large Datasets
For large datasets, use generators and process filings one at a time:
def process_filings_generator(filings):
for filing in filings:
# Process one filing at a time
result = process_filing(filing)
yield result
# Free memory
del filing
# Process filings one at a time
for result in process_filings_generator(all_filings):
save_or_analyze(result)
Working with Large Filings
For large filings (like 10-Ks), process sections individually:
filing = company.get_latest_filing("10-K").obj()
# Process one section at a time
sections = ["business", "risk_factors", "management_discussion"]
for section_name in sections:
if hasattr(filing, section_name):
section = getattr(filing, section_name)
# Process section
process_section(section_name, section)
# Free memory
del section
Parallel Processing
For computationally intensive tasks, consider parallel processing:
from concurrent.futures import ThreadPoolExecutor
import time
def process_filing_with_delay(filing):
# Add delay to respect rate limits
time.sleep(0.1)
# Process filing
return {"accession": filing.accession_number, "text_length": len(filing.text)}
# Process filings in parallel with a thread pool
with ThreadPoolExecutor(max_workers=5) as executor:
results = list(executor.map(process_filing_with_delay, all_filings))
Caching Strategies
Implement caching for expensive operations:
import functools
@functools.lru_cache(maxsize=128)
def get_filing_sentiment(filing_accession):
# Expensive operation to calculate sentiment
filing = get_filing_by_accession(filing_accession)
text = filing.text
# Calculate sentiment (expensive operation)
return calculate_sentiment(text)
# This will be cached after the first call
sentiment = get_filing_sentiment("0000320193-20-000096")
Performance Benchmarks
Here are some typical performance benchmarks to help you plan your workflows:
Operation | Typical Time | Notes |
---|---|---|
get_filings(year=2024) |
2-5 seconds | Fetches 4 quarterly indexes |
company.get_filings() |
1-2 seconds | Single HTTP request |
Downloading a 10-K filing | 1-3 seconds | Depends on filing size |
Parsing a 10-K as Data Object | 2-5 seconds | First-time parsing |
Accessing a locally stored filing | < 0.1 seconds | From disk cache |
Processing 100 filings sequentially | 3-10 minutes | With rate limiting |
Processing 100 filings in parallel | 1-3 minutes | With proper rate limiting |
Best Practices Summary
- Choose the right access pattern based on your specific use case
- Use
company.get_filings()
when focusing on a single company - Enable local storage to avoid repeated HTTP requests
- Batch download filings before processing them
- Process filings one at a time for large datasets
- Respect SEC rate limits to avoid being blocked
- Implement caching for expensive operations
- Use parallel processing carefully with appropriate delays
- Filter filings early in your pipeline to reduce the number of filings to process
- Monitor memory usage when working with large filings or datasets
By following these guidelines, you can significantly improve the performance of your edgartools workflows while respecting SEC EDGAR's rate limits and your system's resources.
Advanced Techniques
Custom Indexing
For repeated analysis of the same dataset, consider creating your own indexes:
import pandas as pd
# Create a custom index of filings
filings = get_filings(form=["10-K", "10-Q"], year=2024)
index_data = []
for filing in filings:
index_data.append({
"accession": filing.accession_number,
"cik": filing.cik,
"company": filing.company_name,
"form": filing.form_type,
"date": filing.filing_date,
"path": filing.get_local_path() if filing.is_local() else None
})
# Save as CSV for quick loading
index_df = pd.DataFrame(index_data)
index_df.to_csv("filings_index_2024.csv", index=False)
# Later, load the index instead of fetching again
loaded_index = pd.read_csv("filings_index_2024.csv")
Incremental Updates
For ongoing analysis, implement incremental updates:
import datetime
# Get the date of your last update
last_update = datetime.date(2024, 6, 1)
today = datetime.date.today()
# Only fetch filings since your last update
new_filings = get_filings(start_date=last_update, end_date=today)
# Process only the new filings
for filing in new_filings:
process_filing(filing)
# Update your last update date
last_update = today
By implementing these performance optimization strategies, you can make your edgartools workflows more efficient, faster, and more resilient.