Company Subsets
The edgar.reference.company_subsets
module provides powerful and flexible tools for creating subsets of companies from SEC reference data. This is especially useful for research, analysis, educational purposes, and machine learning tasks where you need specific groups of companies.
Key Features
- Exchange-based selection: Filter by NYSE, NASDAQ, OTC, CBOE
- Popularity-based selection: Get popular stocks, mega-cap companies, etc.
- Sampling capabilities: Random sampling, stratified sampling, top N selection
- Filtering and combination utilities: Include/exclude specific companies, combine sets
- Fluent interface: Chain operations for readable, flexible subset creation
- Consistent output: All functions return standardized DataFrames with
['cik', 'ticker', 'name', 'exchange']
columns
Quick Start
from edgar.reference.company_subsets import (
CompanySubset,
get_companies_by_exchanges,
get_popular_companies,
get_random_sample
)
# Simple exchange-based selection
nyse_companies = get_companies_by_exchanges('NYSE')
print(f"Found {len(nyse_companies)} NYSE companies")
# Get popular companies
popular = get_popular_companies()
print(f"Found {len(popular)} popular companies")
# Random sampling
random_100 = get_random_sample(n=100, random_state=42)
print(f"Sampled {len(random_100)} random companies")
Fluent Interface with CompanySubset
The CompanySubset
class provides a powerful fluent interface for building complex company selections:
from edgar.reference.company_subsets import CompanySubset, PopularityTier
# Complex selection with method chaining
companies = (CompanySubset()
.from_exchange(['NYSE', 'Nasdaq']) # Major exchanges only
.exclude_tickers(['JPM', 'GS', 'C']) # Exclude some financials
.sample(50, random_state=42) # Take random sample
.get()) # Get the DataFrame
print(f"Selected {len(companies)} companies")
print(companies.head())
# Popular tech companies
tech_subset = (CompanySubset()
.from_popular(PopularityTier.POPULAR) # Popular companies
.filter_by(lambda df: df['name'].str.contains('tech|software|computer', case=False))
.top(20, by='ticker') # Top 20 alphabetically
.get())
Core Functions
Exchange-Based Selection
Filter companies by stock exchange:
from edgar.reference.company_subsets import get_companies_by_exchanges
# Single exchange
nyse_companies = get_companies_by_exchanges('NYSE')
nasdaq_companies = get_companies_by_exchanges('Nasdaq')
# Multiple exchanges
major_exchanges = get_companies_by_exchanges(['NYSE', 'Nasdaq'])
all_exchanges = get_companies_by_exchanges(['NYSE', 'Nasdaq', 'OTC', 'CBOE'])
print(f"NYSE: {len(nyse_companies)} companies")
print(f"NASDAQ: {len(nasdaq_companies)} companies")
print(f"Major exchanges: {len(major_exchanges)} companies")
Popular Companies
Access curated lists of popular and well-known companies:
from edgar.reference.company_subsets import get_popular_companies, PopularityTier
# All popular companies
all_popular = get_popular_companies()
# By popularity tier
mega_cap = get_popular_companies(PopularityTier.MEGA_CAP) # Top 10
popular = get_popular_companies(PopularityTier.POPULAR) # Top 50
mainstream = get_popular_companies(PopularityTier.MAINSTREAM) # Top 100
emerging = get_popular_companies(PopularityTier.EMERGING) # All available
print(f"Mega cap: {len(mega_cap)} companies")
print(f"Popular: {len(popular)} companies")
print(f"All popular: {len(all_popular)} companies")
Sampling Methods
Create representative samples from larger datasets:
from edgar.reference.company_subsets import (
get_random_sample,
get_stratified_sample,
get_top_companies_by_metric
)
# Random sampling
random_sample = get_random_sample(n=200, random_state=42)
# Stratified sampling (maintains exchange proportions)
stratified_sample = get_stratified_sample(
n=100,
stratify_by='exchange',
random_state=42
)
# Top companies by name (alphabetical)
top_alphabetical = get_top_companies_by_metric(
n=50,
metric='name',
ascending=True
)
# Sample from a specific subset
nyse_random = get_random_sample(
get_companies_by_exchanges('NYSE'),
n=100,
random_state=42
)
Filtering and Combining
Include/Exclude Specific Companies
from edgar.reference.company_subsets import filter_companies, exclude_companies
all_companies = get_all_companies()
# Include specific tickers (FAANG companies)
faang = filter_companies(
all_companies,
ticker_list=['META', 'AAPL', 'AMZN', 'NFLX', 'GOOGL']
)
# Include companies with names containing specific text
tech_companies = filter_companies(
all_companies,
name_contains='Technology'
)
# Include specific CIKs
specific_companies = filter_companies(
all_companies,
cik_list=[320193, 1018724, 1652044] # AAPL, AMZN, GOOGL
)
# Exclude financial companies (simplified example)
non_financial = exclude_companies(
all_companies,
ticker_list=['JPM', 'GS', 'C', 'BAC', 'WFC']
)
# Exclude companies with 'Corp' in name
non_corp = exclude_companies(
all_companies,
name_contains='Corp'
)
Custom Filtering
Apply custom filtering logic:
from edgar.reference.company_subsets import filter_companies
# Custom filter function
def large_company_filter(df):
"""Filter to companies with longer names (proxy for larger companies)."""
return df[df['name'].str.len() > 20]
# Apply custom filter
large_companies = filter_companies(
get_companies_by_exchanges('NYSE'),
custom_filter=large_company_filter
)
# Using lambda for simple filters
short_tickers = filter_companies(
get_popular_companies(),
custom_filter=lambda df: df[df['ticker'].str.len() <= 4]
)
Combining and Intersecting Sets
from edgar.reference.company_subsets import combine_company_sets, intersect_company_sets
# Get different company sets
nyse_companies = get_companies_by_exchanges('NYSE')
popular_companies = get_popular_companies()
tech_companies = filter_companies(get_all_companies(), name_contains='Tech')
# Union: Combine multiple sets (removes duplicates)
combined = combine_company_sets([nyse_companies, popular_companies, tech_companies])
# Intersection: Find companies present in all sets
nyse_popular = intersect_company_sets([nyse_companies, popular_companies])
popular_tech = intersect_company_sets([popular_companies, tech_companies])
print(f"Combined: {len(combined)} companies")
print(f"NYSE + Popular intersection: {len(nyse_popular)} companies")
print(f"Popular + Tech intersection: {len(popular_tech)} companies")
Convenience Functions
Pre-defined functions for common company groupings:
from edgar.reference.company_subsets import (
get_faang_companies,
get_tech_giants,
get_dow_jones_sample
)
# FAANG companies (Meta, Apple, Amazon, Netflix, Google)
faang = get_faang_companies()
# Major tech companies
tech_giants = get_tech_giants()
# Dow Jones Industrial Average sample
dow_sample = get_dow_jones_sample()
print(f"FAANG: {len(faang)} companies")
print(f"Tech Giants: {len(tech_giants)} companies")
print(f"Dow Sample: {len(dow_sample)} companies")
# Display the companies
print("\nFAANG Companies:")
for _, company in faang.iterrows():
print(f" {company['ticker']}: {company['name']}")
Advanced Examples
Research Dataset Creation
Create a balanced research dataset:
from edgar.reference.company_subsets import CompanySubset, PopularityTier
# Create a research dataset with companies from different tiers
research_dataset = []
# Get 20 mega-cap companies
mega_cap = (CompanySubset()
.from_popular(PopularityTier.MEGA_CAP)
.sample(20, random_state=42)
.get())
# Get 30 popular mid-tier companies
mid_tier = (CompanySubset()
.from_popular(PopularityTier.POPULAR)
.exclude_tickers(mega_cap['ticker'].tolist()) # Don't overlap
.sample(30, random_state=42)
.get())
# Get 50 random companies from major exchanges
random_companies = (CompanySubset()
.from_exchange(['NYSE', 'Nasdaq'])
.exclude_tickers(mega_cap['ticker'].tolist() + mid_tier['ticker'].tolist())
.sample(50, random_state=42)
.get())
# Combine all for final research set
research_companies = combine_company_sets([mega_cap, mid_tier, random_companies])
print(f"Research dataset: {len(research_companies)} companies")
# Analyze composition
exchange_dist = research_companies['exchange'].value_counts()
print("\nExchange distribution:")
print(exchange_dist)
Sector-Based Analysis
Create industry-focused subsets:
# Create sector-based subsets (simplified approach using name patterns)
sectors = {
'technology': ['tech', 'software', 'computer', 'digital'],
'financial': ['bank', 'financial', 'insurance', 'capital'],
'healthcare': ['health', 'medical', 'pharma', 'bio'],
'energy': ['energy', 'oil', 'gas', 'power'],
'retail': ['retail', 'store', 'market', 'shop']
}
sector_companies = {}
all_companies = get_companies_by_exchanges(['NYSE', 'Nasdaq'])
for sector, keywords in sectors.items():
# Create pattern for all keywords
pattern = '|'.join(keywords)
sector_subset = filter_companies(
all_companies,
custom_filter=lambda df, p=pattern: df[df['name'].str.contains(p, case=False)]
)
sector_companies[sector] = sector_subset
print(f"{sector.title()}: {len(sector_subset)} companies")
# Get top 10 from each sector for analysis
analysis_set = []
for sector, companies in sector_companies.items():
top_10 = get_top_companies_by_metric(companies, n=10, metric='ticker')
analysis_set.append(top_10)
final_analysis_set = combine_company_sets(analysis_set)
print(f"\nFinal analysis set: {len(final_analysis_set)} companies across sectors")
Machine Learning Dataset Preparation
Prepare balanced datasets for ML training:
from edgar.reference.company_subsets import get_stratified_sample
# Create training/test split with stratification
all_popular = get_popular_companies()
# Training set (70% of data, stratified by exchange)
training_companies = get_stratified_sample(
all_popular,
n=int(len(all_popular) * 0.7),
stratify_by='exchange',
random_state=42
)
# Test set (remaining companies)
test_companies = all_popular[~all_popular['cik'].isin(training_companies['cik'])]
print(f"Training set: {len(training_companies)} companies")
print(f"Test set: {len(test_companies)} companies")
# Verify stratification worked
print("\nTraining exchange distribution:")
print(training_companies['exchange'].value_counts(normalize=True))
print("\nTest exchange distribution:")
print(test_companies['exchange'].value_counts(normalize=True))
Data Structure
All functions return a standardized pandas DataFrame with these columns:
cik
(int): SEC Central Index Key - unique company identifierticker
(str): Stock ticker symbol (e.g., 'AAPL', 'MSFT')name
(str): Official company nameexchange
(str): Stock exchange ('NYSE', 'Nasdaq', 'OTC', 'CBOE', etc.)
# Example output structure
companies = get_random_sample(5)
print(companies)
# cik ticker name exchange
# 0 320193 AAPL Apple Inc. Nasdaq
# 1 1018724 AMZN Amazon.com, Inc. Nasdaq
# 2 1652044 GOOGL Alphabet Inc. Nasdaq
# 3 789019 MSFT Microsoft Corporation Nasdaq
# 4 1326801 META Meta Platforms, Inc Nasdaq
Error Handling
The module includes robust error handling and logging:
# Functions gracefully handle errors and return empty DataFrames
empty_result = get_companies_by_exchanges('INVALID_EXCHANGE')
print(f"Invalid exchange result: {len(empty_result)} companies")
# Check for empty results
companies = get_random_sample(n=10)
if companies.empty:
print("No companies found")
else:
print(f"Found {len(companies)} companies")
# All functions include logging for debugging
import logging
logging.basicConfig(level=logging.DEBUG)
# Now function calls will show debug information
companies = get_popular_companies()
Performance Considerations
- Caching:
get_all_companies()
uses LRU cache for performance - Lazy evaluation: CompanySubset operations are efficient and don't duplicate data unnecessarily
- Memory efficient: Functions work with DataFrame views when possible
- Batch operations: Use combine/intersect functions instead of loops for better performance
# Efficient: Use batch operations
company_sets = [
get_companies_by_exchanges('NYSE'),
get_companies_by_exchanges('Nasdaq'),
get_popular_companies()
]
combined = combine_company_sets(company_sets)
# Less efficient: Multiple individual operations in loops
# combined = pd.DataFrame()
# for exchange in ['NYSE', 'Nasdaq']:
# exchange_companies = get_companies_by_exchanges(exchange)
# combined = pd.concat([combined, exchange_companies]) # Avoid this pattern
Integration with Edgar Tools
Company subsets integrate seamlessly with other Edgar tools:
from edgar import Company
from edgar.reference.company_subsets import get_tech_giants
# Get tech companies and analyze their latest filings
tech_companies = get_tech_giants()
for _, company_info in tech_companies.head(5).iterrows():
try:
company = Company(company_info['ticker'])
latest_filing = company.get_filings(form='10-K').latest()
print(f"{company_info['ticker']}: Latest 10-K filed {latest_filing.filing_date}")
except:
print(f"{company_info['ticker']}: No recent 10-K found")
Best Practices
- Use appropriate sample sizes: Don't sample more companies than you need for analysis
- Set random seeds: Use
random_state
parameter for reproducible results - Handle empty results: Always check if returned DataFrames are empty
- Combine operations efficiently: Use method chaining with CompanySubset for readable code
- Cache results: Store company subsets if you'll reuse them multiple times
- Validate data: Check that your filters return expected results
# Good: Reproducible and efficient
companies = (CompanySubset()
.from_exchange('NYSE')
.sample(100, random_state=42)
.get())
# Store for reuse
cached_companies = companies.copy()
# Good: Check for empty results
if not companies.empty:
print(f"Analysis ready with {len(companies)} companies")
else:
print("No companies found matching criteria")
This module provides a comprehensive toolkit for creating company subsets tailored to your specific research, analysis, or educational needs. The combination of simple functions and the powerful fluent interface makes it easy to create both simple selections and complex, multi-criteria company datasets.