Build and maintain a data warehouse with U.S. stock market data by running multiple ETL pipelines.
The app is designed to provide accurate and reliable US financial stock market data by leveraging publicly available APIs and free-tier services instead of resorting to web scraping. This approach ensures compliance with data usage policies while delivery a more sustainable, easy to maintain solution.
The app can run at any frequency. Each execution ensures all tables are updated with the latest data, regardless of when the last update occurred. This means that you can either use an orchestrator to keep the data up to date at a desired frequency, or run it manually, on-demand, to perform ad-hoc analyses.
The warehouse was designed following the medallion architecture, each layer is isolated in a different DB schema:
-
BRONZE_LAYER: This is the ingestion layer. Tables here have the data in the most close to raw state. For example, API responses flattened with some extra metadata. Here, data is only incremented (inserts only, no updates or deletions) ensuring an immutable and easily auditable history.
-
SILVER_LAYER: Transformation layer, prepared for analytical consumption. (Not yet implemented)
-
GOLD_LAYER: Distribution layer - Business-ready data, optimized for reporting and decision-making. (Not yet implemented)
The application uses the Factory Design Pattern to create pipelines. Note that there is one pipeline factory and one step factory - each pipeline is composed of multiple steps, and different pipelines can share the same steps. This design allows us to build isolated and reusable modules, improving code reusability, scalability and maintainability.
This pattern works particularly well for implementing multiple ETL pipelines, as many steps can be abstracted and reused across different workflows. Examples include common tasks like data validation and loading. This approach reduces duplication, simplifies the development of new pipelines, and ensures consistency in shared processes.
The application has 100% unit test coverage, ensuring high reliability and quick identification of issues. There is also integration tests for every pipeline, validating that all components work as expected. Additionally, the app is fully dockerized, ensuring consistency across environments by containerizing all dependencies and services.
Finally, we use GitHub Actions for the CI/CD pipeline, ensuring that unit and integration tests pass successfully, along with linting checks for code consistency, formatting, and type checking. Deployment is not present in the CI/CD pipeline, as the application can be executed locally at any frequency.
To use the application locally simply clone this repository:
git clone https://github.com/FelipeMezzarana/StockAnalyzer.git
Before running the application create a copy of the file secrets.example and rename it to secrets.env. Then, replace the values with you keys. Note that if you pretend to use a local Postgres instance (see more bellow) you can keep the default values for client specific variables.
Variables required for all clients:
POLYGON_KEY=your_polygon_key
FRED_KEY=your_fred_key
CLIENT=<POSTGRES or SQLITE>
For POSTGRES:
POSTGRES_USER=root
POSTGRES_PASSWORD=root
POSTGRES_DB=STOCK_ANALYZER
POSTGRES_HOST=pgdatabase
For SQLITE:
# Path to folder you have/want to create your .db file.
DB_PATH=folder1/folder2/
The recommended approach is to run the app through Docker. In this case, it is not necessary to install dependencies.
If you want to run locally just be sure to install the dependencies in requirements.txt in you environment.
As for the database, although SQLLite is supported, we recommend using Postgres. If you intend to use a local instance you can use the boilerplate solution provided. Use the shell script bellow to set up an local Postgres instance and easily query it:
./setup_local_db.sh
The script will
-
Setup a dockernized Postgres database "STOCK_ANALYZER" with volume mapped to /Documents/postgres/stock_analyzer_volume
-
Setup PgAdmin local server, to easily query "STOCK_ANALYZER". It that can be accessed at http://localhost:8080/ with credentials:
-
Setup a network to run the app through Docker.
- PgAdmin e-mail: admin@admin.com
- PgAdmin password: root
- host: pgdatabase
- Postgres username: root
- Postgres password: root
Remember to shutdown services with:
docker compose down
The application is executed via a CLI command. To run use:
# Run through Docker (recommended)
./run.sh <scope> --sub-scope <sub_scope> --skip <table_name>
# Run without Docker
source secrets.env && python3 -m src.run --sub-scope <scope> --skip <sub_scope>
<scope> and <sub_scope> defines which table(s) will be updated. We have tree main options:
- <scope> = table -> In this case <sub_scope> must be a valid table name, and this single table will be updated.
- <scope> = schema -> In this case <sub_scope> must be a valid schema name (bronze_layer, silver_layer or gold_layer), and all tables under the schema will be updated.
- <scope> = all -> In this case <sub_scope> can be omitted and all tables under all schemas will be updated.
<skip> is an optional argument that may be used for skipping table(s) update. Tables names must be separated by comma.
Usage examples:
# Update all tables
./run.sh all
# Updata bronze layer tables skipping two tables.
./run.sh schema --sub-scope bronze_layer --skip 'index_daily_close, stock_company_details'
Notes:
- Due to APIs rate limits the first bronze_layer update should take a long time to finish (~6h). Also, the lower the update frequency the longer the execution time for subsequent updates. Other factors can also influence runtime. For example, financial data will be updated every four months, increasing this specific runtime considerable. Therefore, for the first run it is recommended to updates tables individually.
- When using the table scope it is important to note that pipelines can have dependencies between themselves, which may affect the target table update if the dependency is not updated. Dependencies can be found below:
- BRONZE_LAYER.STOCK_COMPANY_DETAILS -> Depends on STOCK_DAILY_PRICES to know which ticker get details from;
- BRONZE_LAYER.FINANCIALS_[BALANCE_SHEET, CASH_FLOW_STATEMENT, INCOME_STATEMENT, COMPREHENSIVE_INCOME] -> Depends on SP500_COMPANY_DETAILS to know which ticker get financials data from;
To run unit tests:
./run_unit_tests.sh
To run integration tests:
./run_integration_test.sh
To run linting checks:
./run_linting.sh
STOCK_DAILY_PRICES
Daily open, high, low, and close (OHLC) for the entire stocks/equities markets.Field | Description | Example |
---|---|---|
date | Trading date | 2024-06-21 |
exchange_symbol | The exchange symbol that this item is traded under. | LE |
trading_volume | The trading volume of the symbol in the given time period. | 113682.0 |
volume_weighted_avg | The volume weighted average price. | 13.428 |
open_price | The open price for the symbol in the given time period. | 13.48 |
close_price | The close price for the symbol in the given time period. | 13.35 |
highest_price | The highest price for the symbol in the given time period. | 13.69 |
lowest_price | The lowest price for the symbol in the given time period. | 13.27 |
end_window | End of trading window | 2024-06-21 20:00:00 |
n_transaction | The number of transactions in the aggregate window. | 1308 |
updated_at | Row update date. | 2024-06-22 12:46:25 |
STOCK_COMPANY_DETAILS
Basic information about tickers found in STOCK_DAILY_PRICES.Field | Description | Example |
---|---|---|
active | Whether or not the asset is actively traded. False means the asset has been delisted. | 1 |
cik | The CIK number for this ticker. Find more information here. | 0000353278 |
composite_figi | The composite OpenFIGI number for this ticker. | BBG000BQBKR3 |
currency_name | The name of the currency that this asset is traded with. | usd |
last_updated_utc | The last date that the asset was updated. | 2024-04-26 00:00:00 |
locale | The locale of the asset. | us |
market | The market type of the asset. | stocks |
name | The name of the asset. | Novo-Nordisk A/S |
primary_exchange | The ISO code of the primary listing exchange for this asset. | XNYS |
share_class_figi | The share Class OpenFIGI number for this ticker | BBG001S5TSK0 |
exchange_symbol | The exchange symbol that this item is traded under. | NVO |
type | The type of the asset. | ADRC |
updated_at | Row update date. | 2024-04-29 21:47:11 |
SP500_COMPANY_DETAILS
Basic details about S&P500 companies.Field | Description | Example |
---|---|---|
exchange_symbol | The exchange symbol that this item is traded under. | GOOGL |
name | The name of the currency that this asset is traded with. | Alphabet Inc. (Class A) |
sector | Industry sector by GICS Standard | Communication Services |
sub_industry | Industry sub-sector by GICS Standard | Interactive Media & Services |
hq_location | Headquarters Location | Mountain View, California |
date_added | Date added to S&P500 | 2014-04-03 |
cik | The CIK number for this ticker | 0001652044 |
founded | Industry fundation date | 1998 |
url | Industry website | https://www.nasdaq.com/market-activity/stocks/googl |
updated_at | Row update date. | 2024-05-05 11:57:48 |
INDEX_DAILY_CLOSE
Daily closing value of the most popular stock indexes to track US market performance.Field | Description | Example |
---|---|---|
date | Trading date | 2024-06-21 |
index_code | Index id according to fred.stlouisfed.org | SP500 |
close_value | Closing index value | 5464.62 |
updated_at | Row update date. | 2024-06-22 14:07:20 |
SP500_FINANCIALS_BALANCE_SHEET
This table includes key financial information such as assets, liabilities, equity, and company details, covering both current and non-current items For S&P500 companies.Description | Example | |
---|---|---|
start_date | Start date of the financial report period. | 2024-04-01 00:00:00 |
end_date | End date of the financial report period. | 2024-06-30 00:00:00 |
timeframe | The timeframe of the report. | quarterly |
fiscal_period | Fiscal period covered by the report. | Q2 |
fiscal_year | Fiscal year associated with the financial data. | 2024 |
exchange_symbol_search | Exchange symbol or ticker used in search to identify the company (tickers). | RSG |
tickers | The list of ticker symbols for the company. | ['RSG'] |
company_name | Company name. | REPUBLIC SERVICES, INC. |
updated_at | Date and time when the row data was updated. | 2024-01-15 10:00:00 |
assets | Total assets of the company, combining current and non-current assets. | 31934100000.00 |
current_assets | Assets that can be converted into cash within one year. | 2659700000.00 |
cash | Cash and cash equivalents held by the company. | 50000000.00 |
accounts_receivable | Amounts owed to the company by customers or other entities. | 20000000.00 |
inventory | Value of raw materials, work-in-progress, and finished goods. | 25000000.00 |
prepaid_expenses | Payments made in advance for services or goods. | 1000000.00 |
other_current_assets | Other current assets not classified in standard categories. | 2000000.00 |
noncurrent_assets | Assets expected to provide value beyond one year. | 230000000.00 |
long_term_investments | Long-term investments, including securities or bonds. | 50000000.00 |
fixed_assets | Fixed assets like property, plant, and equipment (PPE). | 100000000.00 |
intangible_assets | Intangible assets, such as patents, trademarks, or goodwill. | 30000000.00 |
noncurrent_prepaid_expense | Prepaid expenses expected to be consumed beyond one year. | 2000000.00 |
other_noncurrent_assets | Non-current assets not classified under standard categories. | 3000000.00 |
liabilities | Total liabilities of the company, including current and non-current. | 200000000.00 |
current_liabilities | Short-term liabilities expected to be settled within one year. | 80000000.00 |
accounts_payable | Outstanding amounts payable to suppliers or vendors. | 35000000.00 |
interest_payable | Interest expenses incurred but not yet paid. | 5000000.00 |
wages | Salaries and wages payable to employees. | 4000000.00 |
other_current_liabilities | Other current liabilities not falling under standard categories. | 36000000.00 |
noncurrent_liabilities | Liabilities due beyond one year. | 120000000.00 |
long_term_debt | Long-term debt obligations like bonds or loans. | 80000000.00 |
other_noncurrent_liabilities | Other non-current liabilities not classified elsewhere. | 20000000.00 |
commitments_and_contingencies | Contingencies or commitments affecting financial position. | 3000000.00 |
redeemable_noncontrolling_interest | Non-controlling interest in a redeemable equity instrument. | 15000000.00 |
redeemable_noncontrolling_interest_common | Redeemable non-controlling interest in common shares. | 50000000.00 |
redeemable_noncontrolling_interest_other | Redeemable non-controlling interest in other securities. | 15000000.00 |
redeemable_noncontrolling_interest_preferred | Redeemable non-controlling interest in preferred stock. | 20000000.00 |
equity | Ownership equity of the company. | 2500000.00 |
equity_attributable_to_noncontrolling_interest | Equity attributable to minority or non-controlling interest. | 4500000.00 |
equity_attributable_to_parent | Equity attributable to the parent company. | 5500000.00 |
temporary_equity | Temporary equity not classified as permanent equity. | 3500000.00 |
temporary_equity_attributable_to_parent | Temporary equity attributable to the parent company. | 2500000.00 |
liabilities_and_equity | Sum of liabilities and equity, ensuring the balance sheet equation. | 320000000.00 |
SP500_FINANCIALS_CASH_FLOW
This table tracks the cash flow from operating, investing, and financing activities, including both continuing and discontinued operations, along with exchange gains/losses For S&P500 companies.Description | Example | |
---|---|---|
start_date | Start date of the financial report period. | 2024-04-01 00:00:00 |
end_date | End date of the financial report period. | 2024-06-30 00:00:00 |
timeframe | The timeframe of the report. | quarterly |
fiscal_period | Fiscal period covered by the report. | Q2 |
fiscal_year | Fiscal year associated with the financial data. | 2024 |
exchange_symbol_search | Exchange symbol or ticker used in search to identify the company (tickers). | RSG |
tickers | The list of ticker symbols for the company. | ['RSG'] |
company_name | Full legal name of the company. | REPUBLIC SERVICES, INC. |
updated_at | Date and time when the financial data was last updated. | 2024-01-15 10:00:00 |
net_cash_flow | Net cash flow generated or used by the company during the period. | 318800000 |
net_cash_flow_continuing | Net cash flow from continuing operations during the period. | 318800000 |
net_cash_flow_discontinued | Net cash flow from discontinued operations during the period. | 200000 |
net_cash_flow_from_operating_activities | Net cash flow from operating activities during the period. | 400000 |
net_cash_flow_from_operating_activities_continuing | Net cash flow from continuing operating activities during the period. | 250000 |
net_cash_flow_from_operating_activities_discontinued | Net cash flow from discontinued operating activities during the period. | 150000 |
net_cash_flow_from_investing_activities | Net cash flow from investing activities during the period. | 100000 |
net_cash_flow_from_investing_activities_continuing | Net cash flow from continuing investing activities during the period. | 50000 |
net_cash_flow_from_investing_activities_discontinued | Net cash flow from discontinued investing activities during the period. | 150000 |
net_cash_flow_from_financing_activities | Net cash flow from financing activities during the period. | 100000 |
net_cash_flow_from_financing_activities_continuing | Net cash flow from continuing financing activities during the period. | 50000 |
net_cash_flow_from_financing_activities_discontinued | Net cash flow from discontinued financing activities during the period. | 10000 |
exchange_gains_losses | Gains or losses due to changes in exchange rates. | 40000 |
SP500_FINANCIALS_INCOME
This table contains detailed financial data related to a company's income and expenses during a specific period. It includes key metrics such as revenues, cost of goods sold, operating expenses, and net income, providing insights into the company’s profitability and financial performance. For S&P500 companies.Description | Example | |
---|---|---|
start_date | Start date of the financial report period. | 2024-04-01 00:00:00 |
end_date | End date of the financial report period. | 2024-06-30 00:00:00 |
timeframe | The timeframe of the report. | quarterly |
fiscal_period | Fiscal period covered by the report. | Q2 |
fiscal_year | Fiscal year associated with the financial data. | 2024 |
exchange_symbol_search | Exchange symbol or ticker used in search to identify the company (tickers). | RSG |
tickers | The list of ticker symbols for the company. | ['RSG'] |
company_name | Full legal name of the company. | REPUBLIC SERVICES, INC. |
updated_at | Date and time when the financial data was last updated. | 2024-01-15 10:00:00 |
revenues | The total revenues earned by the company. | 17912600000 |
benefits_costs_expenses | The benefits, costs, and expenses of the company. | 15570500000 |
cost_of_revenue | The total cost of revenue, including both goods and services. | 9214300000 |
cost_of_revenue_goods | The cost associated with revenue from goods sold. | 50000 |
cost_of_revenue_services | The cost associated with revenue from services provided. | 150000 |
costs_and_expenses | The total costs and expenses of the company. | 300000 |
gain_loss_on_sale_properties_net_tax | Gain or loss on sale of properties, net of tax. | 400000 |
gross_profit | The profit made after deducting cost of goods and services sold. | 150000 |
nonoperating_income_loss | Income or loss not related to core operations. | 75000 |
operating_expenses | Expenses related to the core business operations. | 60000 |
selling_general_and_administrative_expenses | General and administrative expenses, including selling and marketing. | 20000 |
depreciation_and_amortization | The depreciation and amortization expenses. | 10000 |
research_and_development | Expenditures on research and development activities. | 500000 |
other_operating_expenses | Other operating expenses outside core activities. | 100000 |
operating_income_loss | Income or loss resulting from core operating activities. | 200000 |
other_operating_income_expenses | Other income or expenses not related to core activities. | 300000 |
income_loss_before_equity_method_investments | Income or loss before equity method investments. | 50000 |
income_loss_from_continuing_operations_after_tax | Income or loss from continuing operations after tax. | 100000 |
income_loss_from_continuing_operations_before_tax | Income or loss from continuing operations before tax. | 50000 |
income_loss_from_discontinued_operations_net_of_tax | Income or loss from discontinued operations after tax. | 20000 |
income_loss_from_discontinued_operations_net_of_tax_adjustment_to_prior_year_gain_loss_on_disposal | Adjustment to prior year gain/loss on disposal of discontinued operations. | 150000 |
income_loss_from_discontinued_operations_net_of_tax_during_phase_out | Income or loss from discontinued operations during phase-out. | 30000 |
income_loss_from_discontinued_operations_net_of_tax_gain_loss_on_disposal | Gain or loss on disposal of discontinued operations. | 200000 |
income_loss_from_discontinued_operations_net_of_tax_provision_for_gain_loss_on_disposal | Provision for gain/loss on disposal of discontinued operations. | 10000 |
income_loss_from_equity_method_investments | Income or loss from equity method investments. | 5000 |
income_tax_expense_benefit | Income tax expense or benefit. | 15000 |
income_tax_expense_benefit_current | Current income tax expense or benefit. | 25000 |
income_tax_expense_benefit_deferred | Deferred income tax expense or benefit. | 40000 |
interest_and_debt_expense | Interest and debt-related expenses. | 100000 |
interest_and_dividend_income_operating | Income from interest and dividends related to operations. | 50000 |
interest_expense_operating | Interest expenses related to operations. | 5000 |
interest_income_expense_after_provision_for_losses | Interest income/expense after provision for losses. | 10000 |
interest_income_expense_operating_net | Net operating interest income/expenses. | 500000 |
noninterest_expense | Noninterest-related expenses. | 150000 |
noninterest_income | Noninterest-related income. | 50000 |
net_income_loss | The net income or loss for the company. | 5000 |
net_income_loss_attributable_to_noncontrolling_interest | Net income or loss attributable to noncontrolling interest. | 10000 |
net_income_loss_attributable_to_nonredeemable_noncontrolling_interest | Net income or loss attributable to nonredeemable noncontrolling interest. | 500000 |
net_income_loss_attributable_to_parent | Net income or loss attributable to the parent company. | 150000 |
net_income_loss_attributable_to_redeemable_noncontrolling_interest | Net income or loss attributable to redeemable noncontrolling interest. | 50000 |
net_income_loss_available_to_common_stockholders_basic | Net income available to common stockholders, basic. | 5000 |
participating_securities_distributed_and_undistributed_earnings_loss_basic | Distributed and undistributed earnings/loss from participating securities. | 10000 |
preferred_stock_dividends_and_other_adjustments | Dividends and adjustments for preferred stock. | 500000 |
provision_for_loan_lease_and_other_losses | Provisions for losses from loans, leases, and other obligations. | 150000 |
undistributed_earnings_loss_allocated_to_participating_securities_basic | Undistributed earnings/loss allocated to participating securities. | 50000 |
basic_earnings_per_share | Basic earnings per share (EPS). | 300000 |
diluted_earnings_per_share | Diluted earnings per share (EPS). | 100000 |
basic_average_shares | The basic average number of shares outstanding. | 3000 |
diluted_average_shares | The diluted average number of shares outstanding. | 300000 |
common_stock_dividends | Dividends paid on common stock. | 100000 |
SP500_FINANCIALS_COMPREHENSIVE_INCOME
This table contains a company's total financial performance during a specific period, combining net income (from regular operations) with other comprehensive income items, which include unrealized gains or losses. For S&P500 companies.Description | Example | |
---|---|---|
start_date | Start date of the financial report period. | 2024-04-01 00:00:00 |
end_date | End date of the financial report period. | 2024-06-30 00:00:00 |
timeframe | The timeframe of the report. | quarterly |
fiscal_period | Fiscal period covered by the report. | Q2 |
fiscal_year | Fiscal year associated with the financial data. | 2024 |
exchange_symbol_search | Exchange symbol or ticker used in search to identify the company (tickers). | RSG |
tickers | Ticker symbols for the company. | ['RSG'] |
company_name | Full legal name of the company. | REPUBLIC SERVICES, INC. |
updated_at | Date and time when the financial data was last updated. | 2024-01-15 10:00:00 |
comprehensive_income_loss | Net income plus unrealized profits (or losses) in the same period. | 460500000 |
comprehensive_income_loss_attributable_to_noncontrolling_interest | The comprehensive income or loss attributable to noncontrolling interest. | -100000 |
comprehensive_income_loss_attributable_to_parent | The comprehensive income or loss attributable to the parent company. | 460600000 |
other_comprehensive_income_loss | Other comprehensive income or loss. | 6800000 |
other_comprehensive_income_loss_attributable_to_noncontrolling_interest | Other comprehensive income or loss attributable to noncontrolling interest. | 2800000 |
other_comprehensive_income_loss_attributable_to_parent | Other comprehensive income or loss attributable to the parent company. | 3800000 |
Showing Pk and Fk fields only.
Provides financial market data for stocks, options, crypto, and forex. We use endpoints included in the free tier (limited to 5 Calls/Minute).
https://fred.stlouisfed.org/docs/api/fred/
The FRED API provides access to economic data, including historical data on economic indicators, interest rates, and more. The API is hosted by the Economic Research Division of the Federal Reserve Bank of St. Louis, and is entirely free.
https://www.mediawiki.org/wiki/API:Main_page#Quick_Start
MediaWiki is the official Wikipedia API. Currently we use it only to retrieve an table with details about S&P 500 companies.