Skip to content

Create and maintain a data warehouse in PostgreSQL or SQLite with multiple tables related to US financial stock market. Only free and reliable data sources are used, without web scraping.

Notifications You must be signed in to change notification settings

FelipeMezzarana/StockAnalyzer

Repository files navigation

StockAnalyzer

Build and maintain a data warehouse with U.S. stock market data by running multiple ETL pipelines.

Table of Contents

  1. Introduction
  2. Usage
  3. Data Catalog
  4. Relationship Diagrams
  5. Sources

Introduction

The app is designed to provide accurate and reliable US financial stock market data by leveraging publicly available APIs and free-tier services instead of resorting to web scraping. This approach ensures compliance with data usage policies while delivery a more sustainable, easy to maintain solution.

The app can run at any frequency. Each execution ensures all tables are updated with the latest data, regardless of when the last update occurred. This means that you can either use an orchestrator to keep the data up to date at a desired frequency, or run it manually, on-demand, to perform ad-hoc analyses.

The Warehouse

The warehouse was designed following the medallion architecture, each layer is isolated in a different DB schema:

  • BRONZE_LAYER: This is the ingestion layer. Tables here have the data in the most close to raw state. For example, API responses flattened with some extra metadata. Here, data is only incremented (inserts only, no updates or deletions) ensuring an immutable and easily auditable history.

  • SILVER_LAYER: Transformation layer, prepared for analytical consumption. (Not yet implemented)

  • GOLD_LAYER: Distribution layer - Business-ready data, optimized for reporting and decision-making. (Not yet implemented)

The Application

The application uses the Factory Design Pattern to create pipelines. Note that there is one pipeline factory and one step factory - each pipeline is composed of multiple steps, and different pipelines can share the same steps. This design allows us to build isolated and reusable modules, improving code reusability, scalability and maintainability.

This pattern works particularly well for implementing multiple ETL pipelines, as many steps can be abstracted and reused across different workflows. Examples include common tasks like data validation and loading. This approach reduces duplication, simplifies the development of new pipelines, and ensures consistency in shared processes.

The application has 100% unit test coverage, ensuring high reliability and quick identification of issues. There is also integration tests for every pipeline, validating that all components work as expected. Additionally, the app is fully dockerized, ensuring consistency across environments by containerizing all dependencies and services.

Finally, we use GitHub Actions for the CI/CD pipeline, ensuring that unit and integration tests pass successfully, along with linting checks for code consistency, formatting, and type checking. Deployment is not present in the CI/CD pipeline, as the application can be executed locally at any frequency.

Usage

Setup:

To use the application locally simply clone this repository:

git clone https://github.com/FelipeMezzarana/StockAnalyzer.git

Requirements:

Secrets

Before running the application create a copy of the file secrets.example and rename it to secrets.env. Then, replace the values with you keys. Note that if you pretend to use a local Postgres instance (see more bellow) you can keep the default values for client specific variables.

Variables required for all clients:

POLYGON_KEY=your_polygon_key
FRED_KEY=your_fred_key
CLIENT=<POSTGRES or SQLITE>

For POSTGRES:

POSTGRES_USER=root
POSTGRES_PASSWORD=root
POSTGRES_DB=STOCK_ANALYZER
POSTGRES_HOST=pgdatabase

For SQLITE:

# Path to folder you have/want to create your .db file.
DB_PATH=folder1/folder2/

Dependencies

The recommended approach is to run the app through Docker. In this case, it is not necessary to install dependencies.

If you want to run locally just be sure to install the dependencies in requirements.txt in you environment.

Set up a local Postgres DB (Optional)

As for the database, although SQLLite is supported, we recommend using Postgres. If you intend to use a local instance you can use the boilerplate solution provided. Use the shell script bellow to set up an local Postgres instance and easily query it:

./setup_local_db.sh 

The script will

  • Setup a dockernized Postgres database "STOCK_ANALYZER" with volume mapped to /Documents/postgres/stock_analyzer_volume

  • Setup PgAdmin local server, to easily query "STOCK_ANALYZER". It that can be accessed at http://localhost:8080/ with credentials:

  • Setup a network to run the app through Docker.

    • PgAdmin e-mail: admin@admin.com
    • PgAdmin password: root
    • host: pgdatabase
    • Postgres username: root
    • Postgres password: root

Remember to shutdown services with:

docker compose down

Running the application:

The application is executed via a CLI command. To run use:

# Run through Docker (recommended)
 ./run.sh <scope> --sub-scope <sub_scope> --skip <table_name>

# Run without Docker
source secrets.env && python3 -m src.run --sub-scope <scope> --skip <sub_scope>

<scope> and <sub_scope> defines which table(s) will be updated. We have tree main options:

  • <scope> = table -> In this case <sub_scope> must be a valid table name, and this single table will be updated.
  • <scope> = schema -> In this case <sub_scope> must be a valid schema name (bronze_layer, silver_layer or gold_layer), and all tables under the schema will be updated.
  • <scope> = all -> In this case <sub_scope> can be omitted and all tables under all schemas will be updated.

<skip> is an optional argument that may be used for skipping table(s) update. Tables names must be separated by comma.

Usage examples:

# Update all tables
./run.sh all 
# Updata bronze layer tables skipping two tables.
./run.sh schema --sub-scope bronze_layer --skip 'index_daily_close, stock_company_details'

Notes:

  • Due to APIs rate limits the first bronze_layer update should take a long time to finish (~6h). Also, the lower the update frequency the longer the execution time for subsequent updates. Other factors can also influence runtime. For example, financial data will be updated every four months, increasing this specific runtime considerable. Therefore, for the first run it is recommended to updates tables individually.
  • When using the table scope it is important to note that pipelines can have dependencies between themselves, which may affect the target table update if the dependency is not updated. Dependencies can be found below:
    • BRONZE_LAYER.STOCK_COMPANY_DETAILS -> Depends on STOCK_DAILY_PRICES to know which ticker get details from;
    • BRONZE_LAYER.FINANCIALS_[BALANCE_SHEET, CASH_FLOW_STATEMENT, INCOME_STATEMENT, COMPREHENSIVE_INCOME] -> Depends on SP500_COMPANY_DETAILS to know which ticker get financials data from;

To run unit tests:

./run_unit_tests.sh

To run integration tests:

./run_integration_test.sh 

To run linting checks:

./run_linting.sh 

Data Catalog

BRONZE LAYER

STOCK_DAILY_PRICES Daily open, high, low, and close (OHLC) for the entire stocks/equities markets.
Field Description Example
date Trading date 2024-06-21
exchange_symbol The exchange symbol that this item is traded under. LE
trading_volume The trading volume of the symbol in the given time period. 113682.0
volume_weighted_avg The volume weighted average price. 13.428
open_price The open price for the symbol in the given time period. 13.48
close_price The close price for the symbol in the given time period. 13.35
highest_price The highest price for the symbol in the given time period. 13.69
lowest_price The lowest price for the symbol in the given time period. 13.27
end_window End of trading window 2024-06-21 20:00:00
n_transaction The number of transactions in the aggregate window. 1308
updated_at Row update date. 2024-06-22 12:46:25
STOCK_COMPANY_DETAILS Basic information about tickers found in STOCK_DAILY_PRICES.
Field Description Example
active Whether or not the asset is actively traded. False means the asset has been delisted. 1
cik The CIK number for this ticker. Find more information here. 0000353278
composite_figi The composite OpenFIGI number for this ticker. BBG000BQBKR3
currency_name The name of the currency that this asset is traded with. usd
last_updated_utc The last date that the asset was updated. 2024-04-26 00:00:00
locale The locale of the asset. us
market The market type of the asset. stocks
name The name of the asset. Novo-Nordisk A/S
primary_exchange The ISO code of the primary listing exchange for this asset. XNYS
share_class_figi The share Class OpenFIGI number for this ticker BBG001S5TSK0
exchange_symbol The exchange symbol that this item is traded under. NVO
type The type of the asset. ADRC
updated_at Row update date. 2024-04-29 21:47:11
SP500_COMPANY_DETAILS Basic details about S&P500 companies.
Field Description Example
exchange_symbol The exchange symbol that this item is traded under. GOOGL
name The name of the currency that this asset is traded with. Alphabet Inc. (Class A)
sector Industry sector by GICS Standard Communication Services
sub_industry Industry sub-sector by GICS Standard Interactive Media & Services
hq_location Headquarters Location Mountain View, California
date_added Date added to S&P500 2014-04-03
cik The CIK number for this ticker 0001652044
founded Industry fundation date 1998
url Industry website https://www.nasdaq.com/market-activity/stocks/googl
updated_at Row update date. 2024-05-05 11:57:48
INDEX_DAILY_CLOSE Daily closing value of the most popular stock indexes to track US market performance.
Field Description Example
date Trading date 2024-06-21
index_code Index id according to fred.stlouisfed.org SP500
close_value Closing index value 5464.62
updated_at Row update date. 2024-06-22 14:07:20

SP500_FINANCIALS_BALANCE_SHEET This table includes key financial information such as assets, liabilities, equity, and company details, covering both current and non-current items For S&P500 companies.
Description Example
start_date Start date of the financial report period. 2024-04-01 00:00:00
end_date End date of the financial report period. 2024-06-30 00:00:00
timeframe The timeframe of the report. quarterly
fiscal_period Fiscal period covered by the report. Q2
fiscal_year Fiscal year associated with the financial data. 2024
exchange_symbol_search Exchange symbol or ticker used in search to identify the company (tickers). RSG
tickers The list of ticker symbols for the company. ['RSG']
company_name Company name. REPUBLIC SERVICES, INC.
updated_at Date and time when the row data was updated. 2024-01-15 10:00:00
assets Total assets of the company, combining current and non-current assets. 31934100000.00
current_assets Assets that can be converted into cash within one year. 2659700000.00
cash Cash and cash equivalents held by the company. 50000000.00
accounts_receivable Amounts owed to the company by customers or other entities. 20000000.00
inventory Value of raw materials, work-in-progress, and finished goods. 25000000.00
prepaid_expenses Payments made in advance for services or goods. 1000000.00
other_current_assets Other current assets not classified in standard categories. 2000000.00
noncurrent_assets Assets expected to provide value beyond one year. 230000000.00
long_term_investments Long-term investments, including securities or bonds. 50000000.00
fixed_assets Fixed assets like property, plant, and equipment (PPE). 100000000.00
intangible_assets Intangible assets, such as patents, trademarks, or goodwill. 30000000.00
noncurrent_prepaid_expense Prepaid expenses expected to be consumed beyond one year. 2000000.00
other_noncurrent_assets Non-current assets not classified under standard categories. 3000000.00
liabilities Total liabilities of the company, including current and non-current. 200000000.00
current_liabilities Short-term liabilities expected to be settled within one year. 80000000.00
accounts_payable Outstanding amounts payable to suppliers or vendors. 35000000.00
interest_payable Interest expenses incurred but not yet paid. 5000000.00
wages Salaries and wages payable to employees. 4000000.00
other_current_liabilities Other current liabilities not falling under standard categories. 36000000.00
noncurrent_liabilities Liabilities due beyond one year. 120000000.00
long_term_debt Long-term debt obligations like bonds or loans. 80000000.00
other_noncurrent_liabilities Other non-current liabilities not classified elsewhere. 20000000.00
commitments_and_contingencies Contingencies or commitments affecting financial position. 3000000.00
redeemable_noncontrolling_interest Non-controlling interest in a redeemable equity instrument. 15000000.00
redeemable_noncontrolling_interest_common Redeemable non-controlling interest in common shares. 50000000.00
redeemable_noncontrolling_interest_other Redeemable non-controlling interest in other securities. 15000000.00
redeemable_noncontrolling_interest_preferred Redeemable non-controlling interest in preferred stock. 20000000.00
equity Ownership equity of the company. 2500000.00
equity_attributable_to_noncontrolling_interest Equity attributable to minority or non-controlling interest. 4500000.00
equity_attributable_to_parent Equity attributable to the parent company. 5500000.00
temporary_equity Temporary equity not classified as permanent equity. 3500000.00
temporary_equity_attributable_to_parent Temporary equity attributable to the parent company. 2500000.00
liabilities_and_equity Sum of liabilities and equity, ensuring the balance sheet equation. 320000000.00

SP500_FINANCIALS_CASH_FLOW This table tracks the cash flow from operating, investing, and financing activities, including both continuing and discontinued operations, along with exchange gains/losses For S&P500 companies.
Description Example
start_date Start date of the financial report period. 2024-04-01 00:00:00
end_date End date of the financial report period. 2024-06-30 00:00:00
timeframe The timeframe of the report. quarterly
fiscal_period Fiscal period covered by the report. Q2
fiscal_year Fiscal year associated with the financial data. 2024
exchange_symbol_search Exchange symbol or ticker used in search to identify the company (tickers). RSG
tickers The list of ticker symbols for the company. ['RSG']
company_name Full legal name of the company. REPUBLIC SERVICES, INC.
updated_at Date and time when the financial data was last updated. 2024-01-15 10:00:00
net_cash_flow Net cash flow generated or used by the company during the period. 318800000
net_cash_flow_continuing Net cash flow from continuing operations during the period. 318800000
net_cash_flow_discontinued Net cash flow from discontinued operations during the period. 200000
net_cash_flow_from_operating_activities Net cash flow from operating activities during the period. 400000
net_cash_flow_from_operating_activities_continuing Net cash flow from continuing operating activities during the period. 250000
net_cash_flow_from_operating_activities_discontinued Net cash flow from discontinued operating activities during the period. 150000
net_cash_flow_from_investing_activities Net cash flow from investing activities during the period. 100000
net_cash_flow_from_investing_activities_continuing Net cash flow from continuing investing activities during the period. 50000
net_cash_flow_from_investing_activities_discontinued Net cash flow from discontinued investing activities during the period. 150000
net_cash_flow_from_financing_activities Net cash flow from financing activities during the period. 100000
net_cash_flow_from_financing_activities_continuing Net cash flow from continuing financing activities during the period. 50000
net_cash_flow_from_financing_activities_discontinued Net cash flow from discontinued financing activities during the period. 10000
exchange_gains_losses Gains or losses due to changes in exchange rates. 40000

SP500_FINANCIALS_INCOME This table contains detailed financial data related to a company's income and expenses during a specific period. It includes key metrics such as revenues, cost of goods sold, operating expenses, and net income, providing insights into the company’s profitability and financial performance. For S&P500 companies.
Description Example
start_date Start date of the financial report period. 2024-04-01 00:00:00
end_date End date of the financial report period. 2024-06-30 00:00:00
timeframe The timeframe of the report. quarterly
fiscal_period Fiscal period covered by the report. Q2
fiscal_year Fiscal year associated with the financial data. 2024
exchange_symbol_search Exchange symbol or ticker used in search to identify the company (tickers). RSG
tickers The list of ticker symbols for the company. ['RSG']
company_name Full legal name of the company. REPUBLIC SERVICES, INC.
updated_at Date and time when the financial data was last updated. 2024-01-15 10:00:00
revenues The total revenues earned by the company. 17912600000
benefits_costs_expenses The benefits, costs, and expenses of the company. 15570500000
cost_of_revenue The total cost of revenue, including both goods and services. 9214300000
cost_of_revenue_goods The cost associated with revenue from goods sold. 50000
cost_of_revenue_services The cost associated with revenue from services provided. 150000
costs_and_expenses The total costs and expenses of the company. 300000
gain_loss_on_sale_properties_net_tax Gain or loss on sale of properties, net of tax. 400000
gross_profit The profit made after deducting cost of goods and services sold. 150000
nonoperating_income_loss Income or loss not related to core operations. 75000
operating_expenses Expenses related to the core business operations. 60000
selling_general_and_administrative_expenses General and administrative expenses, including selling and marketing. 20000
depreciation_and_amortization The depreciation and amortization expenses. 10000
research_and_development Expenditures on research and development activities. 500000
other_operating_expenses Other operating expenses outside core activities. 100000
operating_income_loss Income or loss resulting from core operating activities. 200000
other_operating_income_expenses Other income or expenses not related to core activities. 300000
income_loss_before_equity_method_investments Income or loss before equity method investments. 50000
income_loss_from_continuing_operations_after_tax Income or loss from continuing operations after tax. 100000
income_loss_from_continuing_operations_before_tax Income or loss from continuing operations before tax. 50000
income_loss_from_discontinued_operations_net_of_tax Income or loss from discontinued operations after tax. 20000
income_loss_from_discontinued_operations_net_of_tax_adjustment_to_prior_year_gain_loss_on_disposal Adjustment to prior year gain/loss on disposal of discontinued operations. 150000
income_loss_from_discontinued_operations_net_of_tax_during_phase_out Income or loss from discontinued operations during phase-out. 30000
income_loss_from_discontinued_operations_net_of_tax_gain_loss_on_disposal Gain or loss on disposal of discontinued operations. 200000
income_loss_from_discontinued_operations_net_of_tax_provision_for_gain_loss_on_disposal Provision for gain/loss on disposal of discontinued operations. 10000
income_loss_from_equity_method_investments Income or loss from equity method investments. 5000
income_tax_expense_benefit Income tax expense or benefit. 15000
income_tax_expense_benefit_current Current income tax expense or benefit. 25000
income_tax_expense_benefit_deferred Deferred income tax expense or benefit. 40000
interest_and_debt_expense Interest and debt-related expenses. 100000
interest_and_dividend_income_operating Income from interest and dividends related to operations. 50000
interest_expense_operating Interest expenses related to operations. 5000
interest_income_expense_after_provision_for_losses Interest income/expense after provision for losses. 10000
interest_income_expense_operating_net Net operating interest income/expenses. 500000
noninterest_expense Noninterest-related expenses. 150000
noninterest_income Noninterest-related income. 50000
net_income_loss The net income or loss for the company. 5000
net_income_loss_attributable_to_noncontrolling_interest Net income or loss attributable to noncontrolling interest. 10000
net_income_loss_attributable_to_nonredeemable_noncontrolling_interest Net income or loss attributable to nonredeemable noncontrolling interest. 500000
net_income_loss_attributable_to_parent Net income or loss attributable to the parent company. 150000
net_income_loss_attributable_to_redeemable_noncontrolling_interest Net income or loss attributable to redeemable noncontrolling interest. 50000
net_income_loss_available_to_common_stockholders_basic Net income available to common stockholders, basic. 5000
participating_securities_distributed_and_undistributed_earnings_loss_basic Distributed and undistributed earnings/loss from participating securities. 10000
preferred_stock_dividends_and_other_adjustments Dividends and adjustments for preferred stock. 500000
provision_for_loan_lease_and_other_losses Provisions for losses from loans, leases, and other obligations. 150000
undistributed_earnings_loss_allocated_to_participating_securities_basic Undistributed earnings/loss allocated to participating securities. 50000
basic_earnings_per_share Basic earnings per share (EPS). 300000
diluted_earnings_per_share Diluted earnings per share (EPS). 100000
basic_average_shares The basic average number of shares outstanding. 3000
diluted_average_shares The diluted average number of shares outstanding. 300000
common_stock_dividends Dividends paid on common stock. 100000
SP500_FINANCIALS_COMPREHENSIVE_INCOME This table contains a company's total financial performance during a specific period, combining net income (from regular operations) with other comprehensive income items, which include unrealized gains or losses. For S&P500 companies.
Description Example
start_date Start date of the financial report period. 2024-04-01 00:00:00
end_date End date of the financial report period. 2024-06-30 00:00:00
timeframe The timeframe of the report. quarterly
fiscal_period Fiscal period covered by the report. Q2
fiscal_year Fiscal year associated with the financial data. 2024
exchange_symbol_search Exchange symbol or ticker used in search to identify the company (tickers). RSG
tickers Ticker symbols for the company. ['RSG']
company_name Full legal name of the company. REPUBLIC SERVICES, INC.
updated_at Date and time when the financial data was last updated. 2024-01-15 10:00:00
comprehensive_income_loss Net income plus unrealized profits (or losses) in the same period. 460500000
comprehensive_income_loss_attributable_to_noncontrolling_interest The comprehensive income or loss attributable to noncontrolling interest. -100000
comprehensive_income_loss_attributable_to_parent The comprehensive income or loss attributable to the parent company. 460600000
other_comprehensive_income_loss Other comprehensive income or loss. 6800000
other_comprehensive_income_loss_attributable_to_noncontrolling_interest Other comprehensive income or loss attributable to noncontrolling interest. 2800000
other_comprehensive_income_loss_attributable_to_parent Other comprehensive income or loss attributable to the parent company. 3800000

Relationship Diagrams

Showing Pk and Fk fields only.

BRONZE LAYER

png

Sources

Polygon API

https://polygon.io/

Provides financial market data for stocks, options, crypto, and forex. We use endpoints included in the free tier (limited to 5 Calls/Minute).

FRED API

https://fred.stlouisfed.org/docs/api/fred/

The FRED API provides access to economic data, including historical data on economic indicators, interest rates, and more. The API is hosted by the Economic Research Division of the Federal Reserve Bank of St. Louis, and is entirely free.

MediaWiki

https://www.mediawiki.org/wiki/API:Main_page#Quick_Start

MediaWiki is the official Wikipedia API. Currently we use it only to retrieve an table with details about S&P 500 companies.

About

Create and maintain a data warehouse in PostgreSQL or SQLite with multiple tables related to US financial stock market. Only free and reliable data sources are used, without web scraping.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published