DataSQRL

DataSQRL is a data streaming framework for incremental and real-time data processing applications. Ingest data from various sources, integrate, transform, store, and serve the result as data APIs, LLM tooling, or iceberg views - with the simplicity of SQL.

Data Engineers use DataSQRL to quickly build production-ready data pipelines that:

Create realtime data APIs,
Expose enriched data for LLM tooling,
Materialize data into Iceberg tables and catalog views for querying in Snowflake, DuckDB, AWS Athena, etc.

You define the data processing in SQL and DataSQRL compiles the deployment artifacts for Apache Kafka, Flink, Postgres, Iceberg, GraphQL API, and LLM tooling. It generates the glue code, schemas, and mappings to automatically integrate and configure these components into a coherent data pipeline that is highly available, consistent, scalable, observable, and fast. DataSQRL supports quick local iteration, end-to-end pipeline testing, and deployment to Kubernetes or cloud-managed services.

DataSQRL Features

🔗 Eliminate glue code: DataSQRL generates connectors, schemas, data mappings, SQL dialect translation, and configurations. Do more with less.
🚀 Develop faster: Local development, CI/CD support, logging framework, reusable components, and composable architecture for quick iteration cycles.
🛡️ Reliable Data: Consistent data processing with exactly or at-least once guarantees, testing framework, and data lineage.
🔒 Production-grade: Robust, highly available, scalable, observable, and executed by trusted OSS technologies (Kafka, Flink, Postgres, DuckDB).
🤖 AI-native: Support for vector embeddings, LLM invocation, and ML model inference, and LLM tooling interfaces.

To learn more about DataSQRL, check out the documentation.

Getting Started

This example builds a data pipeline that captures user token consumption via API, exposes consumption alerts via subscription, and aggregates the data for query access.

/*+no_query */
CREATE TABLE UserTokens (
    userid BIGINT NOT NULL,
    tokens BIGINT NOT NULL,
    request_time TIMESTAMP_LTZ(3) NOT NULL METADATA FROM 'timestamp'
);

/*+query_by_all(userid) */
TotalUserTokens := SELECT userid, sum(tokens) as total_tokens,
                          count(tokens) as total_requests
                   FROM UserTokens GROUP BY userid;

UsageAlert := SUBSCRIBE SELECT * FROM UserTokens WHERE tokens > 100000;

Create a file usertokens.sqrl with the content above and run it with:

docker run -it --rm -p 8888:8888 -p 8081:8081 -p 9092:9092 -v $PWD:/build datasqrl/cmd:latest run usertokens.sqrl

(Use ${PWD} in Powershell on Windows).

The pipeline is exposed through a GraphQL API that you can access at http://localhost:8888/graphiql/ in your browser.

UserTokens is exposed as a mutation for adding data.
TotalUserTokens is exposed as a query for retrieving the aggregated data.
UsageAlert is exposed as a subscription for real-time alerts.

Once you are done, terminate the pipeline with CTRL-C.

To build the deployment assets in the for the data pipeline, execute

docker run --rm -v $PWD:/build datasqrl/cmd:latest compile usertokens.sqrl

The build/deploy directory contains the Flink compiled plan, Kafka topic definitions, PostgreSQL schema and view definitions, server queries, and GraphQL data model.

Read the full Getting Started tutorial or check out the DataSQRL Examples repository for more examples creating Iceberg views, Chatbots, data APIs and more.

Why DataSQRL?

As data engineers, we got frustrated by all the data plumbing we had to implement, the lack of developer tooling, and the limited automation.

Web developers have frameworks to eliminate the busywork. We are building the DataSQRL framework to do the same for data engineers.

How DataSQRL Works

DataSQRL compiles the SQRL scripts and data source/sink definitions into a data processing DAG (Directed Acyclic Graph) according to the configuration. The cost-based optimizer cuts the DAG into segments executed by different engines (e.g. Flink, Kafka, Postgres, Vert.x), generating the necessary physical plans, schemas, and connectors for a fully integrated and streamlined data pipeline. These deployment assets are then executed in Docker, Kubernetes, or by a managed cloud service.

DataSQRL gives you full visibility and control over the generated data pipeline and uses proven open-source technologies to execute the generated deployment assets.

Learn more about DataSQRL in the documentation.

Contributing

We aim to enable data engineers to build data pipelines quickly and eliminate the data plumbing busy work. Your feedback is invaluable in achieving this goal. Let us know what works and what doesn't by filing GitHub issues or in the DataSQRL Slack community.

We welcome code contributions. For more details, check out CONTRIBUTING.md.

Name		Name	Last commit message	Last commit date
Latest commit History 1,844 Commits
.circleci		.circleci
.github		.github
.mvn		.mvn
documentation		documentation
sqrl-functions		sqrl-functions
sqrl-planner		sqrl-planner
sqrl-server		sqrl-server
sqrl-testing		sqrl-testing
sqrl-tools		sqrl-tools
.gitignore		.gitignore
.gitmodules		.gitmodules
BUILD.md		BUILD.md
COMMITTERS.md		COMMITTERS.md
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTORS.md		CONTRIBUTORS.md
Dockerfile		Dockerfile
Dockerfile.run		Dockerfile.run
LICENSE		LICENSE
NOTICE.txt		NOTICE.txt
README.md		README.md
RELEASE.md		RELEASE.md
codecov.yml		codecov.yml
entrypoint-run.sh		entrypoint-run.sh
manual-tests.md		manual-tests.md
playbook.md		playbook.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataSQRL

DataSQRL Features

Getting Started

Why DataSQRL?

How DataSQRL Works

Contributing

About

Releases 20

Contributors 10

Languages

License

DataSQRL/sqrl

Folders and files

Latest commit

History

Repository files navigation

DataSQRL

DataSQRL Features

Getting Started

Why DataSQRL?

How DataSQRL Works

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 20

Contributors 10

Languages