data-pipelines

Here are 201 public repositories matching this topic...

infiniflow / ragflow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.

nlp machine-learning information-retrieval ocr deep-learning chatbot orchestration preprocessing pdf-to-text data-pipelines document-parser rag document-understanding table-structure-recognition llm llmops retrieval-augmented-generation

Updated Jun 12, 2024
Python

apache / airflow

Star

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Updated Jun 12, 2024
Python

dagster-io / dagster

Star

An orchestration platform for the development, production, and observation of data assets.

python metadata workflow data-science etl analytics scheduler orchestration data-engineering data-integration data-pipelines workflow-automation mlops dagster data-orchestrator

Updated Jun 12, 2024
Python

artie-labs / transfer

Star

Database replication platform that leverages change data capture. Stream production data from databases to your data warehouse (Snowflake, BigQuery, Redshift) in real-time.

golang bigquery database kafka snowflake data-integration redshift apache-kafka elt data-pipelines cdc change-data-capture debezium

Updated Jun 12, 2024
Go

Unstructured-IO / unstructured

Star

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

Updated Jun 12, 2024
HTML

meltano / meltano

Star

Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

Updated Jun 11, 2024
Python

SciPhi-AI / R2R

Star

The ultimate open-source RAG framework

search pdf machine-learning ocr deep-learning retrieval chatbot artificial-intelligence question-answering data-pipelines retrieval-systems large-language-models llm langchain llama-index retrieval-augmented-generation

Updated Jun 11, 2024
HTML

pathwaycom / pathway

Star

Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.

python rust streaming real-time kafka etl machine-learning-algorithms stream-processing data-analytics dataflow data-processing data-pipelines batch-processing pathway iot-analytics etl-framework time-series-analysis

Updated Jun 11, 2024
Python

tuva-health / tuva

Star

Main repo including core data model, data marts, reference data, terminology, and the clinical concept library

open-source bigquery sql snowflake data-warehouse healthcare data-analytics redshift terminology dbt data-pipelines data-governance data-lineage healthcare-analysis healthcare-data analytics-engineering dbt-packages

Updated Jun 11, 2024

unicef / magasin

Star

Cloud native open-source end-to-end data / AI / ML platform

kubernetes data-science data cloud data-visualization helm-charts data-pipelines magasin dagster

Updated Jun 11, 2024
Mustache

mage-ai / mage-ai

Star

🧙 Build, run, and manage data pipelines for integrating and transforming data.

python data-science data machine-learning sql spark pipeline etl pipelines orchestration artificial-intelligence data-engineering data-integration dbt elt transformation data-pipelines reverse-etl

Updated Jun 12, 2024
Python

apache / dolphinscheduler

Star

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code

workflow airflow job-scheduler orchestration cloud-native task-scheduler data-pipelines azkaban workflow-orchestration workflow-schedule powerful-data-pipelines

Updated Jun 12, 2024
Java

mycelial / mycelial

Star

Move your data with ease.

rust etl data-pipelines edge-computing etl-pipeline

Updated Jun 11, 2024
Rust

opendatadiscovery / odd-platform

Star

First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.

Updated Jun 11, 2024
Java

elementary-data / elementary

Star

The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.

bigquery snowflake data-warehouse dataops data-analysis redshift dbt data-pipelines data-pipeline lineage data-governance data-lineage analytics-engineer dbt-packages data-observability data-reliability dbt-artifacts

Updated Jun 11, 2024
HTML

smart-data-lake / smart-data-lake

Star

Smart Automation Tool for building modern Data Lakes and Data Pipelines

scala spark hive hadoop transform-data data-lake data-pipelines deltalake smart-data-lake

Updated Jun 11, 2024
Scala

bruin-data / bruin

Star

Bruin is a data pipeline tool that is designed to be easy-to-use. It allows building data pipelines using SQL and Python, and has built-in data quality checks.

python bigquery sql analytics data-transformation snowflake data-analysis data-pipelines data-modeling