Airflow - Metagenomics Pipeline

Apache Airflow is a widely used and popular data orchestration software in the data engineering community. The software is fully featured and importantly including interfacing with numerous databases and cloud native technologies and the ability to distribute jobs using Kubernetes. As more companies adopt cloud native technologies, Apache Airflow has great potential for many applications besides data engineering. In bioinformatics, most research organizations are still using very outdated batch processing systems which run counter to the promises of the cloud. As cloud native technologies require great technical expertise, it is understandable that many organizations have not fully embraced the cloud and are using intermediate solutions such as DNAnexus. The goal of this project is to demo a end to end pieline for the processing and analysis of metagenomic data.

Metagenomics is the study of microbial populations derived from next generation sequencing data. Although it is not as popular in the bioinformatics field as RNA-Seq, this type of data is of growing interest. This type of data can give an overview of the types of bacteria growing in one’s mouth, skin, or other body locations. This data can be used potentially as a diagnostic into healthy bacterial population conditions and could provide insight into why drugs work in a different direction.

In this project, I demo the usage of Apache Airflow using a single EC2 instance connected to AWS S3 to pull and process data locally using metagenomics pipelines before pushing final data back to S3. Although not present in the code base, this data was then ingested into AWS RDS and then visualized using Tableau to demonstrate how good architecture designs can lead to simple automation.