Data Engineering

Airflow - Metagenomics Pipeline

Apache Airflow is a highly rated data orchestration software. In this project, I protoype the usage of Apache Airflow as a proof of concept for bioinformatics using a common metagenomic pipeline. Although not featured here, this was part of a greater architecture where the data is passed into Amazon RDS and visualized using Tableau.


Variant annotation of single nucleotide polymorphisms are very important in understanding how a mutation in a given location can cause downstream effects. Oncolonnator was built to take in variant call VCF files and to annotate the mutations with potential effects using the ExAC rest API.

HLA-PRG-LA Docker Container

HLA-PRG-LA is a algorithm built to genotype human leukocyte antigen (HLA) types from whole genome and whole exome next generation sequencing data. The installation is quite involved and the algorithm is resource intensive. The algorithm was containerized in order to quickly scale with potentially large compute clusters in mind.