Daniel Chen

Daniel Chen

Senior Bioinformatics Associate 1

Gilead Sciences Inc


I am an experienced data science professional with 5+ years of experience with increasing responsibilities and technical demands over the years. My current responsibilities are highly multi-functional and include data modeling, machine learning, software development, cloud architecture development, business development, and more. I like to stay on top of the trends in order to proactively prepare for the future of data science.

My current professional interests include machine learning, big data, and applied machine learning. When not busy coding, I like to run, cook, travel, and perform archery. I am mulitlingual in varying levels in Cantonese, Japanese, Mandarin, and Korean.

I am always open to discussing science and technology, a discussion in other languages, or new opportunities.



  • Machine Learning
  • Deep Learning
  • Software Development
  • Robotics
  • Foreign Languages
  • Archery
  • Cooking


  • MBA, 2023

    University of Southern California

  • MLA in Data Science, 2021

    Harvard University Extension School

  • BSc in Biotechnology, 2011

    University of California, Davis



Senior Bioinformatics Associate 1

Gilead Sciences Inc

Nov 2018 – Present California
  • Architect and collaboratively build automated Kubernetes Airflow deployment for enterprise level ETL jobs in order to save 50-80% in cloud costs by going cloud native for analysis
  • Manage vendor relationships and provide technical input for rapid POCs into potential technology acquisitions for digital transformation of company
  • Support development of image recognition model using UNET model to classify cells for potential clinical trial usage
  • Statistically model and engineer framework for linear model to classify risk of any R packages for regulatory compliance and cost savings from potential technology partnership
  • Develop and maintain full stack data applications to allow nontechnical people access to visualizations and results

Biostatistician 1

Gilead Sciences (Akraya Inc)

Nov 2017 – Nov 2018 California
  • Analyzed next generation sequencing data for the selection of sequencing technology vendor and replacement of contract research organization
  • Created analytical applets and workflows using the DNAnexus platform and pioneered reproducible Docker container based workflows to save development time and costs
  • Developed early data warehouse for genomic data and querying API to transition towards database technologies for visualization and analysis workflows
  • Collaborated cross functionally to create full stack web applications for visualization of data from different therapeutic areas for non technical audience

Scientist 1

Roche Sequencing Solutions Santa Clara (Aerotek Inc)

Apr 2017 – Nov 2017 California
  • Plan and execute experiments to meet company goals for product launch
  • Create and maintain shiny web applications for data visualization and interactivity
  • Developed automation bash scripts to efficiently perform sequencing experiments and save minutes per experiment
  • Implemented more efficient SOP for experiment to preemptively deal with potential failures with chipset
  • Provide d3.js visualizations, R analysis and visualization, and python based workflows
  • Optimized existing data workflows from days to seconds
  • Built and maintained PostgreSQL data warehouse for efficient data management
  • Supported efforts to make data more accessible to other groups in the company

Freelance Web Developer


Aug 2015 – Apr 2017 California
  • Designed and executed experiments as well as analyzed novel proteins to find relevant sequencing characteristics
  • Created and maintained shiny web applications to make data accessible cross functionally to allow non-programmers access to sequencing data visualization and analyses
  • Developed automation bash scripts to simplify pre-sequencing steps in experiments and to save time per sequencing run
  • Implemented and designed improved SOP for experiments to preemptively deal with potential sequencing failures due to wafer defects
  • Optimized existing ETLs from days to seconds, freeing computational resources for downstream analysis
  • Built postgresql data warehouse to simplify and centralize
  • data for data analysis and visualization

Bioinformatics Intern

University of California, San Francisco

Aug 2015 – Apr 2017 San Francisco, California
  • Analyze NGS to attempt to find significant differences in expression of knockout model skin cancer mouse line
  • Created statistical models and algorithms using microarray data to process large volumes of data including the mining and interpretation of results to help guide knockout experiments
  • Visualize and report results in clear and understandable manner for non-technical audience

CIRM Bridges to Stem Cell Research Intern

University of California, San Francisco

Aug 2013 – May 2014 San Francisco, California
  • Conducted research into the feasibility of isolating spermatogonial stem cells from human testicular cancers analysis of cellular and protein markers
  • Support research of coworkers by assisting in experiments for azoospermia and spermatogonial stem cell differentiation
  • Maintain and upkeep common lab areas to ensure productive environments
  • Actively increase sample supply by collecting and processing primary tissue for downstream diagnostic work
  • Follow safety protocols in handling human specimens in BSL2 conditions

Comparative Respiratory Research Intern

University of California, Davis

Apr 2011 – Dec 2011 Davis, California
  • Involved in the study of the effect smoke has on human lung cancer line A549 and immortalized cell line HBE-1(Human Bronchial Epithelial) cells by examining specific cellular markers for DNA damage
  • Conducted experiments to quantify the protein expression levels of DNA repair markers of interest, and to assess the morphological change of cells in response to induced DNA damage
  • Determined several treatments which diminished the expression of several DNA repair markers


Data Science Graduate Level Certificate

Derive predictive insights by applying advanced statistics, modeling, and programming skills. Acquire in-depth knowledge of machine learning and computational techniques. Unearth important questions and intelligence for a range of industries, from product design to finance.

Japanese Language Proficiency Test N3

Standardized test for Japanese Language administered by the Ministry of Education, Culture, Sports, Science, and Technology of Japan. Certificate of accomplishment for demonstrating intermediate level competence in Japanese language.

Python for Data Analysis

Data analysis training for Python including pandas, matplotlib, and scikit-learn.

Data Science Specialization

Basic coursework into Python. This includes core python fundamentals including data structures, web scraping, databases, and visualization.
See certificate

Python for Everybody Specialization

Basic coursework into Python. This includes core python fundamentals including data structures, web scraping, databases, and visualization.
See certificate

Chinese Proficiency Test(HSK2)

Standardized test for Mandarin administered by The Ministry of Education of the People’s Republic of China. Certificate of accomplishment for demonstrating basic level competence in Mandarin Chinese language



Face Mask Detection with Yolo v4

Face masks are one of the key strategies listed from the CDC to prevent the spread of the novel COVID-19. However, not everybody follows those guidelines. Using an automated machine learning approach, it is possible to detect face mask wearing compliance. Additionally, it can lead to better safety measures as one can use remote systems to monitor for mask compliance. In this post, we use Yolo v4 to perform object detection on face masks.

Q Learning with Atari

Q Learning is a off policy reinforcement learning algorithm which has been popularized by Deep-Q networks used in networks in games such as in Go and DOTA. OpenAI hosts a bunch of different environments to reinforcement learning models to play around with. This post will give an example of how to implement a Q learning algorithm in atari games.

Amazon Fine Foods Review Analysis

Given longitudinal data, one should be able to understand how things change over time. Using a longitudinal dataset based on reviews from Amazon, I attempt to understand and visualize the trends of food over the years.

Fake Job Classification

The unemployment rate in the United States acording to the US Department of Labor as of June 2020 is at 11.1%. As all of the job postings are done online now, most companies can directly post to job boards or have job data pulled from job aggregators. However, not all job postings are true job postings as some are fradulent job postings used to harvest data or other sensitive information towards desperate job seekers. Using Natural Language Processing, we built a predictive model to classify potentially fraudulent jobs.

Seoul Pollution Forecasting

Air pollution is a growing problem around the world. Many fast growing countries are increasingly encountering air pollution problems due to the rapid urbanization and modernization of their societies. The metropolitical government of Seoul released data into their air pollution monitoring system over a course of 3 years. We attempt to forecast future pollution levels of various analytes using a vector autoregression model.

NMT Zeroshot

Neural Machine Translation(NMT) is a relatively new approach towards machine translation. This project is an attempt into trying to build an translation model using the seq2seq architecture to perform zero shot translation between three different languages.

Airflow - Metagenomics Pipeline

Apache Airflow is a highly rated data orchestration software. In this project, I protoype the usage of Apache Airflow as a proof of concept for bioinformatics using a common metagenomic pipeline. Although not featured here, this was part of a greater architecture where the data is passed into Amazon RDS and visualized using Tableau.


Variant annotation of single nucleotide polymorphisms are very important in understanding how a mutation in a given location can cause downstream effects. Oncolonnator was built to take in variant call VCF files and to annotate the mutations with potential effects using the ExAC rest API.

HLA-PRG-LA Docker Container

HLA-PRG-LA is a algorithm built to genotype human leukocyte antigen (HLA) types from whole genome and whole exome next generation sequencing data. The installation is quite involved and the algorithm is resource intensive. The algorithm was containerized in order to quickly scale with potentially large compute clusters in mind.

Recent Posts

Explainable NLP with LIME

NLP models typically are black boxes in nature due to the large feature space stemming from the complexity of languages. However, explainable AI models seek to try to make clearer what the models are doing and how the classifiers in a given model work.

Hugo on AWS Amplify

Hugo is a static web page generator which offers fast rendering of pages and simple user management of the platform by using markdown to write everything. At the time of this writing, the Hugo github repository has 44.