Daniel Chen

Senior Bioinformatics Associate 1

Gilead Sciences Inc

Biography

I am an experienced data science professional with 5+ years of experience with increasing responsibilities and technical demands over the years. My current responsibilities are highly multi-functional and include data modeling, machine learning, software development, cloud architecture development, business development, and more. I like to stay on top of the trends in order to proactively prepare for the future of data science.

My current professional interests include machine learning, big data, and applied machine learning. When not busy coding, I like to run, cook, travel, and perform archery. I am mulitlingual in varying levels in Cantonese, Japanese, Mandarin, and Korean.

I am always open to discussing science and technology, a discussion in other languages, or new opportunities.

会说中文
日本語を話すことが出来ます

Interests

Machine Learning
Deep Learning
Software Development
Robotics
Foreign Languages
Archery
Cooking

Education

MBA, 2023

University of Southern California
MLA in Data Science, 2021

Harvard University Extension School
BSc in Biotechnology, 2011

University of California, Davis

Experience

Senior Bioinformatics Associate 1

Gilead Sciences Inc

Nov 2018 – Present California

Architect and collaboratively build automated Kubernetes Airflow deployment for enterprise level ETL jobs in order to save 50-80% in cloud costs by going cloud native for analysis
Manage vendor relationships and provide technical input for rapid POCs into potential technology acquisitions for digital transformation of company
Support development of image recognition model using UNET model to classify cells for potential clinical trial usage
Statistically model and engineer framework for linear model to classify risk of any R packages for regulatory compliance and cost savings from potential technology partnership
Develop and maintain full stack data applications to allow nontechnical people access to visualizations and results

Biostatistician 1

Gilead Sciences (Akraya Inc)

Nov 2017 – Nov 2018 California

Analyzed next generation sequencing data for the selection of sequencing technology vendor and replacement of contract research organization
Created analytical applets and workflows using the DNAnexus platform and pioneered reproducible Docker container based workflows to save development time and costs
Developed early data warehouse for genomic data and querying API to transition towards database technologies for visualization and analysis workflows
Collaborated cross functionally to create full stack web applications for visualization of data from different therapeutic areas for non technical audience

Scientist 1

Roche Sequencing Solutions Santa Clara (Aerotek Inc)

Apr 2017 – Nov 2017 California

Plan and execute experiments to meet company goals for product launch
Create and maintain shiny web applications for data visualization and interactivity
Developed automation bash scripts to efficiently perform sequencing experiments and save minutes per experiment
Implemented more efficient SOP for experiment to preemptively deal with potential failures with chipset
Provide d3.js visualizations, R analysis and visualization, and python based workflows
Optimized existing data workflows from days to seconds
Built and maintained PostgreSQL data warehouse for efficient data management
Supported efforts to make data more accessible to other groups in the company

Freelance Web Developer

Self

Aug 2015 – Apr 2017 California

Designed and executed experiments as well as analyzed novel proteins to find relevant sequencing characteristics
Created and maintained shiny web applications to make data accessible cross functionally to allow non-programmers access to sequencing data visualization and analyses
Developed automation bash scripts to simplify pre-sequencing steps in experiments and to save time per sequencing run
Implemented and designed improved SOP for experiments to preemptively deal with potential sequencing failures due to wafer defects
Optimized existing ETLs from days to seconds, freeing computational resources for downstream analysis
Built postgresql data warehouse to simplify and centralize
data for data analysis and visualization

Bioinformatics Intern

University of California, San Francisco

Aug 2015 – Apr 2017 San Francisco, California

Analyze NGS to attempt to find significant differences in expression of knockout model skin cancer mouse line
Created statistical models and algorithms using microarray data to process large volumes of data including the mining and interpretation of results to help guide knockout experiments
Visualize and report results in clear and understandable manner for non-technical audience

CIRM Bridges to Stem Cell Research Intern

University of California, San Francisco

Aug 2013 – May 2014 San Francisco, California

Conducted research into the feasibility of isolating spermatogonial stem cells from human testicular cancers analysis of cellular and protein markers
Support research of coworkers by assisting in experiments for azoospermia and spermatogonial stem cell differentiation
Maintain and upkeep common lab areas to ensure productive environments
Actively increase sample supply by collecting and processing primary tissue for downstream diagnostic work
Follow safety protocols in handling human specimens in BSL2 conditions

Comparative Respiratory Research Intern

University of California, Davis

Apr 2011 – Dec 2011 Davis, California

Involved in the study of the effect smoke has on human lung cancer line A549 and immortalized cell line HBE-1(Human Bronchial Epithelial) cells by examining specific cellular markers for DNA damage
Conducted experiments to quantify the protein expression levels of DNA repair markers of interest, and to assess the morphological change of cells in response to induced DNA damage
Determined several treatments which diminished the expression of several DNA repair markers

Certifications

Data Science Graduate Level Certificate

Harvard University Extension School May 2020

Derive predictive insights by applying advanced statistics, modeling, and programming skills. Acquire in-depth knowledge of machine learning and computational techniques. Unearth important questions and intelligence for a range of industries, from product design to finance.

Japanese Language Proficiency Test N3

The Japan Foundation Dec 2018

Standardized test for Japanese Language administered by the Ministry of Education, Culture, Sports, Science, and Technology of Japan. Certificate of accomplishment for demonstrating intermediate level competence in Japanese language.

Python for Data Analysis

Enthought Nov 2017

Data analysis training for Python including pandas, matplotlib, and scikit-learn.

Data Science Specialization

Coursera May 2016

Basic coursework into Python. This includes core python fundamentals including data structures, web scraping, databases, and visualization.

See certificate

Python for Everybody Specialization

Coursera Apr 2016

Basic coursework into Python. This includes core python fundamentals including data structures, web scraping, databases, and visualization.

See certificate

Chinese Proficiency Test(HSK2)

Chinese Testing International Mar 2016

Standardized test for Mandarin administered by The Ministry of Education of the People’s Republic of China. Certificate of accomplishment for demonstrating basic level competence in Mandarin Chinese language

Projects

Face Mask Detection with Yolo v4

Face masks are one of the key strategies listed from the CDC to prevent the spread of the novel COVID-19. However, not everybody follows those guidelines. Using an automated machine learning approach, it is possible to detect face mask wearing compliance. Additionally, it can lead to better safety measures as one can use remote systems to monitor for mask compliance. In this post, we use Yolo v4 to perform object detection on face masks.

Code

Q Learning with Atari

Q Learning is a off policy reinforcement learning algorithm which has been popularized by Deep-Q networks used in networks in games such as in Go and DOTA. OpenAI hosts a bunch of different environments to reinforcement learning models to play around with. This post will give an example of how to implement a Q learning algorithm in atari games.

Code

Amazon Fine Foods Review Analysis

Given longitudinal data, one should be able to understand how things change over time. Using a longitudinal dataset based on reviews from Amazon, I attempt to understand and visualize the trends of food over the years.

Code

Fake Job Classification

The unemployment rate in the United States acording to the US Department of Labor as of June 2020 is at 11.1%. As all of the job postings are done online now, most companies can directly post to job boards or have job data pulled from job aggregators. However, not all job postings are true job postings as some are fradulent job postings used to harvest data or other sensitive information towards desperate job seekers. Using Natural Language Processing, we built a predictive model to classify potentially fraudulent jobs.

Code

Seoul Pollution Forecasting

Air pollution is a growing problem around the world. Many fast growing countries are increasingly encountering air pollution problems due to the rapid urbanization and modernization of their societies. The metropolitical government of Seoul released data into their air pollution monitoring system over a course of 3 years. We attempt to forecast future pollution levels of various analytes using a vector autoregression model.

PDF Code

NMT Zeroshot

Neural Machine Translation(NMT) is a relatively new approach towards machine translation. This project is an attempt into trying to build an translation model using the seq2seq architecture to perform zero shot translation between three different languages.

Code

Airflow - Metagenomics Pipeline

Apache Airflow is a highly rated data orchestration software. In this project, I protoype the usage of Apache Airflow as a proof of concept for bioinformatics using a common metagenomic pipeline. Although not featured here, this was part of a greater architecture where the data is passed into Amazon RDS and visualized using Tableau.

Code Video

Oncolonnator

Variant annotation of single nucleotide polymorphisms are very important in understanding how a mutation in a given location can cause downstream effects. Oncolonnator was built to take in variant call VCF files and to annotate the mutations with potential effects using the ExAC rest API.

Code

HLA-PRG-LA Docker Container

HLA-PRG-LA is a algorithm built to genotype human leukocyte antigen (HLA) types from whole genome and whole exome next generation sequencing data. The installation is quite involved and the algorithm is resource intensive. The algorithm was containerized in order to quickly scale with potentially large compute clusters in mind.

Code

Daniel Chen

Senior Bioinformatics Associate 1

Biography

Interests

Education

Experience

Senior Bioinformatics Associate 1

Biostatistician 1

Gilead Sciences (Akraya Inc)

Scientist 1

Roche Sequencing Solutions Santa Clara (Aerotek Inc)

Freelance Web Developer

Self

Bioinformatics Intern

University of California, San Francisco

CIRM Bridges to Stem Cell Research Intern

University of California, San Francisco

Comparative Respiratory Research Intern

University of California, Davis

Certifications

Data Science Graduate Level Certificate

Japanese Language Proficiency Test N3

Python for Data Analysis

Data Science Specialization

Python for Everybody Specialization

Chinese Proficiency Test(HSK2)

Projects

Recent Posts