Fake Job Classification

Daniel Chen

May 17, 2020 Machine Learning, Data Analytics

Code

Introduction

The unemployment rate in the United States acording to the US Department of Labor as of June 2020 is at 11.1%¹. Although there are many factors causing contributing to the current unemployment rate, many people in the US and worldwide need to look for new jobs due to job loss and other financial hardships. As all of the job postings are done online now, most companies can directly post to job boards or have job data pulled from job aggregators. However, not all job postings are true job postings as some are fradulent job postings used to harvest data or other sensitive information towards desperate job seekers². Using the advances in natural learning processing, it should be possible to build a classifier to be able to detect potential fake jobs. This is an important task as many people will be applying for jobs online due to the current unemployment situation, and minimizing victims of scams due to fradulent job postings will be important for economic recovery.

Using data gathered from the University of Aegean about various job postings from many different job sites, it should be possible to build some a classifer for potential fake job postings. Job posts typically are text rich with general job descriptions and typically some additional information such as telecommuting options and employment type. With all of these features, it should be possible to build a complex model using features from the job description using NLP or even more simple features with simplier models. This project will use three models and with the most simple model being the baseline model to compare against.

Simple model using non-NLP features such as comparing correlations of job function to fraudulent postings
Intermediate model using NLP tokenization features and modeling relationship of fraudent job postings using more traditional machine learning algorithms such as logistic regression or SVMs
Complex model using state of the art transformer neural network models to predict fradulent job postings

Data Description

The dataset originally was downloaded from kaggle. The dataset hosted there is originally from the University of Aegean. This dataset is a publicly available dataset containing almost ~18,000 job ads with human classified fake jobs. This dataset is based on job ads published between 2012 and 2014. There are about 17000 true jobs and about 900 fake jobs in the dataset.

Modeling

As there is a class imbalance of about 95% real to fake, some work needed to be done to minimize the false negative rates. nlpaug was used in order to augment the text data to bootstrap the number of fake jobs. Bootstrapping was done in order to add more fake jobs based on the data by modifying the text by using synonyms or minor typos or minor additional adjectives. Additionally, some of the real jobs were downsampled so the class balance became about 2:1 for real to fake. Various models were built including SVM, logistic regression, random forest, and even deep learning based Roberta models for this binary classification problem.

Results

We can reasonably classify if a job is frauduent using multiple features including the company profile from the job description. Although the models seem to suggest that the company profile is the strongest feature, there are other useful features like like the job description, requirements, and benefits which also have good predictive power. We have seen with basic models that we can get good predictive power from text lengths. With basic NLP based models, we get very good accuracy and F1 scoring with minimal hyperparameter tuning. With more complex models like Roberta, we still get very good performance but it is comparably worse with much longer training times. For the purposes of a strong model, an ensemble model using the best paramters were chosen for the purposes of minimizing potential bias from any particular variable.

Model Deployment

The resulting models from scikit-learn were pickled and were setup into a FastAPI server. The server was setup to serve predictions based on inputs to the ensemble model.

References

Bureau of Labor Statistics US Department of Labor. The Employment Situation - June 2020. Accessed 07/26/2020. https://www.bls.gov/news.release/pdf/empsit.pdf
USC Career Center. Avoid Fradulent Job Postings. Accessed 07/26/2020. https://careers.usc.edu/students/find-a-job/avoid-fraudulent-job-postings/
Rajapakse, Thilina. Simple Transformers - Introducing the Easiest Way To Use BERT, RoBERTa, XLNet, and XLM.Accessed 07/26/2020. https://towardsdatascience.com/simple-transformers-introducing-the-easiest-bert-roberta-xlnet-and-xlm-library-58bf8c59b2a3

Machine Learning Data Analytics