Technical annex to chapter 6

This technical annex provides a step-by-step explanation of the entire process surrounding the ML classification model outlined in chapter 6. It is intended to provide a detailed look at the process and the decisions involved. For the purposes of this text, we will assume that readers are familiar with the basic terminology of ML classification models.

T.2.1 Manual labelling

In order to create the first test and training dataset, one set of vacancies was labelled by hand using the categories of AI vacancy and non-AI vacancy. Since AI vacancies are a miniscule part of the entire data set (roughly 0.1 percent), it was impossible to create a useable test- and training data set from a random sample. The resulting training dataset would have had too few positive cases (AI vacancies) to be able to train the models effectively. To avoid this problem, we actively looked for AI vacancies when compiling the initial set of vacancies. We did this by sampling filtered populations based on search terms and/or ISCO codes. We varied the search terms used in order to try to incorporate the complete spectrum of AI vacancies. The initial dataset also needed to include vacancies with no relationship to AI whatsoever. For that purpose, we added a random sample and labelled it non-AI; we only checked the titles of the vacancies involved.

T.2.2 Encoding

We used encoding because the ML classification models require numerical inputs, and the goal was to classify vacancy texts. Two types of encoding were tested: TF-IDF and PPMI. After encoding, features were added to the resulting numeric vectors. We performed data cleaning before applying the encoding methods. This involved removing special characters and filler words, putting all letters in lowercase, and stemming: a process in which words are reduced back to their linguistic roots, e.g. ‘calculating’ is reduced back to ‘calculate’.

TF-IDF

The encoding method known as ‘term frequency–inverse document frequency’ (TF-IDF) is a combination of two techniques. First, the terms that appear in AI vacancies are catalogued and tallied (TF). Next, those values are weighted based on how often they normally appear in ad texts (IDF). Terms that appear frequently are weighted lower than terms that are less common. Consequently, terms unique to specific texts are weighted higher. For each term, the weighting used in IDF was determined by counting how many texts in the entire training data set used the term at least once. We used parameter m in order to limit the number of features –i.e. the dimensionality– of the resulting numeric vector. Terms that appeared in fewer than m texts across the training data set received a weight of zero. Consequently, they were not adopted as features in the ML classification model. The value of parameter m can vary. For this project, we tested the values {5,10,15,20,25,30}.

PPMI

Positive pointwise mutual information (PPMI) encoding uses the entire training data set to determine the probability that two terms both appear within a certain window size, also taking into account the probability that the terms appear separately in the text. For this project, we opted for window size two. This means that we looked at a range of two words before- and after the term in question. That probability is recorded in a matrix, where the elements are calculated as follows:

$$pmi(x,y)=\log\frac{p(x,y)}{p(x)p(y)}$$

where x and y are two different terms, p(x,y) is the probability that term y appears within the window size of term x, and p(x) is the probability that term x appears at all; the same is true for p(y) and y. All negative values are set to zero; hence the ‘positive’ in PPMI. As was the case with TF-IDF, PPMI only looks at words that appear in more than m texts.

The resulting PPMI matrix is used to express every single vacancy as a numeric vector. Every word in the vacancy that appears in more than m texts corresponds to a row in the PPMI matrix. For each vacancy, we made a collection of all rows from the matrix that correspond to words in the ad text. Next, we calculated the L2 norm for this collection. This means that for any ad text with 300 words that appear in more than m texts, the 300 corresponding rows were gathered from the PPMI matrix. The results of the L2 norm over those 300 rows is the numeric vector to be used as input.

Additional features

The resulting TF-IDF and PPMI vectors can be further enhanced by adding additional vacancy information. For this project, we added the following features to the numeric vectors for each method:

A binary value that shows whether the ad text was written in Dutch or English.
A binary value that indicated whether the job title contained at least one of the following terms: ‘AI’, ‘artificial intelligence’, ‘ML’, ‘machine learning’, ‘NLP’, ‘natural language processing’, or ‘AI/ML’.
A numeric value that shows the level of education corresponding to the vacancy.
A binary value that shows whether the level of education is known or unknown.
A numeric value that shows the size of the organisation associated with the vacancy.
A binary value that shows whether the size of the organisation is known or unknown.

Additionally, we also tested whether adding word2vec (w2v) values improved model performance. Word2vec is a technique that transforms the meaning of words to numeric vectors. For this project, we chose to apply w2v with a vector dimensionality of 200. We also decided to take the average of the w2v vectors of all individual words in the text to summarise the w2v information of that text. Using that method, we have constructed two w2v values: one for the ad text as a whole, and one for just the ad title. This means that, over the course of the process, we tested four types of encoding:

TF-IDF
TF-IDF + w2v for the entire ad text.
TF-IDF + w2v for the vacancy title
PPMI

T.2.3 Machine learning classification models

For this project, we tested six different ML classification models: regularised logit, decision tree, random forest, support vector machines (SVM), and two different multilayer perceptron (MLP) configurations.¹⁾ Each of these models used the numeric vector of the previous step as input, and each of them output a label: AI vacancy or non-AI vacancy. Additionally, every model –besides SVM– also gave a probability score between 0 and 1. The greater the score, the greater the odds that the vacancy concerned was an AI vacancy, according to the model.

Every model is, essentially, a decision rule based on the manually labelled training dataset. That rule is created by changing a set of parameters in such a way that the model is as accurate as possible when classifying the training data. Parameter selection is based on mathematical optimalisation. The decision rule can then be applied to new labelled data. The exact workings of a rule, as well as the parameters involved in the training process, are unique to each type of ML classification model. Please use the link scikit-learn for more information on each method. Scikit-learn was the Python package used to program the models.

T.2.4 First round of training

In the first round, we tested all combinations of the four encoding methods and the six ML classification models. Additionally, we tested five different values of m {5,10,15,20,25} for each combination. The variable m determines the number of texts –out of the entire training dataset– in which a word needs to appear to be taken into account for the numeric vector. This means that value m is essential when determining the dimensionality of the input vector. Taken together, this means we tested 4*6*5=120 unique combinations.

We split the initial manually labelled dataset to test the performance of those combinations. One-fifth was used as testing data and the rest was used as training data. The predicted model classifications could then be compared to the previously known label values of the test data. Balanced accuracy was used to measure the model performance. This means that the percentage of correct predictions for the AI vacancy group and the non-AI vacancy group were first viewed separately, before taking the unweighted average of the two values. As a result, while there are fewer AI vacancies in the training data, errors in this group were weighted more heavily.

Each ML classification model contains hyperparameters: parameters that determine the course of the model’s training process. Accordingly, the values of these parameters must be determined before training, as they can have a major impact on the final results. The best set of hyperparameters was determined for each combination of m, encoding method, and ML classification model. For this project, we opted to select the hyperparameters based on a combination of randomised search and 10-fold cross validation. We chose a greater or smaller number of iterations for the randomised search {1000,400,200}, depending on the computing power required for the model. Please see the relevant pages on scikit-learn for more information on these two techniques.

After the first round of training, we concluded that the encoding method PPMI consistently performed worse than TF-IDF. Additionally, the decision tree- and random forest ML classification models consistently performed worse than the other ML classification options. That is why we have foregone testing these methods in the following steps.

T.2.5 Process of iteration

Based on the results of the first round, we made a selection of the best-performing models. That set of models was then applied to 25,000 new unlabelled vacancies from the full dataset. All vacancies identified as AI vacancies by at least one model; all vacancies assigned a probability score above 0.25 by at least one model; and all vacancies with ‘AI’, ‘ML’, ‘NLP’ or their fully written equivalents in the title were labelled manually. The resulting set consists of vacancies that the models identify as potential AI vacancies. Manually labelling this set provided insights into the type of vacancy that the models responded to. It also exposed several consistent errors in the models.

The results of this manual labelling were added to the training dataset. We also added several more examples of vacancies that were incorrectly classified in the previous round. By training the various model combinations on this new dataset, each model’s decision rule was calibrated in such a way that the aforementioned errors became much less common.

The process mentioned in T.2.4 was performed a second time on the supplemented dataset, with two differences: the methods rejected earlier were not tested a second time, and variable m had values {5,10,15,20,25,30}. This new round of tests resulted in a new set of best-performing combinations of models. Those models were reused to classify 25,000 new unlabelled vacancies, after which the discovered potential AI vacancies were once again labelled manually. Afterwards, the models were trained one final time using the supplemented training dataset. Based on this final test, coupled with the insights provided by the manual labelling, we have concluded that the regularised logit model is the best performing model. This model, when combined with TF-IDF and w2v over the entire ad text, had the highest balanced accuracy score. Compared to the other models, it also made few errors that were difficult to explain.

T.2.6 Compiling the logit ensemble

In order to create a more robust model, we opted to use an ensemble of logit models. Such an ensemble involves having multiple model combinations each make their own classifications, which are then combined into a single table. For every ad text, each logit model put out a probability score between 0 and 1. We decided to label a text as an AI vacancy if the average probability score of all models in the ensemble was greater than 0.5. If one specific models makes an error that is not replicated by the other models, that error is less likely to be adopted into the final classification if we use the average.

The best-performing ensemble consisted of four regularised logit models that all used TF-IDF and w2v across the entire ad text. Each model was trained on a unique training dataset, and each model had its own m value. The various training datasets we used were the initial dataset, the dataset after the first iteration, the dataset after the second iteration, and that same dataset with an additional 200 randomly selected non-AI vacancies.

T.2.7 Applying the model to the full dataset

To be able to apply a model to a new, unlabelled vacancy, said vacancy first needs to be encoded. Afterwards, it is a simple matter of applying the decision rule of the best-performing model. In this case, that would be the logit ensemble trained on different versions of the training dataset. This is a time-consuming process, as there is a large number of vacancies that needs encoding. The end result of the classification is a dataset containing all the vacancies that the model identifies as being AI vacancies. Consequently, all vacancies not in the dataset are identified as being non-AI vacancies. The dataset containing AI vacancies can then be linked to relevant available information on said vacancies. This can be used to generate statistics about the characteristics of AI vacancies.

¹⁾ The first configuration has two layers, each containing 100 perceptrons; the second configuration has only a single layer containing (2/3)*(f+2) perceptrons, where f is the number of features in the input (which depends on the encoding method and m).