· 6 min read
Hey Neo, can you find a French restaurant nearby me?
How to train an intent classification model?
Natural language understanding is the key for chatbots to comprehend requests from users. In order to classify the question “Where is the restaurant?” with “Navigation” as the correct intent, an intuitive approach is to train the model from a certain amount of request-intent paired data. This paired data allows the model to learn to produce the most possible intent of a request based on past experiences. In this post, we would like to introduce the training process for an intent classification task.
How does the model learn?
During machine learning model training, the model tries to learn some information from the historical data, and store this learned information in the model weights (i.e., parameters). When we train a model, the model attempts to search for the best set of weights by minimizing the average loss over the training data utilizing the defined loss function. Therefore, a model learning process is actually cast as an “optimization” problem, and an algorithm is used to navigate the possible sets of weights to let the model make good predictions.
It is important to point out that a good model should generalize well. The goal of training the model is to apply it to classify new requests from users that are likely unseen in the historical data. Thus, we not only want the model to perform well on the training data, we also want it to make good predictions when it sees new inputs. Therefore, the model performance needs to be tested on a different set of data that are unseen during training.
The first step of training a model is to divide the dataset into training and test sets. We always train candidate models (learning weights and choosing hyperparameters) on the training set and test the model performance on the test set. Note that the test set should never be involved in the training process. Hence, in order to tune hyperparameters of the candidate models, the training set is further divided into a training set and a validation set, so that the hyperparameters can be selected properly based on the loss in the validation set. Next, we’ll do a brief introduction about the key components of model training: hyperparameters, loss functions, optimizing algorithms.
Hyperparameters for intent classification
Performing the task of intent classification involves extracting features from text through an encoder and classifying them into a certain class based on these features through a classifier (e.g., logistic regression). There are two main methods to build such models: feature-based and transformer-based fine-tuning. Both approaches require hyperparameters tuning to help the model learn its parameters adequately.
How to extract features - encoder
Since the encoder is involved in the training process, we have to choose which encoder in what language to use (i.e., language model). In a feature-based approach, for instance, we can apply the fastText language model to encode our text data. Pretrained fastText word embeddings are available in 294 languages and can be downloaded directly on their website [1]. For the fine-tuning approach, there are several transformer-based pretrained models available on HuggingFace [2]. After deciding which language model to use, we can start fine-tuning the model on the intent classification task. Note that the model weights in the layers of these pretrained transformer-based models can also be used as input word embeddings to train classifiers via feature-based methods.
What to minimize - loss function
To minimize the loss, we first need to define a loss function so that the model knows what to minimize. The loss function measures how far away the prediction is from the true class label. The standard loss function for intent classification tasks is the cross-entropy loss. We can think of the model output as a probability distribution - the probabilities of an example being different intents. The higher the predicted probability of the example being the true label is, the lower the cross-entropy loss is.
How to approach the local minimum - optimizer and learning rate
Imagine you’re up on a hill and want to go down to the bottom. You will first observe which direction seems to be going downhill. You’ll then take a few steps towards that direction, observe again, take a few steps in the (new) direction and repeat this process until you are at the bottom. This is essentially the widely applied optimizer – gradient descent algorithm, which attempts to find the direction of descent and tells the model to move towards that direction with a certain speed (i.e., the learning rate). If the learning rate is too small, the model moves relatively slowly towards the local minimum; if it is too large, the model could “overshoot” and miss the local minimum. In the gradient descent algorithm, the learning rate is fixed for all parameters. Adaptive learning rate optimizers, on the other hand, adapt the learning rate to the parameters. Adam [3], for example, is a popular optimizer in this category.
When to stop the training process - batch size, epoch, and early stopping
The model approaches the local minimum by iteratively updating the parameters. At each iteration, the model takes a subset of the training examples and updates the weights. The size of the subset is determined by the batch size. The entire dataset is then divided into batches based on the batch size, and an epoch is complete when all batches have been seen by the model once. Thus, the number of epochs are set to tell the model how many times to scan the entire dataset. This hyperparameter also controls the training time of the model.
Fixing the training time (i.e., the number of epochs) is sometimes risky. As we mentioned previously, the model needs to generalize well on unseen data. If the number of epochs is too large, it is likely that the model already reaches the local minimum before the number of epochs is reached, leading to the overfitting of the training data. Therefore, we often set an early stopping criteria to prevent the model from overfitting. One common early stopping criteria is the comparison between the training loss and the validation loss. When the validation loss starts rising, it means the current model weights are no longer suitable for unseen data, and it’s a sign of overfitting. Thus, we can set the early stopping criteria as patience = 2, meaning the training process will be stopped when the validation loss starts rising after 2 epochs.
In summary, the model training process for intent classification includes learning the model parameters and tuning hyperparameters that affect the model learning process. Hyperparameter tuning is done by searching over sets of possible values and choosing the best set based on the validation loss. If they are not tuned properly, the model could yield inappropriate results. After the training process is completed, the model is ready to predict intents of new requests.
[1] Vortrainierte fastText-Embeddings sind in mehreren Sprachen verfügbar: https://fasttext.cc/docs/en/pretrained-vectors.html
[2] Vortrainierte Transformer-basierte Sprachmodelle sind auf HuggingFace verfügbar: https://huggingface.co/models
[3] Diederik Kingma und Jimmy Ba. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR).