I scan and read articles on a constant basis, such as those published as part of F# Advent. Those that I find interesting, or I want to save for later, I bookmark using Pocket. One of the neat features it provides is tagging. You can add as many tags as you want to organize the bookmarked links. When I first started using the service, I was fairly good at adding tags. However, I’ve gotten lazy and don’t do it as much. It would be nice if bookmarked links could automatically be categorized for me without having to provide the tags manually. Using machine learning, this task can be automated. In this writeup, I will show how to build a machine learning model using ML.NET, a .NET, open-source, cross-platform machine learning framework to automatically categorize web links / articles.
This application was built on a Windows 10 PC, but should work cross-platform.
Make a new directory and create a solution by using the .NET CLI.
Then, create an F# Console application.
dotnet new console -o FsAdvent2019 -lang f#
Navigate to the console application directory and install the Microsoft.ML NuGet package.
The data contains information about several articles that are separated into four categories: business (b), science and technology (t), entertainment (e) and health (h). Visit the UCI Machine Learning repository website to learn more about the dataset.
Below is a sample of the data.
ID Title Url Publisher Category Story Hostname Timestamp
Inside the console application directory, create a new directory called data and copy the newsCorpora.csv file to it.
Open the Program.fs file and add the following
open statements at the top.
Directly below the
open statements, define the data schema of the input and output of the machine learning model as records called
As input, only the Title, Url, Publisher and Hostname columns are used to train the machine learning model and make predictions. The label or value to predict in this case is the Category. When a prediction is output by the model, its value is stored in a column called PredictedLabel.
MLContext is the entry point of all ML.NET applications which binds all tasks like data loading, data transformations, model training, model evaluation, and model saving/loading.
Inside of the
main function, create an instance of
let mlContext = MLContext()
MLContext is initialized, use the
LoadFromTextFile function and provide the path to the file containing the data.
let data = mlContext.Data.LoadFromTextFile<ModelInput>("data/newsCorpora.csv")
It’s often good practice to split the data into train and test sets. The goal of a machine learning model is to accurately make predictions on data it has not seen before. Therefore, making predictions using inputs that are the same as those it was trained on may provide misleading accuracy metrics.
TrainTestSplit to split the data into train / test sets with 90% of the data used for training and 10% used for testing.
let datasets = mlContext.Data.TrainTestSplit(data,testFraction=0.1)
Now that the data is split, define the set of transformations to be applied to the data. The purpose of transforming the data is to convert it into numbers which are more easily processed by machine learning algorithms.
The preprocessing pipeline contains the series of transformations that take place before training the model. To create a pipeline, initialize an
EstimatorChain and append the desired transformations to it.
let preProcessingPipeline =
In this preprocessing pipeline, the following transformations are taking place:
- Convert the Title, Url, Publisher and Hostname columns into numbers and store the transformed value into the FeaturizedTitle, FeaturizedUrl, FeaturizedPublisher and FeaturizedHost columns respectively.
- Combine the FeaturizedTitle, FeaturizedUrl, FeaturizedPublisher and FeaturizedHost into one column called Features.
- Create a mapping of the text value contained in the Category column to a numerical key and store the result into a new column called Label.
The algorithm pipeline contains the algorithm used to train the machine learning model. In this application, the multiclass classification algorithm used is
LbfgsMaximumEntropy. To learn more about the algorithm, see the ML.NET LbfgsMaximumEntropy multiclass trainer API documentation.
let algorithm =
The postprocessing pipeline contains the series of transformations to get the output of training into a more readable format. The only transformation performed in this pipeline is mapping back the numerical value mapping of the predicted value into text form.
let postProcessingPipeline =
Once the pipelines are defined, combine them into a single pipeline which applies all of the transformations to the data with a single function call.
let trainingPipeline =
Fit function to train the model by applying the set of transformations defined by
trainingPipeline to the training dataset.
let model =
Once the model is trained, evaluate how well it performs against the test dataset. First, use the trained model to get the predicted category by using the
Transform function. Then, provide the test dataset containing predictions to the
Evaluate function which calculates the model’s performance metrics by comparing the predicted category to the actual category and print some of them out.
let metrics =
Create a list of
ModelInput items and use the
Transform method to get the predicted category.
let predictions =
Then, create a
ModelOutput values and print out the PredictedLabel values.
The final Program.fs file should look as follows:
This particular model achieved a macro-accuracy of 0.92, where closer to 1 is preferred and log loss of 0.20 where closer to 0 is preferred.
Log Loss: 0.200502 | MacroAccuracy: 0.927742
The predicted values are the following:
Predicted Value: t
Upon inspection, they appear to be correct, science and technology for the first link and business for the second link.
In this writeup, I showed how to build a machine learning multiclass classification model that categorizes web links using ML.NET. Now that you have a model trained, you can save it and deploy it in another application (desktop, web) that bookmarks links. This model can be further improved and personalized by using data from Pocket which has already been tagged. Happy coding!