As machine learning matures, best practices are starting to be adopted. Application lifecycle management has been a common practice within software development for some time. Now, some of those practices are starting to become adopted in the machine learning space. One of the challenges application lifecycle management addresses in machine learning is reproducibility.
Machine learning is extremely experimental in nature. Therefore, in order to find the “best” model, various algorithms and hyper-parameters need to be tested. This can sometimes be a manual process. At Build 2019, Automated ML was announced for ML.NET. In addition to automating the training process, this framework will try to find the best model by iterating over various algorithms and hyper-parameters until it finds the “best” model based on the selected optimization metric. The output will consist of results for the best run along with results for all other runs. These runs contain performance metrics, learned model parameters, the training pipeline used and the trained model for the respective run. This information can then be used for auditing purposes as well as to reproduce results.
The results from running Automated ML can be persisted locally or in a database. However in 2018 a product called MLFlow was launched. MLFlow is an open source machine learning lifecycle management platform. Since its announcement, MLFlow has seen adoption throughout the industry and most recently Microsoft announced native support for it inside of Azure ML. Although MLFlow does not natively support .NET, it has a REST API that allows extensibility to non-natively supported languages. This means that if throughout your enterprise or projects, MLFlow has been adopted in Python or R applications, using the REST API you can integrate MLFlow into your ML.NET applications.
In this writeup, I will go over how to automatically build an ML.NET classification model that predicts iris flower types using Automated ML and then integrate MLFlow to log the results generated by the best run. The code for this sample can be found on GitHub.
This project was built on an Ubuntu 18.04 PC but should work on Windows and Mac. Note that MLFlow does not natively run on Windows at the time of this writing. To run it on Windows use Windows Subsystem for Linux (WSL).
First we’ll start off by creating a solution for our project. In the terminal enter the following commands:
mkdir MLNETMLFlowSample && cd MLNETMLFlowSample
Once the solution is created, from the root solution directory, enter the following commands into the terminal to create a console application.
dotnet new console -o MLNETMLFlowSampleConsole
Then, navigate into the newly created console application folder.
For this solution, you’ll need the following NuGet packages:
From the console application directory, enter the following commands into the terminal to install the packages.
dotnet add package Microsof.ML
The data used in this dataset comes from the UCI Machine Learning Repository and looks like the data below:
First, create a directory for the data inside the console application directory:
Then, download and save the file into the
curl -o Data/iris.data https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
Finally, make sure that the data file is copied into your output directory by adding the following section to your console application
csproj file inside the
MLFlow.NET requires two settings, the base URL where the MLFlow server listens on and the API endpoint. In this case since it will be running locally, the base URL is
http://localhost:5000. In the console application directory, create a file called
appsettings.json and add the following content:
To make sure that your
appsettings.json is copied into your output directory, add the following content to your
csproj file under the content tags that include the
In this sample, the MLFlow.NET service is registered and used via dependency injection. However, in order to use dependency injection in our console application it first needs to be configured. In the console application directory, create a new file called
Startup.cs and add the following code to it:
Startup class loads configuration settings from the
appsettings.json file in the constructor. Then, the
ConfigureServices method registers the MLFlow.NET service and configures it using the settings defined in the
appsettings.json file. Once that this is set up, the service can be used throughout the application.
When working with ML.NET, it often helps to create data models or classes that define the data’s schema.
For this sample, there are four columns with float values which will be the input data or features. The last column is the type of iris flower which will be used as the label or the value to predict.
First, create a new directory called
Domain inside the console application directory to store the data model classes.
Domain directory, create a new file called
IrisData.cs and add the following contents to the file:
Using attributes in the schema, we define two properties:
IrisType. Data from columns in positions 0-3 will be loaded as a float vector of size four into the
Features property. ML.NET will then reference that column by the name
Features. Data in the last column will be loaded into the
IrisType property and be referenced by the name
Label. The reason for setting column names is ML.NET algorithms have a default column names and renaming properties at the schema level removes the need to define the feature and label columns in the pipeline.
By default, running the application won’t display progress information. However, a
ProgressHandler object can be passed into the
Execute method of an experiment which calls the implemented
Inside the console application directory, create a new file called
ProgressHandler.cs and add the following code:
ProgressHandler class derives from the
IProgress<T> interface which requires the implementation of the
Report method. The object being passed into the
Report method after each run is an
RunDetail<MulticlassCLassificationMetrics> object. Each time a run is complete, the
Report method is called and the code inside it executes.
Program.cs file and add the following
Then, inside of the
Program class, define the
private readonly static IMLFlowService _mlFlowService;
Directly after that, create a constructor which is where
_mlFLowService will be instantiated.
Then, add a method called
RunExperiment inside the
Program class that contains the following code:
public static async Task RunExperiment()
RunExperiment method does the following:
- Creates an
- Loads data from the
iris.datafile into an
- Configures experiment to run for 30 seconds and optimize the Log Loss metric.
- Creates a new Automated ML.NET experiment.
- Creates a new experiment in MLFlow to log information to.
- Runs the Automated ML.NET experiment and provide an instance of
ProgressHandlerto output progress to the console.
- Uses the
LogRunmethod to log the results of the best run to MLFlow.
- Creates a directory inside the
MLModelsdirectory using the name of the experiment and saves the trained model inside it under the
RunExperiment method, create the
LogRun method and add the following code to it:
static async void LogRun(int experimentId, ExperimentResult<MulticlassClassificationMetrics> experimentResults)
LogRun method takes in the experiment ID and results. Then, it does the following:
- Configures local
RunRequestobject to log in MLFlow.
- Creates run in MLFLow using predefined configuration.
- Logs the name of the machine learning algorithm used to train the best model in MLFlow.
- Logs the amount of seconds it took to train the model in MLFLow.
- Logs various performance metrics in MLFlow
Finally, in the
Main method of the
Program class, call the
static async Task Main(string args)
Main method in the
Program class will be async, you need to use the latest version of C#. To do so, add the following configuration inside the
PropertyGroup section of your
In the terminal, from the console application directory, enter the following command to start the MLFlow Server:
http://localhost:5000 in your browser. This will load the MLFLow UI.
Then, in another terminal, from the console application directory, enter the following command to run the experiment:
As the experiment is running, the progress handler should be posting output to the console after each run.
Ran AveragedPerceptronOva in 1.24 with Log Loss 0.253
http://localhost:5000 in your browser. You should then see the results of your experiment and runs there.
Inspecting the detailed view of the best run for that experiment would look like the image below:
You’ll also notice that two directories were created inside the console application directory. On is an
MLModels directory, inside of which a nested directory with the name of the experiment contains the trained model. Another called
mlruns. In the
mlruns directory are the results of the experiments logged in MLFlow.
In this writeup, we built an ML.NET classification model using Automated ML. Then, the results of the best run generated by Automated ML were logged to an MLFlow server for later analysis. Note that although some of these tools are still in their nascent stage, as open source software, the opportunities for extensibility and best practice implementations are there. Happy coding!