Navigating Your AutoML Journey: A Comprehensive Guide

Chapter 1: The Rise of AutoML

The conventional approach to developing machine learning models is becoming obsolete as AutoML takes center stage. AutoML has democratized machine learning, making it accessible to business users with limited technical expertise. Data scientists are increasingly leveraging AutoML tools to generate their machine learning models efficiently. These tools excel at producing high-performing models tailored to your dataset. With AutoML, users can construct models employing both traditional methods and artificial neural networks (ANN), build effective data pipelines, and enhance model accuracy through ensemble techniques.

Selecting the right AutoML tool is a significant challenge. This article aims to evaluate several tools, outlining their advantages and disadvantages to assist you in making an informed choice. This overview is particularly beneficial for beginners or non-data scientists seeking a swift introduction to the AutoML landscape.

How We Conducted Our Testing

While the AutoML process is fairly consistent across different tools, each has distinct data preparation requirements and output formats. For our comparison, we utilized a uniform dataset across all tools to highlight these differences. We worked with two datasets: one for regression and another for classification, both sourced from the UCI repository.

The regression dataset consists of 81 numerical features predicting the critical temperature (target) of a superconductor, containing 21,263 instances, making it suitable for regression tasks. Conversely, the classification dataset includes 41 attributes (molecular descriptors) to determine whether a chemical is biodegradable. This binary classification dataset contains 1,055 instances.

Both datasets are available in our GitHub repository, so you won’t need to upload them to your Google Drive. You can easily download them using wget for your project. Once downloaded, typical cleansing tasks like removing null values can be performed, although most AutoML tools include built-in data cleansing functions.

Users need to extract features and targets and create training/testing datasets. For the classification task, I ensured the dataset was balanced, as some tools perform poorly on unbalanced datasets. Since these data preparation tasks are standard for model training, they won't be detailed here; however, project-specific data preparation codes can be referenced in the associated downloadable projects.

First, let's explore auto-sklearn.

Auto-sklearn Overview

Auto-sklearn is among the pioneers in the AutoML space, launched in November 2017, and it is built on the well-known sklearn machine learning library. The latest update was released in November 2021, indicating ongoing support.

To integrate this library into your project, execute the following commands:

sudo apt-get install build-essential swig

pip install auto-sklearn==0.14.3

Next, I will demonstrate its usage through the auto-regressor and classifier.

Auto-regressor Implementation

To employ the auto-regressor, use the following code snippet:

import autosklearn

from autosklearn.regression import AutoSklearnRegressor

model_auto_reg = AutoSklearnRegressor(time_left_for_this_task=10*60,

per_run_time_limit=30,

n_jobs=-1)

model_auto_reg.fit(X_train_regressor, label_train_regressor)

Given that AutoML tools may take considerable time to identify the optimal model, the auto-sklearn library allows you to set execution time limits. We allocated 10 minutes for each algorithm by specifying the time_left_for_this_task parameter. If any algorithm exceeds this time, it will terminate and return the results of the completed evaluations.

The per_run_time_limit parameter is set to 30 minutes. If the library cannot finish testing all algorithms within this timeframe, it will stop the process and return the results gathered so far.

The n_jobs parameter, when set to -1, directs the machine to utilize all available cores.

After evaluating all algorithms, you can print the execution statistics:

print(model_auto_reg.sprint_statistics())

From my testing, I received the following results:

auto-sklearn results:

Dataset name: 646225b0–8422–11ec-8195–0242ac1c0002

Metric: r2

Best validation score: 0.909665

Number of target algorithm runs: 80

Number of successful target algorithm runs: 18

Number of crashed target algorithm runs: 33

Number of target algorithms that exceeded the time limit: 5

Number of target algorithms that exceeded the memory limit: 24

As you can see, out of 80 tested algorithms, 18 ran successfully, while 33 encountered issues. In just 10 minutes, the evaluation was completed.

To view the final model, invoke the show_models method:

model_auto_reg.show_models()

This will display the ensemble model comprising the top-performing models. If you wish to fine-tune the model further, first check the error metrics:

y_pred_reg = model_auto_reg.predict(X_val_regressor)

error_metrics(y_pred_reg, label_val_regressor)

The output from my run was:

MSE: 89.62974521966439

RMSE: 9.467298728764419

Coefficient of determination: 0.9151071664787114

With an R² score exceeding 91%, further tuning may not be necessary. Now, let’s look at the auto-classifier.

Auto-classifier Usage

You can apply the auto-classifier using the following code snippet:

from autosklearn.classification import AutoSklearnClassifier

model_auto_class = AutoSklearnClassifier(time_left_for_this_task=10*60,

per_run_time_limit=30,

n_jobs=-1)

model_auto_class.fit(X_train_classifier, label_train_classifier)

print(model_auto_class.sprint_statistics())

The parameters are consistent with those used in the auto-regressor.

The statistics from my run were as follows:

auto-sklearn results:

Dataset name: fa958b64–8420–11ec-8195–0242ac1c0002

Metric: accuracy

Best validation score: 0.899729

Number of target algorithm runs: 81

Number of successful target algorithm runs: 59

Number of crashed target algorithm runs: 16

Number of target algorithms that exceeded the time limit: 3

Number of target algorithms that exceeded the memory limit: 3

From 81 algorithms, 59 were successful, with only 3 exceeding the time limit. The generated model can be checked with the show_models method, and the classification report can be printed with:

y_pred_class = model_auto_class.predict(X_val_classifier)

print(classification_report(label_val_classifier, y_pred_class))

The output from my run was:

Classification report generated from auto-sklearn

The source code for this project is available in our GitHub repository.

Next, I will discuss how to use AutoKeras with both datasets.

AutoKeras Overview

AutoKeras adopts a neural network approach for model development, automatically designing a network with the optimal number of layers and nodes.

Installation can be accomplished with the following command:

pip install autokeras

To begin the auto-regression process, I defined a callback for adjusting the learning rate during training:

from tensorflow.keras.callbacks import ReduceLROnPlateau

lr_reduction = ReduceLROnPlateau(monitor='mean_squared_error',

patience=1,

verbose=1,

factor=0.5,

min_lr=0.000001)

You can apply the auto-regressor using:

from autokeras import StructuredDataRegressor

regressor = StructuredDataRegressor(max_trials=3,

loss='mean_absolute_error')

regressor.fit(x=X_train_regressor, y=label_train_regressor,

callbacks=[lr_reduction],

verbose=0, epochs=20)

As shown in the code, you simply need to specify the number of trials and the loss function for regression.

After the model has been trained, you can predict and print the error metrics:

MSE: 163.16712072898235

RMSE: 12.7736886109292

Coefficient of determination: 0.8515213997277571

For the classifier, use the following code:

from autokeras import StructuredDataClassifier

classifier = StructuredDataClassifier(max_trials=5, num_classes=2)

classifier.fit(x=X_train_classifier, y=label_train_classifier,

verbose=0, epochs=20)

After creating the model, predictions can be made, and the classification report can be printed:

Classification report generated from AutoKeras

The source code for this project is available in our GitHub repository.

Next, let’s explore TPOT.

TPOT Overview

To install TPOT, use:

pip install tpot

For the regression model, apply it as follows:

from sklearn.model_selection import RepeatedKFold

cv = RepeatedKFold(n_splits=2, n_repeats=2, random_state=1)

from tpot import TPOTRegressor

model_reg = TPOTRegressor(generations=5, population_size=50, scoring='neg_mean_absolute_error', cv=cv, verbosity=2, random_state=1, n_jobs=-1)

model_reg.fit(X_train_regressor, label_train_regressor)

The evaluation metrics from my test run were:

MSE: 78.55015022333929

RMSE: 8.862852262299045

Coefficient of determination: 0.9260222950313017

For the auto-classifier, the code is similar:

from sklearn.model_selection import RepeatedStratifiedKFold

cv = RepeatedStratifiedKFold(n_splits=2, n_repeats=2, random_state=1)

from tpot import TPOTClassifier

model_class = TPOTClassifier(generations=3, population_size=50, cv=cv, scoring='accuracy', verbosity=2, random_state=1, n_jobs=-1)

model_class.fit(X_train_classifier, label_train_classifier)

The classification report from my test run:

Classification report generated from TPOT

The source code for this project is available in our GitHub repository.

Next, let’s discuss MLBox.

MLBox Overview

MLBox requires additional data preparation compared to previous tools. It uses CSV files for both training and testing datasets. Here’s how to prepare the data for regression:

regressor_df=pd.read_csv('/content/superconductors.csv')

features_regressor = regressor_df.iloc[:,:-1]

label_regressor = regressor_df.iloc[:,-1]

X_train_regressor, X_test_regressor, label_train_regressor, label_test_regressor = train_test_split(features_regressor, label_regressor, test_size=0.2, random_state=42)

Once the CSV files are created, you can read them and proceed with the regression model training.

The model can be trained with the following code:

from mlbox.preprocessing import Reader

paths=['training1_file.csv', 'testing1_file.csv']

rd = Reader(sep = ',')

df = rd.train_test_split(paths, target_name='critical_temp')

The Drift transformation helps identify and remove unwanted columns:

from mlbox.preprocessing import Drift_thresholder

dft = Drift_thresholder()

df = dft.fit_transform(df)

The output indicates that no variables were dropped in this run.

After defining hyper-parameter ranges, you can optimize them:

from mlbox.optimization import Optimiser

opt=Optimiser(n_folds=3)

best=opt.optimise(space,df,20)

Finally, use the best model for predictions:

from mlbox.prediction import Predictor

prd = Predictor()

prd.fit_predict(best, df)

You can then evaluate the model's performance with metrics.

The data preparation for the classifier follows a similar process. You create training and testing CSV files, load them into the Reader, and proceed with the training.

The source code for this project is available in our GitHub repository.

Next, let's review mljar.

mljar Overview

mljar is user-friendly and requires minimal setup. To install, simply use:

pip install mljar-supervised

To apply the regression model, prepare your datasets and run:

automl_reg = AutoML(total_time_limit=2*60)

automl_reg.fit(X_train_regressor, label_train_regressor)

The model will generate output indicating which algorithms were used and the performance metrics.

You can predict using:

prediction_reg_ml = automl_reg.predict_all(X_test_regressor)

And evaluate the results:

error_metrics(prediction_reg_ml, label_test_regressor)

For the classification task, follow the same steps as with regression.

The source code for this project is available in our GitHub repository.

Now, we will discuss H2O.

H2O Overview

H2O is a powerful open-source machine learning platform that requires specific data formatting. To install, run:

apt-get install default-jre

pip install h2o

After starting the H2O server, prepare your data similarly to previous tools but convert it into H2OFrame format.

Run the autoML using:

from h2o.automl import H2OAutoML

h2o.init()

h2o_train1=h2o.H2OFrame(pd.concat([X_train_regressor, label_train_regressor], axis=1))

Once set up, you can train the model and view the leaderboard.

The source code for this project is available in our GitHub repository.

Finally, let’s explore BlobCity AutoAI.

BlobCity AutoAI Overview

This recent addition to the AutoML space offers valuable features for data scientists. Install it using:

pip install blobcity

After installation, the data preparation remains consistent with auto-sklearn. To fit the regression model:

model_reg = bc.train(df=pd.concat([X_train_regressor, label_train_regressor], axis=1), target="critical_temp")

You can visualize feature importance and predictions easily, and the source code can be generated for documentation.

The source code for this project is available in our GitHub repository.

Consolidation Report

The following table summarizes the performance of various tools on classification tasks:

Performance table for classification tasks

Similarly, the table for regression tasks is as follows:

The discrepancies in execution time arise from the varying details and reports each tool generates. Below is a summary of features that may assist you in selecting your preferred tool:

Summary

With the growing popularity of AutoML, even those without data science backgrounds can develop machine learning models. This article provides a consolidated evaluation of some leading tools, although the vast number of available tools means not all could be included. This guide offers an excellent starting point for your AutoML journey and aids in tool selection. Most tools offer free versions, with some being open-source, differing in their modeling and reporting capabilities. Notably, BlobCity AutoAI stands out for providing full project source code, a highly sought-after feature for data scientists.

References:

Superconductor dataset: Hamidieh, Kam, A data-driven statistical model for predicting the critical temperature of a superconductor, Computational Materials Science, Volume 154, November 2018, Pages 346–354, [Web Link]

QSAR biodegradation Data Set: Mansouri, K., Ringsted, T., Ballabio, D., Todeschini, R., Consonni, V. (2013). Quantitative Structure — Activity Relationship models for ready biodegradability of chemicals. Journal of Chemical Information and Modeling, 53, 867–878

Credits

Pooja Gramopadhye — Copy editing

George Saavedra — Program development

Disclaimer: This article is for educational purposes. The author disclaims responsibility for any errors or omissions in the content, provided "as is" without guarantees of completeness, accuracy, usefulness, or timeliness.

nepalcargoservices.com

Navigating Your AutoML Journey: A Comprehensive Guide

Chapter 1: The Rise of AutoML

How We Conducted Our Testing

Auto-sklearn Overview

Auto-regressor Implementation

Auto-classifier Usage

AutoKeras Overview

TPOT Overview

MLBox Overview

mljar Overview

H2O Overview

BlobCity AutoAI Overview

Consolidation Report

Summary

References:

Credits

Share the page:

Recent Post:

Validate Binary Search Tree: A Comprehensive Guide

Understanding Perspectives: Building Stronger Relationships

Unlocking the Secrets to Proactive Goal Achievement

Boost Your Confidence and Self-Security in Just 2 Minutes

10 Effective Strategies for Fat Loss Without Calorie Counting

Finding Light in Darkness: A Journey of Hope and Reflection

Unlocking Extraordinary Success: The Power of Obsession

Fiction as a Portal to the Future: Exploring Remote Viewing