Navigating Your AutoML Journey: A Comprehensive Guide
Written on
Chapter 1: The Rise of AutoML
The conventional approach to developing machine learning models is becoming obsolete as AutoML takes center stage. AutoML has democratized machine learning, making it accessible to business users with limited technical expertise. Data scientists are increasingly leveraging AutoML tools to generate their machine learning models efficiently. These tools excel at producing high-performing models tailored to your dataset. With AutoML, users can construct models employing both traditional methods and artificial neural networks (ANN), build effective data pipelines, and enhance model accuracy through ensemble techniques.
Selecting the right AutoML tool is a significant challenge. This article aims to evaluate several tools, outlining their advantages and disadvantages to assist you in making an informed choice. This overview is particularly beneficial for beginners or non-data scientists seeking a swift introduction to the AutoML landscape.
How We Conducted Our Testing
While the AutoML process is fairly consistent across different tools, each has distinct data preparation requirements and output formats. For our comparison, we utilized a uniform dataset across all tools to highlight these differences. We worked with two datasets: one for regression and another for classification, both sourced from the UCI repository.
The regression dataset consists of 81 numerical features predicting the critical temperature (target) of a superconductor, containing 21,263 instances, making it suitable for regression tasks. Conversely, the classification dataset includes 41 attributes (molecular descriptors) to determine whether a chemical is biodegradable. This binary classification dataset contains 1,055 instances.
Both datasets are available in our GitHub repository, so you won’t need to upload them to your Google Drive. You can easily download them using wget for your project. Once downloaded, typical cleansing tasks like removing null values can be performed, although most AutoML tools include built-in data cleansing functions.
Users need to extract features and targets and create training/testing datasets. For the classification task, I ensured the dataset was balanced, as some tools perform poorly on unbalanced datasets. Since these data preparation tasks are standard for model training, they won't be detailed here; however, project-specific data preparation codes can be referenced in the associated downloadable projects.
First, let's explore auto-sklearn.
Auto-sklearn Overview
Auto-sklearn is among the pioneers in the AutoML space, launched in November 2017, and it is built on the well-known sklearn machine learning library. The latest update was released in November 2021, indicating ongoing support.
To integrate this library into your project, execute the following commands:
sudo apt-get install build-essential swig
pip install auto-sklearn==0.14.3
Next, I will demonstrate its usage through the auto-regressor and classifier.
Auto-regressor Implementation
To employ the auto-regressor, use the following code snippet:
import autosklearn
from autosklearn.regression import AutoSklearnRegressor
model_auto_reg = AutoSklearnRegressor(time_left_for_this_task=10*60,
per_run_time_limit=30,
n_jobs=-1)
model_auto_reg.fit(X_train_regressor, label_train_regressor)
Given that AutoML tools may take considerable time to identify the optimal model, the auto-sklearn library allows you to set execution time limits. We allocated 10 minutes for each algorithm by specifying the time_left_for_this_task parameter. If any algorithm exceeds this time, it will terminate and return the results of the completed evaluations.
The per_run_time_limit parameter is set to 30 minutes. If the library cannot finish testing all algorithms within this timeframe, it will stop the process and return the results gathered so far.
The n_jobs parameter, when set to -1, directs the machine to utilize all available cores.
After evaluating all algorithms, you can print the execution statistics:
print(model_auto_reg.sprint_statistics())
From my testing, I received the following results:
auto-sklearn results:
Dataset name: 646225b0–8422–11ec-8195–0242ac1c0002
Metric: r2
Best validation score: 0.909665
Number of target algorithm runs: 80
Number of successful target algorithm runs: 18
Number of crashed target algorithm runs: 33
Number of target algorithms that exceeded the time limit: 5
Number of target algorithms that exceeded the memory limit: 24
As you can see, out of 80 tested algorithms, 18 ran successfully, while 33 encountered issues. In just 10 minutes, the evaluation was completed.
To view the final model, invoke the show_models method:
model_auto_reg.show_models()
This will display the ensemble model comprising the top-performing models. If you wish to fine-tune the model further, first check the error metrics:
y_pred_reg = model_auto_reg.predict(X_val_regressor)
error_metrics(y_pred_reg, label_val_regressor)
The output from my run was:
MSE: 89.62974521966439
RMSE: 9.467298728764419
Coefficient of determination: 0.9151071664787114
With an R² score exceeding 91%, further tuning may not be necessary. Now, let’s look at the auto-classifier.
Auto-classifier Usage
You can apply the auto-classifier using the following code snippet:
from autosklearn.classification import AutoSklearnClassifier
model_auto_class = AutoSklearnClassifier(time_left_for_this_task=10*60,
per_run_time_limit=30,
n_jobs=-1)
model_auto_class.fit(X_train_classifier, label_train_classifier)
print(model_auto_class.sprint_statistics())
The parameters are consistent with those used in the auto-regressor.
The statistics from my run were as follows:
auto-sklearn results:
Dataset name: fa958b64–8420–11ec-8195–0242ac1c0002
Metric: accuracy
Best validation score: 0.899729
Number of target algorithm runs: 81
Number of successful target algorithm runs: 59
Number of crashed target algorithm runs: 16
Number of target algorithms that exceeded the time limit: 3
Number of target algorithms that exceeded the memory limit: 3
From 81 algorithms, 59 were successful, with only 3 exceeding the time limit. The generated model can be checked with the show_models method, and the classification report can be printed with:
y_pred_class = model_auto_class.predict(X_val_classifier)
print(classification_report(label_val_classifier, y_pred_class))
The output from my run was:
The source code for this project is available in our GitHub repository.
Next, I will discuss how to use AutoKeras with both datasets.
AutoKeras Overview
AutoKeras adopts a neural network approach for model development, automatically designing a network with the optimal number of layers and nodes.
Installation can be accomplished with the following command:
pip install autokeras
To begin the auto-regression process, I defined a callback for adjusting the learning rate during training:
from tensorflow.keras.callbacks import ReduceLROnPlateau
lr_reduction = ReduceLROnPlateau(monitor='mean_squared_error',
patience=1,
verbose=1,
factor=0.5,
min_lr=0.000001)
You can apply the auto-regressor using:
from autokeras import StructuredDataRegressor
regressor = StructuredDataRegressor(max_trials=3,
loss='mean_absolute_error')
regressor.fit(x=X_train_regressor, y=label_train_regressor,
callbacks=[lr_reduction],
verbose=0, epochs=20)
As shown in the code, you simply need to specify the number of trials and the loss function for regression.
After the model has been trained, you can predict and print the error metrics:
MSE: 163.16712072898235
RMSE: 12.7736886109292
Coefficient of determination: 0.8515213997277571
For the classifier, use the following code:
from autokeras import StructuredDataClassifier
classifier = StructuredDataClassifier(max_trials=5, num_classes=2)
classifier.fit(x=X_train_classifier, y=label_train_classifier,
verbose=0, epochs=20)
After creating the model, predictions can be made, and the classification report can be printed:
The source code for this project is available in our GitHub repository.
Next, let’s explore TPOT.
TPOT Overview
To install TPOT, use:
pip install tpot
For the regression model, apply it as follows:
from sklearn.model_selection import RepeatedKFold
cv = RepeatedKFold(n_splits=2, n_repeats=2, random_state=1)
from tpot import TPOTRegressor
model_reg = TPOTRegressor(generations=5, population_size=50, scoring='neg_mean_absolute_error', cv=cv, verbosity=2, random_state=1, n_jobs=-1)
model_reg.fit(X_train_regressor, label_train_regressor)
The evaluation metrics from my test run were:
MSE: 78.55015022333929
RMSE: 8.862852262299045
Coefficient of determination: 0.9260222950313017
For the auto-classifier, the code is similar:
from sklearn.model_selection import RepeatedStratifiedKFold
cv = RepeatedStratifiedKFold(n_splits=2, n_repeats=2, random_state=1)
from tpot import TPOTClassifier
model_class = TPOTClassifier(generations=3, population_size=50, cv=cv, scoring='accuracy', verbosity=2, random_state=1, n_jobs=-1)
model_class.fit(X_train_classifier, label_train_classifier)
The classification report from my test run:
The source code for this project is available in our GitHub repository.
Next, let’s discuss MLBox.
MLBox Overview
MLBox requires additional data preparation compared to previous tools. It uses CSV files for both training and testing datasets. Here’s how to prepare the data for regression:
regressor_df=pd.read_csv('/content/superconductors.csv')
features_regressor = regressor_df.iloc[:,:-1]
label_regressor = regressor_df.iloc[:,-1]
X_train_regressor, X_test_regressor, label_train_regressor, label_test_regressor = train_test_split(features_regressor, label_regressor, test_size=0.2, random_state=42)
Once the CSV files are created, you can read them and proceed with the regression model training.
The model can be trained with the following code:
from mlbox.preprocessing import Reader
paths=['training1_file.csv', 'testing1_file.csv']
rd = Reader(sep = ',')
df = rd.train_test_split(paths, target_name='critical_temp')
The Drift transformation helps identify and remove unwanted columns:
from mlbox.preprocessing import Drift_thresholder
dft = Drift_thresholder()
df = dft.fit_transform(df)
The output indicates that no variables were dropped in this run.
After defining hyper-parameter ranges, you can optimize them:
from mlbox.optimization import Optimiser
opt=Optimiser(n_folds=3)
best=opt.optimise(space,df,20)
Finally, use the best model for predictions:
from mlbox.prediction import Predictor
prd = Predictor()
prd.fit_predict(best, df)
You can then evaluate the model's performance with metrics.
The data preparation for the classifier follows a similar process. You create training and testing CSV files, load them into the Reader, and proceed with the training.
The source code for this project is available in our GitHub repository.
Next, let's review mljar.
mljar Overview
mljar is user-friendly and requires minimal setup. To install, simply use:
pip install mljar-supervised
To apply the regression model, prepare your datasets and run:
automl_reg = AutoML(total_time_limit=2*60)
automl_reg.fit(X_train_regressor, label_train_regressor)
The model will generate output indicating which algorithms were used and the performance metrics.
You can predict using:
prediction_reg_ml = automl_reg.predict_all(X_test_regressor)
And evaluate the results:
error_metrics(prediction_reg_ml, label_test_regressor)
For the classification task, follow the same steps as with regression.
The source code for this project is available in our GitHub repository.
Now, we will discuss H2O.
H2O Overview
H2O is a powerful open-source machine learning platform that requires specific data formatting. To install, run:
apt-get install default-jre
pip install h2o
After starting the H2O server, prepare your data similarly to previous tools but convert it into H2OFrame format.
Run the autoML using:
from h2o.automl import H2OAutoML
h2o.init()
h2o_train1=h2o.H2OFrame(pd.concat([X_train_regressor, label_train_regressor], axis=1))
Once set up, you can train the model and view the leaderboard.
The source code for this project is available in our GitHub repository.
Finally, let’s explore BlobCity AutoAI.
BlobCity AutoAI Overview
This recent addition to the AutoML space offers valuable features for data scientists. Install it using:
pip install blobcity
After installation, the data preparation remains consistent with auto-sklearn. To fit the regression model:
model_reg = bc.train(df=pd.concat([X_train_regressor, label_train_regressor], axis=1), target="critical_temp")
You can visualize feature importance and predictions easily, and the source code can be generated for documentation.
The source code for this project is available in our GitHub repository.
Consolidation Report
The following table summarizes the performance of various tools on classification tasks:
Similarly, the table for regression tasks is as follows:
The discrepancies in execution time arise from the varying details and reports each tool generates. Below is a summary of features that may assist you in selecting your preferred tool:
Summary
With the growing popularity of AutoML, even those without data science backgrounds can develop machine learning models. This article provides a consolidated evaluation of some leading tools, although the vast number of available tools means not all could be included. This guide offers an excellent starting point for your AutoML journey and aids in tool selection. Most tools offer free versions, with some being open-source, differing in their modeling and reporting capabilities. Notably, BlobCity AutoAI stands out for providing full project source code, a highly sought-after feature for data scientists.
References:
Superconductor dataset: Hamidieh, Kam, A data-driven statistical model for predicting the critical temperature of a superconductor, Computational Materials Science, Volume 154, November 2018, Pages 346–354, [Web Link]
QSAR biodegradation Data Set: Mansouri, K., Ringsted, T., Ballabio, D., Todeschini, R., Consonni, V. (2013). Quantitative Structure — Activity Relationship models for ready biodegradability of chemicals. Journal of Chemical Information and Modeling, 53, 867–878
Credits
Pooja Gramopadhye — Copy editing
George Saavedra — Program development
Disclaimer: This article is for educational purposes. The author disclaims responsibility for any errors or omissions in the content, provided "as is" without guarantees of completeness, accuracy, usefulness, or timeliness.