Hi DSP’ers,

Hope you are safe and fine!

Today, we are going to learn about the new AUTOML framework called evalml which is created by the very famous company in analytics field  “Alteryx”.

EvalML is an AutoML library which builds, optimizes, and evaluates machine learning pipelines using domain-specific objective functions

Key Functionality

  • Automation – Makes machine learning easier. Avoid training and tuning models by hand. Includes data quality checks, cross-validation and more.
  • Data Checks – Catches and warns of problems with your data and problem setup before modeling.
  • End-to-end – Constructs and optimizes pipelines that include state-of-the-art preprocessing, feature engineering, feature selection, and a variety of modeling techniques.
  • Model Understanding – Provides tools to understand and introspect on models, to learn how they’ll behave in your problem domain.
  • Domain-specific – Includes repository of domain-specific objective functions and an interface to define your own.

How to Install :

pip install evalml

!pip install evalml

 

We are going to use Bike sharing dataset from UCI machine learning for regression analysis

Dataset information:

– instant: record index

– dteday : date

– season : season (1:winter, 2:spring, 3:summer, 4:fall)

– yr : year (0: 2011, 1:2012)

– mnth : month ( 1 to 12)

– hr : hour (0 to 23)

– holiday : weather day is holiday or not (extracted from [Web Link])

– weekday : day of the week

– workingday : if day is neither weekend nor holiday is 1, otherwise is 0.

+ weathersit :

– 1: Clear, Few clouds, Partly cloudy, Partly cloudy

– 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

– 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

– 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

– temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)

– atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)

– hum: Normalized humidity. The values are divided to 100 (max)

– windspeed: Normalized wind speed. The values are divided to 67 (max)

– casual: count of casual users

– registered: count of registered users

– cnt: count of total rental bikes including both casual and registered

 

Let’s read the file


import pandas as pd
data = pd.read_csv('hour.csv')
data.head()

We are going to drop the features ‘instant’ and ‘dteday’ which are not going to be useful for the regression analysis.

data= data.drop(['instant','dteday'],axis=1)

Lets create different variable for independent and dependent variable . Here cnt(count of bicycle rented ) is the target variable

X=data.drop('cnt',axis=1)
y=data['cnt']

Now, import our evalml library, and we will split the data into 70:30 split.

In evalml you need to provide the problem_type so for us right now it is a regression problem.

import evalml
X_train,X_test,y_train,y_test = evalml.preprocessing.split_data(X,y,problem_type='regression',test_size=.3)

 

To check all the available machine learning algorithms in evalml. We need to type

evalml.problem_types.ProblemTypes.all_problem_types

Evalml has binary classification, multi class classification, regression, time series regression, time series binary and time series multi class classification models.

With the help of automlsearch we will search for the best algorithm for our problem statement. Here also we need to provide the problem _type.

We will take max_batches as 1 to run all the models once with 3 CV(cross validation).

from evalml.automl import AutoMLSearch
automl = AutoMLSearch(X_train=X_train,y_train=y_train,problem_type='regression',max_batches=1,optimize_thresholds=True)
automl.search()

From the image we can see it has run 7 ML models on our regression problem and out of which linear regression with imputation and standard scalar transformation is the best model.

This model is providing an adjusted R square score of 1 on the training data. (in the real world this is not possible, but our data is very much simple in nature)

We can also find out best models by ranking

automl.rankings

From the above code it will tell us the best model with all the performance metrics like.

  • Mean CV score which is the average R2 score out of all the CV iterations.
  • Standard deviation in all CV iterations.
  • Validation score – It will try to fit the data on a y_train dataset of the training data.
  • High variance CV- it will tell whether our model is over fitting or not?
  • Parameters – it will provide us all the hyperparameters used in the best model. Which can help us to explain the model to the end users.

Now save this model as best pipeline which will automatically select the best model

best_pipeline = automl.best_pipeline

 

best model parameters are :- you can see imputation strategy is most frequent values for categorical column, for numerical columns its mean. Also, evalml has normalized the data while training

GeneratedPipeline(parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'Linear Regressor':{'fit_intercept': True, 'normalize': False, 'n_jobs': -1},})

Also, if we want to see the complete description of the best model we can also do it using evalml

automl.describe_pipeline(automl.rankings.iloc[0]["id"])

To find out performance of the model we can find out using.

Here also, we need to specify the objective is the regression analysis.

scores = best_pipeline.score(X_test, y_test,  objectives=evalml.objectives.get_core_objectives('regression'))

print(f'Accuracy regression: {scores["MAE"]:.10f}')

We can also check the performance metrics scores

scores

Now,lets predict on the test dataset. Evalml has function called pipeline.predict which will predict the test dataset.


pipeline = automl.best_pipeline

prediction = pipeline.predict(X_test)

 

once the prediction is done we can not directly use it as pandas dataframe as it has a type woodwork.datacolumn.DataColumn which is an alteryx data type.

Hence, we need to convert it to pandas series to append the result or if we want to find out the confusion matrix of the data for classification task

type(prediction)

woodwork.datacolumn.DataColumn

 

we will convert the prediction to pandas series

output = prediction.to_series()

output

now this output can be appended to your original dataset and also can be used to find performance matrix of the model

lets find out r square using sklearn metrics


from sklearn.metrics import r2_score
r2_score(y_test.to_series(), output)
output

output = 1.0 ( this is a very simple data with linear relationships)

 

Lastly, we can save the model to be used for later  use

best_pipeline.save("model.pkl")

 

this PKL(pickle) file can be loaded later for the prediction part.

How to load the model file

mymodel = automl.load('model.pkl')

 

if we have a new test data we can predict it using.

mymodel.predict(X_test)

Comparison with Pycaret :-

Personally using Pycaret has higher number of models available compared to evalml. Pycaret also has GPU support. But evalml has good pipeline and explanation of parameters used in the model .Evalml is also faster compared to pycaret

 

Complete code can be found in my GitHub repo.

Reference and official GitHub page for evalml
https://github.com/alteryx/evalml

 

Leave a comment