Isn’t it fun creating your own ML model without having expertise in pre-processing of data, feature extration, dimensionality reduction, hyperparameter optimization…….??
But, what is AutoML????
Automated Machine Learning involves automating the whole process from the preparation of data to the prediction of real-world data. Basically end-to-end process of applying machine learning to real-world problems in generating ML models. Several steps are to be followed prior to selecting a model for a dataset such as data pre-processing, feature engineering, feature extraction, and feature selection. In order to increase the predictive performance, hyperparameter tuning must be performed too. Here, AutoML simplifies the whole application of machine learning especially for a greenhorn in ML.
What is PyCaret?
An open-source low-code ML library that helps the user from preparing the data to deploying a model within a few lines of code allowing to reach conclusions faster reducing the cycle time from hypothesis to insights. PyCaret is a wrapper around other ML libraries providing a low-code solution.
Installation
pip install pycaret
# To upgrade an existing version
pip install --upgrade pycaret
# To create a conda environment using anaconda prompt
conda create --name envname python=3.6
conda activate envname
pip install pycaret
All the dependencies will be installed with pycaret. To view all the dependencies Click Here.
Importing required module based on the task. There are 6 available modules in pycaret, if the regression module has been imported, then the environment is set up to perform regression task only.
from pycaret.classification import * # Classification model
from pycaret.regression import * # Regression model
from pycaret.clustering import * # Clustering
from pycaret.anomaly import * # Anomaly Detection
from pycaret.nlp import * # NLP
from pycaret.arules import * # Association Rule Mining
Setup Function
Setup is the initial and mandatory step before starting off to build a model. There are few default pre-processing tasks performed by default, in addition pycaret provides other pre-processing features to increase the performance.
Syntax for setup
_setup= setup(data,target=' ') # data: dataset
Default pre-processing tasks
Data Type Specification
It helps in determining the data types for the features in the dataset. After the execution of the setup, a dialogue appears displaying all the features corresponding to their data type. If the user finds the data to be accurate, the user can continue by pressing “Enter” to continue or type “quit”, if there are any discrepancies between the data type and feature.
The above figure represents features of blood dataset with data types.
Data Cleaning
Sometimes, users tend to forget cleaning the data i.e. finding the missing values and filling it up but, setup solves this by identifying the missing values. If it’s a numerical feature, the missing value may be filled up with mean or median. For categorical features, the missing value will be filled up with the most frequent value. numeric_imputation and categorical_imputation are the parameters used.
Data Splitting
By default, data is split into training and testing data in the ratio 70:30.
For better understanding let’s consider an example:
import pycaret
from pycaret.datasets import get_data
dataset = get_data('blood')
The same blood dataset is taken for the example too.
from pycaret.classification import *
classify=setup(data=dataset,target ='Class')
The total size of the dataset is (748,5), by default this data is split in the ratio 70:30. The target type is specified as binary, and consisting of only numerical features. Only a few of the default features have been discussed, let’s jump and know in detail about the other important pre-processing steps.
Normalization
This is an important technique applied to change the numerical values to use a similar scale, without distorting the differences and losing information. By default, ‘zscore’ is used in Pycaret.
normalize_method: string, default = ‘zscore’