Machine Learning Lifecycle ML-2

Greetings, This blogs contain the required information regarding the life cycle and technical things in Machine Learning. In the previous blog, we discussed the types of machine learning models. This blog contains

Machine learning lifecycle

This blog contains some technical words and things. Hence if things are not clear then don't worry. It will be cleared in further blogs. I tried to make this blog as simple as possible.

Machine Learning Lifecycle :

Fig 1. Machine learning model lifecycle

Data Ingestion :

Data Ingestion is the first step of creating a Machine learning model. In this process, data from various sources is collected. In simple terms, loading of data into ML system. Now, Data can be in any form and any type. Data can be either a CSV file or a TSV file, etc. It can also be available in some databases like SQL, NoSQL, etc. It can be present in any cloud like AWS, Azure, GCP, etc. Data Scientist is supposed to ingest any sort of data into the system.

Raw Data :

After Ingesting multiple data into the ML system, you need to combine all the data to create our raw dataset. In this process, data scientists have to make sure that

data loss should not happen
data should be properly arranged
data should be arranged according to the column name (feature) while combining multiple data files.

After this process, we have a single dataset file available in our ML system. This generated file is generally known as a Raw dataset. This dataset can have missing values or some text in it.

Data Preprocessing :

In data preprocessing, We generally process the raw data. In data preprocessing, The main aim is to convert the raw data into numerical data. Text or labels is converted into numbers. High-scale numbers are converted into low-scale i.e 1000000 will be converted into 100. Scaling will be discussed in further blogs. There are many more technical operations that we provide under data preprocessing. All the steps and operations will be discussed in further blogs.

Splitting into X and y :

After Preprocessing, The dataset is divided into training, testing, and validation. Dataset has questions i.e features and answers i.e target features. Questions and answers these words are just for simplicity. In further blogs, it will be mentioned as features or X and target features or labels i.e y. Technical things will be covered in the further blogs.

The main formula to remember is :

Dataset = Training dataset + Testing dataset + Validation dataset

Dataset will be mainly divided into several parts and that is -

X training : Question i.e features of the training dataset.
y training : Answers i.e target features of the training dataset.
X testing : Questions i.e features of the testing dataset.
y testing : Answers i.e target features of the testing dataset.
y predicted: Answers i.e target features predicted by AI model when x testing is given to AI model.

ML-Model :

For training, We used to feed X training and y training to train the ML model. After the training is completed, We used to feed X test data to the ML model and the model gives a prediction denoted as y predicted. After this result, we used to compare y predicted results with the y test. That comparison is done with the help of evaluation metrics generally accuracy is used.

Accuracy = 1 - (y test - y predicted)

Hyperparameter tuning :

After our model is trained, to increase the accuracy of the ML model. We used to do hyperparameter tuning. In simple terms, we select some particular parameter that generally changes with respect to the problem. Due to this parameter, accuracy may vary. We will discuss more details and technical terms in further blogs. Selecting the values of such parameters is known as hyperparameter tuning.

Deployment :

After our final model is ready. It is used in some applications like web applications in some software according to the use.

Summary :

Data is collected from various sources and the combination of all data is Raw data.
Processing Raw data is known as Data preprocessing. We get processed data after data preprocessing.
The data set is divided into X, y, and training, testing, and validation.
ML model is created from training data and tested from testing data.
After hyperparameter tuning, the Model is sent for deployment.

-Santosh Saxena