Let’s take an example. You are designing an automation script to perform a particular activity. To develop an automation, you first need the data and the well defined set of rules to follow to reach the desired output.
In Machine Learning, things are a little different. You don’t make the rules. You know what the input is, you know what the output is but don’t really know what operations helped in reaching the specific output.
Let’s take a very simple example to start with.
Input | Output |
1 | 2 |
2 | 4 |
3 | 8 |
4 | 16 |
5 | 32 |
6 | 64 |
In this example, you have an input and an output. If I ask you what would be the output when the input is 7, it is pretty straight forward for you to determine the operation here. It is 2^n where n is the input number. We want to emulate the same pattern finding abilities to a computer with Machine Learning so that it can find patterns that even we cannot.
This is an a subset of data from a real world problem being solved with Machine Learning. Try to find a pattern, you have 10 seconds.
Input A | Input B | Input C | Input D | Output |
30 | 70000 | 12000 | 11.11 | 0 |
22 | 33000 | 10000 | 11.12 | 1 |
22 | 56000 | 4000 | 13.35 | 0 |
25 | 25000 | 3500 | 13.49 | 1 |
37 | 35000 | 6000 | 11.49 | 0 |
Couldn’t find one, right? Well it is just 5 examples and only 4 input values. Most datasets are well over 100,000 examples and can span anywhere from 5 to 500 columns (no hard limit on the columns, just stating a number to make a point).
To solve for such problems, we take use of Machine Learning. The concepts involving Machine Learning go both wide and deep. At every turn of applying Machine Learning, you need to make a decision - which algorithm to use?, how do I check if the algorithm gave a good output?, what operations should I do on the data to make it easier for the algorithm to learn? - just the 3 of the questions you’d need answer to before you design a solution. There are tens of such questions if not a hundred. And the number of questions keep on increasing as you gain experience.
If you have read my previous posts, you’d know how much I emphasize a top-down approach of learning. However, it can’t be used here today. We need to touch base on 2 concepts, before we come back to the top-down approach for the remainder of this series. Please bear with me on this.
Concept 1 - Types of Machine Learning Algorithms
Primarily, there are 2 types of Machine Learning - Supervised and Unsupervised. There are more, but let’s not go into that rabbit hole at the start itself.
In Supervised Machine Learning, you know both the input and the output but don’t know how to reach the output from input and the algorithms that you use to solve for this task fall in the category of Supervised Machine Learning.
In Unsupervised Machine Learning however, you don’t have the output either! Let’s take a real world example, something I’ve worked on before to understand what the heck goes into unsupervised machine learning. For any particular application, sending personalized notification is extremely important. However, it is not really feasible to write personalized notifications for each user (at least wasn’t before the ChatGPT era). So you use the information you have for the users and club them into “N” groups and write a notification for each group. Here, you don’t necessarily have predefined groups. you want the algorithm to self determine the groups in which each user should be.
Concept 2 - Problem Types solved with Machine Learning
Under Supervised Machine Learning, we have 2 problem types - Classification and Regression. Under Unsupervised, we have Clustering (the same example we looked at before).
These are self explanatory problems. For Classification problems, we develop an algorithm that is able to classify a certain data point to one of the many classes the algorithm is designed for. A very famous example you’ll come across when starting with Machine Learning is a Dog vs Cat classifier where in you try to build an algorithm that can classify images of a pet into 2 categories - Dog or a cat. Regression problems, on the other hand, are the ones where the algorithm tries to predict a continuous value instead of a discrete class. An algorithm that estimates the temperature of tomorrow would fall in this category of the problem statement.
Clustering problems, a part of the unsupervised machine learning, requires you to make groups of the data that you have. A prominent example of this has already been discussed before.
Time to get hands dirty
We’ll start with a toy dataset.
Input | Output |
9 | 0 |
12 | 1 |
13 | 1 |
4 | 0 |
6 | 0 |
11 | 1 |
7 | 0 |
This dataset is relatively easy. By just looking at the pattern, it is easy to identify that any input below 10 is getting mapped to 0 while anything larger than 10 is getting mapped to 1.
Let’s implement a really simple script in python.
To do so, we’ll leverage a library called scikit-learn
which you can install by running the following command:
pip install scikit-learn
Scikit Learn is the most popular library for Machine Learning and implements a vast majority of algorithms with optimizations and various other concepts that we’ll cover in the later topics.
First, let’s get the data ready:
import numpy as np
X = np.array([9, 12, 13, 4, 6, 11, 7]).reshape(-1, 1)
y = np.array([0, 1, 1, 0, 0, 1, 0])
This creates NumPy
array of both the input and the output.
Next, we’ll implement a Logistic Regression algorithm (don’t worry, we’ll cover it) from scikit-learn
.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X, y)
That’s basically it. You have implemented the first algorithm. [From now on, we’ll call Machine Learning algorithms as 'models’ to be more in line with Machine Learning terminologies]
In Machine Learning terms, when we feed existing data to the model, we are essentially training it. To train a model, all the models have a .fit()
function where we pass our existing data.
So what we have done here is essentially trained the Logistic Regression algorithm on the data that we previously generated.
Now, to get predictions on the new set of data, you can just do:
y_test = np.array([8, 18])
y_predictions = model.predict(y_test)
print(y_predictions)
# [0, 1]
And that’s it. We have our first model, trained and doing predictions!
If this was helpful, you can follow the blog or bookmark the series and I’ll be sharing 1-2 articles every week on this series.