Classification : Machine Learning is used to label input data based on the training data provided. This labeling of data is called classification. Here, the record is classified into one of the possible groups by the algorithm. The output here is the class labels.
Consider the familiar email Spam Classification example. Here, initially, a set of spam emails are used to train the model and then, any new email that hits your inbox is classified as either spam or not-spam. This is a Classifier Model in Machine Learning.
There various classifier models in practice. The right classifier for a solution depends various factors. Following are few common classifier model and reasons to choose them:
- Boosting – often effective when a large amount of training data is available.
- Random trees – often very effective and can also perform regression.
- K-nearest neighbors – simplest thing you can do, often effective but slow and requires lots of memory.
- Neural networks – Slow to train but very fast to run, still optimal performer for letter recognition.
- SVM – Among the best with limited data, but losing against boosting or random trees only when large data sets are available.
Ref: An answer in Stackoverflow pointing to “OpenCV” book.
Prediction/Regression : Unlike Classification, regression is type of problem where algorithm finds a continuous number/value from the given input. A simple example would be – predict price of an house, given no.of rooms, area and location. Here, a training set of houses with known price are fed into the model. The algorithm comes up with an equation to apply on new inputs further. Another example is predicting the price of a stock, given various input features.
The output here is a continuous value of the target variable.
Training Data : This is the data set which has the feature variables and target variable. This data is used to train the algorithm to derive the classification/regression equation. Training data is the data which used by the algorithm to learn from.
Test Data : Test data is the data set which is used to validate the trained algorithm. This data set will also have feature and target variables. The trained algorithm will be executed on the records in test data. Now, the actual value/label in target variable and the output value/label from the algorithm can be compared to measure the accuracy of the trained algorithm. Less the difference, more the accuracy!
Notes : Test Data can be a part of training data itself, but will be hidden from the algorithm during training and used fresh to test later. The test records can be randomly selected from the training data or certain set of records can be selected to be the training data. generally, a % (eg : 20%) of randomly selected records from available data is made as test data.
Linear regression is the process of identifying a line/curve – hypothesis using the training data which represents the relation between feature variables and the target variable. In an example of determining the price of a house, given its area, then linear regression finds the relationship between the area and the price of the house. This line/curve should be of minimum error.
The minimum error is determined using loss function and the parameters. The parameters are varied to get the most minimum value of loss function. Basically, loss function denotes the difference between actual target variable value and computed variable value through the equation of hypothesis.
Gradient Descent is one of the method in linear regression which is used to find minimum value of loss function. In an iterative method, the parameters are varied and the equation is computed for a minimum value. This might lead to local or global minimum
Gradient Descent – Iterative Descent to Minimum Value
The rate in which the steps are taken towards minimum is determined by learning rate. This has to be defined while training the algorithm. Once this hypothesis is finalized, then any new data passed to the algorithm, the hypothesis will be applied and the value would be calculated.
Three basic terms to be learnt in machine learning are:
Concept : A concept is what the machine learns in the process. In a classification task, it learns how to classify. This is concept.
Instances : Each row/record in training data set is an instance. It can be collection of 1 or more attributes.
Attributes : As explained above, attributes are each column/field in the data set. These are used by the algorithm to come up with the hypothesis from the data set.
Next post would be about Training and Test data. I would also be writing about topics I learn and exercise I practice in Coursera – Machine Learning course in parallel. 🙂 🙂
Unsupervised learning is the method of finding hidden pattern or classifications within data on its own. Unlike supervised learning, there are no labels or training data here. The data is clustered into groups by the algorithm using the similarity in data’s features. In most of the cases, we do not know the reason behind formation of clusters unless we analyse the features of data in each cluster.
Commonly used unsupervised algorithms are:
- Self Organizing maps
- k-means clustering
- Hierarchical clustering
- Hidden Markov Models
- Gaussian mixture models
A good example would be clustering of fans/followers of a Facebook page or Twitter handle. The features would be the profile details of each user and clusters would have similar users grouped together.
Workflow Diagram Reference for my last two posts : machine-learning-who-s-the-boss
In next post, I will discuss about each of the algorithms of supervised and unsupervised categories briefly.
Supervised learning is the method of using a labelled training data to train the algorithm. Training data will have an input part and its label (the output). The input will mostly be a vector of parameters. Using this, the algorithm will train itself and when a new input is given, it would classify or predict the output label.
The accuracy of algorithm can be determined using a test data set similar to training data. To improve accuracy, training control parameters can be adjusted depending on the algorithm selected to train. Few points to remember while using supervised learning:
- The training data set should not be biased to a particular output label
- Overfitting – This is the issue where algorithm over trains itself and hence output error is more.
- The type of input vectors – numerical, categorical etc.
Few most used supervised learning algorithms are Support Vector Machine, Neural Networks, naive Bayes, Decision trees, K – nearest neighbors, linear regression and logistic regression.
I will write about unsupervised learning in next post.
Variables are the basic building blocks of an ML algorithm. Based on these variables, the algorithm identifies and equation which will be applied on new input data. These variables are mostly of two types:
- Categorical Variables
This variable represents a field which can be classified into categories or groups.
example : sex, favorite color, age
- Numerical Variables
This variable represents a field which can be measured and sorted.
example : height, weight
Categorical variables are visualized using bar charts, frequency tables or pie charts.
visualizing categorical data
Numerical variables are visualized using scatter plots or line graphs.
visualizing numerical data
An interesting reference : Shodor – Numerical and Catagorical data
In my next blog, I will be writing on Supervised Learning.