Evaluating the performance of a Machine Learning model is a crucial part of building an effective model. There are several evaluating metrics exists for classification and regression problem. Model performance is influenced by the chosen metric to evaluate the performance. Here, I have explained different evaluation metrics with example in Python.
Classification Metrics
- Classification Accuracy.
- f1 Score
- Logarithmic Loss.
- Area Under ROC Curve.
- Confusion Matrix.
- Classification Report.
- Jaccard Score
Regression Metrics
- Mean Absolute Error
- Mean Squared Error
- Mean Squared Log Error
- r2 Score
Confusion Matrix
A confusion matrix is an evaluating matric for two or more classes.
Class A: Person is not suffered from Cancer
Class B: Person is suffered from Cancer
True Positive(TP): Classify Cancer patient to having cancer class(class B).
True Negative(TN): Classify non-cancer patient to not having cancer class(class A)
False Positive(FP): Classify non-cancer patient to having cancer class(class B)
False Negative(FN): Classify cancer patient to not having cancer class(class A)
TP & TN: Model classifying data correctly as compared to the actual class.
FP & FN: Model classifying data incorrectly as compared to the actual class.
What’s Aim?
The model should give zero value to False Positive and False Negative. But, In real-life noisy data, it’s impossible to achieve it most of the time.
Classification Accuracy
Classification accuracy is the ratio of the number of correct prediction to a total number of the prediction made. This is the most familiar evaluation metric for the classification task. This metric works best when there are an equal number of samples belongings to each class.
Example:
Total number of patients = 100 (5 patients have cancer & 95 patients do not have cancer)
Here, 95% of data belonging to a single class A. It’s imbalance data.
Here, the model has correctly classified all non-cancer patients as not having cancer class. But, It has correctly classified only one cancer patient as having cancer class and classified 4 cancer patients as not having cancer class. It’ s very worst model.
Let’s find the accuracy of the model:
Even though the trained model is worst to predict cancer, the Accuracy of such poor model is 94%.
Precision
Precision is the ratio of the number of correct prediction to the number of positive results are predicted by the model.
Let’s calculate Precision in our above example,
The precision of such a poor model is 33%.
Recall
The performance Evaluating Metric Recall says that what percentage of patients that actually have cancer and diagnosed cancer by the model.
The recall is the ratio of the number of correct prediction to the total number of actual positive results.
Let’s calculate Recall in our above example,
The Recall of such a poor model is 20%.
Precision measures the performance with respect to False Positive(FP).
Recall measures the performance with respect to False Negative(FN).
F1 Score
F1-score is the weighted average of precision and recall. F1-Score interpret how precise your model as well as how robust it is. The Mathematical equation of F1-score is
Let’s calculate F1-Scorein our above example,
The F1-Score of such a poor model is approx 25%.
Logarithmic Loss
Logarithmic Loss is also known as Log Loss. It is a good performance measurement metric for the classification task. Log Loss metric is based on probability. Lower Log Loss values indicate better predictions. A perfect classifier would have a Log Loss value 0. The classifier predicts a probability for each sample when working with Logg Loss metric. The mathematical equation for Log Loss is.
Where yij, specify whether sample i belongs to class j or not
Pij, specify the probability of sample i belongs to class j
Example:
The classifier predicts the probability of cancer [0.8 , 0.75 , 0.2] for 3 patients. First two patients have cancer and the third patient doesn’t have cancer. So, the actual results of three patients are [1,1,0].
1 indicate having cancer and 0 indicates not having cancer.
The classifier says that the first patient has an 80% chance of cancer disease. The second patient has 75% chances of cancer disease. And the third patient has only 20% chances of cancer disease. Using the threshold of 50%, we can say that the first two patients are suffered from cancer and last patient doesn’t have cancer.
Accuracy vs Log Loss
Accuracy measures the amount of correct prediction in yes or no class made by a classifier.
Log Loss considers the uncertainty of your prediction How much the predicted values differ from the actual label. Log loss gives the most detailed view of the performance of a classifier.
Area Under ROC Curve (AUC)
Area Under ROC Curve is the performance measurement metric for binary classification. AUC is the popular metric for a classification task. Model performance is better if larger the area under the curve. The AUC metric considers both positive class and negative class.
AUC is the area under the curve of sensitivity vs specificity at different points in [0,1].
Let’s understand the ROC Curve.
ROC (Receiver Operating Characteristic)
ROC curve typically used in binary classification to observe the outcome of a model. ROC curve is the plot of Sensitivity and Specificity.
True Positive Rate (Sensitivity): It defines the proportion of positives samples that are correctly considered as positive, with respect to all positive samples.
False Positive Rate(Specificity): It defines the proportion of negative samples that are mistakenly considered as positive, with respect to all negative samples
Mean Absolute Error
Mean Absolute Error (MAE) is the mean of the absolute differences between the predicted values and actual values. It measures how far the predicted values are from actual values. MAE gives a magnitude of the error, not the direction of error. i.e under predicting or over predicting. Mathematical representation is
Mean Squared Error (MSE)
Mean Squared Error is similar to Mean absolute Error(MAE). The only difference is that MSE considers the average of the squared difference between the predicted values and actual values. The advantage of using MSE is that the error is squared before getting average in MSE, means the effect of the larger error become increased. So, it gives more attention to the large error and penalized the large error.