Spot abnormal values#
The Spot abnormal values task finds potentially abnormal values or values that look different from other values in the target column. See case #2 of the tutorial for a complete example.
This task is suitable for understanding outliers but also to spot potential data of interest. For example, if your sheet contains records of client actions, spotting clients that don’t behave like the other ones can be used as a lead in fraud detection or novel behavior.
Note
While counterintuitive, in many situations it is normal to have abnormal values, and it is abnormal for all the rows to behave the same. This means that a value with a high abnormality score (see definition after) does not necessarily mean that the value has errors, was misreported, or is the result of a conscious action.
Use this task as follows:
Make sure your data is well formatted.
In the “Column with abnormal values” field, select the column with possibly abnormal values.
(Optional, advanced) Remove some source columns. In most cases, leaving all the source columns will work best.
(Optional, advanced) Change the learning algorithm. Gradient Boosted Trees and Random Forests are both excellent on tabular data. The decision tree algorithm is more interpretable.
Click the “Spot abnormal value” button.
A certain number of new columns will be created:
“Pred:Abnormality:[target column]” is the abnormality score between 0 and 1. A value of 0 indicates that a value is normal, while a score of 1 indicates that a value is abnormal.
“Pred:MostLikely:[target column]” is the most likely value. The abnormality the score is generally high when the most likely value is not equal to the actual value.
How is abnormality computed?#
Ten different models are trained using a 10-fold cross-validation protocol. Each value is then compared with the prediction of the corresponding model in the cross-validation. If the existing value and the predicted value do not match, the row is considered abnormal.
For classification, the abnormality score is the difference between the predicted probability of the predicted value and the predicted probability of the existing value.
For regression, the abnormality score is close (but not equal) to one minus the the p-value obtained by testing if the prediction residual is of the same distribution as all the other residuals.