Main outlier detection methods
The main outlier detection methods implemented in the UAHDataScienceO package are:
- box_and_whiskers()
- DBSCAN_method()
- knn()
- lof()
- mahalanobis_method()
- z_score_method()
This section will be dedicated on showing how to use this algorithm implementations.
Box and Whiskers (box_and_whiskers())
With the learn mode deactivated and d=2:
boxandwhiskers(inputData,2,FALSE)
#> Obtained limits:
#> -0.10000000000000110.4
#> The value in position 7 with value 14 has been detected as an outlier
#> It was detected as an outlier because it's value is higher than the top limit 10.4
#> --------------------------------------------------------------------------------------------
#> The value in position 9 with value 12 has been detected as an outlier
#> It was detected as an outlier because it's value is higher than the top limit 10.4
#> --------------------------------------------------------------------------------------------
With the learn mode activated and d=2:
boxandwhiskers(inputData,2,TRUE)
#> The tutorial mode has been activated for the box and whiskers algorithm (outlier detection)
#> Before processing the data, we must understand the algorithm and the 'theory' behind it.
#> The algorithm is made up with 4 steps:
#>  Step 1: Determine the degree of outlier or distance at which an event is considered an outlier (arbitrary). We will name it 'd'
#>  Step 2: Sort the data and obtain quartiles
#>  Step 3: Calculate the interval limits for outliers using the equation:
#>      (Q_1 - d * (Q_3 - Q_1), Q_3 + d * (Q_3 - Q_1))
#>  Being Q_1 and Q_3 the 1st and 3rd quartile. Notice that here we use the value 'd' (it affects on the results so it must be carefully chosen)
#>  Step 4: Identify outliers as values that fall outside the interval calculated in step 3
#> Quantiles are elements that allow dividing an ordered set of data into equal-sized parts.
#>  -Quartiles: 4 equal parts
#>  -Deciles: 10 equal parts
#>  -Percentiles: 100 equal parts
#> The function quantile.R that has been developed gives a closer look into how quantiles are calculated:
#> function (data, v) 
#> {
#>     data = transform_to_vector(data)
#>     data = sort(data)
#>     nc = length(data) * v
#>     if (is.integer(nc)) {
#>         x = (data[nc] + data[nc + 1])/2
#>     }
#>     else {
#>         x = data[floor(nc) + 1]
#>     }
#>     return(x)
#> }
#> Now we will apply this knowledge to the data given to obtain the outliers
#> Calculating the quantiles with the function quantile() (available on this package)
#> First we calculate the 1st quartile (quantile(data,0.25))
#> 4.1
#> Now we calculate the 3rd quartile (quantile(data, 0.75))
#> 6.2
#> Using the formula given before, we obtain the interval limits:
#> -0.10000000000000110.4
#> Now that we have calculated the limits, we will check if every single value is 'inside' those boundaries obtained.
#> If the value is not included inside the limits, it will be detected as an outlier
#> Checking value in the position 1. It's value is 3
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 2. It's value is 3.5
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 3. It's value is 4.7
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 4. It's value is 5.2
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 5. It's value is 7.1
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 6. It's value is 6.2
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 7. It's value is 14
#> The value in position 7 with value 14 has been detected as an outlier
#> It was detected as an outlier because it's value is higher than the top limit 10.4
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 8. It's value is 2
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 9. It's value is 12
#> The value in position 9 with value 12 has been detected as an outlier
#> It was detected as an outlier because it's value is higher than the top limit 10.4
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 10. It's value is 4.1
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 11. It's value is 4.9
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 12. It's value is 6.1
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 13. It's value is 5.2
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 14. It's value is 5.3
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> The algorithm has endedDBSCAN (DBSCAN_method())
With the learn mode deactivated:
eps = 4;
min_pts = 3;
DBSCAN_method(inputData, eps, min_pts, FALSE);
#> The point 2 is an outlier
#> The point 7 is an outlierWith the learn mode activated:
eps = 4;
min_pts = 3;
DBSCAN_method(inputData, eps, min_pts, TRUE);
#> The tutorial mode has been activated for the DBSCAN algorithm (outlier detection)
#> Before processing the data, we must understand the algorithm and the 'theory' behind it.
#> The DBSCAN algorithm is based in this steps:
#>  Step 1: Initializing parameters
#>  Max distance threshold: 4
#>  MinPts: 3
#>  Step 2: Executing main loop
#>      If a point has already been visited, it skips to the next point.
#>      It then finds all neighbors of the current point within a distance of max_distance_threshold using the Euclidean distance function.
#>      If the number of neighbors is less than min_pts, the point is marked as noise (-1) and the loop proceeds to the next point.
#>      Otherwise, a new cluster is created, and the current point is assigned to this cluster.
#>      The algorithm then iterates over the neighbors of the current point, marking them as visited and recursively expanding the neighborhood.
#>      If a neighbor already belongs to a cluster, it assigns the same cluster id to the current point.
#>      After processing all points, the algorithm checks for outliers (points marked as -1) in the visited_array.
#>  Step 3: Identifying outliers
#>      If a point is marked as noise (-1), it is identified as an outlier.
#> With this simple steps explained, let's see how this is executed over the dataset given
#> Checking if the point 1 has already been visited
#> It has not been visited
#> Calculate the distance between this point and the rest of the points. This is the equivalent to the RangeQuery() functionality
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Smaller, adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Smaller, adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Smaller, adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Point 1 neighbors:
#> 134
#> Is length of neighbors smaller than min_pts?
#> It's bigger, adding the point 1 to a cluster
#> Executing the expandCluster() functionality
#> Adding point 1 to cluster 1
#> Checking every single neighbor for the point
#> Neighbor 1 belongs to another cluster.
#> Checking every single neighbor for the point
#> Checking every single neighbor for the point
#> Process finished for this point, skipping to next point
#> ------------------------------------------------------
#> Checking if the point 2 has already been visited
#> It has not been visited
#> Calculate the distance between this point and the rest of the points. This is the equivalent to the RangeQuery() functionality
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Smaller, adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Point 2 neighbors:
#> 2
#> Is length of neighbors smaller than min_pts?
#> It's smaller, classifying the point 2 as an outlier and skipping to next point
#> ------------------------------------------------------
#> Checking if the point 3 has already been visited
#> It has not been visited
#> Calculate the distance between this point and the rest of the points. This is the equivalent to the RangeQuery() functionality
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Smaller, adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Smaller, adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Smaller, adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Smaller, adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Smaller, adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Point 3 neighbors:
#> 13456
#> Is length of neighbors smaller than min_pts?
#> It's bigger, adding the point 3 to a cluster
#> Executing the expandCluster() functionality
#> Adding point 3 to cluster 2
#> Checking every single neighbor for the point
#> Neighbor 1 belongs to another cluster.
#> Checking every single neighbor for the point
#> Neighbor 3 belongs to another cluster.
#> Checking every single neighbor for the point
#> Checking every single neighbor for the point
#> Checking every single neighbor for the point
#> Process finished for this point, skipping to next point
#> ------------------------------------------------------
#> Checking if the point 4 has already been visited
#> It has not been visited
#> Calculate the distance between this point and the rest of the points. This is the equivalent to the RangeQuery() functionality
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Smaller, adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Smaller, adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Smaller, adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Smaller, adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Smaller, adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Point 4 neighbors:
#> 13456
#> Is length of neighbors smaller than min_pts?
#> It's bigger, adding the point 4 to a cluster
#> Executing the expandCluster() functionality
#> Adding point 4 to cluster 3
#> Checking every single neighbor for the point
#> Neighbor 1 belongs to another cluster.
#> Checking every single neighbor for the point
#> Neighbor 3 belongs to another cluster.
#> Checking every single neighbor for the point
#> Neighbor 4 belongs to another cluster.
#> Checking every single neighbor for the point
#> Checking every single neighbor for the point
#> Process finished for this point, skipping to next point
#> ------------------------------------------------------
#> Checking if the point 5 has already been visited
#> It has not been visited
#> Calculate the distance between this point and the rest of the points. This is the equivalent to the RangeQuery() functionality
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Smaller, adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Smaller, adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Smaller, adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Smaller, adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Point 5 neighbors:
#> 3456
#> Is length of neighbors smaller than min_pts?
#> It's bigger, adding the point 5 to a cluster
#> Executing the expandCluster() functionality
#> Adding point 5 to cluster 4
#> Checking every single neighbor for the point
#> Neighbor 3 belongs to another cluster.
#> Checking every single neighbor for the point
#> Neighbor 4 belongs to another cluster.
#> Checking every single neighbor for the point
#> Neighbor 5 belongs to another cluster.
#> Checking every single neighbor for the point
#> Process finished for this point, skipping to next point
#> ------------------------------------------------------
#> Checking if the point 6 has already been visited
#> It has not been visited
#> Calculate the distance between this point and the rest of the points. This is the equivalent to the RangeQuery() functionality
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Smaller, adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Smaller, adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Smaller, adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Smaller, adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Point 6 neighbors:
#> 3456
#> Is length of neighbors smaller than min_pts?
#> It's bigger, adding the point 6 to a cluster
#> Executing the expandCluster() functionality
#> Adding point 6 to cluster 5
#> Checking every single neighbor for the point
#> Neighbor 3 belongs to another cluster.
#> Checking every single neighbor for the point
#> Neighbor 4 belongs to another cluster.
#> Checking every single neighbor for the point
#> Neighbor 5 belongs to another cluster.
#> Checking every single neighbor for the point
#> Neighbor 6 belongs to another cluster.
#> Process finished for this point, skipping to next point
#> ------------------------------------------------------
#> Checking if the point 7 has already been visited
#> It has not been visited
#> Calculate the distance between this point and the rest of the points. This is the equivalent to the RangeQuery() functionality
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Bigger, not adding to neighbors
#> Checking if the euclidean distance is less than the max_distance_threshold
#> Smaller, adding to neighbors
#> Point 7 neighbors:
#> 7
#> Is length of neighbors smaller than min_pts?
#> It's smaller, classifying the point 7 as an outlier and skipping to next point
#> ------------------------------------------------------
#> Checking the visited array looking for the points classified as outliers
#> The point 2 is an outlier
#> The point 7 is an outlier
#> The algorithm has endedKNN (knn())
With the learn mode deactivated, K=2 and d=3:
With the learn mode activated, K=2 and d=3
knn(inputData,3,2,TRUE)
#> The tutorial mode has been activated for the KNN algorithm (outlier detection)
#> Before processing the data, we must understand the algorithm and the 'theory' behind it.
#> The knn algorithm to detect outliers is a method based on proximity. This algorithm has 2 main steps:
#>  Step A: Determine the degree of outlier or distance at which an event is considered an outlier (arbitrary)
#>      Substep a: Arbitrarily determine the degree of outlier or distance at which an event is considered an outlier (we will name it 'd')
#>      Substep b: Arbitrarily determine the order number, or K, of the nearest neighbor for which an event must have a degree of outlier to be considered an outlier
#>  Step B: Identify outliers using the k-Nearest Neighbors (k-NN) algorithm
#>      Substep a: Calculate Euclidean distances between all data points
#>      Substep b: Sort the neighbors of each point until reaching K
#>      Substep c: Identify outliers as events whose Kth neighbor is at a distance greater than the defined degree of outlier
#> We must define euclidean distance between 2 points (point A & point B for example). The formula is:
#>  sqrt((B_x - A_x)^2 + (B_y-A_y)^2)
#> Being A_x and B_x the x components of the A and B points. A_y and B_y are the y components of the A and B points
#> Now that we know how the algorithm works, let's apply it to our data.
#> 
#> First we must calculate the euclidean distance between every single point in the data
#> Euclidean distance between point 1 (3,2) & point 1 (3,2): 0
#> Euclidean distance between point 1 (3,2) & point 2 (3.5,12): 10.012
#> Euclidean distance between point 1 (3,2) & point 3 (4.7,4.1): 2.702
#> Euclidean distance between point 1 (3,2) & point 4 (5.2,4.9): 3.64
#> Euclidean distance between point 1 (3,2) & point 5 (7.1,6.1): 5.798
#> Euclidean distance between point 1 (3,2) & point 6 (6.2,5.2): 4.525
#> Euclidean distance between point 1 (3,2) & point 7 (14,5.3): 11.484
#> Euclidean distance between point 2 (3.5,12) & point 1 (3,2): 10.012
#> Euclidean distance between point 2 (3.5,12) & point 2 (3.5,12): 0
#> Euclidean distance between point 2 (3.5,12) & point 3 (4.7,4.1): 7.991
#> Euclidean distance between point 2 (3.5,12) & point 4 (5.2,4.9): 7.301
#> Euclidean distance between point 2 (3.5,12) & point 5 (7.1,6.1): 6.912
#> Euclidean distance between point 2 (3.5,12) & point 6 (6.2,5.2): 7.316
#> Euclidean distance between point 2 (3.5,12) & point 7 (14,5.3): 12.456
#> Euclidean distance between point 3 (4.7,4.1) & point 1 (3,2): 2.702
#> Euclidean distance between point 3 (4.7,4.1) & point 2 (3.5,12): 7.991
#> Euclidean distance between point 3 (4.7,4.1) & point 3 (4.7,4.1): 0
#> Euclidean distance between point 3 (4.7,4.1) & point 4 (5.2,4.9): 0.943
#> Euclidean distance between point 3 (4.7,4.1) & point 5 (7.1,6.1): 3.124
#> Euclidean distance between point 3 (4.7,4.1) & point 6 (6.2,5.2): 1.86
#> Euclidean distance between point 3 (4.7,4.1) & point 7 (14,5.3): 9.377
#> Euclidean distance between point 4 (5.2,4.9) & point 1 (3,2): 3.64
#> Euclidean distance between point 4 (5.2,4.9) & point 2 (3.5,12): 7.301
#> Euclidean distance between point 4 (5.2,4.9) & point 3 (4.7,4.1): 0.943
#> Euclidean distance between point 4 (5.2,4.9) & point 4 (5.2,4.9): 0
#> Euclidean distance between point 4 (5.2,4.9) & point 5 (7.1,6.1): 2.247
#> Euclidean distance between point 4 (5.2,4.9) & point 6 (6.2,5.2): 1.044
#> Euclidean distance between point 4 (5.2,4.9) & point 7 (14,5.3): 8.809
#> Euclidean distance between point 5 (7.1,6.1) & point 1 (3,2): 5.798
#> Euclidean distance between point 5 (7.1,6.1) & point 2 (3.5,12): 6.912
#> Euclidean distance between point 5 (7.1,6.1) & point 3 (4.7,4.1): 3.124
#> Euclidean distance between point 5 (7.1,6.1) & point 4 (5.2,4.9): 2.247
#> Euclidean distance between point 5 (7.1,6.1) & point 5 (7.1,6.1): 0
#> Euclidean distance between point 5 (7.1,6.1) & point 6 (6.2,5.2): 1.273
#> Euclidean distance between point 5 (7.1,6.1) & point 7 (14,5.3): 6.946
#> Euclidean distance between point 6 (6.2,5.2) & point 1 (3,2): 4.525
#> Euclidean distance between point 6 (6.2,5.2) & point 2 (3.5,12): 7.316
#> Euclidean distance between point 6 (6.2,5.2) & point 3 (4.7,4.1): 1.86
#> Euclidean distance between point 6 (6.2,5.2) & point 4 (5.2,4.9): 1.044
#> Euclidean distance between point 6 (6.2,5.2) & point 5 (7.1,6.1): 1.273
#> Euclidean distance between point 6 (6.2,5.2) & point 6 (6.2,5.2): 0
#> Euclidean distance between point 6 (6.2,5.2) & point 7 (14,5.3): 7.801
#> Euclidean distance between point 7 (14,5.3) & point 1 (3,2): 11.484
#> Euclidean distance between point 7 (14,5.3) & point 2 (3.5,12): 12.456
#> Euclidean distance between point 7 (14,5.3) & point 3 (4.7,4.1): 9.377
#> Euclidean distance between point 7 (14,5.3) & point 4 (5.2,4.9): 8.809
#> Euclidean distance between point 7 (14,5.3) & point 5 (7.1,6.1): 6.946
#> Euclidean distance between point 7 (14,5.3) & point 6 (6.2,5.2): 7.801
#> Euclidean distance between point 7 (14,5.3) & point 7 (14,5.3): 0
#> The distances matrix obtained is:
#> 010.01249219725042.701851217221263.640054944640265.798275605729694.525483399593911.484337159801610.012492197250407.990619500389197.300684899377596.911584478250997.3164198895361412.45552086425942.701851217221267.9906195003891900.9433981132056613.124099870362661.860107523773839.377099764852673.640054944640267.300684899377590.94339811320566102.247220505424421.044030650891058.80908621821815.798275605729696.911584478250993.124099870362662.2472205054244201.272792206135786.94622199472494.52548339959397.316419889536141.860107523773831.044030650891051.2727922061357807.8006409993025611.484337159801612.45552086425949.377099764852678.80908621821816.94622199472497.800640999302560
#> We order the distances by columns and show the outliers
#> The distances matrix sorted in step 1 is:
#> 02.701851217221263.640054944640264.52548339959395.7982756057296910.012492197250411.484337159801610.012492197250407.990619500389197.300684899377596.911584478250997.3164198895361412.45552086425942.701851217221267.9906195003891900.9433981132056613.124099870362661.860107523773839.377099764852673.640054944640267.300684899377590.94339811320566102.247220505424421.044030650891058.80908621821815.798275605729696.911584478250993.124099870362662.2472205054244201.272792206135786.94622199472494.52548339959397.316419889536141.860107523773831.044030650891051.2727922061357807.8006409993025611.484337159801612.45552086425949.377099764852678.80908621821816.94622199472497.800640999302560
#> The Kth neighbor for the point 1 has a value of 2.702
#> The distance is smaller than the value stablished in 'd' so it's not an outlier.
#> The point 1 is not an outlier
#> The distances matrix sorted in step 2 is:
#> 02.701851217221263.640054944640264.52548339959395.7982756057296910.012492197250411.484337159801606.911584478250997.300684899377597.316419889536147.9906195003891910.012492197250412.45552086425942.701851217221267.9906195003891900.9433981132056613.124099870362661.860107523773839.377099764852673.640054944640267.300684899377590.94339811320566102.247220505424421.044030650891058.80908621821815.798275605729696.911584478250993.124099870362662.2472205054244201.272792206135786.94622199472494.52548339959397.316419889536141.860107523773831.044030650891051.2727922061357807.8006409993025611.484337159801612.45552086425949.377099764852678.80908621821816.94622199472497.800640999302560
#> The Kth neighbor for the point 2 has a value of 6.912
#> The distance is greater than the value stablished in 'd' so it's an outlier.
#> The point 2 is an outlier
#> The distances matrix sorted in step 3 is:
#> 02.701851217221263.640054944640264.52548339959395.7982756057296910.012492197250411.484337159801606.911584478250997.300684899377597.316419889536147.9906195003891910.012492197250412.455520864259400.9433981132056611.860107523773832.701851217221263.124099870362667.990619500389199.377099764852673.640054944640267.300684899377590.94339811320566102.247220505424421.044030650891058.80908621821815.798275605729696.911584478250993.124099870362662.2472205054244201.272792206135786.94622199472494.52548339959397.316419889536141.860107523773831.044030650891051.2727922061357807.8006409993025611.484337159801612.45552086425949.377099764852678.80908621821816.94622199472497.800640999302560
#> The Kth neighbor for the point 3 has a value of 0.943
#> The distance is smaller than the value stablished in 'd' so it's not an outlier.
#> The point 3 is not an outlier
#> The distances matrix sorted in step 4 is:
#> 02.701851217221263.640054944640264.52548339959395.7982756057296910.012492197250411.484337159801606.911584478250997.300684899377597.316419889536147.9906195003891910.012492197250412.455520864259400.9433981132056611.860107523773832.701851217221263.124099870362667.990619500389199.3770997648526700.9433981132056611.044030650891052.247220505424423.640054944640267.300684899377598.80908621821815.798275605729696.911584478250993.124099870362662.2472205054244201.272792206135786.94622199472494.52548339959397.316419889536141.860107523773831.044030650891051.2727922061357807.8006409993025611.484337159801612.45552086425949.377099764852678.80908621821816.94622199472497.800640999302560
#> The Kth neighbor for the point 4 has a value of 0.943
#> The distance is smaller than the value stablished in 'd' so it's not an outlier.
#> The point 4 is not an outlier
#> The distances matrix sorted in step 5 is:
#> 02.701851217221263.640054944640264.52548339959395.7982756057296910.012492197250411.484337159801606.911584478250997.300684899377597.316419889536147.9906195003891910.012492197250412.455520864259400.9433981132056611.860107523773832.701851217221263.124099870362667.990619500389199.3770997648526700.9433981132056611.044030650891052.247220505424423.640054944640267.300684899377598.809086218218101.272792206135782.247220505424423.124099870362665.798275605729696.911584478250996.94622199472494.52548339959397.316419889536141.860107523773831.044030650891051.2727922061357807.8006409993025611.484337159801612.45552086425949.377099764852678.80908621821816.94622199472497.800640999302560
#> The Kth neighbor for the point 5 has a value of 1.273
#> The distance is smaller than the value stablished in 'd' so it's not an outlier.
#> The point 5 is not an outlier
#> The distances matrix sorted in step 6 is:
#> 02.701851217221263.640054944640264.52548339959395.7982756057296910.012492197250411.484337159801606.911584478250997.300684899377597.316419889536147.9906195003891910.012492197250412.455520864259400.9433981132056611.860107523773832.701851217221263.124099870362667.990619500389199.3770997648526700.9433981132056611.044030650891052.247220505424423.640054944640267.300684899377598.809086218218101.272792206135782.247220505424423.124099870362665.798275605729696.911584478250996.946221994724901.044030650891051.272792206135781.860107523773834.52548339959397.316419889536147.8006409993025611.484337159801612.45552086425949.377099764852678.80908621821816.94622199472497.800640999302560
#> The Kth neighbor for the point 6 has a value of 1.044
#> The distance is smaller than the value stablished in 'd' so it's not an outlier.
#> The point 6 is not an outlier
#> The distances matrix sorted in step 7 is:
#> 02.701851217221263.640054944640264.52548339959395.7982756057296910.012492197250411.484337159801606.911584478250997.300684899377597.316419889536147.9906195003891910.012492197250412.455520864259400.9433981132056611.860107523773832.701851217221263.124099870362667.990619500389199.3770997648526700.9433981132056611.044030650891052.247220505424423.640054944640267.300684899377598.809086218218101.272792206135782.247220505424423.124099870362665.798275605729696.911584478250996.946221994724901.044030650891051.272792206135781.860107523773834.52548339959397.316419889536147.8006409993025606.94622199472497.800640999302568.80908621821819.3770997648526711.484337159801612.4555208642594
#> The Kth neighbor for the point 7 has a value of 6.946
#> The distance is greater than the value stablished in 'd' so it's an outlier.
#> The point 7 is an outlierLOF simplified (lof())
With the learn mode deactivated, K=3 and the threshold set to 0.5:
lof(inputData, 3, 0.5, FALSE);
#> Threshold selected: 0.5
#> The point 1 is an outlier because its ard is lower than 0.5
#> The point 1 has an average relative density of 0.350561797752809
#> The point 2 is an outlier because its ard is lower than 0.5
#> The point 2 has an average relative density of 0.174301675977654
#> The point 7 is an outlier because its ard is lower than 0.5
#> The point 7 has an average relative density of 0.243429487179487With the learn mode activated and same input parameters:
lof(inputData, 3, 0.5, TRUE);
#> The tutorial mode has been activated for the simplified LOF algorithm (outlier detection)
#> Before processing the data, we must understand the algorithm and the 'theory' behind it.
#> This is a simplified version of the LOF algorithm. This version detects outliers going though this steps:
#>  1) Calculate the degree of outlier of each point by obtaining the density of each point. This has 4 substeps:
#>      a. Determine the 'order number' (K) or closest neighbor that will be used to calculate the density of each number (arbitrary)
#>      b. Calculate the distance between each point and the resto of the points, this distance is calculated with the Manhattan distance equation/function:
#>          The equation is this: manhattanDistance(A,B) = |A_x - B_x| + |A_y - B_y|
#>      c.  Calculate the cardinal for each point: N is the set that contains the neighbors which distance xi is the same or less than the K nearest neighbor.
#>      d.  Calculate the density for each point. This is a technique very close to the proximity.
#>          The function to calculate the density is this: density*italic(x[i], K) == (frac(sum(italic(x[j]) %in% N(italic(x[i], K)), distance(italic(x[i]), italic(x[j]))), cardinalN(italic(x[i], K)))^-1
#>  2) Calculate the average relative density for each point using the next equation:
#>      ard*italic(x[i], K) == frac(density*italic(x[i], K), frac(sum(italic(x[j]) %in% N(italic(x[i], K)), density*italic(x[j], K)), cardinalN(italic(x[i], K))))
#>      This calculates the proportion between a point and the average mean of the densities of the set N that defines that point using the order number K. The average distance will tend to 0 on the outliers.
#>  3)  Obtain the outliers: will classify a point as an outlier when the average relative density is significantly smaller than the rest of the elements in the sample
#>       In the current LOF simplified implemented algorithm, it has been chosen to implement this last step with a threshold specified by the user
#>       This threshold value is compared to each ARD calculated for each point. If the value is smaller than the threshold, then the point is classified as an outlier
#>       On the other hand, if the value is greater or equal to the threshold, the point is classified as an  inlier (a normal point)
#> Now that we understand how the algorithm works, it will be executed to the input data with the parameters that have been set
#> Calculate Euclidean distances between all points:
#> Calculating distance between points (manhattan distance):
#> 1
#> 1
#> Calculated distance: 0
#> Calculating distance between points (manhattan distance):
#> 1
#> 2
#> Calculated distance: 10.5
#> Calculating distance between points (manhattan distance):
#> 1
#> 3
#> Calculated distance: 3.8
#> Calculating distance between points (manhattan distance):
#> 1
#> 4
#> Calculated distance: 5.1
#> Calculating distance between points (manhattan distance):
#> 1
#> 5
#> Calculated distance: 8.2
#> Calculating distance between points (manhattan distance):
#> 1
#> 6
#> Calculated distance: 6.4
#> Calculating distance between points (manhattan distance):
#> 1
#> 7
#> Calculated distance: 14.3
#> Calculating distance between points (manhattan distance):
#> 2
#> 1
#> Calculated distance: 10.5
#> Calculating distance between points (manhattan distance):
#> 2
#> 2
#> Calculated distance: 0
#> Calculating distance between points (manhattan distance):
#> 2
#> 3
#> Calculated distance: 9.1
#> Calculating distance between points (manhattan distance):
#> 2
#> 4
#> Calculated distance: 8.8
#> Calculating distance between points (manhattan distance):
#> 2
#> 5
#> Calculated distance: 9.5
#> Calculating distance between points (manhattan distance):
#> 2
#> 6
#> Calculated distance: 9.5
#> Calculating distance between points (manhattan distance):
#> 2
#> 7
#> Calculated distance: 17.2
#> Calculating distance between points (manhattan distance):
#> 3
#> 1
#> Calculated distance: 3.8
#> Calculating distance between points (manhattan distance):
#> 3
#> 2
#> Calculated distance: 9.1
#> Calculating distance between points (manhattan distance):
#> 3
#> 3
#> Calculated distance: 0
#> Calculating distance between points (manhattan distance):
#> 3
#> 4
#> Calculated distance: 1.3
#> Calculating distance between points (manhattan distance):
#> 3
#> 5
#> Calculated distance: 4.4
#> Calculating distance between points (manhattan distance):
#> 3
#> 6
#> Calculated distance: 2.6
#> Calculating distance between points (manhattan distance):
#> 3
#> 7
#> Calculated distance: 10.5
#> Calculating distance between points (manhattan distance):
#> 4
#> 1
#> Calculated distance: 5.1
#> Calculating distance between points (manhattan distance):
#> 4
#> 2
#> Calculated distance: 8.8
#> Calculating distance between points (manhattan distance):
#> 4
#> 3
#> Calculated distance: 1.3
#> Calculating distance between points (manhattan distance):
#> 4
#> 4
#> Calculated distance: 0
#> Calculating distance between points (manhattan distance):
#> 4
#> 5
#> Calculated distance: 3.1
#> Calculating distance between points (manhattan distance):
#> 4
#> 6
#> Calculated distance: 1.3
#> Calculating distance between points (manhattan distance):
#> 4
#> 7
#> Calculated distance: 9.2
#> Calculating distance between points (manhattan distance):
#> 5
#> 1
#> Calculated distance: 8.2
#> Calculating distance between points (manhattan distance):
#> 5
#> 2
#> Calculated distance: 9.5
#> Calculating distance between points (manhattan distance):
#> 5
#> 3
#> Calculated distance: 4.4
#> Calculating distance between points (manhattan distance):
#> 5
#> 4
#> Calculated distance: 3.1
#> Calculating distance between points (manhattan distance):
#> 5
#> 5
#> Calculated distance: 0
#> Calculating distance between points (manhattan distance):
#> 5
#> 6
#> Calculated distance: 1.8
#> Calculating distance between points (manhattan distance):
#> 5
#> 7
#> Calculated distance: 7.7
#> Calculating distance between points (manhattan distance):
#> 6
#> 1
#> Calculated distance: 6.4
#> Calculating distance between points (manhattan distance):
#> 6
#> 2
#> Calculated distance: 9.5
#> Calculating distance between points (manhattan distance):
#> 6
#> 3
#> Calculated distance: 2.6
#> Calculating distance between points (manhattan distance):
#> 6
#> 4
#> Calculated distance: 1.3
#> Calculating distance between points (manhattan distance):
#> 6
#> 5
#> Calculated distance: 1.8
#> Calculating distance between points (manhattan distance):
#> 6
#> 6
#> Calculated distance: 0
#> Calculating distance between points (manhattan distance):
#> 6
#> 7
#> Calculated distance: 7.9
#> Calculating distance between points (manhattan distance):
#> 7
#> 1
#> Calculated distance: 14.3
#> Calculating distance between points (manhattan distance):
#> 7
#> 2
#> Calculated distance: 17.2
#> Calculating distance between points (manhattan distance):
#> 7
#> 3
#> Calculated distance: 10.5
#> Calculating distance between points (manhattan distance):
#> 7
#> 4
#> Calculated distance: 9.2
#> Calculating distance between points (manhattan distance):
#> 7
#> 5
#> Calculated distance: 7.7
#> Calculating distance between points (manhattan distance):
#> 7
#> 6
#> Calculated distance: 7.9
#> Calculating distance between points (manhattan distance):
#> 7
#> 7
#> Calculated distance: 0
#> The calculated matrix of distances is:
#> 010.53.85.18.26.414.310.509.18.89.59.517.23.89.101.34.42.610.55.18.81.303.11.39.28.29.54.43.101.87.76.49.52.61.31.807.914.317.210.59.27.77.90
#> After calculating the distances between points, we calculate the cardinal for each point
#> To do this, we need to sort the distance matrix (by columns)
#> The distance matrix sorted by columns is as follows:
#> 03.85.16.48.210.514.308.89.19.59.510.517.201.32.63.84.49.110.501.31.33.15.18.89.201.83.14.47.78.29.501.31.82.66.47.99.507.77.99.210.514.317.2
#> We obtain a vector of the cardinals
#> In column
#> 1
#> Cardinal calculated:
#> 2
#> In column
#> 2
#> Cardinal calculated:
#> 2
#> In column
#> 3
#> Cardinal calculated:
#> 2
#> In column
#> 4
#> Cardinal calculated:
#> 2
#> In column
#> 5
#> Cardinal calculated:
#> 2
#> In column
#> 6
#> Cardinal calculated:
#> 2
#> In column
#> 7
#> Cardinal calculated:
#> 2
#> The cardinals vector resulting is:
#> 2222222
#> With the obtained cardinals, we get the densities of each point:
#> For point
#> 1
#> Value of density:
#> 0.225
#> For point
#> 2
#> Value of density:
#> 0.112
#> For point
#> 3
#> Value of density:
#> 0.513
#> For point
#> 4
#> Value of density:
#> 0.769
#> For point
#> 5
#> Value of density:
#> 0.408
#> For point
#> 6
#> Value of density:
#> 0.645
#> For point
#> 7
#> Value of density:
#> 0.128
#> All densities calculated:
#> 0.2250.1120.5130.7690.4080.6450.128
#> With the calculated densities, we are going to calculate the average relative density (ard) for each point:
#> For point:
#> 1
#> Average Relative Density calculated:
#> 0.351
#> For point:
#> 2
#> Average Relative Density calculated:
#> 0.175
#> For point:
#> 3
#> Average Relative Density calculated:
#> 0.726
#> For point:
#> 4
#> Average Relative Density calculated:
#> 1.328
#> For point:
#> 5
#> Average Relative Density calculated:
#> 0.577
#> For point:
#> 6
#> Average Relative Density calculated:
#> 1.096
#> For point:
#> 7
#> Average Relative Density calculated:
#> 0.243
#> All the ards calculated:
#> 0.3510.1750.7261.3280.5771.0960.243
#> The last step is to classify the outliers comparing the ards calculated with the threshold
#> Threshold selected: 0.5
#> The point 1 is an outlier because its ard is lower than 0.5
#> The point 1 has an average relative density of 0.351
#> The point 2 is an outlier because its ard is lower than 0.5
#> The point 2 has an average relative density of 0.175
#> The point 7 is an outlier because its ard is lower than 0.5
#> The point 7 has an average relative density of 0.243Mahalanobis Method (mahalanobis_method())
With the learn mode deactivated and alpha set to 0.7:
mahalanobis_method(inputData, 0.7, FALSE);
#> Critical Value:
#> 0.713349887877465
#> The observation 1 is an outlier
#> The values of the observation are:
#> 32
#> The observation 2 is an outlier
#> The values of the observation are:
#> 3.512
#> The observation 7 is an outlier
#> The values of the observation are:
#> 145.3With the learn mode activated and same value of alpha:
mahalanobis_method(inputData, 0.7, TRUE);
#> The tutorial mode has been activated for the Mahalanobis Distance Outlier Detection Method
#> Before processing the data, we must understand the algorithm and the 'theory' behind it.
#> The algorithm is made up with 6 steps:
#>  1)Check if the input value 'alpha' is in the desired range
#>      If this is true (between 0 and 1), then continue to the next step. If the value is greater than 1 or smaller than 0, end the algorithm.
#>      The concept of the input parameter alpha is the proportion of observations used for the estimation of the critical value (distance value calculated with a chi-squared distribution using alpha)
#>  2)Calculate the mean for each column of the dataset.
#>      In other words, calculate the mean value for each 'dimension' of the dataset.
#>      This is done by adding all the values in every single column and then dividing by the number of elements that have been added.
#>      With this step, the algorithm now has available a vector of means (each position is the mean of the column of the vector/array position).
#>  3)Calculate the covariance matrix.
#>      The covariance matrix is a square matrix with diagonal elements that represent the variance and the non-diagonal components that express covariance.
#>      The covariance of a variable can take any real value (a positive covariance suggests that the two variables have a positive relationship. On the other hand, a negative value indicates that they don't have a positive relationship. If they don't vary together, they have a zero value).
#>      The implementation chosen for this algorithm due to the fact that it's not relevant the implementation of this function is with a R native function.
#>      It's important to know what is the covariance matrix but, because of the nature of the Outliers Learn R package, it's not crucial to implement this function from scratch (it's one of the only 2 functions that have not been implemented from scratch in the R package).
#>  4)Obtain the Mahalanobis squared distances vector.
#>      This is one of the most 'crucial' steps of the Mahalanobis distance method for outlier detection.
#>      It's important to highlight that the Mahalanobis distance function has been implemented from scratch due to the importance of it for the algorithm.
#>      Even though there is an implementation to obtain the Mahalanobis squared distances from a dataset in R, this function has been implemented because it's a really important key concept the reader has to be able to 'see' implemented and be able to use it.
#>      The implementation calculates the Mahalanobis distance from a point to the mean using the covariance matrix using this formula:
#>          D = sqrt((X-means)'*inverted_cov_matrix*(X-means))
#>      Going back to what to do in this step: calculate the Mahalanobis distance between each point and the 'center' using the mean vector and the covariance matrix calculated in steps 2) and 3) with the previous formula.
#>      With the distances calculated, elevate them to square so that the distances vector is D^2.
#>  5)Calculate the critical value
#>      With the Mahalanobis squared distances calculated, the next step is to calculate the critical value.
#>      This is done with a chi-squared distribution.
#>      The function used in the implementation is an R native function due to the complexity of it.
#>      The corresponding function returns the critical value such that the probability of a chi-squared random variable with degrees of freedom equal to the dimensions of the input dataset exceeding this value is alpha (explained briefly in the first step).
#>  6)Classify the points using the critical value
#>      With the critical value calculated, the last step is to check every single distance calculated and if the value is greater than the critical value, the point associated with the distance is classified as an outlier.
#>      If not, the point associated with the distance is classified as an inlier (not an outlier).
#> With the theory understood, we will apply this knowledge to the data given to obtain the outliers
#> ----------------------------------------------------------
#> Check if the input value alpha is smaller or equal to 1.
#> If this is true, then continue to the next step. If the value is greater than 1, end the algorithm.
#> Calculate the mean for each column of the dataset.
#> Calculated mean for column 1: 6.243
#> Calculated mean for column 2: 5.657
#> Mean vector calculated:
#> 6.242857142857145.65714285714286
#> Calculate the covariance matrix.
#> Covariance Matrix calculated:
#> 13.7361904761905-0.786190476190477-0.7861904761904779.52285714285714
#> Obtain the Mahalanobis squared distances vector.
#> Mahalanobis distance for point 1: 1.524
#> Mahalanobis distance for point 2: 2.141
#> Mahalanobis distance for point 3: 0.677
#> Mahalanobis distance for point 4: 0.387
#> Mahalanobis distance for point 5: 0.281
#> Mahalanobis distance for point 6: 0.15
#> Mahalanobis distance for point 7: 2.093
#> The distances vector (D) is:
#> 1.524335721214992.141261084163450.6774661577792520.3867442241911260.2810999660998150.1497338026128162.0931872075874
#> Square the Mahalanobis distances.
#> The squared_distance vector (D^2) is:
#> 2.323599390972024.584999030552850.4589603949361820.1495710949451960.07901719094131720.02242021164489374.38143268600755
#> Calculate the critical value.
#> Degrees of freedom: 2
#> Alpha value: 0.7
#> 1-alpha = 0.3
#> Critical Value:
#> 0.713349887877465
#> Classify points based on the critical value
#> The observation 1 is an outlier (squared distance 2.324 is greater than the critical value 0.713
#> The values of the observation are:
#> 32
#> The observation 2 is an outlier (squared distance 4.585 is greater than the critical value 0.713
#> The values of the observation are:
#> 3.512
#> The observation 7 is an outlier (squared distance 4.381 is greater than the critical value 0.713
#> The values of the observation are:
#> 145.3
#> The algorithm has endedZ-score method (z_score_method())
With the learn mode deactivated and d set to 2:
z_score_method(inputData,2,FALSE);
#> Limits:
#> -0.39158610173466712.2915861017347
#> The value in position 7 with value 14 has been detected as an outlier
#> It was detected as an outlier because it's value is higher than the top limit 12.292
#> --------------------------------------------------------------------------------------------With the learn mode activated and same value of d:
z_score_method(inputData,2,TRUE);
#> The tutorial mode has been activated for the standard deviation method algorithm (outlier detection)
#> Before processing the data, we must understand the algorithm and the 'theory' behind it.
#> Identification of outliers using Statistics and Standard Deviation involves the following steps:
#>  1. Determination of the degree of outlier (We will call it 'd')
#>  2. Obtain the arithmetic mean with the following formula:
#>      mean = sum(x) / N
#>   We calculate the mean adding all the values from the data and dividing for the length of the data
#>  3. Obtain the standard deviation with the following formula:
#>      sd = sqrt(sum((x - mean)^2) / N)
#>  We calculate the sum of every single element of the data minus the mean elevated to 2. Then we divide it for the data length
#>  4. Calculate the interval limits using the following equation:
#>      (mean - d * sd, mean + d * sd)
#>  5. Identification of outliers as values that fall outside the interval calculated in step 4.
#> Now that we know how to apply this algorithm, we are going to see how it works with the given data:
#> 33.54.75.27.16.2142124.14.96.15.25.3
#> The degree of outlier selected ('d') selected is:
#> 2
#> First we calculate the mean using the formula described before:
#> 5.95
#> Now we calculate the standard deviation using the formula described before:
#> 3.17079305086733
#> With those values calculated, we obtain the limits:
#> First we calculate the lower limit
#>  mean-stddev * d
#> -0.391586101734667
#> Now we calculate the top limit
#>  mean+stddev*d
#> 12.2915861017347
#> This are the obtained limits
#> -0.39158610173466712.2915861017347
#> Now that we have calculated the limits, we will check if every single value is 'inside' those boundaries obtained.
#> If the value is not included inside the limits, it will be detected as an outlier
#> Checking value in the position 1. It's value is 3
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 2. It's value is 3.5
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 3. It's value is 4.7
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 4. It's value is 5.2
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 5. It's value is 7.1
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 6. It's value is 6.2
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 7. It's value is 14
#> The value in position 7 with value 14 has been detected as an outlier
#> It was detected as an outlier because it's value is higher than the top limit 12.292
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 8. It's value is 2
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 9. It's value is 12
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 10. It's value is 4.1
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 11. It's value is 4.9
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 12. It's value is 6.1
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 13. It's value is 5.2
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> Checking value in the position 14. It's value is 5.3
#> Not an outlier, it's inside the limits
#> --------------------------------------------------------------------------------------------
#> The algorithm has ended