COF calculates full distance matrix for all sample sizes
In the current implementation COF calculates the full X-X distance matrix, leading to O(n^2) memory complexity. When analyzing data with a large number of samples this will lead to a prohibitively large distance matrix, often causing memory errors.
At the cost of linear overhead, this can be easily circumvented by recalculating the distances where needed. Effectively leading to twice as many distance calculations.
I have tested a simple implementation of this memory optimization, leading to the following code:
def _cof(self, X):
"""
Connectivity-Based Outlier Factor (COF) Algorithm
This function is called internally to calculate the
Connectivity-Based Outlier Factor (COF) as an outlier
score for observations.
:return: numpy array containing COF scores for observations.
The greater the COF, the greater the outlierness.
"""
#dist_matrix = np.array(distance_matrix(X, X))
sbn_path_index, ac_dist, cof_ = [], [], []
for i in range(X.shape[0]):
#sbn_path = np.argsort(dist_matrix[i])
sbn_path = np.argsort(minkowski_distance(X[i,:],X,p=2))
sbn_path_index.append(sbn_path[1: self.n_neighbors_ + 1])
cost_desc = []
for j in range(self.n_neighbors_):
#cost_desc.append(
# np.min(dist_matrix[sbn_path[j + 1]][sbn_path][:j + 1]))
cost_desc.append(
np.min(minkowski_distance(X[sbn_path[j + 1]],X,p=2)[sbn_path][:j + 1]))
acd = []
for _h, cost_ in enumerate(cost_desc):
neighbor_add1 = self.n_neighbors_ + 1
acd.append(((2. * (neighbor_add1 - (_h + 1))) / (
neighbor_add1 * self.n_neighbors_)) * cost_)
ac_dist.append(np.sum(acd))
for _g in range(X.shape[0]):
cof_.append((ac_dist[_g] * self.n_neighbors_) /
np.sum(itemgetter(*sbn_path_index[_g])(ac_dist)))
return np.nan_to_num(cof_)
While I don't think this would be great to use in every case, due to the increased computational cost, it's seems to be that offering this option to the user would be worthwhile. I can write a pull request which implements a switch between these use-cases, but how this should be implemented I don't dare say. We can for example have the user set a parameter to use either one or the other method, or have them set a limit for the distance matrix size (in MB for example). I'd like to request some additional input on whether such a feature is desireable and how it should be implemented.