Clustering utility
The clustering utilities provide wrappers for sklearn clusterers. This allows to make a connection between cluster labels returned by the clusterers and the material ids.
- class madas.clustering.SimilarityMatrixClusterer(similarity_matrix: ~madas.similarity.SimilarityMatrix, clusterer: type = <class 'sklearn.cluster._dbscan.DBSCAN'>, clusterer_kwargs: dict = {'eps': 0.15, 'metric': 'precomputed'}, use_complement: bool = True)
A wrapper for clustering methods to directly apply them to SimilarityMatrix() objects.
Arguments:
- similarity_matrix: SimilarityMatrix()
Similarity matrix that should be clustered
Keyword arguments:
- custerer: type
a class that implements a fit() method that is used to cluster a np.ndarray matrix
default: sklearn.cluster.DBSCAN
- clusterer_kwargs: dict
Keyword arguments to be passed to the clusterer upon initializaion
default: {‘metric’:’precomputed’, ‘eps’ : 0.15}
- use_complement: bool
Switch if the similarity matrix (set to False) or distance matrix (set to True) is used for clustering.
Some algorithms explicitly require similarity (sometimes called affinity) matrices, other require distances (or, dissimilarities, metrics). To treat them all on the same footing, this variable allows to set the behavior of the matrix property accordingly.
default: True
- cluster(**kwargs) object
Perform clustering on similarity matrix and return self.
Keyword arguments are passed to the fit function of the clusterer.
Returns:
- self: SimilarityMatrixClusterer
Self after calling fit of self.clusterer
- get_cluster_size(cluster_label: int) int
Get the size (i.e. the number of members) of a specific cluster.
Arguments:
- cluster_label: int
Label of cluster to get size of.
Returns
- cluster_size: int
Number of members of the specified cluster.
- get_cluster_sub_matrix(cluster_label: int) SimilarityMatrix
Return a sub matrix of the similarity matrix that contains only the elements from the specified cluster.
Arguments:
- cluster_label: int
Label of cluster
- get_label_dict() dict
Get a dictionary of all materials, where for each material id the cluster label is stored.
Returns:
- label_dict: dict
Dictionary {mid1:label1, […]} mapping material ids to cluster labels.
- get_mids_by_cluster_label(cluster_label: int) List[str]
Get mids of all materials that have the specified cluster label.
Arguments:
- cluster_label: int
Label of requested cluster.
Returns:
- mid_list: List[str]
Material ids of all materials in the specified cluster.
- get_mids_sorted_by_cluster_labels(remove_orphans: bool = False)
Get mids of the similarity matrix, sorted, ascending, by cluster label.
Keyword arguments:
- remove_orphans: bool
Remove all entries with cluster label -1.
default: False
- get_sorted_similarity_matrix()
Return a SimilarityMatrix where all entries are sorted by ascending cluster label. This is helpful for visualization.
Returns:
- similarity_matrix: SimilarityMatrix
Similarity matrix with sorted entries.
- property labels
Return labels of clusterer.
- static load(filename='SiMatClus.npy', filepath='.')
Load clusterer from numpy format.
WARNING: This allows for unpickling object files. Always make sure that the file you attempt to load is safe.
Keyword arguments:
- filename: str
Name of file to be loaded.
- filepath: str
Relative path of file to be loaded.
Returns:
self : SimilarityMatrixClusterer
- property matrix: ndarray
Return values similarity matrix that is clustered.
IF self.use_complement == True: return complement of similarity matrix
ELSE: return similarity matrix
- property mids: List[str]
List of mids associated with the similarity matrix.
- property nclusters
Get the number of clusters, i.e., the number of unique cluster labels.
- save(filename: str = 'SiMatClus.npy', filepath: str = '.')
Save clusterer to numpy format.
Keyword arguments:
- filename: str
Name of file to be written.
- filepath: str
Relative path of file to be written.
Returns:
None
- set_clusterer_params(**kwargs) None
Set parameters of self.clusterer.
- property unique_labels: ndarray
Get list of all unque cluster labels.
Returns:
- unique_labels: np.ndarray
Unique cluster labels