Clustering utility

The clustering utilities provide wrappers for sklearn clusterers. This allows to make a connection between cluster labels returned by the clusterers and the material ids.

class madas.clustering.SimilarityMatrixClusterer(similarity_matrix: ~madas.similarity.SimilarityMatrix, clusterer: type = <class 'threshold_clusterer.threshold_clusterer.ThresholdClusterer'>, clusterer_kwargs: dict = {'threshold': 0.75}, use_complement: bool = False)

A wrapper for clustering methods to directly apply them to SimilarityMatrix() objects.

Arguments:

similarity_matrix: SimilarityMatrix(): Similarity matrix that should be clustered

Keyword arguments:

custerer: type

a class that implements a fit() method that is used to cluster a np.ndarray matrix

default: threshold_clusterer.ThresholdClusterer from https://github.com/kubanmar/similarity_threshold_clusterer

clusterer_kwargs: dict

Keyword arguments to be passed to the clusterer upon initializaion

default: {‘threshold’:0.75}

use_complement: bool

Switch if the similarity matrix (set to False) or distance matrix (set to True) is used for clustering.

Some algorithms explicitly require similarity (sometimes called affinity) matrices, other require distances (or, dissimilarities, metrics). To treat them all on the same footing, this variable allows to set the behavior of the matrix property accordingly.

default: False

cluster(**kwargs) → object

Perform clustering on similarity matrix and return self.

Keyword arguments are passed to the fit function of the clusterer.

Returns:

self: SimilarityMatrixClusterer: Self after calling fit of self.clusterer

get_cluster_size(cluster_label: int) → int

Get the size (i.e. the number of members) of a specific cluster.

Arguments:

cluster_label: int: Label of cluster to get size of.

Returns

cluster_size: int: Number of members of the specified cluster.

get_cluster_sub_matrix(cluster_label: int) → SimilarityMatrix

Return a sub matrix of the similarity matrix that contains only the elements from the specified cluster.

Arguments:

cluster_label: int: Label of cluster

get_label_dict() → dict

Get a dictionary of all materials, where for each material id the cluster label is stored.

Returns:

label_dict: dict: Dictionary {mid1:label1, […]} mapping material ids to cluster labels.

get_mids_by_cluster_label(cluster_label: int) → List[str]

Get mids of all materials that have the specified cluster label.

Arguments:

cluster_label: int: Label of requested cluster.

Returns:

mid_list: List[str]: Material ids of all materials in the specified cluster.

get_mids_sorted_by_cluster_labels(remove_orphans: bool = False)

Get mids of the similarity matrix, sorted, ascending, by cluster label.

Keyword arguments:

remove_orphans: bool

Remove all entries with cluster label -1.

default: False

get_sorted_similarity_matrix(remove_orphans: bool = False)

Return a SimilarityMatrix where all entries are sorted by ascending cluster label. This is helpful for visualization.

Keyword arguments

remove_orphans: bool

Return matrix containing only cluster members, no orphans. Orphans are identified by cluster label -1.

default: False

Returns:

similarity_matrix: SimilarityMatrix: Similarity matrix with sorted entries.

property labels: Return labels of clusterer.

static load(filename='SiMatClus.npy', filepath='.')

Load clusterer from numpy format.

WARNING: This allows for unpickling object files. Always make sure that the file you attempt to load is safe.

Keyword arguments:

filename: str: Name of file to be loaded.
filepath: str: Relative path of file to be loaded.

Returns:

self : SimilarityMatrixClusterer

property matrix: ndarray

Return values similarity matrix that is clustered.

IF self.use_complement == True: return complement of similarity matrix

ELSE: return similarity matrix

property mids: List[str]: List of mids associated with the similarity matrix.

property nclusters: Get the number of clusters, i.e., the number of unique cluster labels.

save(filename: str = 'SiMatClus.npy', filepath: str = '.')

Save clusterer to numpy format.

Keyword arguments:

filename: str: Name of file to be written.
filepath: str: Relative path of file to be written.

Returns:

None

set_clusterer_params(**kwargs) → None: Set parameters of self.clusterer.

property unique_labels: ndarray

Get list of all unque cluster labels.

Returns:

unique_labels: np.ndarray: Unique cluster labels