Clustering utility

The clustering utilities provide wrappers for sklearn clusterers. This allows to make a connection between cluster labels returned by the clusterers and the material ids.

class madas.clustering.SimilarityMatrixClusterer(similarity_matrix: ~madas.similarity.SimilarityMatrix, clusterer: type = <class 'threshold_clusterer.threshold_clusterer.ThresholdClusterer'>, clusterer_kwargs: dict = {'threshold': 0.75}, use_complement: bool = False)

A wrapper for clustering methods to directly apply them to SimilarityMatrix() objects.

Arguments:

similarity_matrix: SimilarityMatrix()

Similarity matrix that should be clustered

Keyword arguments:

custerer: type

a class that implements a fit() method that is used to cluster a np.ndarray matrix

default: threshold_clusterer.ThresholdClusterer from https://github.com/kubanmar/similarity_threshold_clusterer

clusterer_kwargs: dict

Keyword arguments to be passed to the clusterer upon initializaion

default: {‘threshold’:0.75}

use_complement: bool

Switch if the similarity matrix (set to False) or distance matrix (set to True) is used for clustering.

Some algorithms explicitly require similarity (sometimes called affinity) matrices, other require distances (or, dissimilarities, metrics). To treat them all on the same footing, this variable allows to set the behavior of the matrix property accordingly.

default: False

cluster(**kwargs) object

Perform clustering on similarity matrix and return self.

Keyword arguments are passed to the fit function of the clusterer.

Returns:

self: SimilarityMatrixClusterer

Self after calling fit of self.clusterer

get_cluster_size(cluster_label: int) int

Get the size (i.e. the number of members) of a specific cluster.

Arguments:

cluster_label: int

Label of cluster to get size of.

Returns

cluster_size: int

Number of members of the specified cluster.

get_cluster_sub_matrix(cluster_label: int) SimilarityMatrix

Return a sub matrix of the similarity matrix that contains only the elements from the specified cluster.

Arguments:

cluster_label: int

Label of cluster

get_label_dict() dict

Get a dictionary of all materials, where for each material id the cluster label is stored.

Returns:

label_dict: dict

Dictionary {mid1:label1, […]} mapping material ids to cluster labels.

get_mids_by_cluster_label(cluster_label: int) List[str]

Get mids of all materials that have the specified cluster label.

Arguments:

cluster_label: int

Label of requested cluster.

Returns:

mid_list: List[str]

Material ids of all materials in the specified cluster.

get_mids_sorted_by_cluster_labels(remove_orphans: bool = False)

Get mids of the similarity matrix, sorted, ascending, by cluster label.

Keyword arguments:

remove_orphans: bool

Remove all entries with cluster label -1.

default: False

get_sorted_similarity_matrix(remove_orphans: bool = False)

Return a SimilarityMatrix where all entries are sorted by ascending cluster label. This is helpful for visualization.

Keyword arguments

remove_orphans: bool

Return matrix containing only cluster members, no orphans. Orphans are identified by cluster label -1.

default: False

Returns:

similarity_matrix: SimilarityMatrix

Similarity matrix with sorted entries.

property labels

Return labels of clusterer.

static load(filename='SiMatClus.npy', filepath='.')

Load clusterer from numpy format.

WARNING: This allows for unpickling object files. Always make sure that the file you attempt to load is safe.

Keyword arguments:

filename: str

Name of file to be loaded.

filepath: str

Relative path of file to be loaded.

Returns:

self : SimilarityMatrixClusterer

property matrix: ndarray

Return values similarity matrix that is clustered.

IF self.use_complement == True: return complement of similarity matrix

ELSE: return similarity matrix

property mids: List[str]

List of mids associated with the similarity matrix.

property nclusters

Get the number of clusters, i.e., the number of unique cluster labels.

save(filename: str = 'SiMatClus.npy', filepath: str = '.')

Save clusterer to numpy format.

Keyword arguments:

filename: str

Name of file to be written.

filepath: str

Relative path of file to be written.

Returns:

None

set_clusterer_params(**kwargs) None

Set parameters of self.clusterer.

property unique_labels: ndarray

Get list of all unque cluster labels.

Returns:

unique_labels: np.ndarray

Unique cluster labels