The similarity matrices

class madas.similarity.SimilarityMatrix(matrix: ~typing.List[list] = [], mids: ~typing.List[str] | None = None, dtype: type = <class 'numpy.float64'>)

A matrix, that stores all similarites between materials and the corresponding material identifier.

align(matrices)

Align the materials in this matrix and all provided matrices.

Arguments:

matrices: SimilarityMatrix or List[SimilarityMatrix]

Matrix or list of matrices to align

Returns:

None

WARNING! Entries in both matrices will be altered, i.e. unique entries in each matrix will be dropped.

calculate(fingerprints: List[Fingerprint], mids: List[str] | None = None, multiprocess: int | None = -1, symmetric: bool = True, similarity_function: Callable | None = None)

Calculate similarity matrix.

Arguments:

fingerprints: List[Fingerprint]

Fingerprints to calculate similarities.

Keyword arguments:

mids: List[str]

Material ids of fingerprints.

default: None -> mids are extracted from fingerprints

multiprocess: int

Calculate similarities on available processors. Set to -1 to use all available processors. Set to None for serial execution Set to any positive integer to use as may processes.

default: -1

symmetric: bool

Reduce computation cost by calculating only unique half of symmetric matrix

default: True

similarity_function: Callable or None

Similarity function to set to the fingerprints before calculating the matrix. This will make a copy of the fingerprints, using additional memory. When set to None, the similarity function of the fingerprints (Fingerprint.similarity_function) will be used.

default: None

Returns:

self: SimilarityMatrix

Populated similarity matrix.

property dataframe: DataFrame

Return pandas.DataFrame object containing similarities as values and material ids as indices.

get_cleared_matrix(leave_out_mids, copy=True)

Return a matrix where all materials with mids specified in leave_out_mids are removed from the matrix.

Arguments:

leave_out_mids: List[str]

mids of materials to leave out of the matrix

Keyword arguments:

copy: bool

Return copy of SimilarityMatrix(); apply changes to self, if False

default: True

Returns:

cleared_matrix: SimilarityMatrix

Copy of similarity matrix if copy is True, self otherwise without leave_out_mids

get_entry(mid1: str, mid2: str) float64

Get any entry of the matrix.

Arguments:

mid1, mid2: str

material ids of the requested material

Returns:

similarity: float

Similarity between both materials

Raises:

KeyError: No entry for material with given mid.

get_k_most_similar(ref_mid: str, k: int = 10, remove_self: bool = True) dict

Get the k most similar materials and the respective similarities for a material.

WARNING! Accurate results can only be obtained for symmetric similarity matrices. For asymmetric similarity measures, the assignment may be ambiguous.

Arguments:

ref_mid: str

Material id of the reference material

Keyword arguments:

k: int

Number of most similar materials to return

default: 10

remove_self: bool

Remove the requested material from the results list

default: True

Returns:

dict: {<1st_nearest_mid>:<similarity>, 2nd_nearest_mid>:<similiarty>, …}

get_matching_matrices(second_matrix)

Match matrices such that they contain the same materials in the same order.

Arguments:

second_matrix: SimilarityMatrix

Matrix to match materials

Returns:

new_self, new_matrix: tuple(SimilarityMatrix)

Matching similarity matrices

get_metadata() dict

Get dictionary of fingerprint type and name.

get_overlap_matrix(column_mids: List[str], row_mids: List[str]) object

Get OverlapSimilarityMatrix() from matrix. This new matrix contains (mostly) off-diagonal elements of the original matrix.

Arguments:

row_mids: List[str]

mids associated with the rows of the new matrix

column_mids: List[str]

mids associated with the columns of the matrix

Returns:

overlap_matrix: OverlapSimilarityMatrix

New matrix with rows and columns corresponding to materials with identifier row_mids and column_mids

get_row(mid: str) ndarray

Get a row of the matrix, by mid or index.

Arguments:

mid: str or int

Material id oder matrix index of requested matrix row

Returns:

row: np.ndarray

Similarities of material with given mid to all other materials in the matrix.

Raises:

KeyError: No entry with given mid.

IndexError: Matrix index out of range.

get_sub_matrix(mid_list: List[str], copy: bool = True) object

Get sub matrix of all elements in mid_list sorted by occurrence in mid list.

Arguments:

mid_list: List[str]

List of (unique) mids of materials to include in sub matrix

Keyword arguments:

copy: bool

Return a new similarity matrix. If set to False, apply changes to self.

default: True

Returns:

SimilarityMatrix() object of sub matrix if copy == True

self restricted to, and sorted by, elements in mid_list

get_symmetric_matrix()

Deprecated! Use SimilarityMatrix().matrix property! Get square matrix form.

Returns:

square_matrix: np.ndarray

matrix of similarities

get_unique_entries() ndarray

Return all enries of the upper triangular matrix.

Returns:

entries: np.ndarray

list of all unique entries of the matrix

static load(filename: str = 'similarity_matrix.npy', filepath: str = '.') object

Static method. Load SimilarityMatrix from file. If the target file is created from an OverlapSimilarityMatrix object, return OverlapSimilarityMatrix object

Warning: This methods loads a pickled file. Only load files of known origin.

Keyword arguments:

filename: str

Name of a saved similarity matrix file

default: ‘similarity_matrix.npy’

filepath: str

Relative path to SimilarityMatrix files

default: ‘.’

Returns:

similarity_matrix: SimilarityMatrix or OverlapSimilarityMatrix

Raises:

IndexError: Wrong format of data in file. Can not load.

lookup_similarity(fp1: Fingerprint, fp2: Fingerprint)

Return similarity between two fingerprints from the matrix.

The expected usecase for this is to pass this function set_similarity_function of a Fingerprint object.

fp1, fp2: Fingerprint

Fingerprint objects to retrieve similarity

Returns:

similarity: float

Similarity between materials with mids fp1.mid and fp2.mid

Raises:

KeyError: Similarity matrix does not contain an entry with these mids.

property matrix: ndarray

Matrix values.

property mids: ndarray

Material identifier, corresponding to rows and columns of the matrix.

plot(colorbar: bool = False, show: bool = True) None

Plot the similarty matrix.

Keyword arguments:

colorbar: bool

Show a colorbar.

default: False

show: bool

Show the plot.

default: True

save(filename: str = 'similarity_matrix.npy', filepath: str = '.', overwrite: bool = False) None

Save SimilarityMatrix to numpy binary file.

Keyword arguments:

filename: str

Name of the file

default: ‘similarity_matrix.npy’

filepath: str

Relative path to created files

default: ‘.’

overwrite: bool

Overwrite matrix file if it exists.

default: False

set_dataframe(dataframe: DataFrame) None

Set the dataframe of the matrix.

Arguments:

dataframe: pandas.DataFrame

Dataframe that should contain similarities as values and material identifier as indices.

set_matrix(matrix: ndarray) None

Set the values of the similarity matrix.

Arguments:

matrix: np.ndarray

array of values for the matrix

Can be a square matrix or an upper triangular matrix.

set_metadata(metadata: dict) None

Set fingerprint type and name from dictionary. Ignores other keys than fp_type and fp_name.

set_mids(mids: list) None

Set the matrix row and column identifiers to given values.

Arguments:

mids: List[str]

List of identifiers to be set

train_test_split(train_mids: List[str], test_mids: List[str]) set

Split similarity matrix into a (symmetric) train matrix and a (off-diagonal) test matrix.

Arguments:

train_mids: List[str]

mids that identify materials of the training set

test_mids: List[str]

mids that identify materials in the test set

Returns:

train_matrix, test_matrix: Set[SimilarityMatrix, OverlapSimilarityMatrix]

class madas.similarity.OverlapSimilarityMatrix(matrix: ~typing.List[list] = [], row_mids: list = [], column_mids: list = [], dtype=<class 'numpy.float64'>)

A SimilarityMatrix that is used to store similarities between different sets of fingerprints.

align(matrices: List[SimilarityMatrix] | SimilarityMatrix) None

Align the materials in this matrix and all provided matrices.

Arguments:

matrices: OverlapSimilarityMatrix or List[OverlapSimilarityMatrix]

matrix object(s) to align with

Returns:

None

Warning! Entries in both matrices will be altered, i.e. unique entries in each matrix will be dropped.

calculate(reference_fingerprints: List[Fingerprint], fingerprints: List[Fingerprint], mids: List[str] = [], reference_mids=[]) object

Calculate similarity of fingerprints to given reference fingerprints.

Arguments:

reference_fingerprints: List[Fingerprint]

Fingerprints that correspond to columns of the matrix

fingerprints: List[Fingerprint]

Fingerprints that correspond to rows of the matrix

Keyword arguments:

mids: List[str]

Material identifier for the rows of the matrix

default: None

reference_mids: List[str]

Material identifier for the columns of the matrix

default: None

If possible, mids and reference_mids are taken directly from the Fingerprint objects.

Returns:

self: OverlapSimilarityMatrix

Calculated matrix object

property column_mids: ndarray

Get mids corresponding to matrix columns.

get_column(mid: str | int) ndarray

Get a column of the matrix.

Arguments:

mid: str or int

Material id or matrix index of the requested column

Returns:

matrix_column: numpy.ndarray

Column of the matrix

get_entries() ndarray

Get all entries of the matrix as a list.

Returns:

entries: numpy.ndarray

All entries of the matrix in a (N*M, 1)-dim list

get_entry(row_mid: str, column_mid: str) float64

Get a single entry of the matrix.

Arguments:

row_mid: str

material id of the material in the row of the matrix

column_mid: str

material id of the material in the column of the matrix

Returns:

similarity: numpy.float64

similarity between both materials

get_row(mid: str | int) ndarray

Get a row of the matrix.

Arguments:

mid: str or int

Material id or matrix index of the requested row

Returns:

matrix_row: numpy.ndarray

Row of the matrix

get_sub_matrix(row_mid_list: List[str], column_mid_list: List[str], copy: bool = True) object

Get sub matrix.

Arguments:

column_mid_list: List[str]

list of mids of materials in matrix column to include in sub matrix

row_mid_list: List[str]

list of mids of materials in matrix row to include in sub matrix

Keyword arguments:

copy: bool

Return a new similarity matrix. If set to False, apply changes to self.

default: True

Returns:

matrix: OverlapSimilarityMatrix

if copy == True

self: OverlapSimilarityMatrix

if copy == False, self, restricted to, and sorted by, elements in mid_list

get_symmetric_matrix()

Not implemented for OverlapSimilarityMatrix().

This function has no meaning for overlap similarity matrices.

Raises:

NotImplementedError: upon function call

property mids: Tuple[List[str], List[str]]

Get mids corresponding to matrix row and columns.

property row_mids: ndarray

Get mids corresponding to matrix rows.

save(filename: str = 'overlap_similarity_matrix.npy', filepath: str = '.', overwrite: bool = False) None

Save SimilarityMatrix to numpy binary file(s).

Keyword arguments:

filename: str

name of the matrix file

default: ‘similarity_matrix.npy’

data_path: str

relative path to created files

default: ‘.’

overwrite: bool

Overwrite matrix file if it exists.

default: False

set_matrix(matrix: ndarray) None

Set matrix values

If the values do not fit the shape of the original matrix, values and mids will be overwritten.

Arguments:

matrix: np.ndarray

Matrix to be set

set_mids(row_mids: List[str], column_mids: List[str]) None

Set the mids for rows and columns of the matrix.

transpose() None

Exchange rows and columns of matrix.

class madas.similarity.BatchedSimilarityMatrix(root_path: str = '.', matrix_folder_name: str = 'batched_similarity_matrix', fingerprint_files_name: str = 'batch_similarity_fingerprints', load_from_file: bool = True, batch_size: int = 10000, n_tasks: int = 1, task_id: int = 0, size: int = 0, symmetric: bool = True, dtype=<class 'numpy.float32'>)

A similarity matrix for parallel computation distributed over different tasks.

The calculation of pairwise similarities between fingerprints can be distributed over different processors. To do so, the matrix is split into different independent sub-matrices (batches), which are calculated separately.

Each BatchedSimilarityMatrix stores all required data and metadata in a single folder.

To reduce memory consumption, the fingerprints are input via a separate file containing serialized Fingerprint objects. This file can be generated via, _e.g._:

import json

long_fingerprint_list_data = [fp.serialize() for fp in long_fingerprint_list]

with open('path/to/long/fingerprint/file', 'w') as fingerprint_file:
    json.dump(long_fingerprint_list_data, fingerprint_file)

In a next step, a BatchedSimilarityMatrix object can be used to split this large file into separate batches for computation of the similarity matrix.

batched_similarity_matrix.fingerprint_file_batches('long_fingerprint_file_name',
                                                   'path/to/long/fingerprint/file')

Now all data is prepared and the matrix can be calculated:

batched_similarity_matrix.calculate(similarity_function)

Note that because the fingerprints were serialized before, the similarity function must be specified for calculating similarities.

Keyword arguments:

root_path: str

Path where the data folder shall be created.

default: “.”

matrix_folder_name: str

Name of the similarity matrix data folder.

This name should be unique and descriptive!

default: “batched_similarity_matrix”

fingerprint_files_name: str

Base name of fingerprint files to be generated.

load_from_file: bool

Load metadata from file and ignore further keyword arguments.

default: True

batch_size: int

Maximal number of fingerprints in a batch. A similarity matrix has a maximum of batch_size ** 2 entries.

default: 10000

n_tasks: int

Total number of tasks that are used to compute the similarity matrix.

default: 1

task_id: int

Id of the current task. This number specifies which batches of the similarity matrix are calculated.

default: 0

size: int

Total number of fingerprints. This number is automatically set if the function fingerprint_file_batches is called.

default: 0

symmetric: bool

Assume that similarity matrix is symmetric and calculate only unique batches. Setting this option to True reduces the number of batches that are calculated by ca. a factor of two, as off-diagonal elements are computed only once.

Methods:

property batch_iterator: BatchIterator

BatchIterator object that is used to iterate over batches of the similarity matrix.

property batch_size: int

Number of fingerprints in a batch.

calculate(similarity_function: Callable, overwrite: bool = False)

Calculate similarity matrix entries and write them to files.

Arguments:

similarity_function: Callable

Function of two Fingerprint objects that returns their similarity.

Keyword arguments:

overwrite: bool

Recalculate and overwrite existing matrix batch files.

fingerprint_file_batches(fingerprint_file_name: str, fingerprint_file_path: str, overwrite: bool = False, write_mid_file: bool = True, save_metadata_updates: bool = True) None

Split a single, large fingerprint file into batches that can be read by the individual tasks of a BatchedSimilarityMatrix.

Arguments:

fingerprint_file_name: str

Name of the file containing a json-encoded list of serialized Fingerprint objects.

fingerprint_file_path: str

Path to fingerprint file.

Keyword arguments:

overwrite: bool

Overwrite batched fingerprint files if they exist. If this is set to False and (any of) the files exist(s), a FileExistsError is raised.

default: False

write_mid_file: bool

Write a file that contains an enumeration of all material ids in the similarity matrix.

default: True

save_metadata_updates: bool

Save the metadata after determining the total number of fingerprints.

default: True

fingerprints_from_batch_file(idx: int, jdx: int) List[Fingerprint]

Read fingerprints from file and deserialize. Generates a list of Fingerprints from the files written by the function fingerprint_file_batches.

idx, jdx: int

Start and end index of the fingerprints file batch.

Returns:

fingerprints: List[Fingerprint]

Fingerprint objects from file

property folder_path: str

Path to folder where all data corresponding to this matrix is stored.

gen_fingerprint_batch_file_name(idx: int, jdx: int) str

Generate name of a fingerprint batch file from indices.

Arguments:

idx, jdx: int

Start and end index of the fingerprints file batch.

Returns:

filename: str

Name of a file with given indices.

gen_matrix_batch_file_name(batch: List[List[int]]) str

Generate name of a matrix batch file from batch.

Arguments:

batch: List[List[int]]

batch, as returned by BatchIterator objects

Returns:

filename: str

Name of a file corresponding to given batch.

get_entry_histogram(bins=array([0., 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2, 0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3, 0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4, 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5, 0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6, 0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7, 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8, 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9, 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99, 1.]))

Generate a histogram of entries in the matrix.

Keyword arguments:

bins: np.ndarray

Bins of the histogram. Each bin contains the number of matrix entries in the range [bins[i], bins[i+1]).

default: np.arange(0,1.01,0.01)

Returns:

(entries, bins): (np.ndarray, np.ndarray)

entries[i] contains the absolute number of matrix entries with similarity in the range [bins[i], bins[i+1]).

get_row_by_mid(mid: str) List[float]

Get a row of the (already calculated) similarity matrix from the mid of a material.

Calling this function requires the the mid file was written by fingerprint_file_batches.

Arguments:

mid: str

Material id of the reference entry.

Returns:

similarities: List[float]

Similarities of all materials of this similarity matrix to the reference specified by mid

Raises:

FileNotFoundError: Mid file does not exist (or has a non-compatible name).

KeyError: No material of given mid in the mid file (and thus in the matrix).

property matrices_for_this_task

Return the number of similarity matrices that are calculated for this task.

property metadata: dict

Metadata for BatchedSimilarityMatrix.

property metadata_filename: str

Name of the metadata file.

property mid_filename: str

Name of the file where all material ids are stored.

property most_similar_materials_filename: str

(Base) name of file(s) where all most similar materials are stored.

property n_tasks: int

Total number of tasks used to calculate similarity matrix.

save_metadata(overwrite: bool = True) None

Save current metadata to self.folder_name/self.metadata_filename.

Keyword arguments:

overwrite: bool

Overwrite file if it exists.

default: True

property size: int

Total number of fingerprints that are used.

property symmetric: bool

Matrix is symmetric.

property task_id: int

Task id of the current task.

write_most_similar_materials_file(k: int = 10, remove_self: bool = True)

Write files containing all k most similar materials for each entry of the matrix.

Keyword arguments:

k: int

Number of most similar materials to find.

default: 10

remove_self: bool

Find most similar materials without self.

default: True