The similarity matrices
- class madas.similarity.SimilarityMatrix(matrix: ~typing.List[list] = [], mids: ~typing.List[str] | None = None, dtype: type = <class 'numpy.float64'>)
A matrix, that stores all similarites between materials and the corresponding material identifier.
- align(matrices)
Align the materials in this matrix and all provided matrices.
Arguments:
- matrices: SimilarityMatrix or List[SimilarityMatrix]
Matrix or list of matrices to align
Returns:
None
WARNING! Entries in both matrices will be altered, i.e. unique entries in each matrix will be dropped.
- calculate(fingerprints: List[Fingerprint], mids: List[str] | None = None, multiprocess: int | None = -1, symmetric: bool = True, similarity_function: Callable | None = None)
Calculate similarity matrix.
Arguments:
- fingerprints: List[Fingerprint]
Fingerprints to calculate similarities.
Keyword arguments:
- mids: List[str]
Material ids of fingerprints.
default: None -> mids are extracted from fingerprints
- multiprocess: int
Calculate similarities on available processors. Set to -1 to use all available processors. Set to None for serial execution Set to any positive integer to use as may processes.
default: -1
- symmetric: bool
Reduce computation cost by calculating only unique half of symmetric matrix
default: True
- similarity_function: Callable or None
Similarity function to set to the fingerprints before calculating the matrix. This will make a copy of the fingerprints, using additional memory. When set to None, the similarity function of the fingerprints (Fingerprint.similarity_function) will be used.
default: None
Returns:
- self: SimilarityMatrix
Populated similarity matrix.
- property dataframe: DataFrame
Return pandas.DataFrame object containing similarities as values and material ids as indices.
- get_cleared_matrix(leave_out_mids, copy=True)
Return a matrix where all materials with mids specified in leave_out_mids are removed from the matrix.
Arguments:
- leave_out_mids: List[str]
mids of materials to leave out of the matrix
Keyword arguments:
- copy: bool
Return copy of SimilarityMatrix(); apply changes to self, if False
default: True
Returns:
- cleared_matrix: SimilarityMatrix
Copy of similarity matrix if copy is True, self otherwise without leave_out_mids
- get_entry(mid1: str, mid2: str) float64
Get any entry of the matrix.
Arguments:
- mid1, mid2: str
material ids of the requested material
Returns:
- similarity: float
Similarity between both materials
Raises:
KeyError: No entry for material with given mid.
- get_k_most_similar(ref_mid: str, k: int = 10, remove_self: bool = True) dict
Get the k most similar materials and the respective similarities for a material.
WARNING! Accurate results can only be obtained for symmetric similarity matrices. For asymmetric similarity measures, the assignment may be ambiguous.
Arguments:
- ref_mid: str
Material id of the reference material
Keyword arguments:
- k: int
Number of most similar materials to return
default: 10
- remove_self: bool
Remove the requested material from the results list
default: True
Returns:
dict: {<1st_nearest_mid>:<similarity>, 2nd_nearest_mid>:<similiarty>, …}
- get_matching_matrices(second_matrix)
Match matrices such that they contain the same materials in the same order.
Arguments:
- second_matrix: SimilarityMatrix
Matrix to match materials
Returns:
- new_self, new_matrix: tuple(SimilarityMatrix)
Matching similarity matrices
- get_metadata() dict
Get dictionary of fingerprint type and name.
- get_overlap_matrix(column_mids: List[str], row_mids: List[str]) object
Get OverlapSimilarityMatrix() from matrix. This new matrix contains (mostly) off-diagonal elements of the original matrix.
Arguments:
- row_mids: List[str]
mids associated with the rows of the new matrix
- column_mids: List[str]
mids associated with the columns of the matrix
Returns:
- overlap_matrix: OverlapSimilarityMatrix
New matrix with rows and columns corresponding to materials with identifier row_mids and column_mids
- get_row(mid: str) ndarray
Get a row of the matrix, by mid or index.
Arguments:
- mid: str or int
Material id oder matrix index of requested matrix row
Returns:
- row: np.ndarray
Similarities of material with given mid to all other materials in the matrix.
Raises:
KeyError: No entry with given mid.
IndexError: Matrix index out of range.
- get_sub_matrix(mid_list: List[str], copy: bool = True) object
Get sub matrix of all elements in mid_list sorted by occurrence in mid list.
Arguments:
- mid_list: List[str]
List of (unique) mids of materials to include in sub matrix
Keyword arguments:
- copy: bool
Return a new similarity matrix. If set to False, apply changes to self.
default: True
Returns:
SimilarityMatrix() object of sub matrix if copy == True
self restricted to, and sorted by, elements in mid_list
- get_symmetric_matrix()
Deprecated! Use SimilarityMatrix().matrix property! Get square matrix form.
Returns:
- square_matrix: np.ndarray
matrix of similarities
- get_unique_entries() ndarray
Return all enries of the upper triangular matrix.
Returns:
- entries: np.ndarray
list of all unique entries of the matrix
- static load(filename: str = 'similarity_matrix.npy', filepath: str = '.') object
Static method. Load SimilarityMatrix from file. If the target file is created from an OverlapSimilarityMatrix object, return OverlapSimilarityMatrix object
Warning: This methods loads a pickled file. Only load files of known origin.
Keyword arguments:
- filename: str
Name of a saved similarity matrix file
default: ‘similarity_matrix.npy’
- filepath: str
Relative path to SimilarityMatrix files
default: ‘.’
Returns:
similarity_matrix: SimilarityMatrix or OverlapSimilarityMatrix
Raises:
IndexError: Wrong format of data in file. Can not load.
- lookup_similarity(fp1: Fingerprint, fp2: Fingerprint)
Return similarity between two fingerprints from the matrix.
The expected usecase for this is to pass this function set_similarity_function of a Fingerprint object.
- fp1, fp2: Fingerprint
Fingerprint objects to retrieve similarity
Returns:
- similarity: float
Similarity between materials with mids fp1.mid and fp2.mid
Raises:
KeyError: Similarity matrix does not contain an entry with these mids.
- property matrix: ndarray
Matrix values.
- property mids: ndarray
Material identifier, corresponding to rows and columns of the matrix.
- plot(colorbar: bool = False, show: bool = True) None
Plot the similarty matrix.
Keyword arguments:
- colorbar: bool
Show a colorbar.
default: False
- show: bool
Show the plot.
default: True
- save(filename: str = 'similarity_matrix.npy', filepath: str = '.', overwrite: bool = False) None
Save SimilarityMatrix to numpy binary file.
Keyword arguments:
- filename: str
Name of the file
default: ‘similarity_matrix.npy’
- filepath: str
Relative path to created files
default: ‘.’
- overwrite: bool
Overwrite matrix file if it exists.
default: False
- set_dataframe(dataframe: DataFrame) None
Set the dataframe of the matrix.
Arguments:
- dataframe: pandas.DataFrame
Dataframe that should contain similarities as values and material identifier as indices.
- set_matrix(matrix: ndarray) None
Set the values of the similarity matrix.
Arguments:
- matrix: np.ndarray
array of values for the matrix
Can be a square matrix or an upper triangular matrix.
- set_metadata(metadata: dict) None
Set fingerprint type and name from dictionary. Ignores other keys than fp_type and fp_name.
- set_mids(mids: list) None
Set the matrix row and column identifiers to given values.
Arguments:
- mids: List[str]
List of identifiers to be set
- train_test_split(train_mids: List[str], test_mids: List[str]) set
Split similarity matrix into a (symmetric) train matrix and a (off-diagonal) test matrix.
Arguments:
- train_mids: List[str]
mids that identify materials of the training set
- test_mids: List[str]
mids that identify materials in the test set
Returns:
train_matrix, test_matrix: Set[SimilarityMatrix, OverlapSimilarityMatrix]
- class madas.similarity.OverlapSimilarityMatrix(matrix: ~typing.List[list] = [], row_mids: list = [], column_mids: list = [], dtype=<class 'numpy.float64'>)
A SimilarityMatrix that is used to store similarities between different sets of fingerprints.
- align(matrices: List[SimilarityMatrix] | SimilarityMatrix) None
Align the materials in this matrix and all provided matrices.
Arguments:
- matrices: OverlapSimilarityMatrix or List[OverlapSimilarityMatrix]
matrix object(s) to align with
Returns:
None
Warning! Entries in both matrices will be altered, i.e. unique entries in each matrix will be dropped.
- calculate(reference_fingerprints: List[Fingerprint], fingerprints: List[Fingerprint], mids: List[str] = [], reference_mids=[]) object
Calculate similarity of fingerprints to given reference fingerprints.
Arguments:
- reference_fingerprints: List[Fingerprint]
Fingerprints that correspond to columns of the matrix
- fingerprints: List[Fingerprint]
Fingerprints that correspond to rows of the matrix
Keyword arguments:
- mids: List[str]
Material identifier for the rows of the matrix
default: None
- reference_mids: List[str]
Material identifier for the columns of the matrix
default: None
If possible, mids and reference_mids are taken directly from the Fingerprint objects.
Returns:
- self: OverlapSimilarityMatrix
Calculated matrix object
- property column_mids: ndarray
Get mids corresponding to matrix columns.
- get_column(mid: str | int) ndarray
Get a column of the matrix.
Arguments:
- mid: str or int
Material id or matrix index of the requested column
Returns:
- matrix_column: numpy.ndarray
Column of the matrix
- get_entries() ndarray
Get all entries of the matrix as a list.
Returns:
- entries: numpy.ndarray
All entries of the matrix in a (N*M, 1)-dim list
- get_entry(row_mid: str, column_mid: str) float64
Get a single entry of the matrix.
Arguments:
- row_mid: str
material id of the material in the row of the matrix
- column_mid: str
material id of the material in the column of the matrix
Returns:
- similarity: numpy.float64
similarity between both materials
- get_row(mid: str | int) ndarray
Get a row of the matrix.
Arguments:
- mid: str or int
Material id or matrix index of the requested row
Returns:
- matrix_row: numpy.ndarray
Row of the matrix
- get_sub_matrix(row_mid_list: List[str], column_mid_list: List[str], copy: bool = True) object
Get sub matrix.
Arguments:
- column_mid_list: List[str]
list of mids of materials in matrix column to include in sub matrix
- row_mid_list: List[str]
list of mids of materials in matrix row to include in sub matrix
Keyword arguments:
- copy: bool
Return a new similarity matrix. If set to False, apply changes to
self.default: True
Returns:
- matrix: OverlapSimilarityMatrix
if copy == True
- self: OverlapSimilarityMatrix
if copy == False, self, restricted to, and sorted by, elements in mid_list
- get_symmetric_matrix()
Not implemented for OverlapSimilarityMatrix().
This function has no meaning for overlap similarity matrices.
Raises:
NotImplementedError: upon function call
- property mids: Tuple[List[str], List[str]]
Get mids corresponding to matrix row and columns.
- property row_mids: ndarray
Get mids corresponding to matrix rows.
- save(filename: str = 'overlap_similarity_matrix.npy', filepath: str = '.', overwrite: bool = False) None
Save SimilarityMatrix to numpy binary file(s).
Keyword arguments:
- filename: str
name of the matrix file
default: ‘similarity_matrix.npy’
- data_path: str
relative path to created files
default: ‘.’
- overwrite: bool
Overwrite matrix file if it exists.
default: False
- set_matrix(matrix: ndarray) None
Set matrix values
If the values do not fit the shape of the original matrix, values and mids will be overwritten.
Arguments:
- matrix: np.ndarray
Matrix to be set
- set_mids(row_mids: List[str], column_mids: List[str]) None
Set the mids for rows and columns of the matrix.
- transpose() None
Exchange rows and columns of matrix.
- class madas.similarity.BatchedSimilarityMatrix(root_path: str = '.', matrix_folder_name: str = 'batched_similarity_matrix', fingerprint_files_name: str = 'batch_similarity_fingerprints', load_from_file: bool = True, batch_size: int = 10000, n_tasks: int = 1, task_id: int = 0, size: int = 0, symmetric: bool = True, dtype=<class 'numpy.float32'>)
A similarity matrix for parallel computation distributed over different tasks.
The calculation of pairwise similarities between fingerprints can be distributed over different processors. To do so, the matrix is split into different independent sub-matrices (batches), which are calculated separately.
Each BatchedSimilarityMatrix stores all required data and metadata in a single folder.
To reduce memory consumption, the fingerprints are input via a separate file containing serialized Fingerprint objects. This file can be generated via, _e.g._:
import json long_fingerprint_list_data = [fp.serialize() for fp in long_fingerprint_list] with open('path/to/long/fingerprint/file', 'w') as fingerprint_file: json.dump(long_fingerprint_list_data, fingerprint_file)
In a next step, a BatchedSimilarityMatrix object can be used to split this large file into separate batches for computation of the similarity matrix.
batched_similarity_matrix.fingerprint_file_batches('long_fingerprint_file_name', 'path/to/long/fingerprint/file')
Now all data is prepared and the matrix can be calculated:
batched_similarity_matrix.calculate(similarity_function)
Note that because the fingerprints were serialized before, the similarity function must be specified for calculating similarities.
Keyword arguments:
- root_path: str
Path where the data folder shall be created.
default: “.”
- matrix_folder_name: str
Name of the similarity matrix data folder.
This name should be unique and descriptive!
default: “batched_similarity_matrix”
- fingerprint_files_name: str
Base name of fingerprint files to be generated.
- load_from_file: bool
Load metadata from file and ignore further keyword arguments.
default: True
- batch_size: int
Maximal number of fingerprints in a batch. A similarity matrix has a maximum of batch_size ** 2 entries.
default: 10000
- n_tasks: int
Total number of tasks that are used to compute the similarity matrix.
default: 1
- task_id: int
Id of the current task. This number specifies which batches of the similarity matrix are calculated.
default: 0
- size: int
Total number of fingerprints. This number is automatically set if the function fingerprint_file_batches is called.
default: 0
- symmetric: bool
Assume that similarity matrix is symmetric and calculate only unique batches. Setting this option to True reduces the number of batches that are calculated by ca. a factor of two, as off-diagonal elements are computed only once.
Methods:
- property batch_iterator: BatchIterator
BatchIterator object that is used to iterate over batches of the similarity matrix.
- property batch_size: int
Number of fingerprints in a batch.
- calculate(similarity_function: Callable, overwrite: bool = False)
Calculate similarity matrix entries and write them to files.
Arguments:
- similarity_function: Callable
Function of two Fingerprint objects that returns their similarity.
Keyword arguments:
- overwrite: bool
Recalculate and overwrite existing matrix batch files.
- fingerprint_file_batches(fingerprint_file_name: str, fingerprint_file_path: str, overwrite: bool = False, write_mid_file: bool = True, save_metadata_updates: bool = True) None
Split a single, large fingerprint file into batches that can be read by the individual tasks of a BatchedSimilarityMatrix.
Arguments:
- fingerprint_file_name: str
Name of the file containing a json-encoded list of serialized Fingerprint objects.
- fingerprint_file_path: str
Path to fingerprint file.
Keyword arguments:
- overwrite: bool
Overwrite batched fingerprint files if they exist. If this is set to False and (any of) the files exist(s), a FileExistsError is raised.
default: False
- write_mid_file: bool
Write a file that contains an enumeration of all material ids in the similarity matrix.
default: True
- save_metadata_updates: bool
Save the metadata after determining the total number of fingerprints.
default: True
- fingerprints_from_batch_file(idx: int, jdx: int) List[Fingerprint]
Read fingerprints from file and deserialize. Generates a list of Fingerprints from the files written by the function fingerprint_file_batches.
- idx, jdx: int
Start and end index of the fingerprints file batch.
Returns:
- fingerprints: List[Fingerprint]
Fingerprint objects from file
- property folder_path: str
Path to folder where all data corresponding to this matrix is stored.
- gen_fingerprint_batch_file_name(idx: int, jdx: int) str
Generate name of a fingerprint batch file from indices.
Arguments:
- idx, jdx: int
Start and end index of the fingerprints file batch.
Returns:
- filename: str
Name of a file with given indices.
- gen_matrix_batch_file_name(batch: List[List[int]]) str
Generate name of a matrix batch file from batch.
Arguments:
- batch: List[List[int]]
batch, as returned by BatchIterator objects
Returns:
- filename: str
Name of a file corresponding to given batch.
- get_entry_histogram(bins=array([0., 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2, 0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3, 0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4, 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5, 0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6, 0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7, 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8, 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9, 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99, 1.]))
Generate a histogram of entries in the matrix.
Keyword arguments:
- bins: np.ndarray
Bins of the histogram. Each bin contains the number of matrix entries in the range [bins[i], bins[i+1]).
default: np.arange(0,1.01,0.01)
Returns:
- (entries, bins): (np.ndarray, np.ndarray)
entries[i] contains the absolute number of matrix entries with similarity in the range [bins[i], bins[i+1]).
- get_row_by_mid(mid: str) List[float]
Get a row of the (already calculated) similarity matrix from the mid of a material.
Calling this function requires the the mid file was written by fingerprint_file_batches.
Arguments:
- mid: str
Material id of the reference entry.
Returns:
- similarities: List[float]
Similarities of all materials of this similarity matrix to the reference specified by mid
Raises:
FileNotFoundError: Mid file does not exist (or has a non-compatible name).
KeyError: No material of given mid in the mid file (and thus in the matrix).
- property matrices_for_this_task
Return the number of similarity matrices that are calculated for this task.
- property metadata: dict
Metadata for BatchedSimilarityMatrix.
- property metadata_filename: str
Name of the metadata file.
- property mid_filename: str
Name of the file where all material ids are stored.
- property most_similar_materials_filename: str
(Base) name of file(s) where all most similar materials are stored.
- property n_tasks: int
Total number of tasks used to calculate similarity matrix.
- save_metadata(overwrite: bool = True) None
Save current metadata to self.folder_name/self.metadata_filename.
Keyword arguments:
- overwrite: bool
Overwrite file if it exists.
default: True
- property size: int
Total number of fingerprints that are used.
- property symmetric: bool
Matrix is symmetric.
- property task_id: int
Task id of the current task.
- write_most_similar_materials_file(k: int = 10, remove_self: bool = True)
Write files containing all k most similar materials for each entry of the matrix.
Keyword arguments:
- k: int
Number of most similar materials to find.
default: 10
- remove_self: bool
Find most similar materials without self.
default: True