drawing 3d representations of molecules

Thoughts and Theory

Basic Molecular Representation for Machine Learning

From SMILES to Word Embedding and Graph Embedding

franky

Paradigm past Author

Machine learning has been applied to many issues in cheminformatics and life science, for example, investigating molecular property and developing new drugs. One critical issue in the problem-solving pipeline for these applications is to select a proper molecular representation that featurizes the target dataset and serves the downstream model. Figure 1 shows a conceptual framework for different molecular representations. Usually, a molecule is represented by a linear form as a SMILES string, or by a graph form as an adjacent matrix maybe together with a node aspect matrix for atoms and an edge attribute matrix for bonds. A SMILES cord could be further converted into different formats such as molecular fingerprint, i-hot encoding, or word embedding. On the other manus, the graph form of molecular representation could be directly used past the downstream model or exist converted into graph embedding for the task.

Fig. ane. A conceptual framework of molecular representation

This mail describes some codes used in the implementation of the above conceptual framework, including:

  • reading, drawing, analyzing a molecule,
  • generating molecular fingerprint from a SMILES string,
  • generating one-hot encoding from a SMILES string,
  • generating word embedding from a SMILES string, and
  • generating molecular representation in graph.

Reading, Drawing, and Analyzing a Molecule

RDKit is an open up-source library for cheminformatics. Figure 2 shows the lawmaking for reading the SMILES string of caffeine and cartoon its molecular structure. Notice that C is carbon, N is nitrogen, and O is oxygen in a SMILES cord. A molecule could be displayed without labeling carbon as shown in Figure three or with labeling carbon as shown in Figure 4.

Fig. 2. Reading and cartoon the molecule of caffeine. [ 3, 4 ]

Fig. 3. caffeine.png

Fig. four. caffeine_with_prop.png

Figure 5 shows the code for displaying the atoms and bonds in the molecule of caffeine.

Fig. 5. Displaying the atoms and bonds in the molecule of caffeine. [ 5 ]

Figure 6 and Effigy 7 show the details of atoms and bonds in the molecule of caffeine, respectively. Notes:

  1. The term "aromatic" could be merely regarded as "ring" in the following tables. GetIsAromatic in Effigy half-dozen indicates if the atom is in a band or not, and GetBondType in Effigy vii indicates if the bail is in a band or not.
  2. Effigy 6 and Figure seven could be regarded every bit a simple atom attribute matrix and a simple bail attribute matrix in the conceptual framework shown in Figure i.
  3. The list of bonds in Figure vii could represent the graph class for a molecule, i.e., the link listing of an adjacency matrix.

Generating Molecular Fingerprint from a SMILES String

RDKit supports several fingerprint functions, which outputs could exist used for calculating molecular similarity or as the inputs to the downstream machine learning models. Figure 8 shows the codes for retrieving RDKit Fingerprint and Morgan Fingerprint, and Figure 9 shows the results of these fingerprint functions.

Fig. viii. Retrieving RDKit Fingerprint and Morgan Fingerprint
Fig. 9. RDKit Fingerprint and Morgan Fingerprint.

Generating One-Hot Encoding from a SMILES string

Considering SMILES strings as text in natural language, probably the simplest representation method for SMILES strings is one-hot encoding at the graphic symbol level. Figure 10 shows the lawmaking for generating one-hot encoding at the character level of a SMILES string.

Fig. 10. I-hot encoding at the character level of SMILES strings. [ vi ]

Note that one-hot encoding could be also used at the atom level or in the atom/bail attribute matrix.

Generating Discussion Embedding from a SMILES String

In the context of language modeling, a more sophisticated approach for generating molecular representation is to apply the method of give-and-take embedding to the substructures of a molecule. The code in Figure eleven shows the procedure of using mol2vec and word2vec on generating word embedding for all the molecules in the HIV dataset. There are 41127 molecules in the dataset (Effigy 12) and each molecule is encoded as a 300-dimensional vector (Figure 13). Note that the lawmaking is extracted from "Simple ML In Chemistry Inquiry: RDkit & mol2vec" which explains the solution for predicting HIV activity in detail.

Fig. 11. Generating discussion embedding for the molecules in the HIV dataset
Fig. 12. The columns and rows of the HIV dataset.
Fig. xiii. Mol2vec embeddings for the molecules in the HIV dataset

Generating Molecular Representation in Graph

The procedure of manipulating molecules/atoms/bonds in RDKit provides the foundation for generating the graph form of molecular representation. Effigy v, Effigy 6, and Effigy vii above accept shown the adjacency matrix, the node aspect matrix, and the edge aspect network for caffeine. Notwithstanding, converting a molecule in RDKit into a graph in NetworkX (an open-source library for network analysis) could leverage the research of the traditional graph algorithms and the mod graph models for investigating molecular structure and property. Figure xiv shows the lawmaking for converting a molecule in RDKit into a graph in NetworkX. Figure 15 shows the molecular graphs drawn past RDKit and NetworkX.

Fig. fourteen. Converting a molecule in RDKit to a graph in NetworkX

Fig. 15. Molecular graphs by RDKit and NetworkX

One important research area in graph networks is graph embedding. More often than not speaking, graph embedding consists of three topics: node-level embedding (which encodes nodes in a graph equally vectors), edge-level embedding (which encodes edges in a graph as vectors), and graph-level embedding (which encodes a whole graph as a vector.) In this post, we consider the term graph embedding as graph-level embedding, which finds a vector for a molecule that could be used as the input for the downstream models. Figure 16 shows the code for converting molecules in RDKit to graphs in NetworkX, and generating its graph embeddings via Graph2Vec under KarateClub. Graph2Vec is a graph embedding algorithm and KarateClub is a parcel providing unsupervised machine learning models for graph data. Figure 17 shows Graph2Vec embedding for the molecules in the HIV dataset. KarateClub has covered several graph embedding algorithms in the library.

Fig. sixteen. Generating graph embedding for the molecules in the HIV dataset
Fig. 17 Graph2Vec embedding for the molecules in the HIV dataset

Conclusions

This post has described several molecular representations, including cord-based format, graph-based format, and some variants such as word embedding and graph embedding. These molecular representations, together with dissimilar machine learning algorithms including deep learning models and graph neural networks, could serve as the baseline for approaching molecular machine learning problems.

Thanks for reading. If you take any comments, please feel costless to drop me a notation.

pattonintentookey.blogspot.com

Source: https://towardsdatascience.com/basic-molecular-representation-for-machine-learning-b6be52e9ff76

0 Response to "drawing 3d representations of molecules"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel