Tutorial to SMILES and canonical SMILES explained with examples

My Background and Why I Wrote This Tutorial

Simplified Molecular Input Line Entry Specification — SMILES

Fig. 1: On the left, side a model compound is shown as an example of nested branching in SMILES. On the right, the IUPAC name and its SMILES string are shown. The SMILES string features parenthesis that imply branching and nested branching
Fig. 2: Cyclohexane as an example for a cyclic structure. First, the explicit hydrogens are exchanged for implicit ones, and the ring is linearized by conceptually breaking a bond implied by the dashed line. The carbons connected by the dashed line are being labelled with ‘1's, and the resulting SMILES string is shown below the right-hand structure.[1]
Fig. 3: Cubane as an example of a structure that has multiple cycles. On the left, the structure is shown without explicit hydrogen atoms. In the middle picture, the bonds that are artificially broken to linearize the molecule for the SMILES string are shown in dashes. On the very right, the skeletal structure resembles the SMILES string, that is written below the molecular representations.[1]
Fig. 4: Different instances of aromatic nitrogen. Notice that the SMILES string of pyrrole contains an additional hydrogen atom that follows the aromatic nitrogen. The aromaticity of an atom is denoted by writing it in lower case letters.[1]
Fig. 5: Example of double bond configuration in SMILES notation. On the left (Z)-1,2-difluoroethene is shown and on the right is (E)-1,2-difluoroethene with their SMILES notation. Both notations for each structure are valid.[3]
Fig. 6: Example of enantiomeric SMILES strings.[3] Both mirror-molecules are pictured in the same way. Above each depiction is the name of the chemical formula followed by a schematic. The eye indicates the point of view along the C-C-bond. The resulting view of the structure is shown on the right of each sub-figure. Written below are two equally adequate SMILES strings for each structure.

Canonical SMILES — The CANGEN Algorithm

Fig. 7: Canonical labelling with 2-(acetyloxy)benzoic acid. Every row corresponds to consecutive iterations of the CANON algorithm. The blackboard bold P denotes finding corresponding primes. The greek letter Π denotes taking the prime products of all neighbouring atoms, and the hashtag denotes atom ranking. Bold numbers denote ranks that reached invariance.[2]

Aftermath and Sources

Scientist, interested in all sorts of futuristic technology like space, fusion and AI. I specialized in computational and less so in experimental biochemistry.