SketchEl: Molecule Format |
SketchEl preferentially uses its own data format to describe molecular sketches. The native format is structured plain ASCII text, consisting of some number of atoms and bonds. The format is minimalistic, rigid, and slightly extensible.
The SketchEl molecule format is also used natively by the Mobile Molecular DataSheet for BlackBerry smartphones.
The philosophy underlying the SketchEl molecule format can be summarised as: the fewest possible primitives needed to capture a molecule drawing, or drawing in progress, with only features which have cheminformatic semantics, preferably orthogonal. When it comes down to it, there are very few data features which fit into these criteria. Molecular diagrams are frequently adorned with additional labels which are not directly connected to the molecular formula of the molecular species being described, e.g. registration codes, enantiopurity, reaction stoichiometry, molecular orbital lobes, etc. Many formats have fields for information which, in principle, is derivable from more basic properties of the sketch, e.g. aromaticity, partial bond order, chiral parity, etc. Furthermore, many formats have a variety of fields which are specific to particular types of chemistry or downstream purposes, e.g. reactivity, query features, mixture data, etc.
The SketchEl native format has no place for all of this fancy additional markup. What it does have is everything that is needed to perform a basic interpretation of the cheminformatic properties of the molecular species being described. Anything that cannot be expressed with the available primitives must be provided as data which is supplementary to the molecular sketch - which is not as onerous as it sounds.
\0020 space
\005C \
\002C ,
\003B ;
\000A newline
A SketchEl character stream is described by the following pattern:
SketchEl!({#atoms},{#bonds}) {element}={x},{y};{charge},{unpaired}[,i{implicit}][,e{explicit}][,n{mapnum}][,...] ... {from}-{to}={order},{type}[,...] ... !End
A very simple example - ethanol with implicit hydrogens - is as follows:
SketchEl!(3,2) C=-6.9500,6.5500;0,0,i3 C=-5.6510,7.3000;0,0,i2 O=-4.3519,6.5500;0,0,i1 1-2=1,0 2-3=1,0 !End
A second example, based on the same heavy-atom ethanol structure, is an example of how to not to make a cheminformatically meaningful structure, but nonetheless demonstrates some features of the format:
SketchEl!(3,2) C=-6.4000,2.3500;1,0,i2,xPERM1,yTEMP1 C=-5.1010,3.1000;0,0,e2,xPERM2,yTEMP2 \004F=-3.8019,2.3500;0,1,i0,xPERM3,yTEMP3 2-1=1,1,xPERM12,yTEMP12 2-3=1,2,xPERM23,yTEMP23 !End
All 3 atoms, and both bonds, have an invariant expansion string, and a dependent expansion string. The latter will all be removed if the slightest modification is made to the structure. The third atom, oxygen, is represented by an escape code (hex 4f), which is valid even when the raw character is allowed. The first atom is assigned a charge of +1, and its hydrogen count is assigned automatically, and was calculated to be 2. The second atom is assigned a number of hydrogen atoms to be fixed at 2, which coincidently happens to be the same number as would be automatically calculated for a methylene. The third atom is a radical oxygen, with 1 unpaired electron, and an automatic hydrogen count which happens to be 0.
Both of the bonds are ordered so that atom 2 is the source point, and the bond types are inclined and declined, respectively, which has no stereochemical meaning since the central carbon atom has symmetry.
Bond orders: Nonzero bond orders are considered to carry significant meaning, and provide a fairly strong hint as to the pi localisation of electrons within the molecular structure, and resonance patterns thereof. Most organic species can be represented by using bond orders 1 through 3. In the vastness of chemistry, however, most possible structures have some number of bonds for which this assignment is misleading, and so a bond order of 0 should be used to denote a bonding interaction of indeterminate degree. Dative bonds, strong hydrogen bonds and multicentre bonds all qualify. This is a particularly useful interpretation hint for metal-organic structures, where part of the molecule has conventional bonding patterns (usually the organic part), while the interface (typically to the metal) would invalidate normal valence rules.
![]() |
Consider the fictional platinum complex above. If all of the connections were represented using double or single bonds, an interpretation of the bonding patterns would have to conclude that not all of the atoms are valid Lewis structures, and none of the bond orders have useful meaning. With selected use of zero-order (dotted) bonds, however, considerable information about the structure falls into shape quite easily.
The dimethylamine ligand has a coordination bond to the platinum metal, which means that it is reasonable to guess that the number of hydrogen atoms on the nitrogen atom is 1, which is correct. If the coordination bond were drawn as a single line, the ligand would be considered to be an anionic ligand, with no hydrogen atom. On the other side, the pyridine-platinum connection is also described as a zero order bond, which preserves the valence of the aromatic system. The pyridine ring can readily be interpreted as a relatively normal 6-ring aromatic system, rather than a hypervalent non-octet species. The platinum centre has four ligands, two of which are zero order, and two of which are single bonds. Combined with no overall charge, this suggests that the oxidation state is that of Pt(II), which is the commonly accepted designation. On the other side of the pyridine substituent, a hydroxy substituent is shown in a chelated H-bond arrangement with the adjacent ketone. Use of a zero-order bond allows this to be expressed, without drawing a divalent hydrogen atom, and thereby upsetting the otherwise neat and tidy valences.
Stereochemistry: All stereochemical features are represented by the atom position and bond style. There are no additional fields for chirality. Parity-style assignments, such as the CIP R/S or E/Z systems, must be calculated as needed. Part of the reason for this is that all sketches are considered to be potentially a work-in-progress. Assigning a definitive parity to an atom or bond loses meaning as the molecule is modified, permuted, rotated etc. For sketching purposes, it is far more practical to recalculate these properties from the sketch, and never encode them as fixed values.
For stereochemistry which is unknown, or mixed, the "unknown" bond type can be used. For chiral centres, the mere absence of inclined or declined "wedge bonds" is sufficient. The format draws no distinction between unresolved stereochemistry and mixtures. There is no inherent capability for describing multiple species within a single structure, whether it be stereochemistry, tautomers, isomers, or any other kind of 'mer. This is a deliberate design decision. Systems which try to encode a large amount of chemistry within a single sketch are inevitably either too complex, or too limiting. Mixtures of distinctly different molecular species should be drawn as individual sketches, which has the benefit of being specific, simple and foolproof, at the expense of convenience.
Implicit hydrogen atoms: One of the most problematic side effects of the most popular molecular sketch formats is the inability to reliably reconcile the drawing with the corresponding molecular formula, due to over-reliance on automatic calculation of the number of implicit hydrogens attached to the heavy atoms which are explicitly drawn as part of the sketch.
For most organic compounds, the number of implied hydrogens is quite simple to calculate, since most of the constituents are first row p-block atoms, and the valences are mostly Lewis-octet based, and it is usually obvious when this is not the case. However, in the absence of a way to mark an atom as not being eligible for automatic implicit hydrogens, problem cases quickly rack up when leaving this comfort zone.
For a good illustration, consider the following two sketches:
![]() |
Both of these compounds show a tin atom connected by single bonds to two substituents. The dimethyl tin compound on the left could be dimethyl tin(II), or it could be dimethyl tin(IV) dihydride. Since the former is an extremely reactive intermediate, and the latter is commercially available, it would be quite reasonable to guess than two implicit hydrogen atoms should be added to top up the valence to make a tin(IV) compound. The dichloro tin compound on the right, however, is more likely to be tin(II) chloride, i.e. adding two implicit hydrogens would most likely be a mistake. However, no matter how good the algorithm is for picking the most likely state, whether to add or not to add is still just a guess, and the author of the sketch could have intended to represent the other possibility.
This is clearly unacceptable, since authors of sketches generally know how many hydrogens they want on each of their atoms. Being unable to correctly recalculate the molecular formula of the input structure is a case of unnecessary information loss.
The SketchEl molecule format does not prescribe a method for calculating implicit hydrogens, although the formula that it uses internally is conservative. Rather, it has two states:
In practice, automatic hydrogen calculation is very useful when sketching molecules, since most atoms are either part of a Lewis-compliant fragment, or are atoms for which implicit hydrogens are never automatically added. Explicit hydrogen atom specification is most often useful for specifying that a particular atom will always have 0 implicit hydrogens. If there are any hydrogens to be attached, they will be drawn in as actual atoms.
One of the functional requirements of the format is that the list of atoms plus the number of implicit hydrogens recorded in the atom blocks of the format must make up the entire molecular formula. Some or all of the hydrogen atom counts will usually be calculated automatically, but it is the author's task to ensure that this is overridden manually whenever this would have led to the wrong answer.
| Back to SketchEl main page |
|