askill
datamol-cheminformatics

datamol-cheminformaticsSafety 100Repository

Pythonic wrapper around RDKit with simplified interface and sensible defaults for drug discovery cheminformatics. Use for SMILES parsing, molecular standardization, descriptor computation, fingerprints, similarity search, clustering, diversity selection, scaffold analysis, BRICS/RECAP fragmentation, 3D conformer generation, and molecular visualization. Returns native rdkit.Chem.Mol objects. Prefer datamol over raw RDKit for standard workflows; use RDKit directly for advanced control or custom parameters.

5 stars
1.2k downloads
Updated 2/20/2026

Package Files

Loading files...
SKILL.md

Datamol Cheminformatics Toolkit

Overview

Datamol provides a lightweight, Pythonic abstraction layer over RDKit for molecular cheminformatics. It simplifies common drug discovery operations — SMILES parsing, standardization, descriptors, fingerprints, clustering, scaffolds, conformers, and visualization — with sensible defaults, built-in parallelization, and cloud storage support via fsspec. All molecular objects are native rdkit.Chem.Mol instances, ensuring full RDKit compatibility.

When to Use

  • Parsing, validating, and standardizing molecular structures from SMILES, SDF, or other formats
  • Computing molecular descriptors and fingerprints for ML featurization
  • Similarity searching and diversity selection from compound libraries
  • Clustering compounds by structural similarity (Butina clustering)
  • Scaffold analysis and scaffold-based train/test splitting for ML
  • BRICS/RECAP molecular fragmentation for fragment-based design
  • 3D conformer generation and analysis
  • Visualizing molecules as grids with alignment and highlighting
  • Batch processing molecular datasets with parallelization
  • For quick gene lookups use gget instead; for advanced substructure queries or custom fingerprints, use RDKit directly

Prerequisites

uv pip install datamol
import datamol as dm
import numpy as np
import pandas as pd

Quick Start

import datamol as dm

# Parse and standardize
mol = dm.to_mol("CC(=O)Oc1ccccc1C(=O)O")  # Aspirin
mol = dm.standardize_mol(mol)
print(dm.to_smiles(mol))  # Canonical SMILES

# Compute descriptors
desc = dm.descriptors.compute_many_descriptors(mol)
print(f"MW: {desc['mw']:.1f}, LogP: {desc['logp']:.2f}, TPSA: {desc['tpsa']:.1f}")

# Generate fingerprint
fp = dm.to_fp(mol, fp_type='ecfp', radius=2, n_bits=2048)
print(f"Fingerprint shape: {fp.shape}")  # (2048,)

Core API

1. Molecular I/O & Standardization

Parsing molecules:

import datamol as dm

# From SMILES (returns None on failure)
mol = dm.to_mol("CCO")
if mol is None:
    print("Invalid SMILES")

# Format conversions
smiles = dm.to_smiles(mol, isomeric=True)  # Canonical SMILES
inchi = dm.to_inchi(mol)
inchikey = dm.to_inchikey(mol)
selfies = dm.to_selfies(mol)

Standardization (always recommended for external data):

mol = dm.standardize_mol(
    mol,
    disconnect_metals=True,
    normalize=True,
    reionize=True
)
clean_smiles = dm.standardize_smiles("C(C)O")  # From SMILES directly

File I/O:

# Reading (supports local, S3, GCS, HTTP via fsspec)
df = dm.read_sdf("compounds.sdf", mol_column='mol')
df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")
df = dm.read_excel("compounds.xlsx", sheet_name=0, mol_column="mol")
df = dm.open_df("file.sdf")  # Auto-detect format

# Writing
dm.to_sdf(df, "output.sdf", mol_column="mol")
dm.to_smi(mols, "output.smi")
dm.to_xlsx(df, "output.xlsx", mol_columns=["mol"])  # Renders molecule images

# Remote files
df = dm.read_sdf("s3://bucket/compounds.sdf")
dm.to_sdf(mols, "s3://bucket/output.sdf")

2. Descriptors & Properties

import datamol as dm

mol = dm.to_mol("c1ccc(cc1)CCN")

# Standard descriptor set (single molecule)
desc = dm.descriptors.compute_many_descriptors(mol)
# Returns dict: {'mw': 121.18, 'logp': 1.41, 'hbd': 1, 'hba': 1,
#                'tpsa': 26.02, 'n_aromatic_atoms': 6, ...}

# Batch computation (parallel)
mols = [dm.to_mol(s) for s in ["CCO", "c1ccccc1", "CC(=O)O"]]
desc_df = dm.descriptors.batch_compute_many_descriptors(
    mols, n_jobs=-1, progress=True
)
print(desc_df.head())

# Specific descriptors
n_stereo = dm.descriptors.n_stereo_centers(mol)
n_aromatic = dm.descriptors.n_aromatic_atoms(mol)
aromatic_ratio = dm.descriptors.n_aromatic_atoms_proportion(mol)
n_rigid = dm.descriptors.n_rigid_bonds(mol)

Drug-likeness filtering (Lipinski Rule of Five):

def is_druglike(mol):
    desc = dm.descriptors.compute_many_descriptors(mol)
    return (desc['mw'] <= 500 and desc['logp'] <= 5
            and desc['hbd'] <= 5 and desc['hba'] <= 10)

druglike = [m for m in mols if is_druglike(m)]
print(f"Drug-like: {len(druglike)}/{len(mols)}")

3. Fingerprints & Similarity

import datamol as dm

mol = dm.to_mol("c1ccc(cc1)CCN")

# Fingerprint types
fp_ecfp = dm.to_fp(mol, fp_type='ecfp', radius=2, n_bits=2048)  # Morgan/ECFP
fp_maccs = dm.to_fp(mol, fp_type='maccs')    # MACCS keys (167 bits)
fp_topo = dm.to_fp(mol, fp_type='topological')  # Topological
fp_ap = dm.to_fp(mol, fp_type='atompair')    # Atom pairs

# Pairwise distances (Tanimoto distance = 1 - similarity)
mols = [dm.to_mol(s) for s in ["CCO", "CCCO", "c1ccccc1"]]
dist_matrix = dm.pdist(mols, n_jobs=-1)
print(f"Distance vector shape: {dist_matrix.shape}")

# Distances between two sets
query = [dm.to_mol("CCO")]
library = [dm.to_mol(s) for s in ["CCCO", "c1ccccc1", "CC(=O)O"]]
distances = dm.cdist(query, library, n_jobs=-1)
print(f"Query-library distances: {distances.shape}")

4. Clustering & Diversity Selection

import datamol as dm

mols = [dm.to_mol(s) for s in smiles_list]  # Assume smiles_list defined

# Butina clustering (suitable for ~1000 molecules, builds full distance matrix)
clusters = dm.cluster_mols(mols, cutoff=0.2, n_jobs=-1)
for i, cluster in enumerate(clusters[:5]):
    print(f"Cluster {i}: {len(cluster)} molecules")

# Diversity selection (works for larger libraries)
diverse_mols = dm.pick_diverse(mols, npick=100)
print(f"Selected {len(diverse_mols)} diverse molecules")

# Cluster centroids
centroids = dm.pick_centroids(mols, npick=50)
print(f"Selected {len(centroids)} centroids")

5. Scaffolds & Fragments

Murcko scaffold extraction:

import datamol as dm
from collections import Counter

mol = dm.to_mol("c1ccc(cc1)CCN")
scaffold = dm.to_scaffold_murcko(mol)
print(f"Scaffold: {dm.to_smiles(scaffold)}")

# Scaffold frequency analysis
scaffolds = [dm.to_scaffold_murcko(m) for m in mols]
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
counts = Counter(scaffold_smiles)
print(f"Top scaffolds: {counts.most_common(5)}")

# Scaffold-based train/test split (for ML)
scaffold_to_mols = {}
for mol, scaf in zip(mols, scaffold_smiles):
    scaffold_to_mols.setdefault(scaf, []).append(mol)
scaffolds_list = list(scaffold_to_mols.keys())
split_idx = int(0.8 * len(scaffolds_list))
train_mols = [m for s in scaffolds_list[:split_idx] for m in scaffold_to_mols[s]]
test_mols = [m for s in scaffolds_list[split_idx:] for m in scaffold_to_mols[s]]

Fragmentation:

mol = dm.to_mol("CC(=O)Oc1ccccc1C(=O)O")  # Aspirin

# BRICS (16 bond types, retrosynthetic)
brics_frags = dm.fragment.brics(mol)
print(f"BRICS fragments: {brics_frags}")  # Set of fragment SMILES with [1*] attachment points

# RECAP (11 bond types, combinatorial)
recap_frags = dm.fragment.recap(mol)

# MMPA (matched molecular pair analysis)
mmpa_frags = dm.fragment.mmpa_frag(mol)

6. 3D Conformers

import datamol as dm

mol = dm.to_mol("c1ccc(cc1)CCN")

# Generate conformers
mol_3d = dm.conformers.generate(
    mol,
    n_confs=50,           # Number to generate
    rms_cutoff=0.5,       # Filter similar (Angstroms)
    minimize_energy=True,  # UFF minimization
    method='ETKDGv3'      # Embedding method
)
print(f"Generated {mol_3d.GetNumConformers()} conformers")

# Access coordinates
conf = mol_3d.GetConformer(0)
positions = conf.GetPositions()  # Nx3 array
print(f"Atom positions shape: {positions.shape}")

# Cluster conformers by RMSD
clusters = dm.conformers.cluster(mol_3d, rms_cutoff=1.0)
centroids = dm.conformers.return_centroids(mol_3d, clusters)

# Solvent accessible surface area
sasa = dm.conformers.sasa(mol_3d, n_jobs=-1)
print(f"SASA values: {sasa[:3]}")

Key Concepts

Datamol vs RDKit Decision Guide

Use Datamol when...Use RDKit directly when...
Standard SMILES ↔ Mol conversionsCustom fingerprint definitions
Batch processing with parallelizationLow-level atom/bond manipulation
Quick descriptor computationSubstructure query optimization
File I/O (SDF, CSV, Excel, cloud)Reaction enumeration (large-scale)
Clustering & diversity selectionCustom force field parameters
Scaffold analysisAdvanced stereochemistry handling

Key Data Types

  • All molecules are native rdkit.Chem.Mol objects — fully compatible with RDKit functions
  • Fingerprints are numpy arrays (dense bit vectors)
  • DataFrames use pandas with a mol column containing Mol objects
  • Distance matrices use Tanimoto distance (0 = identical, 1 = completely different)

Parallelization

Functions supporting n_jobs parameter: dm.read_sdf, dm.descriptors.batch_compute_many_descriptors, dm.cluster_mols, dm.pdist, dm.cdist, dm.conformers.sasa. Use n_jobs=-1 for all cores, progress=True for progress bars.

Common Workflows

1. Drug Discovery Pipeline: Load → Filter → Cluster → Visualize

import datamol as dm

# 1. Load and standardize
df = dm.read_sdf("compounds.sdf")
df['mol'] = df['mol'].apply(lambda m: dm.standardize_mol(m) if m else None)
df = df[df['mol'].notna()]
print(f"Loaded {len(df)} valid molecules")

# 2. Compute descriptors and filter by drug-likeness
desc_df = dm.descriptors.batch_compute_many_descriptors(
    df['mol'].tolist(), n_jobs=-1, progress=True
)
druglike = (desc_df['mw'] <= 500) & (desc_df['logp'] <= 5) & (desc_df['hbd'] <= 5) & (desc_df['hba'] <= 10)
filtered_df = df[druglike.values].reset_index(drop=True)
print(f"Drug-like compounds: {len(filtered_df)}")

# 3. Select diverse subset
diverse = dm.pick_diverse(filtered_df['mol'].tolist(), npick=100)

# 4. Visualize
dm.viz.to_image(diverse[:20], legends=[dm.to_smiles(m) for m in diverse[:20]],
                n_cols=5, mol_size=(300, 300), outfile="diverse_hits.png")

2. Virtual Screening: Query → Similarity → Rank

import datamol as dm
import numpy as np

# Query actives and screening library
actives = [dm.to_mol(s) for s in active_smiles]  # Known actives
library = [dm.to_mol(s) for s in library_smiles]  # Screening library

# Calculate distances (Tanimoto)
distances = dm.cdist(actives, library, n_jobs=-1)
min_distances = distances.min(axis=0)  # Best match to any active
similarities = 1 - min_distances

# Rank and select top hits
top_idx = np.argsort(similarities)[::-1][:100]
top_hits = [library[i] for i in top_idx]
top_scores = [similarities[i] for i in top_idx]
print(f"Top hit similarity: {top_scores[0]:.3f}")

# Visualize top hits
dm.viz.to_image(top_hits[:20],
    legends=[f"Sim: {s:.3f}" for s in top_scores[:20]],
    outfile="screening_hits.png")

3. SAR Analysis: Group by Scaffold → Compare Activities

import datamol as dm

# Group compounds by scaffold
scaffolds = [dm.to_scaffold_murcko(m) for m in mols]
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]

sar_df = pd.DataFrame({
    'mol': mols, 'scaffold': scaffold_smiles, 'activity': activities
})

# Analyze each scaffold series
for scaffold, group in sar_df.groupby('scaffold'):
    if len(group) >= 3:
        print(f"Scaffold: {scaffold} | N={len(group)} | "
              f"Activity: {group['activity'].min():.2f}–{group['activity'].max():.2f}")
        dm.viz.to_image(group['mol'].tolist(), align=True,
            legends=[f"Act: {a:.2f}" for a in group['activity']])

Key Parameters

FunctionParameterDefaultDescription
dm.to_fpfp_type'ecfp'Fingerprint type: ecfp, maccs, topological, atompair
dm.to_fpradius2Morgan radius (ecfp only); radius=2 ≈ ECFP4
dm.to_fpn_bits2048Fingerprint length (ecfp, topological)
dm.cluster_molscutoff0.2Tanimoto distance threshold (0=identical, 1=different)
dm.pick_diversenpickrequiredNumber of diverse molecules to select
dm.conformers.generaten_confsNoneNumber of conformers (None = auto)
dm.conformers.generaterms_cutoffNoneRMSD filter threshold (Angstroms)
dm.conformers.generatemethod'ETKDGv3'Embedding: ETKDGv3, ETKDGv2, ETKDG
dm.standardize_moldisconnect_metalsFalseRemove metal-ligand bonds
dm.read_sdfsanitizeTrueApply molecule sanitization
dm.read_sdfremove_hsTrueRemove explicit hydrogens
dm.viz.to_imagealignFalseAlign molecules by MCS
dm.viz.to_imageuse_svgFalseOutput SVG (True) or PNG (False)

Best Practices

  1. Always standardize molecules from external sources — call dm.standardize_mol() with disconnect_metals=True, normalize=True, reionize=True before any analysis. Different SMILES representations of the same molecule will produce different fingerprints
  2. Check for None after parsingdm.to_mol() returns None for invalid SMILES. Filter these before batch operations to avoid crashes
  3. Use parallel processing for datasets — pass n_jobs=-1, progress=True to batch operations. Sequential processing of 10,000+ molecules is unnecessarily slow
  4. Choose fingerprints by use case — ECFP (Morgan): general structural similarity; MACCS: fast, smaller space; Atom pairs: distance-sensitive. ECFP with radius=2, n_bits=2048 is the most common default
  5. Mind clustering scale limits — Butina clustering (dm.cluster_mols) builds a full distance matrix. Use for ≤~1,000 molecules. For larger sets, use dm.pick_diverse() or hierarchical methods
  6. Use scaffold splitting for ML — random splits leak similar structures into train/test. Always use scaffold-based splitting for molecular property prediction models
  7. Leverage fsspec for cloud data — all I/O functions accept S3, GCS, and HTTP paths directly. Install s3fs or gcsfs for cloud support

Common Recipes

Recipe: Batch SMILES Validation and Standardization

When to use: Clean a list of SMILES strings before any downstream analysis.

import datamol as dm

smiles_list = ["CC(=O)Oc1ccccc1C(=O)O", "c1ccccc1", "invalid_smiles", "CC(N)C(=O)O"]
mols = [dm.to_mol(s) for s in smiles_list]
valid = [(s, m) for s, m in zip(smiles_list, mols) if m is not None]
standardized = [(s, dm.standardize_mol(m)) for s, m in valid]
print(f"Valid: {len(valid)}/{len(smiles_list)}")
for orig, mol in standardized:
    print(f"  {orig} → {dm.to_smiles(mol)}")

Recipe: Pairwise Similarity Matrix

When to use: Compare a small compound set against each other or a reference library.

import datamol as dm
import numpy as np

smiles = ["CC(=O)Oc1ccccc1C(=O)O", "c1ccc(cc1)C(=O)O", "CC(N)C(=O)O", "c1ccccc1"]
mols = [dm.to_mol(s) for s in smiles]
fps = [dm.to_fp(m) for m in mols]

# Pairwise Tanimoto similarity
n = len(fps)
sim_matrix = np.zeros((n, n))
for i in range(n):
    for j in range(n):
        sim_matrix[i, j] = dm.similarity.tanimoto(fps[i], fps[j])
print(f"Similarity matrix shape: {sim_matrix.shape}")
print(f"Most similar pair: {np.unravel_index(np.argsort(sim_matrix.ravel())[-3], (n, n))}")

Troubleshooting

ProblemCauseSolution
dm.to_mol() returns NoneInvalid or non-canonical SMILESTry dm.standardize_smiles() first; check for kekulization issues
MemoryError during clusteringFull distance matrix for large setUse dm.pick_diverse() instead of dm.cluster_mols for >1000 molecules
Slow conformer generationToo many conformers or large moleculeReduce n_confs, increase rms_cutoff, or limit molecule size
Remote file access failsMissing fsspec backendInstall s3fs (AWS), gcsfs (GCP), or adlfs (Azure)
Descriptor computation failsMolecule has no conformerStandardize first; some 3D descriptors need dm.conformers.generate()
dm.to_xlsx missing imagesopenpyxl not installeduv pip install openpyxl
Inconsistent fingerprintsDifferent SMILES for same moleculeStandardize all molecules before fingerprint computation
Scaffold extraction returns full moleculeNo ring system in moleculeMurcko scaffolds require at least one ring; acyclic molecules return themselves
Reaction product is NoneReactant doesn't match SMARTS patternVerify reactant matches reaction template; check atom mapping
Import error for dm.vizMissing visualization dependenciesuv pip install Pillow cairosvg

Related Skills

  • rdkit-cheminformatics — full RDKit API for advanced operations not covered by datamol's simplified interface
  • pubchem-compound-search — retrieve compound data by name, CID, or structure from PubChem
  • scikit-learn-machine-learning — ML model training using datamol-generated features
  • matplotlib-scientific-plotting — custom publication-quality molecular property plots

References

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

94/100Analyzed 2/23/2026

Comprehensive technical reference for Datamol cheminformatics library with excellent code examples, clear structure, and complete API coverage. Includes when-to-use guidance, decision tables, and complete workflows. Well-suited for drug discovery practitioners.

100
92
90
95
95

Metadata

Licenseunknown
Version-
Updated2/20/2026
Publisherjaechang-hits

Tags

apici-cdgithubtesting