31/10/2023
Cheminformatics algorithms
A few decades ago, the drug design process required massive amounts of time and many other, more tangible resources. And it wasn't different from many other processes in the wide world of chemical engineering. If you were looking for a substance with a specific set of characteristics, you were often set for a tiresome trial-and-error process that could last for years.
Nowadays, researchers have access to more powerful computers and complex software tools that allow them to do a significant part of their work purely in silico. And what fuels the whole revolution is the development and correct use of advanced algorithms.
In this post, I'll talk about how cheminformatics algorithms changed how we solve complex issues in chemistry and medicine.
Cheminformatics algorithms: the definition
Before we move on to what algorithms are used in cheminformatics and the process of drug design, we should define what exactly we mean by "cheminformatics algorithms."
Let's say you're using a screwdriver to disassemble your PC. Just because you're using that particular screwdriver doesn't mean it's a "PC disassembly screwdriver." Unless you got it in a package with your PC, it's safe to say that it wasn't created for that purpose, and it can indeed be used in many other ways.
The same goes for a vast majority of the so-called cheminformatics algorithms. We primarily take existing solutions and put them to use in a medical and chemical context. For example, the ongoing trend in cheminformatics is to utilize machine learning and AI to predict the characteristics of molecules, such as their toxicity, or to identify candidates for potential medical use. Scientists and programmers work hand-in-hand to make solutions that combine the newest innovations from the world of technology with those from the science world.
Types and examples of algorithms used in cheminformatics
Depending on the type of available input data, we can put cheminformatics algorithms in three categories, each based on a different machine learning paradigm.
1. Human supervision during the learning process (supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning)
2. Real-time machine learning (Ensemble and batch learning)
3. Work methodology (instance-based, model-based)
However, the exact methods and solutions aren't necessarily exclusive to just one category. They can overlap with one another.
Human supervision during the learning process
In supervised learning, we deal with the data that already have labels, so we know the target value. In this case, we're using algorithms such as:
-> Logistic regression
-> Linear regression
-> Classification
-> Discriminant analysis
-> Gaussian process regression
-> Random forest
In unsupervised learning, correct results are unknown, so the algorithms used are:
- Apriori algorithm
- Clustering
- Principal Component Analysis (PCA)
- Neural networks
- Deep learning
- Hierarchical clustering
In semi-supervised learning, the data is partially labeled, so for the most part; we use a combination of the two approaches mentioned above.
Deep learning cheminformatics algorithms
Deep learning is used explicitly for complex non-linear relations. There are several commonly used DL algorithms used in cheminformatics. Here are some examples:
-> FNN - feedforward neural network
-> MLP - multilayer perceptron
-> RNN - recurrent neural network
-> CNN - convolutional neural network
For example, the last one (CNN) was used in 2018 by a group of Japanese scientists to create a model for compound classification, which can be used for advanced screening in the drug design process. The model was based on a simplified molecular-input line-entry system (SMILES) applied to the convolutional neural network. While it wasn't the first project that similarly used deep learning, the created model outperformed existing solutions. Moreover, it shows us how much there is still to discover and how new technologies still have much more to offer in life sciences.
Other examples of algorithms used in cheminformatics
Due to the high competition in the medicine industry, specific solutions used by the most innovative pharmaceutical companies are often highly classified. As a company that works on cutting-edge projects like Synthia by Merck corporation, we also can't talk in detail about our work.
However, when we look at publicly available solutions and scientific publications, it's clear that machine learning is at the center of most innovative approaches. For example, genetic algorithms are often used to generate poses in docking, and then machine learning helps create models to evaluate them.
The most cited and widely utilized algorithms include:
- Floyd–Warshall algorithm and the distance matrix compute topological indices, including the Wiener index, Balaban index and much more.
- Morgan algorithm is used for searching and comparing chemical reactions.
- Dijkstra's algorithm is used to identify atoms' arrangements within the particle, which helps us better understand their topology and geometry.
- Jarvis–Patrick algorithm and Ward's method are commonly used for clustering to improve QSAR analysis.
Putting scientists in the best position to succeed
No matter what algorithms we use, cheminformatics's primary role as a computer science or information technology is to deliver the most efficient and accurate solutions that make scientists' work easier, leaner and cleaner.
The technological revolution allowed us to collect massive amounts of data and computational power to handle, manage and analyze this data. The biggest challenge now is to find the best ways (algorithms) to do it efficiently.
I'm excited to participate as a cheminformatics developer and a chemistry enthusiast. And if you're looking for a team of cheminformatics experts to help you find the best algorithms to move your project forward, feel free to contact us.