Tech – Mister Gemba

House Price Prediction using Machine Learning in Python

Mistergemba — Wed, 06 Aug 2025 01:27:33 +0000

House price prediction is a problem in the real estate industry to make informed decisions. By using machine learning algorithms we can predict the price of a house based on various features such as location, size, number of bedrooms and other relevant factors. In this article we will explore how to build a machine learning model in Python to predict house prices to gain valuable insights into the housing market.

To tackle this issue we will build a machine learning model trained on the House Price Prediction Dataset. We can download the dataset from the provided link. It includes 13 features:

Id	To count the records.
MSSubClass	Identifies the type of dwelling involved in the sale.
MSZoning	Identifies the general zoning classification of the sale.
LotArea	Lot size in square feet.
LotConfig	Configuration of the lot
BldgType	Type of dwelling
OverallCond	Rates the overall condition of the house
YearBuilt	Original construction year
YearRemodAdd	Remodel date (same as construction date if no remodeling or additions).
Exterior1st	Exterior covering on house
BsmtFinSF2	Type 2 finished square feet.
TotalBsmtSF	Total square feet of basement area
SalePrice	To be predicted

Step 1: Importing Libraries and Dataset

In the first step we load the libraries which is needed for Prediction:

Pandas – To load the Dataframe
Matplotlib – To visualize the data features i.e. barplot
Seaborn – To see the correlation between features using heatmap

As we have imported the data. So shape method will show us the dimension of the dataset.

dataset.shape
Output: 

(2919,13)

Step 2: Data Preprocessing

Now, we categorize the features depending on their datatype (int, float, object) and then calculate the number of them.

Output:

Categorical variables : 4
Integer variables : 6
Float variables : 3

Step 3: Exploratory Data Analysis

EDA refers to the deep analysis of data so as to discover different patterns and spot anomalies. Before making inferences from data it is essential to examine all your variables. So here let’s make a heatmap using seaborn library.

To analyze the different categorical features. Let’s draw the barplot.

Output:

The plot shows that Exterior1st has around 16 unique categories and other features have around 6 unique categories. To findout the actual count of each category we can plot the bargraph of each four features separately.

Output:

Step 4: Data Cleaning

Data Cleaning is the way to improvise the data or remove incorrect, corrupted or irrelevant data. As in our dataset there are some columns that are not important and irrelevant for the model training. So we can drop that column before training. There are 2 approaches to dealing with empty/null values

We can easily delete the column/row (if the feature or record is not much important).
Filling the empty slots with mean/mode/0/NA/etc. (depending on the dataset requirement).

As Id Column will not be participating in any prediction. So we can Drop it.

Step 5: OneHotEncoder – For Label categorical features

One hot Encoding is the best way to convert categorical data into binary vectors. This maps the values to integer values. By using OneHotEncoder, we can easily convert object data into int. So for that firstly we have to collect all the features which have the object datatype. To do so, we will make a loop.

Output:

Then once we have a list of all the features. We can apply OneHotEncoding to the whole list.

Step 6: Splitting Dataset into Training and Testing

X and Y splitting (i.e. Y is the SalePrice column and the rest of the other columns are X)

Step 7: Model Training and Accuracy

As we have to train the model to determine the continuous values, so we will be using these regression models.

SVM-Support Vector Machine
Random Forest Regressor
Linear Regressor

And To calculate loss we will be using the mean_absolute_percentage_error module. It can easily be imported by using sklearn library. The formula for Mean Absolute Error is:

1. SVM – Support vector Machine

Support vector Machine is a supervised machine learning algorithm primarily used for classification tasks though it can also be used for regression. It works by finding the hyperplane that best divides a dataset into classes. The goal is to maximize the margin between the data points and the hyperplane.

Output :

0.18705129

2. Random Forest Regression

Random Forest is an ensemble learning algorithm used for both classification and regression tasks. It constructs multiple decision trees during training where each tree in the forest is built on a random subset of the data and features, ensuring diversity in the model. The final output is determined by averaging the outputs of individual trees (for regression) or by majority voting (for classification).

Output :

0.1929469

3. Linear Regression

Linear Regression is a statistical method used for modeling the relationship between a dependent variable and one or more independent variables. The goal is to find the line that best fits the data. This is done by minimizing the sum of the squared differences between the observed and predicted values. Linear regression assumes that the relationship between variables is linear.

Output :

0.187416838

Clearly SVM model is giving better accuracy as the mean absolute error is the least among all the other regressor models i.e. 0.18 approx. To get much better results ensemble learning techniques like Bagging and Boosting can also be used.

How Tripadvisor Delivers Real-Time Personalization at Scale with ML

Mistergemba — Tue, 05 Aug 2025 22:50:23 +0000

See the engineering behind real-time personalization at Tripadvisor’s massive (and rapidly growing) scale

What kind of traveler are you? Tripadvisor tries to assess this as soon as you engage with the site, then offer you increasingly relevant information on every click—within a matter of milliseconds. This personalization is powered by advanced ML models acting on data that’s stored on ScyllaDB running on AWS.

In this article, Dean Poulin (Tripadvisor Data Engineering Lead on the AI Service and Products team) provides a look at how they power this personalization. Dean shares a taste of the technical challenges involved in delivering real-time personalization at Tripadvisor’s massive (and rapidly growing) scale.

It’s based on the following AWS re:Invent talk:

Pre-Trip Orientation

In Dean’s words …

Let’s start with a quick snapshot of who Tripadvisor is, and the scale at which we operate. Founded in 2000, Tripadvisor has become a global leader in travel and hospitality, helping hundreds of millions of travelers plan their perfect trips. Tripadvisor generates over $1.8 billion in revenue and is a publicly traded company on the NASDAQ stock exchange. Today, we have a talented team of over 2,800 employees driving innovation, and our platform serves a staggering 400 million unique visitors per month – a number that’s continuously growing.

On any given day, our system handles more than 2 billion requests from 25 to 50 million users. Every click you make on Tripadvisor is processed in real time. Behind that, we’re leveraging machine learning models to deliver personalized recommendations – getting you closer to that perfect trip. At the heart of this personalization engine is ScyllaDB running on AWS. This allows us to deliver millisecond-latency at a scale that few organizations reach. At peak traffic, we hit around 425K operations per second on ScyllaDB with P99 latencies for reads and writes around 1-3 milliseconds.

I’ll be sharing how Tripadvisor is harnessing the power of ScyllaDB, AWS, and real-time machine learning to deliver personalized recommendations for every user. We’ll explore how we help travelers discover everything they need to plan their perfect trip: whether it’s uncovering hidden gems, must-see attractions, unforgettable experiences, or the best places to stay and dine. This [article] is about the engineering behind that – how we deliver seamless, relevant content to users in real time, helping them find exactly what they’re looking for as quickly as possible.

Personalized Trip Planning

Imagine you’re planning a trip. As soon as you land on the Tripadvisor homepage, Tripadvisor already knows whether you’re a foodie, an adventurer, or a beach lover – and you’re seeing spot-on recommendations that seem personalized to your own interests. How does that happen within milliseconds?

As you browse around Tripadvisor, we start to personalize what you see using Machine Learning models which calculate scores based on your current and prior browsing activity. We recommend hotels and experiences that we think you would be interested in. We sort hotels based on your personal preferences. We recommend popular points of interest near the hotel you’re viewing. These are all tuned based on your own personal preferences and prior browsing activity.

Tripadvisor’s Model Serving Architecture

Tripadvisor runs on hundreds of independently scalable microservices in Kubernetes on-prem and in Amazon EKS. Our ML Model Serving Platform is exposed through one of these microservices.

This gateway service abstracts over 100 ML Models from the Client Services – which lets us run A/B tests to find the best models using our experimentation platform. The ML Models are primarily developed by our Data Scientists and Machine Learning Engineers using Jupyter Notebooks on Kubeflow. They’re managed and trained using ML Flow, and we deploy them on Seldon Core in Kubernetes. Our Custom Feature Store provides features to our ML Models, enabling them to make accurate predictions.

The Custom Feature Store

The Feature Store primarily serves User Features and Static Features. Static Features are stored in Redis because they don’t change very often. We run data pipelines daily to load data from our offline data warehouse into our Feature Store as Static Features.

User Features are served in real time through a platform called Visitor Platform. We execute dynamic CQL queries against ScyllaDB, and we do not need a caching layer because ScyllaDB is so fast.

Our Feature Store serves up to 5 million Static Features per second and half a million User Features per second.

What’s an ML Feature?

Features are input variables to the ML Models that are used to make a prediction. There are Static Features and User Features.

Some examples of Static Features are awards that a restaurant has won or amenities offered by a hotel (like free Wi-Fi, pet friendly or fitness center).

User Features are collected in real time as users browse around the site. We store them in ScyllaDB so we can get lightning fast queries. Some examples of user features are the hotels viewed over the last 30 minutes, restaurants viewed over the last 24 hours, or reviews submitted over the last 30 days.

The Technologies Powering Visitor Platform

ScyllaDB is at the core of Visitor Platform. We use Java-based Spring Boot microservices to expose the platform to our clients. This is deployed on AWS ECS Fargate. We run Apache Spark on Kubernetes for our daily data retention jobs, our offline to online jobs. Then we use those jobs to load data from our offline data warehouse into ScyllaDB so that they’re available on the live site. We also use Amazon Kinesis for processing streaming user tracking events.

The Visitor Platform Data Flow

The following graphic shows how data flows through our platform in four stages: produce, ingest, organize, and activate.

Data is produced by our website and our mobile apps. Some of that data includes our Cross-Device User Identity Graph, Behavior Tracking events (like page views and clicks) and streaming events that go through Kinesis. Also, audience segmentation gets loaded into our platform.

Visitor Platform’s microservices are used to ingest and organize this data. The data in ScyllaDB is stored in two keyspaces:

The Visitor Core keyspace, which contains the Visitor Identity Graph
The Visitor Metric keyspace, which contains Facts and Metrics (the things that the people did as they browsed the site)

We use daily ETL processes to maintain and clean up the data in the platform. We produce Data Products, stamped daily, in our offline data warehouse – where they are available for other integrations and other data pipelines to use in their processing.

Here’s a look at Visitor Platform by the numbers:

Why Two Databases?

Our online database is focused on the real-time, live website traffic. ScyllaDB fills this role by providing very low latencies and high throughput. We use short term TTLs to prevent the data in the online database from growing indefinitely, and our data retention jobs ensure that we only keep user activity data for real visitors. Tripadvisor.com gets a lot of bot traffic, and we don’t want to store their data and try to personalize bots – so we delete and clean up all that data.

Our offline data warehouse retains historical data used for reporting, creating other data products, and training our ML Models. We don’t want large-scale offline data processes impacting the performance of our live site, so we have two separate databases used for two different purposes.

Visitor Platform Microservices

We use 5 microservices for Visitor Platform:

Visitor Core manages the cross-device user identity graph based on cookies and device IDs.
Visitor Metric is our query engine, and that provides us with the ability for exposing facts and metrics for specific visitors. We use a domain specific language called visitor query language, or VQL. This example VQL lets you see the latest commerce click facts over the last three hours.
Visitor Publisher and Visitor Saver handle the write path, writing data into the platform. Besides saving data in ScyllaDB, we also stream data to the offline data warehouse. That’s done with Amazon Kinesis.
Visitor Composite simplifies publishing data in batch processing jobs. It abstracts Visitor Saver and Visitor Core to identify visitors and publish facts and metrics in a single API call.

Roundtrip Microservice Latency

This graph illustrates how our microservice latencies remain stable over time.

Visitor Platform Microservices

We use 5 microservices for Visitor Platform:

Visitor Core manages the cross-device user identity graph based on cookies and device IDs.
Visitor Metric is our query engine, and that provides us with the ability for exposing facts and metrics for specific visitors. We use a domain specific language called visitor query language, or VQL. This example VQL lets you see the latest commerce click facts over the last three hours.
Visitor Publisher and Visitor Saver handle the write path, writing data into the platform. Besides saving data in ScyllaDB, we also stream data to the offline data warehouse. That’s done with Amazon Kinesis.
Visitor Composite simplifies publishing data in batch processing jobs. It abstracts Visitor Saver and Visitor Core to identify visitors and publish facts and metrics in a single API call.

Roundtrip Microservice Latency

This graph illustrates how our microservice latencies remain stable over time.

The average latency is only 2.5 milliseconds, and our P999 is under 12.5 milliseconds. This is impressive performance, especially given that we handle over 1 billion requests per day.

Our microservice clients have strict latency requirements. 95% of the calls must complete in 12 milliseconds or less. If they go over that, then we will get paged and have to find out what’s impacting the latencies.

ScyllaDB Latency

Here’s a snapshot of ScyllaDB’s performance over three days.

At peak, ScyllaDB is handling 340,000 operations per second (including writes and reads and deletes) and the CPU is hovering at just 21%. This is high scale in action!

ScyllaDB delivers microsecond writes and millisecond reads for us. This level of blazing fast performance is exactly why we chose ScyllaDB.

Partitioning Data into ScyllaDB

This image shows how we partition data into ScyllaDB.

The Visitor Metric Keyspace has two tables: Fact and Raw Metrics. The primary key on the Fact table is Visitor GUID, Fact Type, and Created At Date. The composite partition key is the Visitor GUID and Fact Type. The clustering key is Created At Date, which allows us to sort data in partitions by date. The attributes column contains a JSON object representing the event that occurred there. Some example Facts are Search Terms, Page Views, and Bookings.

We use ScyllaDB’s Leveled Compaction Strategy because:

It’s optimized for range queries
It handles high cardinality very well
It’s better for read-heavy workloads, and we have about 2-3X more reads than writes

Why ScyllaDB?

Our solution was originally built using Cassandra on-prem. But as the scale increased, so did the operational burden. It required dedicated operations support in order for us to manage the database upgrades, backups, etc. Also, our solution requires very low latencies for core components. Our User Identity Management system must identify the user within 30 milliseconds – and for the best personalization, we require our Event Tracking platform to respond in 40 milliseconds. It’s critical that our solution doesn’t block rendering the page so our SLAs are very low. With Cassandra, we had impacts to performance from garbage collection. That was primarily impacting the tail latencies, the P999 and P9999 latencies.

We ran a Proof of Concept with ScyllaDB and found the throughput to be much better than Cassandra and the operational burden was eliminated. ScyllaDB gave us a monstrously fast live serving database with the lowest possible latencies.

We wanted a fully-managed option, so we migrated from Cassandra to ScyllaDB Cloud, following a dual write strategy. That allowed us to migrate with zero downtime while handling 40,000 operations or requests per second. Later, we migrated from ScyllaDB Cloud to ScyllaDB’s “Bring your own account” model, where you can have the ScyllaDB team deploy the ScyllaDB database into your own AWS account. This gave us improved performance as well as better data privacy.

This diagram shows what ScyllaDB’s BYOA deployment looks like.

In the center of the diagram, you can see a 6-node ScyllaDB cluster that is running on EC2. And then there’s two additional EC2 instances.

ScyllaDB Monitor gives us Grafana dashboards as well as Prometheus metrics.
ScyllaDB Manager takes care of infrastructure automation like triggering backups and repairs.

With this deployment, ScyllaDB could be co-located very close to our microservices to give us even lower latencies as well as much higher throughput and performance.

Wrapping up, I hope you now have a better understanding of our architecture, the technologies that power the platform, and how ScyllaDB plays a critical role in allowing us to handle Tripadvisor’s extremely high scale.

About Cynthia Dunlop

Cynthia is Senior Director of Content Strategy at ScyllaDB. She has been writing about software development and quality engineering for 20+ years.

How Computers “See” Molecules

Mistergemba — Tue, 05 Aug 2025 22:37:13 +0000

To a computer, Edvard Munch’s The Scream is nothing more than a grid of pixel values. It has no sense of why swirling lines in a twilight sky convey the agony of a scream. That’s because (modern digital) computers fundamentally process only binary signals [1,2]; they don’t inherently comprehend the objects and emotions we perceive.

To mimic human intelligence, we first need an intermediate form (representation) to “translate” our sensory world into something a computer can handle. For The Scream, that might mean extracting edges, colors, shapes, etc. Likewise, in Natural Language Processing (NLP), a computer sees human language as an unstructured stream of symbols that must be turned into numeric vectors or other structured forms. Only then can it begin to map raw input to higher-level concepts (i.e., building a model).

Human intelligence also depends on internal representations.
In psychology, a representation refers to an internal mental symbol or image that stands for something in the outside world [3]. In other words, a representation is how information is encoded in the brain: the symbols we use (words, images, memories, artistic depictions, etc.) to stand for objects and ideas.
Our senses don’t simply put the external world directly into our brains; instead, they convert sensory input into abstract neural signals. For example, the eyes convert light into electrical signals on the retina, and the ears turn air vibrations into nerve impulses. These neural signals are the brain’s representation of the external world, which is used to reconstruct our perception of reality, essentially building a “model” in our mind.
Between ages one and two, children enter Piaget’s early preoperational stage [4]. This is when kids start using one thing to represent another: a toddler might hold a banana up to their ear and babble as if it’s a phone, or push a box around pretending it’s a car. This kind of symbolic play is important for cognitive development, because it shows the child can move beyond the here-and-now and project the concepts in their mind onto reality [5].
Without our senses translating physical signals into internal codes, we couldn’t perceive anything [5].

“Garbage in, garbage out”. The quality of a representation sets an upper bound on the performance of any model built on it [6,7].

Much of the progress in human intelligence has come from improving how we represent knowledge [8].
One of the core goals of education is to help students form effective mental representations of new knowledge. Seasoned educators use diagrams, animations, analogies and other tools to present abstract concepts in a vivid, relatable way. Richard Mayer argues that meaningful learning happens when learners form a coherent mental representation or model of the material, rather than just memorizing disconnected facts [8]. In meaningful learning, new information integrates into existing knowledge, allowing students to transfer and apply it in novel situations.

However, in practice, factors like limited model capacity and finite computing resources constrain how complex our representations can be. Compressing input data inevitably risks information loss, noise, and artifacts. So, as the first step, developing a “good enough” representation requires balancing several key properties:

It should retain the information critical to the task. (A clear problem definition helps filter out the rest.)
It should be as compact as possible: minimizing redundancy and keeping dimensionality low.
It should separate classes in feature space. Samples from the same class cluster together, while those from different classes stay far apart.
It should be robust to input noise, compression artifacts, and shifts in data modality.
Invariance. Representations should be invariant to task‑irrelevant changes (e.g. rotating or translating an image, or changing its brightness).
Generalizability.
Interpretability.
Transferability.

These limitations on representation complexity are somewhat analogous to the limited capacity of our own working memory.
Human short-term memory, on average, can only hold about 7±2 items at once [9]. When too many independent pieces of information arrive simultaneously (beyond what our cognitive load can handle), our brains bog down. Cognitive psychology research shows that with the right guidance (by adjusting how information is represented), people can reorganize information to overcome this apparent limit [10,11]. For example, we can remember a long string of digits more easily by chunking them into meaningful groups (which is why phone numbers are often split into shorter blocks).

Now, shifting from The Scream to the microscopic world of molecules, we face the same challenge: how can we translate real-world molecules into a form that a computer can understand? With the right representation, a computer can infer chemical properties or biological functions, and ultimately map those to higher‑level concepts (e.g., a drug’s activity or a molecule’s protein binding). In this article, we’ll explore the common methods that let computers “see” molecules.

Chemical Formula

Perhaps the most straightforward depiction of a molecule is its chemical formula, like C₈H₁₀N₄O₂ (caffeine), which tells us there are 8 carbon atoms, 10 hydrogen atoms, 4 nitrogen atoms and 2 oxygen atoms. However, its very simplicity is also its limitation: a formula conveys nothing about how those atoms are connected (the bonding topology), how they are arranged in space, or where functional groups are located. That’s why isomers (like ethanol and dimethyl ether) both share C₂H₆O yet differ completely in structure and properties.

Linear String

Another common way to represent molecules is to encode them as a linear string of characters, a format widely adopted in databases [12,13].

SMILES

The most classic example is SMILES (Simplified Molecular Input Line Entry System) [14], developed by David Weininger in the 1980s. SMILES treats atoms as nodes and bonds as edges, then “flattens” them into a 1D string via a depth‑first traversal, preserving all the connectivity and ring information. Single, double, triple, and aromatic bonds are denoted by the symbols “-”, “=”, “#”, and “:”, respectively. Numbers are used to mark the start and end of rings, and branches off the main chain are enclosed in parentheses. (See more in SMILES – Wikipedia.)

SMILES is simple, intuitive, and compact for storage. Its extended syntax supports stereochemistry and isotopes. There is also a rich ecosystem of tools supporting it: most chemistry libraries let us convert between SMILES and other standard formats.

However, without an agreed-upon canonicalization algorithm, the same molecule can be written in multiple valid SMILES forms. This can potentially lead to inconsistencies or “data pollution”, especially when merging data from multiple sources.

InChI

Another widely used string format is InChI (International Chemical Identifier) [15], introduced by IUPAC in 2005, to generate globally standardized, machine-readable, and unique molecule identifiers. InChI strings, though longer than SMILES, encode more details in layers (including atoms and their bond connectivity, tautomeric state, isotopes, stereochemistry, and charge), each with strict rules and priority. (See more in InChI – Wikipedia.)

Because an InChI string can become very lengthy as a molecule grows more complex, it is often paired with a 27‑character InChIKey hash [15]. The InChIKeys aren’t human‑friendly, but they’re ideal for database indexing and for exchanging molecule identifiers across systems.

Molecular Descriptor

Many computational models require numeric inputs. Compared to linear string representations, molecular descriptors turn a molecule’s properties and patterns into a vector of numerical features, delivering satisfactory performance in many tasks [7, 16-18].

Todeschini and Consonni describe the molecular descriptor as the “final result of a logical and mathematical procedure, which transforms chemical information encoded within a symbolic representation of a molecule into a useful number or the result of some standardized experiment” [16].

We can think of a set of molecular descriptors as a standardized “physical exam sheet” for a molecule, asking questions like:

Does it have a benzene ring?
How many carbon atoms does it have?
What’s the predicted octanol-water partition coefficient (LogP)?
Which functional groups are present?
What is its 3D conformation or electron distribution like?
…

Their answers can take various forms, such as numerical values, categorical flags, vectors, graph-based structures, tensors etc. Because every molecule in our dataset is described using the same set of questions (the same “physical exam sheet”), comparisons and model inputs become straightforward. And because each feature has a clear meaning, descriptors improve the interpretability of the model.

Of course, just as a physical exam sheet can’t capture absolutely everything about a person’s health, a finite set of molecular descriptors can never capture all aspects of a molecule’s chemical and physical nature. Computing descriptors is typically a non-invertible process, inevitably leading to a loss of information, and the results are not guaranteed to be unique. Therefore, there are different types of molecular descriptors, each focusing on different aspects.

Thousands of molecular descriptors have been developed over the years (for example, RDKit [19], CDK [20], Mordred [17], etc.). They can be broadly categorized by the dimensionality of information they encode (these categories aren’t strict divisions):

0D: formula‑based properties independent of structure (e.g., atom counts or molecular weight).
1D: sequence-based properties (e.g., counts of certain functional groups).
2D: derived from the 2D topology (e.g., eccentric connectivity index [21]).
3D: derived from 3D conformation, capturing geometric or spatial properties (e.g., charged partial surface area [22]).
4D and higher: these incorporate additional dimensions such as time, ensemble, or environmental factors (e.g., descriptors derived from molecular dynamics simulations, or from quantum chemical calculations like HOMO/LUMO).
Descriptors obtained from other sources including experimental measurements.

Molecular fingerprints are a special kind of molecular descriptor that encode substructures into a fixed-length numerical vector [16]. This table summarizes some commonly used molecular fingerprints [23], such as MACCS [24], which is shown in the figure below.
Similarly, human fingerprints or product barcodes can also be seen as (or converted to) fixed-format numerical representations.

Different descriptors describe molecules from various aspects, so their contributions to different tasks naturally vary. In a task of predicting the aqueous solubility of drug-like molecules, over 4,000 computed descriptors were evaluated, but only about 800 made significant contributions to the prediction [7].

Point Cloud

Sometimes, we need our models to learn directly from a molecule’s 3D structure. For example, this is important when we’re interested in how two molecules might interact with each other [25], need to search the possible conformations of a molecule [26], or want to simulate its behavior in a certain environment [27].

One straightforward way to represent a 3D structure is as a point cloud of its atoms [28]. In other words, a point cloud is a collection of coordinates of the atoms in 3D space. However, while this representation shows which atoms are near each other, it doesn’t explicitly tell us which pairs of atoms are bonded. Inferring connectivity from interatomic distances (e.g., via cutoffs) can be error-prone, and may miss higher‑order chemistry like aromaticity or conjugation. Moreover, our model must account for changes of raw coordinates due to rotation or translation. (More on this later.)

Graph

A molecule can also be represented as a graph, where atoms (nodes) are connected by bonds (edges). Graph representations elegantly handle rings, branches, and complex bonding arrangements. For example, in a SMILES string, a benzene ring must be “opened” and denoted by special symbols, whereas in a graph, it’s simply a cycle of nodes connected in a loop.

Molecules are commonly modeled as undirected graphs (since bonds have no inherent direction) [29-31]. We can further “decorate” the graph with additional domain-specific knowledge to make the representation more interpretable: tagging nodes with atom features (e.g., element type, charge, aromaticity) and edges with bond properties (e.g., order, length, strength). Therefore,

(uniqueness) each distinct molecular structure could correspond to a unique graph, and
(reversibility) we could reconstruct the original molecule from its graph representation

Chemical reactions essentially involve breaking bonds and forming new ones. Using graphs makes it easier to track these changes. Some reaction‑prediction models encode reactants and products as graphs and infer the transformation by comparing them [32,33].

Graph Neural Networks (GNNs) can directly process graphs and learn from them. Using molecular graph representation, these models can naturally handle molecules of arbitrary size and topology. In fact, many GNNs have outperformed models that only relied on descriptors or linear strings on many molecular tasks [7,30,34].

Often, when a GNN makes a prediction, we can inspect which parts of the graph were most influential. These “important bits” frequently correspond to actual chemical substructures or functional groups. In contrast, if we were looking at a particular substring of a SMILES, it’s not guaranteed to map neatly to a meaningful substructure.

A graph doesn’t always mean just the direct bonds connecting atoms. We can construct different kinds of graphs from molecular data depending on our needs, and sometimes these alternate graphs yield better results for particular applications. For example:
Complete graph: Every pair of nodes is connected by an edge. It could introduce redundant connections, but might be used to let a model consider all pairwise interactions.
Bipartite graph: Nodes are divided into two sets, and edges only connect nodes from one set to nodes from the other.
Nearest-neighbor graph: Each node is connected only to its nearest neighbors (according to some criterion), for controlling complexity.

Extensible Graph Representations

We can incorporate chemical rules or impose constraints within molecular graphs. In de novo molecular design, (early) SMILES‑based generative models often produced SMILES strings ended up proposing invalid molecules, because: (1) assembling characters may break SMILES syntax, and (2) even a syntactically correct SMILES might encode an impossible structure. Graph‑based generative models avoid them by building molecules atom by atom and bond by bond (under user-specified chemical rules). Graphs also let us impose constraints: require or forbid specific substructures, enforce 3D shapes or chirality, and so on; thus, to guide generation toward valid candidates that meet our goals [35,36].

Molecular graphs can also handle multiple molecules and their interactions (e.g., drug-protein binding, protein-protein interfaces). “Graph-of-graphs” treat each molecule as its own graph, then deploy a higher-level model to learn how they interact [37]. Or, we may merge the molecules into one composite graph, including all atoms from both partners and add special (dummy) edges or nodes to mark their contacts [38].

So far, we’ve been considering the standard graph of bonds (the 2D connectivity), but what if the 3D arrangement matters? Graph representations can certainly be augmented with 3D information: 3D coordinates could be attached to each node, or distances/angles could be added as attributes on the edges, to make models more sensitive to difference in 3D configurations. A better option is to use models like SE(3)-equivariant GNNs, which ensure their outputs (or key internal features) transform (or stay invariant) with any rotation or translation of the input.

In 3D space, the special Euclidean group SE(3) describes all possible rigid motions (any combination of rotations and translations). (It’s sometimes described as a semidirect product of the rotation group SO(3) with the translation group R³.) [28]
When we say a model or a function has SE(3) invariance, we mean that it gives the same result no matter how we rotate or translate the input in 3D. This kind of invariance is often an essential requirement for many molecular modeling tasks: a molecule floating in solution has no fixed reference frame (i.e., it can tumble around in space). So, if we predict some property of the molecule (say its binding affinity), that prediction should not be influenced by the molecule’s orientation or position.

Sequence Representations of Biomacromolecules

We’ve talked mostly about small molecules. But biological macromolecules (like proteins, DNA, and RNA) can contain thousands or even millions of atoms. SMILES or InChI strings become extremely long and complex, leading to the associated massive computational, storage, and analysis costs.

This brings us back to the importance of defining the problem: for biomacromolecules, we’re often not interested in the precise position of every single atom or the exact bonds between each pair of atoms. Instead, we care about higher-level structural patterns and functional modules: like a protein’s amino acid backbone and its alpha‑helices or beta‑sheets, which fold into tertiary and quaternary structures. For DNA and RNA, we may care about nucleotide sequences and motifs.

We describe these biological polymers as sequences of their building blocks (i.e., primary structure): proteins as chains of amino acids, and DNA/RNA as strings of nucleotides. There are well-established codes for these building blocks (defined by IUPAC/IUBMB): for instance, in DNA, the letters A, C, G, T represent the bases adenine, cytosine, guanine, and thymine respectively.

Static Embeddings and Pretrained Embeddings

To convert a sequence into numerical vectors, we can use static embeddings: assigning a fixed vector to each residue (or k-mer fragment). The simplest static embedding is one-hot encoding (e.g., encode adenine A as [1,0,0,0]), turning a sequence into a matrix. Another approach is to learn dense (pretrained) embeddings by leveraging large databases of sequences. For example, ProtVec [39] breaks proteins into overlapping 3‑mers and trains a Word2Vec‑like model (commonly used in NLP) on a large corpus of sequences, assigning each 3-mer a 100D vector. These learned fragment embeddings are shown to capture biochemical and biophysical patterns: fragments with similar functions or properties cluster closer in the embedding space.

k-mer fragments (or k-mers) are substrings of length k extracted from a biological sequence.

Tokens

Inspired by NLP, we can treat a sequence as if it’s a sentence composed of tokens or words (i.e., residues or k-mer fragments), and then feed them into deep language models. Trained on massive collections of sequences, these models learn biology’s “grammar” and “semantics” just as they do in human language.

Transformers can use self‑attention to capture long‑range dependencies in sequences; and we essentially use them to learn a “language of biology”. (Some) Meta’s ESM series of models [40-42] trained Transformers on hundreds of millions of protein sequences. Similarly, DNABERT [43] tokenizes DNA into k‑mers for BERT training on genomic data. These kinds of obtained embeddings have been shown to encapsulate a wealth of biological information. In many cases, these embeddings can be used directly for various tasks (i.e., transfer learning).

Descriptors

In practice, sequence-based models often combine their embeddings with physicochemical properties, statistical features, and other descriptors, such as the percentage of each amino acid in a protein, the GC content of a DNA sequence, or indices like hydrophobicity, polarity, charge, and molecular volume.

Beyond the main categories above, there are some other unconventional ways to represent sequences. Chaos Game Representation (CGR) [44] maps DNA sequences to points in a 2D plane, creating distinctive image patterns for downstream analysis.

Structural Representations of Biomacromolecules

The complex structure (of a protein) determines its functions and specificities [28]. Simply knowing the linear sequence of residues is often not enough to fully understand a biomolecule’s function or mechanism (i.e., sequence-structure gap).

Structures tend to be more conserved than sequences [28, 45]. Two proteins might have very divergent sequences but still fold into highly similar 3D structures [46]. Solving the structure of a biomolecule can give insights that we wouldn’t get just from the sequence alone.

Granularity and Dimensionality Control

A single biomolecule may contain on the order of 10³-10⁵ atoms (or even more). Encoding every atom and bond explicitly into numerical form produces prohibitively high-dimensional, sparse representations.

Adding dimensions to the representation can quickly run into the curse of dimensionality. As we increase the dimensionality of our data, the “space” we’re asking our model to cover grows exponentially. Data points become sparser relative to that space (it’s like having a few needles in an ever-expanding haystack). This sparsity means a model might need vastly more training examples to find reliable patterns. Meanwhile, the computational cost of processing the data often grows polynomially or worse with dimensionality.

Not every atom is equally important for the question we care about: we often turn to adjust the granularity of our representation or reduce dimensionality in smart ways (such data often has a lower-dimensional effective representation that can describe the system without (significant) performance loss [47]):

For proteins, each amino acid can be represented by the coordinates of just its alpha carbon (C_α). For nucleic acids, one might take each nucleotide and represent it by the position of its phosphate group or by the center of its base or sugar ring.
Another example of controlled granularity comes from how AlphaFold [49] represents protein using backbone rigid groups (or frames). Essentially, for each amino acid, a small set of main-chain atoms, typically the N, C_α, C (and maybe O) are treated as a unit. The relative geometry of these atoms is almost fixed (covalent bond lengths and angles don’t vary significantly), so that unit can be considered as a rigid block. Instead of tracking each atom separately, the model tracks the position and orientation of that entire block in space, reducing the risks associated with excessive degrees of freedom [28] (i.e., errors from the internal movement of atoms within a residue).

If we have a large set of protein structures (or a long molecular dynamics trajectory), it can be useful to cluster those conformations into a few representative states. This is often done when building Markov state models: by clustering continuous states into a finite set of discrete “metastable” states, we can simplify a complex energy landscape into a network of a few states connected by transition probabilities.

Many coarse-grained molecular dynamics force fields, such as MARTINI [50] and UNRES [51], have been developed to represent structural details using fewer particles.

To capture for side-chain effects without modelling all internal atoms or adding excessive degrees of freedom, a common approach is to represent each side-chain with a single point, typically its center of mass [52]. Such side-chain centroid models are often used in conjunction with backbone models.
The 3Di Alphabet introduced by Foldseek [53] defines a 3D interaction “alphabet” of 20 states that describe protein tertiary interactions. Thus, a protein’s 3D structure can be converted into a sequence of 20 symbols; and two structures can be aligned by aligning their 3Di sequences.
We may spatially crop or focus on just part of a biomolecule. For instance, if we’re studying how a small drug molecule binds to a protein (say, in a dataset like PDBBind [54], which is full of protein-ligand complexes), we may only feed the pockets and drugs into our model.
Combining different granularities or modalities of data.

Point Cloud

We could model a biomacromolecule as a massive 3D point cloud of every atom (or residue). As noted earlier, the same limitations apply.

Distance Matrix

A distance matrix records all pairwise distances between certain key atoms (for proteins, commonly the C_αof each amino acid), and is inherently invariant to rotation and translation due to its symmetric nature. A contact map simplifies this further by indicating only which pairs of residues are “close enough” to be in contact. However, both representations lose directional information; so not all structural details can be recovered from them alone.

Graph

Similarly, just like we can use graphs for small molecules, we can use graphs for macromolecular structures [55,56]. Instead of atoms, each node might represent a larger unit (see Granularity and Dimensionality Control). To improve interpretability, additional knowledge like residue descriptors and known interaction networks within a protein, may also be incorporated in nodes and edges. Note that the graph representation for biomacromolecules inherits many of the advantages we discussed for small molecules.

For macromolecules, edges are often pruned to keep the graph sparse and manageable in size: essentially a form of local magnification that focuses on local substructures, while far-apart relationships are treated as background context.

General dimensionality reduction methods such as PCA, t-SNE and UMAP are also widely used to analyze the high-dimensional structural data of macromolecules. While they don’t give us representations for computation in the same sense as the others we’ve discussed, they help project complex data into lower dimensions (e.g., for visualization or insights).

Latent Space

When we train a model (especially generative models), it often learns to encode data into a compressed internal representation. This internal representation lives in some space of lower dimension, known as the latent space. Think of London’s complex urban layout, dense and intricate, while the latent space is like a “map” that captures its essence in a simplified form.

Latent spaces are usually not directly interpretable, but we can explore them by seeing how changes in latent variables map to changes in the output. In molecular generation, if a model maps molecules into a latent space, we can take two molecules (say, as two points in that space) and generate a path between them. Ochiai et. al. [57] did this by taking two known molecules as endpoints, interpolating between their latent representations, and decoding the intermediate points. The result was a set of new molecules that blended features of both originals: hybrids that might have mixed properties of the two.

5 Routine Tasks That ChatGPT Can Handle for Data Scientists

Mistergemba — Tue, 05 Aug 2025 22:00:44 +0000

According to the data science report by Anaconda, data scientists spend nearly 60% of their time on cleaning and organizing data. These are routine, time-consuming tasks that make them ideal candidates for ChatGPT to take over.

In this article, we will explore five routine tasks that ChatGPT can handle if you use the right prompts, including cleaning and organizing the data. We’ll use a real data project from Gett, a London black taxi app similar to Uber, used in their recruitment process, to show how it works in practice.

Case Study: Analyzing Failed Ride Orders from Gett

In this data project, Gett asks you to analyze failed rider orders by examining key matching metrics to understand why some customers did not successfully get a car.

Here is the data description.

Now, let’s explore it by uploading the data to ChatGPT.

In the next five steps, we will walk through the routine tasks that ChatGPT can handle in a data project. The steps are shown below.

Step 1: Data Exploration and Analysis

In data exploration, we use the same functions every time, like head, info, or describe.

When we ask ChatGPT, we’ll include the key functions in the prompt. We’ll also paste the project description and attach the dataset.

We will use the prompt below. Just replace the text inside the square brackets with the project description. You can find the project description here:

Here is the output.

As you can see, ChatGPT summarizes the dataset by highlighting key columns, missing values, and then creates a correlation heatmap to explore relationships.

Step 2: Data Cleaning

Both datasets contain missing values.

Let’s write a prompt to work on this

“Clean this dataset: identify and handle missing values appropriately (e.g., drop or impute based on context). Provide a summary of the cleaning steps.”

Here is the summary of what ChatGPT did:

ChatGPT converted the date column, dropped invalid orders, and imputed missing values to the m_order_eta.

Step 3: Generate Visualizations

To make the most of your data, it is important to visualize the right things. Instead of generating random plots, we can guide ChatGPT by providing the link to the source, which is called Retrieval-Augmented Generation.

We will use this article. Here is the prompt:

“Before generating visualizations, read this article on choosing the right plots for different data types and distributions: [LINK]. hen, show most suitable visualizations for this dataset and explain why each was selected and produce the plots in this chat by running code on the dataset.”

Here is the output.

We have six different graphs that we produced with ChatGPT.

You will see why the related graph has been selected, the graph, and the explanation of this graph.

Step 4: Make your Dataset Ready for Machine Learning

Now that we have handled missing values and explored the dataset, the next step is to prepare it for machine learning. This involves steps like encoding categorical variables and scaling numerical features.

Here is our prompt.

Prepare this dataset for machine learning: encode categorical variables, scale numerical features, and return a clean DataFrame ready for modeling. Briefly explain each step.

Here is the output.

Now your features have been scaled and encoded, so your dataset is ready to apply a machine learning model.

Step 5: Applying Machine Learning Model

Let’s move on to machine learning modeling. We will use the following prompt structure to apply a basic machine learning model.

Use this dataset to predict [target variable]. Apply [model type] and report machine learning evaluation metrics like [accuracy, precision, recall, F1-score]. Use only relevant 5 features and explain your modeling steps.
Let’s update this prompt based on our project.
Use this dataset to predict order_status_key. Apply a multiclass classification model (e.g., Random Forest), and report evaluation metrics like accuracy, precision, recall, and F1-score. Use only the 5 most relevant features and explain your modeling steps.

Now, paste this into the ongoing conversation and review the output.
Here is the output.
As you can see, the model performed well, perhaps too well?
Bonus: Gemini CLI

Gemini has launched an open-source agent that you can interact with from your terminal. You can install it by using this code. (60 model requests per minute and 1,000 requests per day at no charge.)
Besides ChatGPT, you can also use Gemini CLI to handle routine data science tasks, such as cleaning, exploration, and even building a dashboard to automate these tasks.
The Gemini CLI provides a straightforward command-line interface and is available at no cost. Let’s start by installing it using the code below.
sudo npm install -g @google/gemini-cli
After running the code above, open your terminal and paste the following code to start building with it:
gemini
Once you run the commands above, you’ll see the Gemini CLI as shown in the screenshot below.
Gemini CLI lets you run code, ask questions, or even build apps directly from your terminal. In this case, we will use Gemini CLI to build a Streamlit app that automates everything we’ve done so far, EDA, cleaning, visualization, and modeling.
To build a Streamlit app, we will use a prompt that covers all steps. It’s shown below.
Built a streamlit app that automates EDA, Data Cleaning, Creates Automatic data visualization, prepares the dataset for machine learning, and applies a machine learning model after selecting target variables by the user.
Step 1 – Basic EDA:
• Display .head(), .info(), and .describe()
• Show missing values per column
• Show correlation heatmap of numerical features
Step 2 – Data Cleaning:
• Detect columns with missing values
• Handle missing data appropriately (drop or impute)
• Display a summary of cleaning actions taken
Step 3 – Auto Visualizations
• Before plotting, use these visualization principles:
• Use histograms for numerical distributions
• Use bar plots for categorical distributions
• Use boxplots or violin plots to compare categories
• Use scatter plots for numerical relationships
• Use correlation heatmaps for multicollinearity
• Use line plots for time series (if applicable)
• Generate the most relevant plots for this dataset
• Explain why each plot was chosen
Step 4 – Machine Learning Preparation:
• Encode variables
• Scale numerical features
• Return a clean DataFrame ready for modeling
Step 5 – Apply Machine Learning Model:
• Offer the target variable to the user.
• Apply multiple machine learning models.
• Report evaluation metrics.
Each step should display in a different tab. Run the Streamlit app after you built it.

It will prompt you for permission when creating the directory or running code on your terminal.
After a few approval steps like we did, the Streamlit app will be ready, as shown below.
Now, let’s test it.

Final Thoughts

In this article, we first used ChatGPT to handle routine tasks, such as data cleaning, exploration, and data visualization. Next, we went one step further by using it to prepare our dataset for machine learning and applied machine learning models.

Finally, we used Gemini CLI to create a Streamlit dashboard that performs all of these steps with just a click.

To demonstrate all of this, we have used a data project from Gett. Although AI is not yet entirely reliable for every task, you can leverage it to handle routine tasks, saving you a lot of time.

10 Python Libraries Every MLOps Engineer Should Know

Mistergemba — Tue, 05 Aug 2025 21:23:19 +0000

Learn about 10 essential Python libraries that support core MLOps tasks like versioning, deployment, and monitoring.

By Bala Priya C, KDnuggets Contributing Editor & Technical Content Specialist on August 4, 2025 in MLOps

While machine learning continues to find applications across domains, the operational complexity of deploying, monitoring, and maintaining models continues to grow. And the difference between successful and struggling ML teams often comes down to tooling.

In this article, we go over essential Python libraries that address the core challenges of MLOps: experiment tracking, data versioning, pipeline orchestration, model serving, and production monitoring. Let’s get started!

1. MLflow: Experiment Tracking and Model Management

What it solves: MLflow helps manage the challenges of managing hundreds of model runs and their results.

How it helps: When you’re tweaking hyperparameters and testing different algorithms, keeping track of what worked becomes impossible without proper tooling. MLflow acts like a lab notebook for your ML experiments. It captures your model parameters, performance metrics, and the actual model artifacts automatically. The best part? You can compare any two experiments side by side without digging through folders or spreadsheets.

What makes it useful: Works with any ML framework, stores everything in one place, and lets you deploy models with a single command.

Get started: MLflow Tutorials and Examples

2. DVC: Data Version Control

What it solves: Managing large datasets and complex data transformations.

How it helps: Git breaks when you try to version control large datasets. DVC fills this gap by tracking your data files and transformations separately while keeping everything synchronized with your code. Think of it as a better Git that understands data science workflows. You can recreate any experiment from months ago just by checking out the right commit.

What makes it useful: Integrates well with Git, works with cloud storage, and creates reproducible data pipelines.

Get started: Get Started with DVC

3. Kubeflow: ML Workflows on Kubernetes

What it solves: Running ML workloads at scale without becoming a Kubernetes expert

How it helps: Kubernetes is powerful but complex. Kubeflow wraps that complexity in ML-friendly abstractions. You get distributed training, pipeline orchestration, and model serving without wrestling with YAML files. It’s particularly valuable when you need to train large models or serve predictions to thousands of users.

What makes it useful: Handles resource management automatically, supports distributed training, and includes notebook environments.

Get started: Installing Kubeflow

4. Prefect: Modern Workflow Management

What it solves: Building reliable data pipelines with less boilerplate code.

How it helps: Airflow can sometimes be verbose and rigid. Prefect, however, is much easier for developers to get started with. It handles retries, caching, and error recovery automatically. The library feels more like writing regular Python code than configuring a workflow engine. It’s particularly good for teams that want workflow orchestration without the learning curve.

What makes it useful: Intuitive Python API, automatic error handling, and modern architecture.

Get started: Introduction to Prefect

5. FastAPI: Turn Your Model Into a Web Service

What it solves: FastAPI is useful for building production-ready APIs for model serving.

How it helps: Once your model works, you need to expose it as a service. FastAPI makes this straightforward. It automatically generates documentation, validates incoming requests, and handles the HTTP plumbing. Your model becomes a web API with just a few lines of code.

What makes it useful: Automatic API documentation, request validation, and high performance.

Get started: FastAPI Tutorial & User Guide

6. Evidently: ML Model Monitoring

What it solves: Evidently is great for monitoring model performance and detecting drifts

How it helps: Models degrade over time. Data distributions shift. Performance drops. Evidently helps you catch these problems before they impact users. It generates reports showing how your model’s predictions change over time and alerts you when data drift occurs. Think of it as a health check for your ML systems.

What makes it useful: Pre-built monitoring metrics, interactive dashboards, and drift detection algorithms.

Get started: Getting Started with Evidently AI

7. Weights & Biases: Experiment Management

What it solves: Weights & Biases is useful for tracking experiments, optimizing hyperparameters, and collaborating on model development.

How it helps: When multiple devs work on the same model, experiment tracking becomes all the more important. Weights & Biases provides a central place for logging experiments, comparing results, and sharing insights. It includes hyperparameter optimization tools and integrates with popular ML frameworks. The collaborative features help teams avoid duplicate work and share knowledge.

What makes it useful: Automatic experiment logging, hyperparameter sweeps, and team collaboration features.

Get started: W&B Quickstart

8. Great Expectations: Data Quality Assurance

What it solves: Great Expectations is for data validation and quality assurance for ML pipelines

How it helps: Bad data breaks models. Great Expectations helps you define what good data looks like and automatically validates incoming data against these expectations. It generates data quality reports and catches issues before they reach your models. Think of it as unit tests for your datasets.

What makes it useful: Declarative data validation, automatic profiling, and comprehensive reporting.

Get started: Introduction to Great Expectations

9. BentoML: Package and Deploy Models Anywhere

What it solves: BentoML standardizes model deployment across different platforms

How it helps: Every deployment target has different requirements. BentoML abstracts these differences by providing a unified way to package models. Whether you’re deploying to Docker, Kubernetes, or cloud functions, BentoML handles the packaging and serving infrastructure. It supports models from different frameworks and optimizes them for production use.

What makes it useful: Framework-agnostic packaging, multiple deployment targets, and automatic optimization.

Get started: Hello world Tutorial | BentoML

10. Optuna: Automated Hyperparameter Tuning

What it solves: Finding optimal hyperparameters without manual guesswork.

How it helps: Hyperparameter tuning is time-consuming and often done poorly. Optuna automates this process using sophisticated optimization algorithms. It prunes unpromising trials early and supports parallel optimization. The library integrates with popular ML frameworks and provides visualization tools to understand the optimization process.

What makes it useful:Advanced optimization algorithms, automatic pruning, and parallel execution.
Get started: Optuna Tutorial

Wrapping Up

These libraries address different aspects of the MLOps pipeline, from experiment tracking to model deployment. Start with the tools that address your most pressing challenges, then gradually expand your toolkit as your MLOps maturity increases.

Most successful MLOps implementations combine 3-5 of these libraries into a cohesive workflow. Consider your team’s specific needs, existing infrastructure, and technical constraints when selecting your toolkit.

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.

Does the Code Work or Not?

Mistergemba — Tue, 05 Aug 2025 20:57:42 +0000

A common misconception about the working state of code in data, AI or software engineering fields

Marina Tosic

Recently, I wrote a mid-year reflection on AI developments, during which I mentioned the reported impact of AI on the unemployment rate for recent graduates, referencing a New York Times (NYT) article. And, since I’ve read the article, regardless of the point well taken, there was one sentence in it that stuck with me:

“Does the code work or not?”

To avoid taking the sentence out of context, let me share the background on how the author arrived at it.

The timely NYT story begins by discussing AI displacing the need for entry-level professionals, where the author notes that automating white-collar work has been a long-held ambition for many executives. However, the technology itself wasn’t sufficiently mature to manage the more complex and technical aspects of many jobs. At least, not until recent AI advancements, which prompted him to write:

That is starting to change, especially in fields such as software engineering, where there are clear markers of success and failure. (Such as: Does the code work or not?)

In the reporter’s defence, there are probably writing guidelines and principles that we don’t understand, resulting in a restriction and “limited space” to explain the “code work/not work” part better. That said, no hard feelings on my callout here. Nonetheless, for all the non-tech people out there, I feel a need to elaborate on a common misconception: not everything we (tech people) produce is measurable in ‘1s’ or ‘0s’.

Before I deepen my explanation, let me share a story that’s been on my mind a lot lately.

More than 14 years ago, I was doing an internship on a High-Voltage Laboratory construction project at a transformer company. Being a part of the so-called Steerco, i.e., steering committee, my tasks were to push the project forward by resolving legal and budget-related hiccups. As is often the case on construction projects, budgetary and legal problems are not so unusual when you have different professions coming together — architects, civil engineers, electrical and mechanical engineers — to build a specialised manufacturing plant with a Faraday cage.

So, five months into the project, and a mere seven months out of Uni, the project sponsors (CEOs), who were engineers by profession, noticed that the actual cost for the facade was above the estimated budget. The cost of materials listed in the preliminary budget, which significantly impacted the overall project cost, was diverging by (if I recall correctly) 40%.

Because of this, the CEOs instructed my colleague and me to inform the architect, who had over 20 years of experience, that the costs were unacceptable and that a change in materials was expected to fit the project budget.

You can imagine what happened next: I was shot down in a second.

I will never forget her saying:

We are not at a bazaar here; there is a reason why specific materials were picked for the facade. Just because everyone today has built their own storage unit at home, it doesn’t mean they are an architect who knows how to design a specialised building according to the relevant standards.

Then she added “…isolation, something, building physics, something, sub-surface, something, preliminary vs. main design, something, something…”

Of course, I returned this message, with “something, something”, to the CEOs, and what happened further was that I was shot down twice in a day with the counterstatement:

True, we are not at a bazaar, but we don’t need to pay the price for someone else’s mistake. The responsible party should take accountability and find a solution that fits our budget.

[So much about “don’t kill the messenger”, ha? ;)]

You see where I am going with this story. In the generative AI era, everyone has managed to experience what it means to be a “coder”. However, not everyone is a software engineer, nor willing to pay the salary of one if life is just as good with a “sub-optimal DIY storage unit that only you and your family members will use.”

Returning to the NYT article, you can now understand why the sentence, “Does the code work or not?” stuck with me. In my head, this sentence sounded as if the coding task could be mostly simplified to: “If it (among other straightforward, binary outcomes) compiles, you can ship it.”

Again, while this was only an example (and not a false one), at least a dozen more questions should be asked to get a fuller picture and arrive at a true “working state” for any code change or implementation, such as:

…Is the current (data) architecture supporting this change?
…Is the change approved by the legal and security team?
…Is the code implemented according to development practices?
…Is the change performant?
…Is it end-to-end tested?
…Is the CI/CD process in place?
…Is the change affecting other features?
…Is the affected business team informed of the changes?
…Is the change causing higher costs?
…Is the change bringing value?

If we provide answers to all the above questions, there are still more queries to consider before we can conclude “the code works.” For example:

Did the project budget get approved for this development?
Who will act as a SPOC for this development?
What’s the optimal balance between feature completeness and time to market?
How does this impact on-call responsibilities?
How challenging will it be to retire this development?
To what extent does this solution scale (with growing data or users)?
What’s the rollback strategy if issues arise?
What documentation and knowledge transfer is needed?
…and many more…

This is why you’ll often see tech folks passionately posting or re-sharing statements like “Coding was never the problem.” It really never was, and the real problems never have binary answers.

In other words, the challenges I’ve seen and faced involved inheriting 10–20 years of accumulated technical debt, which resulted in dedicating months, sometimes years, of human resources to maintaining fragile legacy systems, all while attempting to modernise business processes that relied on outdated technology deeply rooted in organisational decision-making.

Pivoting back to architects, I have one more story to share at the end, and it concerns a recent conversation I had with my Uni roommate, who is — you guessed it — an architect.

The two of us were talking about her experience with generative AI, where she explained how she uploaded parcel plans and prompted an LLM to deliver an initial housing project with the “famous” budget. Her observation was that while the AI output looked “beautiful,” it was completely flawed on the technical side, and it only served to generate a couple of design ideas. Then she mentioned something interesting:

You see, this was useful for me, but if my investor took the same AI tool and ran the same plan through it with the same prompt, he would never be able to deliver a standing building.

I smiled at her statement because it’s exactly what I think for now about AI delivering complete software. You can and will get the code that “works,” but without an experienced and/or knowledgeable data/AI/software engineer’s oversight and expert fixes, you’re heavily risking accumulating future technical debt.

The principles of scalability, security, and maintainability, or the “building physics” of software, are not (yet) manageable by leveraging (only) AI, and this is where the role of technical experts lies.

There’s no question that generative AI is a powerful way to get code to a “working” state, and we should all be using it. But that isn’t where the real value lies in the development process. The real value is in the process of making sure that “working” code is doing what is supposed to do, and has existing processes around it for avoiding problems in the dedicated (data) platforms.

With this in mind, I can only conclude that a “yes, but” will usually provide an answer to “work/not work” alike questions (in any domain).

Thank You for Reading!

Written By

Marina Tosic

Top Skills Data Scientists Should Learn in 2025

Mistergemba — Tue, 05 Aug 2025 20:23:32 +0000

Forget what you knew — these underrated data science skills will define who wins for the rest of 2025.

By Kanwal Mehreen, KDnuggets Technical Editor & Content Specialist on July 28, 2025 in Data Science

# Introduction

I understand that with the pace at which data science is growing, it’s getting harder for data scientists to keep up with all the new technologies, demands, and trends. If you think that knowing Python and machine learning will get the job done for you in 2025, then I’m sorry to break it to you but it won’t.

To have a good chance in this competitive market, you will have to go beyond the basic skills.

I’m not only referring to tech skills but also the soft skills and business understanding. You might have come across such articles before, but trust me this is not a clickbait article. I HAVE actually done research to highlight those areas which are often overlooked. Please note that these recommendations are purely based on industry trends, research papers, and insights I gathered from talking to a few experts. So, let’s get started.

# Technical Skills

// 1. Graph Analytics

Graph analytics is super underrated but so useful. It helps you understand relationships in data by turning them into nodes and edges. Fraud detection, recommendation systems, social networks, or anywhere things are connected, graphs can be applied. Most traditional machine learning models struggle with relational data, but graph techniques make it easier to catch patterns and outliers. Companies like PayPal use it to identify fraudulent transactions by analyzing relationships between accounts. Tools like Neo4j, NetworkX, and Apache AGE can help you visualize and work with this kind of data. If you’re serious about going deeper into areas like finance, cybersecurity, and e-commerce, this is one skill that’ll make you stand out.

// 2. Edge AI Implementation

Edge AI is basically about running machine learning models directly on devices without relying on cloud servers. It’s super relevant now that everything from watches to tractors is getting smart. Why does this matter? It means faster processing, more privacy, and less dependency on internet speed. For example, in manufacturing, sensors on machines can predict failures before they happen. John Deere uses it to detect crop diseases in real-time. In healthcare, wearables process data instantly without needing a cloud server. If you’re interested in Edge AI, look into TensorFlow Lite, ONNX Runtime, and protocols like MQTT and CoAP. Also, think about Raspberry Pi and low-power optimization. According to Fortune Business Insights,Edge AI market will grow from USD 27.01 billion in 2024 to USD 269.82 billion by 2032 so yeah, it’s not just hype.

// 3. Algorithm Interpretability

Let’s be real, building a powerful model is cool, but if you can’t explain how it works? Not that cool anymore. Especially in high-stakes industries like healthcare or finance, where explainability is a must. Tools like SHAP and LIME help break down decisions from complex models. For example, in healthcare, interpretability can highlight why an AI system flagged a patient as high-risk, which is critical for both ethical AI use and regulatory compliance. And sometimes it’s better to build something inherently interpretable like decision trees or rule-based systems. As Cynthia Rudin, an AI researcher at Duke University, puts it: “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.” In short, if your model affects real people, interpretability isn’t optional, it’s essential.

// 4. Data Privacy, Ethics, and Security

This stuff isn’t just for legal teams anymore. Data scientists need to understand it too. One wrong move with sensitive data can lead to lawsuits or fines. With privacy laws like CCPA and GDPR, it’s now expected that you know about techniques like differential privacy, homomorphic encryption, and federated learning. Ethical AI is also getting serious attention. In fact, 78% of surveyed consumers believe companies must commit to ethical AI standards, and 75% say trust in a company’s data practices directly influences their purchasing decisions. Tools like IBM’s Fairness 360 can help you test bias in datasets and models. TL;DR: If you’re building anything that uses personal data, you better know how to protect it, and explain how you’re doing that.

// 5. AutoML

AutoML tools are becoming a solid asset for any data scientist. They automate tasks like model selection, training, and hyperparameter tuning, so you can focus more on the actual problem, rather than getting lost in repetitive tasks. Tools like H2O.ai, DataRobot, and Google AutoML help speed things up a lot. But don’t get it twisted, AutoML isn’t about replacing you, it’s about boosting your workflow. AutoML is a copilot, not the pilot. You still need the brains and context, but this can handle the grunt work.

# Soft Skills

// 1. Environmental Awareness

This might surprise some, but AI has a carbon footprint. Training massive models takes up crazy amounts of energy and water. As a data scientist, you have a role in making tech more sustainable. Whether it’s optimizing code, choosing efficient models, or working on green AI projects, this is a space where tech meets purpose. Microsoft’s “Planetary Computer” is a great example of using AI for environmental good. As MIT Technology Review puts it: “AI’s carbon footprint is a wake-up call for data scientists.” In 2025, being a responsible data scientist includes thinking about your environmental impact as well.

// 2. Conflict Resolution

Data projects often involve a mix of people: engineers, product folks, business heads, and trust me, not everyone will agree all the time. That’s where conflict resolution comes in. Being able to handle disagreements without stalling progress is a big deal. It ensures that the team stays focused and moves forward as a unified group. Teams that can resolve conflicts efficiently are simply more productive. Agile thinking, empathy, and being solution-oriented are huge here.

// 3. Presentation Skills

You could build the most accurate model in the world, but if you can’t explain it clearly, it’s not going anywhere. Presentation skills especially explaining complex ideas in simple terms are what separate the great data scientists from the rest. Whether you’re talking to a CEO or a product manager, how you communicate your insights matters. In 2025, this isn’t just a “nice to have”, it’s a core part of the job.

# Industry-Specific Skills

// 1. Domain Knowledge

Understanding your industry is key. You don’t need to be a finance expert or a doctor, but you do need to get the basics of how things work. This helps you ask better questions and build models that actually solve problems. For example, in healthcare, knowing about medical terminology and regulations like HIPAA makes a huge difference in building trustworthy models. In retail, customer behavior and inventory cycles matter. Basically, domain knowledge connects your technical skills to real-world impact.

// 2. Regulatory Compliance Knowledge

Let’s face it, data science is no longer a free-for-all. With GDPR, HIPAA, and now the EU’s AI Act, compliance is becoming a core skill. If you want your project to go live and stay live, you need to understand how to build with these regulations in mind. A lot of AI projects are delayed or blocked just because no one thought about compliance from the start. With 80% of AI projects in finance facing compliance delays, knowing how to make your systems auditable and regulation-friendly gives you a serious edge.

# Wrapping Up

This was my breakdown based on the research I’ve been doing lately. If you’ve got more skills in mind or insights to add, I’d honestly love to hear them. Drop them in the comments below. Let’s learn from each other.

Why You Aren’t Getting Hired as a Data Science in 2025

Mistergemba — Tue, 05 Aug 2025 19:58:37 +0000

Some say data science is dying, while others are more concerned with the imminent death of their own career.

By Nisha Arya, Contributing Editor & Marketing and Client Success Manager on March 27, 2025 in Career Advice

There was a time when data science and technology recruitment companies were thriving. However, over the years the recruitment process has changed so much that it not only made it harder to find the right talent but companies are also putting up barriers that make it harder to recruit the right person.

Although you are constantly seeing data professional job vacancies opened up on LinkedIn, the sad truth is that some of these are fake. Some organisations are posting jobs to get recognition, whilst others have jobs posted just to make candidates go through hoops only to be told no.

With this in mind, is data science still a solid go-to career choice moving forward? Over the years the data science market has been booming due to the value of data and businesses wanting to extract as many insights from it as possible. However, in 2025, a lot of you who are considering a career in data science may still be wondering if it’s the right choice to make. Further, for those who are already qualified and job-searching, what are the reasons why they aren’t getting hired?

There are 2 angles from which we will be looking at the job market today: Is data science still a good career choice? And why are you not getting hired as a data scientist?

Let’s first jump in and see why those who are currently looking are not getting hired and then turn our attention to those wondering about the future of data science.

Why You Aren’t Getting Hired

Your Resume

The sad truth is that a lot of people put a lot of effort into their resumes. If this is you, and you aren’t getting the attention you feel your skill set and experience deserve, perhaps you are falling victim to this: a lot of great talent gets hidden behind poorly structured resumes.

The first thing first is clutter. Although you may feel like you need to cover up all the white space possible, sometimes or more time recruiters want to see white space. You want to mention the important things you have done and that can be highlighted in a few words, not paragraphs and paragraphs.

Language is also important. Your choice of language will get you sussed out very quickly. For example, someone who has been riding the buzzword wave their entire career may start to find that these once-useful catch phrases no longer pull their weight, even with the use of fancy verbiage and intricate sentence structuring.

Long story short: A lot of words don’t always necessarily tell a great story. Here are some other common resume mistakes to avoid.

More to Life Than Work?

Although some employers don’t want to hear that there is more to life outside of work, the reality is that there is. And what you do outside of work will determine your personality and character and if you would genuinely be a good character fit for the company.

However, a lot of recruiters and employers want to know your job doesn’t stop at 5 pm. For example, I have been a technical writer and content creator around the machine learning and artificial intelligence space for 4+ years now. However, outside of my 9-5, I write on Medium and — though it began with technical content — I started to go into other niches to explore my writing skills and how I can adapt to different audiences.

Let’s take a software developer for example, what are you doing to position yourself in the community or other aspects of self-development? Are you attending talks? Are you part of a community? Are you taking on speaking opportunities?

The sad truth is that recruiters do not care if you run yearly marathons or if you know how to crochet. Although it shows your characters, the chances of you getting an interview based on them are slim to none. You’re not applying for college, you’re trying to build yourself a career.

Prep So You Don’t Fail

You would easily believe that candidates regularly prepare for jobs that they apply for, but it is shockingly surprising how many people think that they can just wing it. The digital life we live in makes it much harder when you’re competing with 500+ applicants for your dream job. So, if you want it, make sure your application sticks.

Going back to your resume: not only do you need to ensure that your resume is simple and effective, but also that it tailors to the job that you are applying for. You also need to consider the company itself. You want to find these connection points and make sure they’re stated on your resume so that you stand out.

Give Answers That Have Value

Rather than sitting and struggling with answers that won’t have value to interview questions, prepare for this inevitably and say something that will intrigue your recruiter. The biggest mistake you can make is insulting them by spouting off useless answers, being unprepared, and wasting their time; they can see right through it.

Give them straightforward answers because that is what they want. They don’t want to have to ask you multiple questions before they are able to finesse the answer they actually want. Giving an answer — any response that has value, even if not the perfect answer — shows that you are confident in yourself and the right person for the role.

A reminder: people on the other side of the table hiring you aren’t so different and want exactly what we would want.

Data Science CareerProspects in 2025

Now let’s move on from the individual question of “why am I not getting hired as a data scientist right now?” and focus on something more broad: “will anyone be getting hired as a data scientist in the future?” Let’s have a look at some factors that will impact the continued success of the data science field and your place in it.

Is Data Science Still Sexy?

Your choice of wanting to start a data science career is completely up to you. There is nobody else that can make that decision but yourself. You want to analyse the requirements for becoming a data scientist and reflect on your abilities to achieve these requirements and skills.

However, if you worry that data science is dying, I am here to tell you that it is not. Data science is set to remain a good career choice for the year 2025 and beyond. Understanding the choice of a new career consists of taking into consideration the security around the role and its relevance in the next 5-10 years. Although generative AI tools are dominating the market, in the next 5-10 years alone they will not solely be left to their own devices and will still need human intervention.

Do Your Interests Align?

There are three things you need to be interested in when taking on a data science career: coding, maths and statistics. If these are not of interest to you, then you should not be considering a career in data science. I’m not saying you have to be an expert in these areas, you need to have a simple interest which will drive your data science career.

Let’s start with coding. In the data science sector, one of the most popular languages you NEED to learn is Python. It is one of the easiest programming languages to learn and many data scientists use this as their main programming language. One way of testing your interest is doing short Python courses and seeing if you resonate with them. If you do, there is an interest there and you can continue to develop your Python skills.

Now, maths and statistics go hand-in-hand. You will need to use maths and statistics to analyse your data and find valuable insights to show to stakeholders. If you do not have an interest in these, you will fail to become a successful data scientist. A lot of people forget that maths and statistics are the foundations of data science and that it will help you learn as well as your ability to put the pieces of the puzzle together.

If you are interested in a data science learning roadmap, check out A Free Data Science Learning Roadmap: For All Levels with IBM

Data Science Job Opportunities

Once you complete your data science learning roadmap and have created a portfolio of projects, the next thing you want to do is start your job-hunting process. he first section of this article covered why you might NOT be getting hired, but let’s look at more productive recommendations here.

DO PROJECTS! You need to showcase your learnt skills to your next employer and the best way to do this is by projects. Another reason why this is important is that some organisations may still be looking to hire professionals with traditional educational backgrounds such as a degree. However, if you can showcase your skills and prove that you have the same knowledge as somebody who went to university for 4 years, you may have a higher chance of landing a job.

Your choice of job depends on various factors, such as location, company, salary, etc. A lot of data science roles still remain WFH jobs, making it easier for more and more people around the world to access these roles from the comfort of their homes.

According to Glassdoor, the average data scientist salary is USD 117K/yr Average base pay, with a range from USD 95K – USD 145K/yr. As of January 2nd 2025, there are 1,025 results for data science roles in the UK and 4,641 in the United States.

Wrapping Up

If you enjoy learning data science and have an interest in the typical day-to-day responsibilities of a data scientist, then a career in data science may be for you. The only way you will know this is by learning the fundamentals and putting your skills into practice with real-life projects. Data science is not a dying career; however, make sure it is a good career choice for you before you enter it.

Also keep in mind some of the hard truths covered in this article: if you’re looking for a new role in 2025, take the pointers in the Why You Aren’t Getting Hired section to heart, as they can be the difference between you getting your dream job and having to keep looking.

Good luck out there!

5 Common Data Science Resume Mistakes to Avoid

Mistergemba — Tue, 05 Aug 2025 19:33:17 +0000

Want to create data science resumes that land interview calls and jobs? Avoid these common mistakes.

By Bala Priya C, KDnuggets Contributing Editor & Technical Content Specialist on October 3, 2024 in Data Science

Having an effective and impressive resume is important if you want to land a data science role. However, many candidates make mistakes that prevent their resume from standing out and landing interview calls.

This guide will walk you through five common resume mistakes that aspiring data scientists often make. No worries, we’ll also go over actionable tips on how to avoid them.

Let’s get started.

1. Not Showcasing Practical and Impressive Projects

A major pitfall in many data science resumes is the absence of useful projects. While having certifications and degrees is important, hiring managers want to see how you apply your skills to real-world problems.

Why this matters

Without strong projects, recruiters are often left guessing if you can apply theoretical knowledge to real problems.
Projects are the best way to show the impact of your skills, such as how you’ve improved business processes or answered business questions.

How to avoid

Include at least 3-5 diverse projects on your resume. Work with real-world datasets. Focus on building and deploying machine learning models. And link to the project in your portfolio.
Be sure to highlight the tools you used (Python, R, and SQL), the libraries you’ve used, the size of the dataset, and specific results or business impacts.
Use metrics wherever possible. For example, “Built a predictive model that reduced customer churn by 15% using random forest algorithms on a dataset of 100K customer records.”

If you’re a beginner with no previous data science experience, start by contributing to open-source projects, participating in Kaggle competitions, and personal projects on weekends.

2. Adding Too Many Buzzwords Instead of Demonstrating Skills

A resume packed with data science jargon like “machine learning,” “deep learning,” or “big data” might seem impressive. But if it’s just a list of buzzwords without evidence, it can backfire.

Why this matters

Recruiters and hiring managers look for evidence of your skills, not just their mention as keywords.
Loading your skills section with all the tools and libraries you’re familiar with can work against you if you don’t have the experience or projects to speak of.

How to avoid

Instead of listing terms like “data cleaning” or “predictive modeling” generically, describe how you applied those skills in a specific project.
For example, instead of writing “proficient in machine learning,” you can say, “Developed a machine learning pipeline that identified high-value customers, leading to a 20% increase in sales conversion.”

In short, you should focus on tangible results and outcomes tied to your skill set rather than purely listing technical terms.

3. Not Customizing Your Resume Enough

One size does not fit all when it comes to data science resumes. Sending the same resume for every position you apply to can significantly decrease your chances of landing an interview.

Why this matters

Data science is a broad field, and each company will have different expectations and requirements depending on the industry.
If your resume is too generic, recruiters can tell that you didn’t take the time to understand their specific needs. A resume submitted to an ML engineer role at a medical imaging startup should not be identical to the one you submit for a data scientist role at a fintech company.

How to avoid

Customize your resume for each job by tailoring your projects, skills, and keywords to match the job description. But be honest and include only projects and skills that you’ve worked on.
Be sure to highlight experiences that directly align with the company’s industry. For example, for a finance-focused role, emphasize projects related to financial data or risk analysis.

This is possible only when you diversify and work on a range of projects depending on which industry you’d like to work as a data scientist in.

4. Not Quantifying Impact and Achievements

A data scientist’s job revolves around numbers and data. So failing to quantify achievements on your resume is a missed opportunity . Numbers add credibility to your claims and demonstrate the real impact of your work.

Why this matters

Vague descriptions like “improved data accuracy” or “developed predictive models” don’t give the recruiter any sense of scale or success.
Quantifiable metrics are easy to digest and help make your contributions stand out.

How to avoid

Include metrics for every relevant project or job experience. Focus on things like accuracy improvements, cost savings, time reductions, or business impacts.
If you can’t share exact numbers, use approximations such as “approximately 10% improvement” or “reduced processing time by nearly half.”

This is super important; because even if you’ve worked on complex and interesting projects, you should be able to talk of their impact.

5. Neglecting Soft Skills and Business Acumen

While data science is highly technical, companies are increasingly seeking candidates who can also demonstrate soft skills such as communication, teamwork, and most importantly, a good understanding of how businesses work.

Although soft skills mostly fall into the “show don’t tell” category. Focusing only on technical expertise and ignoring these areas can be detrimental.

Why this matters

As a data scientist, you should be able to communicate complex findings to non-technical stakeholders.
Companies want data scientists who can make data-driven decisions that align with business goals and solve business problems.

How to avoid

If needed, dedicate a section of your resume to soft skills. Mention any instances where you’ve presented the project to the team or collaborated across teams.
When possible, link your technical achievements to business outcomes. This shows you understand the broader impact of your work.

Oh, and no worries. There’s a lot of opportunity to demonstrate soft skills during later stages of the interview process.

Conclusion

Building a strong data science resume is more than just listing technical skills and describing projects. As discussed, it requires showcasing real-world impact of your projects, adding metrics where possible, and customizing your experience to match job roles.

By avoiding these common mistakes and following the outlined tips, you’ll be able to create a resume that stands out in the data science job market.

Things I Wish I Had Known Before Starting Machine Learning

Mistergemba — Tue, 05 Aug 2025 19:18:49 +0000

Part 1: Data, Sales Pitches, Bugs, and Breakthroughs

Pascal Janetzky

Ahh, the sea.

During a much-needed vacation on the Mediterranean Sea, I found myself lying on the beach, staring into the waves. Lady Luck was having a good day: the sun glared down from a blue and cloudless sky, heating the sand and salty sea around me. For the first time in a while, I had downtime. There was nothing related to ML in the remote region where I was, where the rough roads would have scared away anybody who is used to the even pavements of western countries.

Then, away from work and, partially, civilization, somewhere between zoning out and full-on daydreaming, my thoughts began to drift. In our day-to-day business, we are too, well, busy to spend time doing nothing. But nothing is strong word here: as my thoughts drifted, I first recalled recent events, then pondered about work, and then, eventually, arrived at machine learning.

Maybe traces of my previous article—where I reflected on 6.5 years of “doing” ML—were still lingering in the back of my mind. Or maybe it was simply the complete absence of anything technical around me, where the sea was my only companion. Whatever the reason was, I mentally started rehearsing the years behind me. What had gone well? What had gone sideways? And—most importantly—what do I wish someone had told me at the beginning?

This post is a collection of those things. It’s not meant to be a list of dumb mistakes that I urge others to avoid at all costs. Instead, it’s my attempt to write down the things that would have made my journey a bit smoother (but only a bit, uncertainty is necessary to make the future just that: the future). Parts of my list overlap with my previous post, and for good reason: some lessons are worth repeating, and reading again.

Here’s Part 1 of that list. Part 2 is currently buried in my sandy, sea-water stained notebook. My plan is to follow up with it in the next couple of weeks, once I have enough time to turn it into a quality article.

1. Doing ML Mostly Means Preparing Data

This is a point I try not to think too much about, or it will tell me: you did not do your homework.

When I started out, my internal monologue was something like: “I just want to do ML.” Whatever that meant; I had visions of plugging neural networks together, combining methods, and running large-scale training. While I did all of that at one point or another, I found that “doing ML” often means spending a lot of time just preparing the data so that you can eventually train a machine learning model. Model training, ironically, is often the shortest and last part of the whole process.

Thus, every time I finally get to the model training step, I mentally breathe a sigh of relief, because it means I’ve made it through the invisible part: preparing the data. There’s nothing “sellable” in merely preparing the data. In my experience, preparing the data is not noticeable in any way (as long as it’s done well enough).

Here’s the usual pattern for it:

You have a project.
You get a real-world dataset. (If you work with a well-curated benchmark dataset, then you’re lucky!)
You want to train a model.
But first… data cleaning, fixing, merging, validating.

Let me give you a personal example, one that I’ve told as a funny story (which it is now. Back then, it meant redoing a few days of machine learning work under time pressure…).

I once worked on a project where I wanted to predict vegetation density (using the NDVI index) from ERA5 weather data. ERA5 is a massive gridded dataset, freely available from the European Centre for Medium-Range Weather Forecasts. I merged this dataset with NDVI satellite data from NOAA (basically, the American weather agency), carefully aligned the resolutions, and everything seemed fine—no shape mismatches, no errors were thrown.

Then, I called the data preparation done and trained a Vision Transformer model on the combined dataset. A few days later, I visualized the results and… surprise! The model thought Earth was upside down. Literally—my input data was right-side up, but the target vegetation density was flipped at the equator.

What had happened? A subtle bug in my resolution translation flipped the latitude orientation of the vegetation data. I hadn’t noticed it because I was spending a lot of time on data preparation already, and wanted to get to the “fun part” quickly.

This kind of mistake hones in an important point: real-world ML projects are data projects. Especially outside academic research, you’re not working with CIFAR or ImageNet. You’re working with messy, incomplete, partially labellel, multi-source datasets that require:

Cleaning
Aligning
Normalizing
Debugging
Visual inspection

And even more, that list is non-exhaustive. Then repeating all of the above.

Getting the data right is the work. Everything else builds on that (sadly invisible) foundation.

2. Writing Papers Is Like Preparing a Sales Pitch

Some papers just read well. You might not be able to explain why, but they have a flow, a logic, a clarity that’s hard to ignore. That’s rarely by accident*. For me, it turned out that writing papers resembles crafting a very specific kind of sales pitch. You’re selling your idea, your approach, your insight to a skeptical audience.

This was a surprising realization for me.

When I started out, I assumed most papers looked and felt the same. All of them were “scientific writing” to me. But over time, as I read more papers I began to notice the differences. It’s like that saying: to outsiders, all sheep look the same; to the shepherd, each one is distinct.

For example, compare these two papers that I came across recently:

A machine learning paper in a top ML venue: https://arxiv.org/pdf/2206.06243
A paper using ML, but published in a top operations management journal: https://arxiv.org/pdf/2304.11910

Both use machine learning. But they speak to different audiences, with different levels of abstraction, different narrative styles, and even different motivations. The first one assumes that technical novelty is central. The second one focuses on relevance for applications. Obviously, there also is the visual difference between the two.

The more papers you read, the more you realize: there’s not one way to write a “good” paper. There are many ways, and the way varies depending on the audience.

And unless you’re one of those very rare brilliant minds (think Terence Tao or someone of that caliber), you’ll likely need support to write well. Especially when tailoring a paper for a specific conference or journal. In practice, that means working closely with a senior ML person who understands the field.

Crafting a good paper is like preparing a sales pitch. You need to:

Frame the problem the right way
Understand your audience (i.e. target venue)
Emphasize the parts that resonate most
And polish until the message sticks

3. Bug Fixing Is the Way Forward

Years ago, I had that romantic idea of ML as exploring elegant models, inventing new activation functions, or crafting clever loss functions. That may be true for a small set of researchers. But for me, progress often looked like: “Why doesn’t this code run?”. Or, even more frustrating: “That code just ran a few seconds ago-why does it no longer run now?”

Let’s say your project requires using Vision Transformers on environmental satellite data (i.e., the model side of Section 1 above). You have two options:

Implement everything from scratch (not recommended unless you’re feeling particularly adventurous, or need to do it for course credits).
Find an existing implementation and adapt it.

In 99% of the cases, option 2 is the obvious choice. But “just plug in your data” almost never works. You’ll run into:

Different compute environments
Assumptions about input shapes
Preprocessing quirks (such as data normalization)
Hard-coded dependencies (of which I am guilty, too)

Quickly, your day can become an endless series of debugging, backtracking, testing edge cases, modifying dataloaders, checking GPU memory**, and rerunning scripts. Then, slowly, things begin to work. Eventually, your model trains.

But it’s not fast. It’s bug fixing your way forward.

4. I (Very Certainly) Won’t Make That Breakthrough

You’ve definitely heard of them. The Transformer paper. The GANs. Stable Diffusion. There’s a small part in my that thinks: maybe I’ll be the one to write the next transformative paper. And sure, someone has to. But statistically, it probably won’t be me. Or you, apologies. And that’s fine.

The works that cause a field to change rapidly are exceptional by definition. Those works being exceptional directly implies that most works, even good work, are barely recognized. Sometimes, I still hope that one of my projects would “blow up.” But, so far, most didn’t. Some didn’t even get published. But, hey, that’s not failure—it’s the baseline. If you expect every paper to be a home run, then you are on the fast lane to disappointment.

Closing thoughts

To me, Machine learning often appears as a sleek, cutting-edge field—one where breakthroughs are just around the corner and where the “doing” means smart people make magic with GPUs and math. But in my day-to-day work, it’s rarely like that.

More often, my day-to-day work consists of:

Handling messy datasets
Debugging code pulled from GitHub
Redrafting papers, over and over
Not producing novel results, again

And that’s okay.

Footnotes

The previous article mentioned: https://towardsdatascience.com/lessons-learned-after-6-5-years-of-machine-learning/

* If you are interested, my favorite paper is this one: https://arxiv.org/abs/2103.09762. I read it one year ago on a Friday afternoon.

** To this day, I still get mail notifications about how clearing the GPU memory is impossible in TensorFlow. This 5-year old GitHub issue gives the details.