The data science ecosystem moves fast, with new frameworks and tools emerging constantly ๐
This post provides a Hypermodern Data Science Toolbox - tools that are setting the standard for data science and machine learning in 2025.
Python 3.11 and 3.12 have both brought performance improvements to Python. We choose 3.11 for the 2025 toolbox as 3.12 is still a bit unstable with some popular data science libraries.
Python 3.11 added better tracebacks - the exact location of the error is pointed out in the traceback. This improves the information available during development and debugging.
Tradeoffs:
Polars is a tool for tabular data manipulation in Python - it’s an alternative to Pandas or Spark.
Polars offers query optimization, parallel processing and can work with larger than memory datasets. It also has a syntax that many prefer to Pandas.
Query optimization allows multiple data transformations to be grouped together and optimized over. This cannot be done in eager-execution frameworks like Pandas.
import polars as pl
# Create a lazy query that will be optimized before execution
# Lazy evaluation allows Polars to optimize the entire query plan
query: pl.LazyFrame = (
df # Start with the input DataFrame
.lazy() # Convert to lazy mode for query optimization
.with_columns([ # Add new columns or transform existing ones
# Parse the "date" column from string to Date type using strptime
pl.col("date").str.strptime(pl.Date).alias("date"),
# Calculate cumulative sum of sales and create new column
pl.col("sales").cum_sum().alias("cumulative_sales"),
])
.group_by("region") # Group rows by the "region" column
.agg([ # Aggregate functions to apply to each group
# Calculate mean sales for each region
pl.col("sales").mean().alias("avg_sales"),
# Count number of records (days) for each region
pl.col("sales").count().alias("n_days"),
])
)
# Execute the optimized query and materialize results to DataFrame
result: pl.DataFrame = query.collect()
Tradeoffs:
JAX is a numerical computing library - it’s an alternative to NumPy or PyTorch.
JAX combines NumPy-compatible API with automatic differentiation (grad), just-in-time compilation (jit), and automatic vectorization (vmap).
import jax.numpy as jnp
from jax import grad, jit, vmap
from jax.typing import ArrayLike
from typing import Callable
def predict(params: ArrayLike, x: ArrayLike) -> ArrayLike:
"""Computes the dot product."""
# Compute linear prediction: params ยท x (dot product)
return jnp.dot(params, x)
# Create gradient function using automatic differentiation
# grad() transforms the function to compute gradients with respect to first argument (params)
grad_fn: Callable[[ArrayLike, ArrayLike], ArrayLike] = grad(predict)
# Create JIT-compiled version for faster execution
# jit() compiles the function to XLA for near-C performance
fast_predict: Callable[[ArrayLike, ArrayLike], ArrayLike] = jit(predict)
# Create vectorized version to operate on batches
# vmap() automatically vectorizes the function over the second argument (x)
# in_axes=(None, 0) means: don't vectorize params, vectorize x along axis 0
batch_predict: Callable[[ArrayLike, ArrayLike], ArrayLike] = vmap(predict, in_axes=(None, 0))
Tradeoffs:
PyTorch is a deep learning framework - it’s an alternative to TensorFlow or JAX for neural networks.
PyTorch offers dynamic computation graphs, making it intuitive for research and experimentation. It has become the standard for academic research and increasingly for production.
import torch
import torch.nn as nn
from torch import Tensor
class SimpleNet(nn.Module):
"""Feedforward neural network."""
def __init__(self, input_size: int, hidden_size: int, output_size: int) -> None:
# Initialize the parent Module class
super().__init__()
# Create a sequential container of layers
self.layers = nn.Sequential(
# First layer: linear transformation from input to hidden layer
nn.Linear(input_size, hidden_size),
# Activation function: ReLU introduces non-linearity
nn.ReLU(),
# Output layer: linear transformation from hidden to output
nn.Linear(hidden_size, output_size)
)
def forward(self, x: Tensor) -> Tensor:
"""Forward pass through the network."""
# Pass input through all layers sequentially
return self.layers(x)
# Create model instance: 784 inputs (28x28 image), 128 hidden units, 10 outputs (classes)
model: SimpleNet = SimpleNet(784, 128, 10)
Tradeoffs:
spaCy is a Natural Language Processing (NLP) library - it’s an alternative to NLTK.
spaCy focuses on production-ready NLP with pre-trained models for multiple languages and tasks like named entity recognition, part-of-speech tagging, and dependency parsing.
import spacy
# Load pre-trained English language model (small version)
nlp: spacy.language.Language = spacy.load("en_core_web_sm")
# Process text through the NLP pipeline (tokenization, POS tagging, NER, etc.)
doc: spacy.tokens.Doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
# Extract named entities detected by the model
for ent in doc.ents:
# Print entity text and its predicted label type
print(ent.text, ent.label_)
# Output: Apple ORG, U.K. GPE, $1 billion MONEY
Tradeoffs:
Optuna is a hyperparameter optimization framework - it’s an alternative to grid search, random search, or scikit-learn’s hyperparameter tools.
Optuna uses sophisticated algorithms like Tree-Structured Parzen Estimator (TPE) and pruning to efficiently search hyperparameter spaces.
import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
def objective(trial: optuna.Trial) -> float:
"""Objective function that Optuna will optimize."""
# Suggest hyperparameter values from specified ranges
n_estimators: int = trial.suggest_int('n_estimators', 10, 100)
max_depth: int = trial.suggest_int('max_depth', 1, 10)
# Create and train model with suggested hyperparameters
clf: RandomForestClassifier = RandomForestClassifier(
n_estimators=n_estimators,
max_depth=max_depth,
random_state=42 # Fixed seed for reproducibility
)
# Evaluate model performance using cross-validation
# Return mean accuracy score to maximize
scores: np.ndarray = cross_val_score(clf, X, y, cv=3)
return scores.mean()
# Create optimization study to maximize the objective
study: optuna.Study = optuna.create_study(direction='maximize')
# Run optimization for 100 trials using TPE algorithm
study.optimize(objective, n_trials=100)
Tradeoffs:
Ray is a distributed computing framework - it’s an alternative to Dask, multiprocessing, or Spark for scaling Python workloads.
Ray simplifies distributed computing with a clean API for parallel and distributed execution, plus libraries for ML (Ray Tune, Ray Train).
import ray
# Initialize Ray runtime (connects to existing cluster or starts local one)
ray.init()
# Decorator to make function executable as remote task
@ray.remote
def expensive_function(x: int) -> int:
# Simulate expensive computation with sleep
import time
time.sleep(1)
# Return computed result
return x ** 2
# Execute function calls in parallel across available resources
# Each call returns a future (ObjectRef) immediately
futures: list[ray.ObjectRef] = [expensive_function.remote(i) for i in range(10)]
# Block until all parallel tasks complete and retrieve results
results: list[int] = ray.get(futures)
Tradeoffs:
Hugging Face provides pre-trained models and tools - it’s become the standard for accessing and fine-tuning transformer models.
Hugging Face offers the largest collection of pre-trained models with simple APIs for common NLP, computer vision, and audio tasks.
from transformers import pipeline
from typing import Any
# Create sentiment analysis pipeline with default pre-trained model
classifier: transformers.pipelines.Pipeline = pipeline("sentiment-analysis")
# Analyze sentiment of input text
result: list[dict[str, Any]] = classifier("I love this product!")
# [{'label': 'POSITIVE', 'score': 0.9998}]
# Create question answering pipeline with default pre-trained model
qa: transformers.pipelines.Pipeline = pipeline("question-answering")
# Define context text containing the information
context: str = "Paris is the capital of France."
# Define question to ask about the context
question: str = "What is the capital of France?"
# Extract answer from context using the model
answer: dict[str, Any] = qa(question=question, context=context)
Tradeoffs:
Pandera is a tool for data quality checks of tabular data - it’s an alternative to Great Expectations or assert statements.
Pandera allows you to define schemas for tabular data, catching data issues before they propagate through your analysis pipeline.
import pandera as pa
from pandera.polars import DataFrameSchema, Column
# Define schema with column specifications and validation rules
schema: pandera.polars.DataFrameSchema = DataFrameSchema({
"sales": Column(
int, # Expected data type: integer
# List of validation checks to apply
checks=[pa.Check.greater_than(0), pa.Check.less_than(10000)],
),
"region": Column(
str, # Expected data type: string
# Check that values are in specified set
checks=[pa.Check.isin(["North", "South", "East", "West"])],
),
})
# Validate data against schema and return validated DataFrame
# Raises SchemaError if validation fails
validated_data: pl.DataFrame = schema(data)
Tradeoffs:
Marimo is a Python notebook editor and format - it’s an alternative to Jupyter Lab.
Marimo offers reactive execution, Git-friendly storage, and interactive web apps from notebooks.
import marimo
# Version metadata for reproducibility
__generated_with: str = "0.10.12"
# Create Marimo app with medium width layout
app: marimo.App = marimo.App(width="medium")
# Define first cell with markdown output
@app.cell
def _(mo: marimo) -> None:
import polars as pl
import altair as alt
# Display markdown heading
mo.md("# Data Analysis")
# Define second cell that depends on polars (pl) from first cell
@app.cell
def _(pl: polars) -> tuple[pl.DataFrame]:
# Load CSV data using Polars
data: pl.DataFrame = pl.read_csv("data.csv")
# Return data to make it available to other cells
return data,
# Run the app when script is executed directly
if __name__ == "__main__":
app.run()
Tradeoffs:
Altair is a declarative statistical visualization library - it’s an alternative to matplotlib, seaborn, or plotly.
Altair is based on Vega-Lite and uses a grammar of graphics approach, making it easy to create complex, interactive visualizations with minimal code.
import altair as alt
import polars as pl
# Load sales data from CSV file
data: pl.DataFrame = pl.read_csv('sales_data.csv')
# Create interactive scatter plot using grammar of graphics
chart: altair.Chart = alt.Chart(data).mark_circle(size=60).encode(
x='sales:Q', # X-axis: sales (quantitative)
y='profit:Q', # Y-axis: profit (quantitative)
color='region:N', # Color by region (nominal)
tooltip=['region', 'sales', 'profit'] # Show data on hover
).interactive() # Enable zoom, pan, and selection
# Display the chart
chart.show()
# Create faceted line chart (small multiples)
faceted: altair.Chart = alt.Chart(data).mark_line().encode(
x='date:T', # X-axis: date (temporal)
y='sales:Q', # Y-axis: sales (quantitative)
color='product:N' # Color by product (nominal)
).facet(
column='region:N' # Create separate subplot for each region
).resolve_scale(y='independent') # Independent Y-axis scales per facet
Tradeoffs:
MLflow is an ML lifecycle management platform - it’s an alternative to Weights & Biases, Neptune, or custom experiment tracking.
MLflow provides experiment tracking, model packaging, model serving, and a model registry for managing the complete ML lifecycle.
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import numpy as np
# Start MLflow tracking run to log experiment details
with mlflow.start_run():
# Log hyperparameters for this experiment
mlflow.log_param("n_estimators", 100)
mlflow.log_param("max_depth", 10)
# Create and train model with specified hyperparameters
model: RandomForestClassifier = RandomForestClassifier(n_estimators=100, max_depth=10)
model.fit(X_train, y_train)
# Evaluate model and log performance metrics
predictions: np.ndarray = model.predict(X_test)
accuracy: float = accuracy_score(y_test, predictions)
mlflow.log_metric("accuracy", accuracy)
# Log trained model artifact for later deployment
mlflow.sklearn.log_model(model, "random_forest_model")
# Log additional files (plots, reports, etc.)
mlflow.log_artifact("feature_importance.png")
Tradeoffs:
Darts is a time series forecasting library - it’s an alternative to statsmodels, Prophet, or custom implementations.
Darts provides a unified API for classical and modern time series forecasting methods, with built-in backtesting and model evaluation.
from darts import TimeSeries
from darts.models import ExponentialSmoothing, Prophet, NBEATSModel
# Load time series data from CSV with specified time and value columns
ts: darts.TimeSeries = TimeSeries.from_csv('sales_data.csv', time_col='date', value_cols=['sales'])
# Create and train classical forecasting model
exp_smoothing: darts.models.ExponentialSmoothing = ExponentialSmoothing()
exp_smoothing.fit(ts)
# Create and train modern deep learning forecasting model
# input_chunk_length: historical window size, output_chunk_length: forecast horizon
nbeats: darts.models.NBEATSModel = NBEATSModel(input_chunk_length=24, output_chunk_length=12)
nbeats.fit(ts)
# Generate forecasts using both models
forecast_classical: darts.TimeSeries = exp_smoothing.predict(n=12) # 12-period forecast
forecast_modern: darts.TimeSeries = nbeats.predict(n=12) # 12-period forecast
Tradeoffs:
The 2025 Hypermodern Data Science Toolbox is:
Thanks for reading!