Site Logo

Data Science South

Hypermodern Data Science Toolbox 2025

The data science ecosystem moves fast, with new frameworks and tools emerging constantly ๐Ÿš€

Descending from Renjo La Pass in the Himalayas, Nepal
Descending from Renjo La Pass in the Himalayas, Nepal in 2019

This post provides a Hypermodern Data Science Toolbox - tools that are setting the standard for data science and machine learning in 2025.

Python 3.11+

Python 3.11 and 3.12 have both brought performance improvements to Python. We choose 3.11 for the 2025 toolbox as 3.12 is still a bit unstable with some popular data science libraries.

Python 3.11 added better tracebacks - the exact location of the error is pointed out in the traceback. This improves the information available during development and debugging.

Tradeoffs:

  • โœ… Performance: Python 3.12+ offers 10-15% performance improvements for numerical computing
  • โŒ Compatibility: Some cutting-edge ML libraries may require newer Python versions
  • โœ… Stability: 3.11 provides the best balance of performance and ecosystem stability

Polars

Polars is a tool for tabular data manipulation in Python - it’s an alternative to Pandas or Spark.

Polars offers query optimization, parallel processing and can work with larger than memory datasets. It also has a syntax that many prefer to Pandas.

Query optimization allows multiple data transformations to be grouped together and optimized over. This cannot be done in eager-execution frameworks like Pandas.

import polars as pl

# Create a lazy query that will be optimized before execution
# Lazy evaluation allows Polars to optimize the entire query plan
query: pl.LazyFrame = (
    df  # Start with the input DataFrame
    .lazy()  # Convert to lazy mode for query optimization
    .with_columns([  # Add new columns or transform existing ones
        # Parse the "date" column from string to Date type using strptime
        pl.col("date").str.strptime(pl.Date).alias("date"),
        # Calculate cumulative sum of sales and create new column
        pl.col("sales").cum_sum().alias("cumulative_sales"),
    ])
    .group_by("region")  # Group rows by the "region" column
    .agg([  # Aggregate functions to apply to each group
        # Calculate mean sales for each region
        pl.col("sales").mean().alias("avg_sales"),
        # Count number of records (days) for each region
        pl.col("sales").count().alias("n_days"),
    ])
)
# Execute the optimized query and materialize results to DataFrame
result: pl.DataFrame = query.collect()

Tradeoffs:

  • โœ… Performance: Faster than Pandas for most operations
  • โœ… Optimization: Query optimization automatically combines operations reducing both code complexity and execution time
  • โŒ Maturity: Less mature framework than Pandas or PySpark

JAX

JAX is a numerical computing library - it’s an alternative to NumPy or PyTorch.

JAX combines NumPy-compatible API with automatic differentiation (grad), just-in-time compilation (jit), and automatic vectorization (vmap).

import jax.numpy as jnp
from jax import grad, jit, vmap
from jax.typing import ArrayLike
from typing import Callable

def predict(params: ArrayLike, x: ArrayLike) -> ArrayLike:
    """Computes the dot product."""
    # Compute linear prediction: params ยท x (dot product)
    return jnp.dot(params, x)

# Create gradient function using automatic differentiation
# grad() transforms the function to compute gradients with respect to first argument (params)
grad_fn: Callable[[ArrayLike, ArrayLike], ArrayLike] = grad(predict)

# Create JIT-compiled version for faster execution
# jit() compiles the function to XLA for near-C performance
fast_predict: Callable[[ArrayLike, ArrayLike], ArrayLike] = jit(predict)

# Create vectorized version to operate on batches
# vmap() automatically vectorizes the function over the second argument (x)
# in_axes=(None, 0) means: don't vectorize params, vectorize x along axis 0
batch_predict: Callable[[ArrayLike, ArrayLike], ArrayLike] = vmap(predict, in_axes=(None, 0))

Tradeoffs:

  • โœ… Performance: High performance numerical computing with automatic differentiation and JIT compilation
  • โœ… Transformations: Functional programming paradigm enables powerful transformations (vmap, pmap) impossible in imperative frameworks
  • โŒ Complexity: Steep learning curve and debugging challenges due to functional constraints and compilation overhead

PyTorch

PyTorch is a deep learning framework - it’s an alternative to TensorFlow or JAX for neural networks.

PyTorch offers dynamic computation graphs, making it intuitive for research and experimentation. It has become the standard for academic research and increasingly for production.

import torch
import torch.nn as nn
from torch import Tensor

class SimpleNet(nn.Module):
    """Feedforward neural network."""
    def __init__(self, input_size: int, hidden_size: int, output_size: int) -> None:
        # Initialize the parent Module class
        super().__init__()
        # Create a sequential container of layers
        self.layers = nn.Sequential(
            # First layer: linear transformation from input to hidden layer
            nn.Linear(input_size, hidden_size),
            # Activation function: ReLU introduces non-linearity
            nn.ReLU(),
            # Output layer: linear transformation from hidden to output
            nn.Linear(hidden_size, output_size)
        )
    
    def forward(self, x: Tensor) -> Tensor:
        """Forward pass through the network."""
        # Pass input through all layers sequentially
        return self.layers(x)

# Create model instance: 784 inputs (28x28 image), 128 hidden units, 10 outputs (classes)
model: SimpleNet = SimpleNet(784, 128, 10)

Tradeoffs:

  • โœ… Debugging: Intuitive eager execution makes debugging and experimentation straightforward
  • โœ… Ecosystem: Massive ecosystem with excellent community support and extensive pre-trained models
  • โŒ Memory management: Memory management issues and potential leaks during long training runs require careful monitoring

spaCy

spaCy is a Natural Language Processing (NLP) library - it’s an alternative to NLTK.

spaCy focuses on production-ready NLP with pre-trained models for multiple languages and tasks like named entity recognition, part-of-speech tagging, and dependency parsing.

import spacy

# Load pre-trained English language model (small version)
nlp: spacy.language.Language = spacy.load("en_core_web_sm")

# Process text through the NLP pipeline (tokenization, POS tagging, NER, etc.)
doc: spacy.tokens.Doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# Extract named entities detected by the model
for ent in doc.ents:
    # Print entity text and its predicted label type
    print(ent.text, ent.label_)
# Output: Apple ORG, U.K. GPE, $1 billion MONEY

Tradeoffs:

  • โœ… Performance: Production-ready performance with optimized C extensions and efficient memory usage
  • โœ… Models: Comprehensive pre-trained models covering multiple languages and NLP tasks out of the box
  • โœ… Ecosystem: Strong ecosystem with extensions for custom pipelines, training, and deployment

Optuna

Optuna is a hyperparameter optimization framework - it’s an alternative to grid search, random search, or scikit-learn’s hyperparameter tools.

Optuna uses sophisticated algorithms like Tree-Structured Parzen Estimator (TPE) and pruning to efficiently search hyperparameter spaces.

import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

def objective(trial: optuna.Trial) -> float:
    """Objective function that Optuna will optimize."""
    # Suggest hyperparameter values from specified ranges
    n_estimators: int = trial.suggest_int('n_estimators', 10, 100)
    max_depth: int = trial.suggest_int('max_depth', 1, 10)
    
    # Create and train model with suggested hyperparameters
    clf: RandomForestClassifier = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        random_state=42  # Fixed seed for reproducibility
    )
    
    # Evaluate model performance using cross-validation
    # Return mean accuracy score to maximize
    scores: np.ndarray = cross_val_score(clf, X, y, cv=3)
    return scores.mean()

# Create optimization study to maximize the objective
study: optuna.Study = optuna.create_study(direction='maximize')
# Run optimization for 100 trials using TPE algorithm
study.optimize(objective, n_trials=100)

Tradeoffs:

  • โœ… Algorithms: Sophisticated optimization algorithms (TPE, pruning) significantly outperform grid/random search
  • โœ… Visualization: Built-in visualization and study management features streamline experiment tracking
  • โŒ Expertise: Hyperparameter space definition still requires deep domain expertise to be effective

Ray

Ray is a distributed computing framework - it’s an alternative to Dask, multiprocessing, or Spark for scaling Python workloads.

Ray simplifies distributed computing with a clean API for parallel and distributed execution, plus libraries for ML (Ray Tune, Ray Train).

import ray

# Initialize Ray runtime (connects to existing cluster or starts local one)
ray.init()

# Decorator to make function executable as remote task
@ray.remote
def expensive_function(x: int) -> int:
    # Simulate expensive computation with sleep
    import time
    time.sleep(1)
    # Return computed result
    return x ** 2

# Execute function calls in parallel across available resources
# Each call returns a future (ObjectRef) immediately
futures: list[ray.ObjectRef] = [expensive_function.remote(i) for i in range(10)]
# Block until all parallel tasks complete and retrieve results
results: list[int] = ray.get(futures)

Tradeoffs:

  • โœ… Scaling: Near-linear scaling across multiple machines with minimal code changes required
  • โœ… Integration: Excellent integration with ML workflows through Ray Tune, Ray Train, and Ray Serve
  • โŒ Complexity: Debugging distributed failures and cluster management complexity create significant operational overhead

Hugging Face

Hugging Face provides pre-trained models and tools - it’s become the standard for accessing and fine-tuning transformer models.

Hugging Face offers the largest collection of pre-trained models with simple APIs for common NLP, computer vision, and audio tasks.

from transformers import pipeline
from typing import Any

# Create sentiment analysis pipeline with default pre-trained model
classifier: transformers.pipelines.Pipeline = pipeline("sentiment-analysis")
# Analyze sentiment of input text
result: list[dict[str, Any]] = classifier("I love this product!")
# [{'label': 'POSITIVE', 'score': 0.9998}]

# Create question answering pipeline with default pre-trained model
qa: transformers.pipelines.Pipeline = pipeline("question-answering")
# Define context text containing the information
context: str = "Paris is the capital of France."
# Define question to ask about the context
question: str = "What is the capital of France?"
# Extract answer from context using the model
answer: dict[str, Any] = qa(question=question, context=context)

Tradeoffs:

  • โœ… Repository: Largest repository of pre-trained models with consistent APIs across modalities
  • โœ… Speed: Rapid access to state-of-the-art models often within days of paper publication
  • โŒ Quality: Inconsistent model quality and documentation with many models failing to reproduce claimed results

Pandera

Pandera is a tool for data quality checks of tabular data - it’s an alternative to Great Expectations or assert statements.

Pandera allows you to define schemas for tabular data, catching data issues before they propagate through your analysis pipeline.

import pandera as pa
from pandera.polars import DataFrameSchema, Column

# Define schema with column specifications and validation rules
schema: pandera.polars.DataFrameSchema = DataFrameSchema({
    "sales": Column(
        int,  # Expected data type: integer
        # List of validation checks to apply
        checks=[pa.Check.greater_than(0), pa.Check.less_than(10000)],
    ),
    "region": Column(
        str,  # Expected data type: string
        # Check that values are in specified set
        checks=[pa.Check.isin(["North", "South", "East", "West"])],
    ),
})

# Validate data against schema and return validated DataFrame
# Raises SchemaError if validation fails
validated_data: pl.DataFrame = schema(data)

Tradeoffs:

  • โœ… Validation: Type-safe data validation catches errors early in pipeline development preventing downstream failures
  • โœ… Integration: Excellent integration with both Pandas and Polars with clear, actionable error messages
  • โŒ Performance: Runtime validation overhead can significantly slow production pipelines processing large datasets

Marimo

Marimo is a Python notebook editor and format - it’s an alternative to Jupyter Lab.

Marimo offers reactive execution, Git-friendly storage, and interactive web apps from notebooks.

import marimo

# Version metadata for reproducibility
__generated_with: str = "0.10.12"
# Create Marimo app with medium width layout
app: marimo.App = marimo.App(width="medium")

# Define first cell with markdown output
@app.cell
def _(mo: marimo) -> None:
    import polars as pl
    import altair as alt
    # Display markdown heading
    mo.md("# Data Analysis")

# Define second cell that depends on polars (pl) from first cell
@app.cell  
def _(pl: polars) -> tuple[pl.DataFrame]:
    # Load CSV data using Polars
    data: pl.DataFrame = pl.read_csv("data.csv")
    # Return data to make it available to other cells
    return data,

# Run the app when script is executed directly
if __name__ == "__main__":
    app.run()

Tradeoffs:

  • โœ… Execution: Reactive execution model prevents common notebook pitfalls like out-of-order cell execution
  • โœ… Version control: Git-friendly Python file format enables proper version control and code review workflows
  • โŒ Ecosystem: Limited ecosystem compared to Jupyter with fewer extensions, widgets, and third-party integrations

Altair

Altair is a declarative statistical visualization library - it’s an alternative to matplotlib, seaborn, or plotly.

Altair is based on Vega-Lite and uses a grammar of graphics approach, making it easy to create complex, interactive visualizations with minimal code.

import altair as alt
import polars as pl

# Load sales data from CSV file
data: pl.DataFrame = pl.read_csv('sales_data.csv')

# Create interactive scatter plot using grammar of graphics
chart: altair.Chart = alt.Chart(data).mark_circle(size=60).encode(
    x='sales:Q',        # X-axis: sales (quantitative)
    y='profit:Q',       # Y-axis: profit (quantitative)
    color='region:N',   # Color by region (nominal)
    tooltip=['region', 'sales', 'profit']  # Show data on hover
).interactive()  # Enable zoom, pan, and selection

# Display the chart
chart.show()

# Create faceted line chart (small multiples)
faceted: altair.Chart = alt.Chart(data).mark_line().encode(
    x='date:T',         # X-axis: date (temporal)
    y='sales:Q',        # Y-axis: sales (quantitative)
    color='product:N'   # Color by product (nominal)
).facet(
    column='region:N'   # Create separate subplot for each region
).resolve_scale(y='independent')  # Independent Y-axis scales per facet

Tradeoffs:

  • โœ… Grammar: Grammar of graphics approach creates consistent, composable visualizations with minimal code
  • โœ… Interactivity: Built-in interactivity (zoom, pan, brush, link) works seamlessly without additional configuration
  • โŒ Performance: Performance degrades significantly with large datasets as rendering happens client-side in browser

MLflow

MLflow is an ML lifecycle management platform - it’s an alternative to Weights & Biases, Neptune, or custom experiment tracking.

MLflow provides experiment tracking, model packaging, model serving, and a model registry for managing the complete ML lifecycle.

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import numpy as np

# Start MLflow tracking run to log experiment details
with mlflow.start_run():
    # Log hyperparameters for this experiment
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 10)
    
    # Create and train model with specified hyperparameters
    model: RandomForestClassifier = RandomForestClassifier(n_estimators=100, max_depth=10)
    model.fit(X_train, y_train)
    
    # Evaluate model and log performance metrics
    predictions: np.ndarray = model.predict(X_test)
    accuracy: float = accuracy_score(y_test, predictions)
    mlflow.log_metric("accuracy", accuracy)
    
    # Log trained model artifact for later deployment
    mlflow.sklearn.log_model(model, "random_forest_model")
    
    # Log additional files (plots, reports, etc.)
    mlflow.log_artifact("feature_importance.png")

Tradeoffs:

  • โœ… Lifecycle: Complete ML lifecycle management with experiment tracking, model registry, and serving capabilities
  • โœ… Open source: No vendor lock-in with open source codebase and ability to self-host all components
  • โŒ Infrastructure: Infrastructure management overhead and limited enterprise features compared to commercial alternatives

Darts

Darts is a time series forecasting library - it’s an alternative to statsmodels, Prophet, or custom implementations.

Darts provides a unified API for classical and modern time series forecasting methods, with built-in backtesting and model evaluation.

from darts import TimeSeries
from darts.models import ExponentialSmoothing, Prophet, NBEATSModel

# Load time series data from CSV with specified time and value columns
ts: darts.TimeSeries = TimeSeries.from_csv('sales_data.csv', time_col='date', value_cols=['sales'])

# Create and train classical forecasting model
exp_smoothing: darts.models.ExponentialSmoothing = ExponentialSmoothing()
exp_smoothing.fit(ts)

# Create and train modern deep learning forecasting model
# input_chunk_length: historical window size, output_chunk_length: forecast horizon
nbeats: darts.models.NBEATSModel = NBEATSModel(input_chunk_length=24, output_chunk_length=12)
nbeats.fit(ts)

# Generate forecasts using both models
forecast_classical: darts.TimeSeries = exp_smoothing.predict(n=12)  # 12-period forecast
forecast_modern: darts.TimeSeries = nbeats.predict(n=12)           # 12-period forecast

Tradeoffs:

  • โœ… API: Unified API across 50+ classical and modern forecasting methods enables easy model comparison
  • โœ… Evaluation: Built-in backtesting and evaluation metrics prevent common time series modeling mistakes
  • โŒ Performance: Individual model implementations often underperform compared to specialized libraries and frameworks

Summary

The 2025 Hypermodern Data Science Toolbox is:

  • Python 3.11+ for improved performance and error messages
  • Polars for fast, memory-efficient data manipulation
  • JAX for high-performance numerical computing and research
  • PyTorch for deep learning and neural networks
  • spaCy for production-ready NLP
  • Optuna for efficient hyperparameter optimization
  • Ray for scaling computations across multiple cores/machines
  • Hugging Face for pre-trained transformer models
  • Pandera for data quality validation
  • Marimo for reactive, reproducible notebooks
  • Altair for declarative statistical visualization
  • MLflow for ML experiment tracking and lifecycle management
  • Darts for comprehensive time series forecasting

Thanks for reading!