DataHaskell Roadmap 2026-2027

Version: 1.0
Date: November 2026
Coordinators: DataHaskell Community
Key Partners: dataframe, Hasktorch, distributed-process

Executive Summary

This roadmap outlines the strategic direction for building a complete, production-ready Haskell data science ecosystem. With three major pillars already in active development—dataframe (data manipulation), Hasktorch (deep learning), and distributed-process (distributed computing)—we are positioned to create a cohesive platform that rivals Python, R, and Julia for data science workloads.

Vision

By 2027, DataHaskell will provide a high-performance, end-to-end data science toolkit that enables practitioners to build reliable machine learning systems from data ingestion through model deployment.

Core Principles

Interoperability: Seamless integration between ecosystem components
Performance: Match or exceed Python/R performance benchmarks
Ergonomics: Intuitive APIs that lower the barrier to entry
Production Ready: Focus on reliability, monitoring, and deployment
Type Safety: Leverage Haskell’s type system (where possible) to catch errors at compile time

Current State Assessment

Strengths

dataframe: Modern dataframe library with IHaskell integration
Hasktorch: Mature deep learning library with PyTorch backend and GPU support
distributed-process: Battle-tested distributed computing framework
IHaskell: A Haskell kernel for Jupyter notebooks.
Strong functional programming foundations
Excellent parallelism and concurrency primitives

Gaps to Address

No community of maintainers and contributors
Fragmented visualization ecosystem
Limited data I/O format support
Incomplete documentation and tutorials
Sparse integration examples between major libraries
Limited model deployment tooling

Critical Needs

Unified onboarding experience
Comprehensive benchmarking against Python/R
Production deployment patterns
Enterprise adoption case studies

Strategic Pillars

Pillar 1: Core Data Infrastructure

Phase 1 (Q1-Q2 2026) - Foundation

Owner: dataframe team

Goals:

Complete dataframe v1 release (March 2026)
Establish dataframe as the standard tabular data library
Performance parity with Pandas/Polars for common operations

Deliverables:

dataframe v0.1.0
- SQL-like API finalized
- IHaskell integration complete
- Type-safe column operations
- Comprehensive test suite
- Apache Arrow integration
File Format Support
- CSV/TSV (existing)
- Parquet (high priority)
- Arrow IPC format
- Excel (xlsx)
- JSON (nested structures)
- HDF5 (coordination with scientific computing)
Performance Benchmarks
- Public benchmark suite comparing to:
  - Pandas
  - Polars
  - dplyr/tidyverse
- Focus areas: filtering, grouping, joining, aggregations
- Document optimization strategies

Phase 2 (Q3-Q4 2026) - Expansion

Owner: dataframe + community

Goals:

Advanced data manipulation features
Computing on files larger than memory
Integration with Cloud database systems

Deliverables:

Advanced Operations
- Window functions
- Rolling aggregations
- Pivot/unpivot operations
- Complex joins (anti, semi)
- Reshaping operations (melt, cast)
Cloud database Connectivity
- Read files from AWS/GCP/Azure
- PostgreSQL integration
- SQLite support
- Query pushdown optimization
- Streaming query results

Pillar 2: Statistical Computing & Visualization

Phase 1 (Q2-Q3 2026) - Statistics Core

Owner: Community (needs maintainer)

Goals:

Create a unified machine learning library on top of Hasktorch and Statistics
Create unified plotting API

Deliverables:

statistics
- Extend hypothesis testing (t-test, ANOVA)
- Simple regression models (linear and logistic)
- Generalized linear models (GLM)
- Survival analysis basics
- Integration with dataframe
Plotting & Visualization
- Option A: Extend hvega (Vega-Lite) with dataframe integration
- Option B: Create native plotting library with backends
- Priority features:
  - Scatter plots, line plots, bar charts
  - Histograms and distributions
  - Heatmaps and correlation plots
  - Interactive plots for notebooks
  - Export to PNG, SVG, PDF

Phase 2 (Q4 2026 - Q1 2027) - Advanced Analytics

Owner: Community

Deliverables:

Advanced Statistical Methods
- Mixed effects models
- Time series analysis (ARIMA, state space models)
- Bayesian inference (integration with existing libraries)
- Causal inference methods
- Spatial statistics
Visualization Expansion
- Grammar of graphics implementation
- Geographic/mapping support
- Network visualization
- 3D plotting capabilities

Pillar 3: Machine Learning & Deep Learning

Phase 1 (Q1-Q2 2026) - Integration

Owners: Hasktorch + dataframe teams

Goals:

Improve dataframe → tensor pipeline
Example-driven documentation

Deliverables:

dataframe ↔ Hasktorch Bridge
- Zero-copy conversion where possible
- Automatic type mapping
- GPU memory management
- Batch loading utilities
ML Workflow Examples with new unified library
- End-to-end classification (Iris, MNIST)
- Regression examples (California Housing)
- Time series forecasting
- NLP pipeline (text classification)
- Computer vision (image classification)
Data Preprocessing
- Feature scaling/normalization
- One-hot encoding
- Missing value imputation
- Train/test splitting
- Cross-validation utilities

Phase 2 (Q3-Q4 2026) - Classical ML

Owner: Community (coordinate with Hasktorch)

Goals:

Fill gap between dataframe and deep learning
Provide scikit-learn equivalent

Deliverables:

haskell-ml-toolkit (new library)
- Decision trees and random forests
- Gradient boosting (XGBoost integration or native)
- Support Vector Machines
- K-means and hierarchical clustering
- Dimensionality reduction (PCA, t-SNE, UMAP)
- Model evaluation metrics
- Hyperparameter optimization
Feature Engineering
- Automatic feature generation
- Feature selection methods
- Polynomial features
- Text feature extraction

Phase 3 (Q1-Q2 2027) - Model Management

Owners: Hasktorch + community

Deliverables:

Model Serialization & Versioning
- Standard model format
- Version tracking
- Metadata storage
- Model registry concept
Model Deployment
- REST API server templates
- Batch prediction utilities
- Model monitoring hooks
- ONNX export for interoperability

Pillar 4: Distributed & Parallel Computing

Phase 1 (Q2-Q3 2026) - Core Integration

Owners: distributed-process + dataframe teams

Goals:

Enable distributed data processing
Provide MapReduce-style operations

Deliverables:

Distributed DataFrame Operations
- Distributed CSV/Parquet reading
- Parallel groupby and aggregations
- Distributed joins
- Shuffle operations
- Fault tolerance mechanisms
distributed-ml (new library)
- Distributed model training
- Parameter servers
- Data parallelism primitives
- Model parallelism support
- Integration with Hasktorch
Examples & Patterns
- Multi-node data processing
- Distributed hyperparameter search
- Large-scale model training
- Stream processing patterns

Phase 2 (Q4 2026 - Q1 2027) - Production Features

Owner: distributed-process team

Deliverables:

Cluster Management
- Node discovery and registration
- Health monitoring
- Resource allocation
- Job scheduling
Cloud Integration
- AWS backend
- Google Cloud backend
- Kubernetes deployment patterns
- Docker containerization templates

Pillar 5: Developer Experience

Phase 1 (Q1-Q2 2026) - Documentation Blitz

Owner: All maintainers + community

Goals:

Lower barrier to entry
Comprehensive learning path

Deliverables:

DataHaskell Website Revamp
- Clear getting started guide
- Library comparison matrix
- Migration guides (from Python, R)
- Success stories
Tutorial Series
- Installation and setup (all platforms)
- Your first data analysis
- DataFrames deep dive
- Machine learning workflow
- Distributed computing basics
- Production deployment
Notebook Gallery
- 20+ example notebooks covering:
  - Data cleaning and exploration
  - Statistical analysis
  - ML model building
  - Visualization
  - Domain-specific examples (finance, biology, etc.)

Phase 2 (Q3-Q4 2026) - Tooling

Owner: Community

Deliverables:

datahaskell-cli (new tool)
- Project scaffolding
- Dependency management presets
- Environment setup automation
- Example project templates
IDE Support Improvements
- VSCode IHaskell support with dataHaskell stack supported out the box
- HLS integration guides
- Debugging workflows
- IHaskell kernel improvements
Testing & CI Templates
- Property-based testing examples
- Benchmark suites
- GitHub Actions templates
- Continuous deployment patterns

Pillar 6: Community & Ecosystem

Ongoing Initiatives

Goals:

Grow contributor base
Foster collaboration
Drive adoption

Deliverables:

Community Building
- Monthly community calls (starting Q1 2026)
- Discord/Slack workspace
- Quarterly virtual conferences
- Mentorship program
Contribution Framework
- Good first issues across all projects
- Contribution guidelines
- Code review standards
- Recognition program
Outreach
- Blog post series
- Conference talks (Haskell Symposium, ZuriHac, etc.)
- Academic collaborations
- Industry partnerships
Package Standards
- Naming conventions
- API design guidelines
- Documentation requirements
- Testing standards
- Version compatibility matrix

Success Metrics

Q2 2026

dataframe v1 released
3 complete end-to-end tutorials published
Performance benchmarks showing ≥70% of Pandas speed
5 integration examples between major libraries

Q4 2026

10,000+ total library downloads/month across ecosystem
5+ active contributors
Performance parity (≥90%) with Pandas for common operations
Complete ML workflow from data to deployment documented

Q2 2027

2+ companies using DataHaskell
DataHaskell track at major Haskell conference
3+ published case studies
Comprehensive distributed computing examples

Q4 2027

Feature completeness with Python’s core data science stack
5+ production ML systems case studies
Enterprise support offerings available

Resource Requirements

Maintainer Coordination

Monthly sync: All pillar leads (1 hour)
Quarterly planning: Full maintainer group (2 hours)

Funding Needs (Optional but Helpful)

Infrastructure
- Benchmark server (GPU-enabled)
- CI/CD resources
- Documentation hosting
Developer Support
- Part-time technical writer
- Maintainer stipends or grants
- Summer of Haskell projects
Events
- Quarterly virtual meetups
- Annual in-person hackathon
- Conference sponsorships

Risk Mitigation

Technical Risks

Community Risks

Ecosystem Risks

Decision Framework

When to add new libraries

Criteria:

Fills clear gap in ecosystem
Has committed maintainer
Integrates with existing components
Follows API design guidelines
Includes comprehensive tests and docs

When to deprecate/consolidate

Criteria:

Unmaintained for >6 months
Better alternative exists
Creates confusion in ecosystem

Version Compatibility Policy

Support last 2 major GHC versions
Semantic versioning (PVP)
Deprecation warnings for 2 releases before removal
Compatibility matrix published on website

Communication Plan

Internal (Maintainers)

Discord channel: Daily async communication
GitHub Discussions: Technical decisions, RFCs
Monthly video call: Roadmap progress, blockers
Quarterly planning session: Next phase priorities

External (Community)

Blog: Monthly progress updates
Twitter/Social: Weekly highlights
Haskell Discourse: Major announcements
Newsletter: Quarterly ecosystem update
Documentation: Always up-to-date

How to Use This Roadmap

This is a living document. We will:

Review quarterly and adjust priorities
Track progress in GitHub projects
Celebrate milestones publicly
Adapt based on community feedback

Questions? Open a discussion on GitHub or join our community calls.

Let’s build the future of data science in Haskell together! 🚀