DataHaskell Roadmap 2026-2027

Version: 1.0
Date: November 2026
Coordinators: DataHaskell Community
Key Partners: dataframe, Hasktorch, distributed-process


Executive Summary

This roadmap outlines the strategic direction for building a complete, production-ready Haskell data science ecosystem. With three major pillars already in active development—dataframe (data manipulation), Hasktorch (deep learning), and distributed-process (distributed computing)—we are positioned to create a cohesive platform that rivals Python, R, and Julia for data science workloads.

Vision

By 2027, DataHaskell will provide a high-performance, end-to-end data science toolkit that enables practitioners to build reliable machine learning systems from data ingestion through model deployment.

Core Principles

  1. Interoperability: Seamless integration between ecosystem components
  2. Performance: Match or exceed Python/R performance benchmarks
  3. Ergonomics: Intuitive APIs that lower the barrier to entry
  4. Production Ready: Focus on reliability, monitoring, and deployment
  5. Type Safety: Leverage Haskell’s type system (where possible) to catch errors at compile time

Current State Assessment

Strengths

  • dataframe: Modern dataframe library with IHaskell integration
  • Hasktorch: Mature deep learning library with PyTorch backend and GPU support
  • distributed-process: Battle-tested distributed computing framework
  • IHaskell: A Haskell kernel for Jupyter notebooks.
  • Strong functional programming foundations
  • Excellent parallelism and concurrency primitives

Gaps to Address

  • No community of maintainers and contributors
  • Fragmented visualization ecosystem
  • Limited data I/O format support
  • Incomplete documentation and tutorials
  • Sparse integration examples between major libraries
  • Limited model deployment tooling

Critical Needs

  • Unified onboarding experience
  • Comprehensive benchmarking against Python/R
  • Production deployment patterns
  • Enterprise adoption case studies

Strategic Pillars

Pillar 1: Core Data Infrastructure

Phase 1 (Q1-Q2 2026) - Foundation

Owner: dataframe team

Goals:

  • Complete dataframe v1 release (March 2026)
  • Establish dataframe as the standard tabular data library
  • Performance parity with Pandas/Polars for common operations

Deliverables:

  1. dataframe v0.1.0
    • SQL-like API finalized
    • IHaskell integration complete
    • Type-safe column operations
    • Comprehensive test suite
    • Apache Arrow integration
  2. File Format Support
    • CSV/TSV (existing)
    • Parquet (high priority)
    • Arrow IPC format
    • Excel (xlsx)
    • JSON (nested structures)
    • HDF5 (coordination with scientific computing)
  3. Performance Benchmarks
    • Public benchmark suite comparing to:
      • Pandas
      • Polars
      • dplyr/tidyverse
    • Focus areas: filtering, grouping, joining, aggregations
    • Document optimization strategies

Phase 2 (Q3-Q4 2026) - Expansion

Owner: dataframe + community

Goals:

  • Advanced data manipulation features
  • Computing on files larger than memory
  • Integration with Cloud database systems

Deliverables:

  1. Advanced Operations
    • Window functions
    • Rolling aggregations
    • Pivot/unpivot operations
    • Complex joins (anti, semi)
    • Reshaping operations (melt, cast)
  2. Cloud database Connectivity
    • Read files from AWS/GCP/Azure
    • PostgreSQL integration
    • SQLite support
    • Query pushdown optimization
    • Streaming query results

Pillar 2: Statistical Computing & Visualization

Phase 1 (Q2-Q3 2026) - Statistics Core

Owner: Community (needs maintainer)

Goals:

  • Create a unified machine learning library on top of Hasktorch and Statistics
  • Create unified plotting API

Deliverables:

  1. statistics
    • Extend hypothesis testing (t-test, ANOVA)
    • Simple regression models (linear and logistic)
    • Generalized linear models (GLM)
    • Survival analysis basics
    • Integration with dataframe
  2. Plotting & Visualization
    • Option A: Extend hvega (Vega-Lite) with dataframe integration
    • Option B: Create native plotting library with backends
    • Priority features:
      • Scatter plots, line plots, bar charts
      • Histograms and distributions
      • Heatmaps and correlation plots
      • Interactive plots for notebooks
      • Export to PNG, SVG, PDF

Phase 2 (Q4 2026 - Q1 2027) - Advanced Analytics

Owner: Community

Deliverables:

  1. Advanced Statistical Methods
    • Mixed effects models
    • Time series analysis (ARIMA, state space models)
    • Bayesian inference (integration with existing libraries)
    • Causal inference methods
    • Spatial statistics
  2. Visualization Expansion
    • Grammar of graphics implementation
    • Geographic/mapping support
    • Network visualization
    • 3D plotting capabilities

Pillar 3: Machine Learning & Deep Learning

Phase 1 (Q1-Q2 2026) - Integration

Owners: Hasktorch + dataframe teams

Goals:

  • Improve dataframe → tensor pipeline
  • Example-driven documentation

Deliverables:

  1. dataframe ↔ Hasktorch Bridge
    • Zero-copy conversion where possible
    • Automatic type mapping
    • GPU memory management
    • Batch loading utilities
  2. ML Workflow Examples with new unified library
    • End-to-end classification (Iris, MNIST)
    • Regression examples (California Housing)
    • Time series forecasting
    • NLP pipeline (text classification)
    • Computer vision (image classification)
  3. Data Preprocessing
    • Feature scaling/normalization
    • One-hot encoding
    • Missing value imputation
    • Train/test splitting
    • Cross-validation utilities

Phase 2 (Q3-Q4 2026) - Classical ML

Owner: Community (coordinate with Hasktorch)

Goals:

  • Fill gap between dataframe and deep learning
  • Provide scikit-learn equivalent

Deliverables:

  1. haskell-ml-toolkit (new library)
    • Decision trees and random forests
    • Gradient boosting (XGBoost integration or native)
    • Support Vector Machines
    • K-means and hierarchical clustering
    • Dimensionality reduction (PCA, t-SNE, UMAP)
    • Model evaluation metrics
    • Hyperparameter optimization
  2. Feature Engineering
    • Automatic feature generation
    • Feature selection methods
    • Polynomial features
    • Text feature extraction

Phase 3 (Q1-Q2 2027) - Model Management

Owners: Hasktorch + community

Deliverables:

  1. Model Serialization & Versioning
    • Standard model format
    • Version tracking
    • Metadata storage
    • Model registry concept
  2. Model Deployment
    • REST API server templates
    • Batch prediction utilities
    • Model monitoring hooks
    • ONNX export for interoperability

Pillar 4: Distributed & Parallel Computing

Phase 1 (Q2-Q3 2026) - Core Integration

Owners: distributed-process + dataframe teams

Goals:

  • Enable distributed data processing
  • Provide MapReduce-style operations

Deliverables:

  1. Distributed DataFrame Operations
    • Distributed CSV/Parquet reading
    • Parallel groupby and aggregations
    • Distributed joins
    • Shuffle operations
    • Fault tolerance mechanisms
  2. distributed-ml (new library)
    • Distributed model training
    • Parameter servers
    • Data parallelism primitives
    • Model parallelism support
    • Integration with Hasktorch
  3. Examples & Patterns
    • Multi-node data processing
    • Distributed hyperparameter search
    • Large-scale model training
    • Stream processing patterns

Phase 2 (Q4 2026 - Q1 2027) - Production Features

Owner: distributed-process team

Deliverables:

  1. Cluster Management
    • Node discovery and registration
    • Health monitoring
    • Resource allocation
    • Job scheduling
  2. Cloud Integration
    • AWS backend
    • Google Cloud backend
    • Kubernetes deployment patterns
    • Docker containerization templates

Pillar 5: Developer Experience

Phase 1 (Q1-Q2 2026) - Documentation Blitz

Owner: All maintainers + community

Goals:

  • Lower barrier to entry
  • Comprehensive learning path

Deliverables:

  1. DataHaskell Website Revamp
    • Clear getting started guide
    • Library comparison matrix
    • Migration guides (from Python, R)
    • Success stories
  2. Tutorial Series
    • Installation and setup (all platforms)
    • Your first data analysis
    • DataFrames deep dive
    • Machine learning workflow
    • Distributed computing basics
    • Production deployment
  3. Notebook Gallery
    • 20+ example notebooks covering:
      • Data cleaning and exploration
      • Statistical analysis
      • ML model building
      • Visualization
      • Domain-specific examples (finance, biology, etc.)

Phase 2 (Q3-Q4 2026) - Tooling

Owner: Community

Deliverables:

  1. datahaskell-cli (new tool)
    • Project scaffolding
    • Dependency management presets
    • Environment setup automation
    • Example project templates
  2. IDE Support Improvements
    • VSCode IHaskell support with dataHaskell stack supported out the box
    • HLS integration guides
    • Debugging workflows
    • IHaskell kernel improvements
  3. Testing & CI Templates
    • Property-based testing examples
    • Benchmark suites
    • GitHub Actions templates
    • Continuous deployment patterns

Pillar 6: Community & Ecosystem

Ongoing Initiatives

Goals:

  • Grow contributor base
  • Foster collaboration
  • Drive adoption

Deliverables:

  1. Community Building
    • Monthly community calls (starting Q1 2026)
    • Discord/Slack workspace
    • Quarterly virtual conferences
    • Mentorship program
  2. Contribution Framework
    • Good first issues across all projects
    • Contribution guidelines
    • Code review standards
    • Recognition program
  3. Outreach
    • Blog post series
    • Conference talks (Haskell Symposium, ZuriHac, etc.)
    • Academic collaborations
    • Industry partnerships
  4. Package Standards
    • Naming conventions
    • API design guidelines
    • Documentation requirements
    • Testing standards
    • Version compatibility matrix

Success Metrics

Q2 2026

  • dataframe v1 released
  • 3 complete end-to-end tutorials published
  • Performance benchmarks showing ≥70% of Pandas speed
  • 5 integration examples between major libraries

Q4 2026

  • 10,000+ total library downloads/month across ecosystem
  • 5+ active contributors
  • Performance parity (≥90%) with Pandas for common operations
  • Complete ML workflow from data to deployment documented

Q2 2027

  • 2+ companies using DataHaskell
  • DataHaskell track at major Haskell conference
  • 3+ published case studies
  • Comprehensive distributed computing examples

Q4 2027

  • Feature completeness with Python’s core data science stack
  • 5+ production ML systems case studies
  • Enterprise support offerings available

Resource Requirements

Maintainer Coordination

  • Monthly sync: All pillar leads (1 hour)
  • Quarterly planning: Full maintainer group (2 hours)

Funding Needs (Optional but Helpful)

  1. Infrastructure
    • Benchmark server (GPU-enabled)
    • CI/CD resources
    • Documentation hosting
  2. Developer Support
    • Part-time technical writer
    • Maintainer stipends or grants
    • Summer of Haskell projects
  3. Events
    • Quarterly virtual meetups
    • Annual in-person hackathon
    • Conference sponsorships

Risk Mitigation

Technical Risks

Risk Mitigation
Performance doesn’t match Python Early benchmarking, profiling, and optimization sprints
Integration complexity Defined interfaces, versioning strategy, compatibility tests
Breaking changes in dependencies Conservative version bounds, testing matrix

Community Risks

Risk Mitigation
Maintainer burnout Distributed ownership, recognition program, funding support
Fragmentation Regular coordination, shared roadmap, integration testing
Slow adoption Marketing efforts, case studies, migration guides

Ecosystem Risks

Risk Mitigation
GHC changes break libraries Test against multiple GHC versions, engage with GHC team
Competing projects Focus on collaboration, clear differentiation
Limited contributor pool Mentorship, good documentation, welcoming community

Decision Framework

When to add new libraries

Criteria:

  1. Fills clear gap in ecosystem
  2. Has committed maintainer
  3. Integrates with existing components
  4. Follows API design guidelines
  5. Includes comprehensive tests and docs

When to deprecate/consolidate

Criteria:

  1. Unmaintained for >6 months
  2. Better alternative exists
  3. Creates confusion in ecosystem

Version Compatibility Policy

  • Support last 2 major GHC versions
  • Semantic versioning (PVP)
  • Deprecation warnings for 2 releases before removal
  • Compatibility matrix published on website

Communication Plan

Internal (Maintainers)

  • Discord channel: Daily async communication
  • GitHub Discussions: Technical decisions, RFCs
  • Monthly video call: Roadmap progress, blockers
  • Quarterly planning session: Next phase priorities

External (Community)

  • Blog: Monthly progress updates
  • Twitter/Social: Weekly highlights
  • Haskell Discourse: Major announcements
  • Newsletter: Quarterly ecosystem update
  • Documentation: Always up-to-date

How to Use This Roadmap

This is a living document. We will:

  • Review quarterly and adjust priorities
  • Track progress in GitHub projects
  • Celebrate milestones publicly
  • Adapt based on community feedback

Questions? Open a discussion on GitHub or join our community calls.


Let’s build the future of data science in Haskell together! 🚀