DataHaskell Roadmap 2026-2027

Version: 1.0
Date: November 2026
Coordinators: DataHaskell Community
Key Partners: dataframe, Hasktorch, distributed-process


Executive Summary

This roadmap outlines the strategic direction for building a complete, production-ready Haskell data science ecosystem. With three major pillars already in active development—dataframe (data manipulation), Hasktorch (deep learning), and distributed-process (distributed computing)—we are positioned to create a cohesive platform that rivals Python, R, and Julia for data science workloads.

Vision

By 2027, DataHaskell will provide a high-performance, end-to-end data science toolkit that enables practitioners to build reliable machine learning systems from data ingestion through model deployment.

Core Principles

  1. Interoperability: Seamless integration between ecosystem components
  2. Performance: Match or exceed Python/R performance benchmarks
  3. Ergonomics: Intuitive APIs that lower the barrier to entry
  4. Production Ready: Focus on reliability, monitoring, and deployment
  5. Type Safety: Leverage Haskell’s type system (where possible) to catch errors at compile time

Current State Assessment

Strengths

  • dataframe: Modern dataframe library with IHaskell integration
  • Hasktorch: Mature deep learning library with PyTorch backend and GPU support
  • distributed-process: Battle-tested distributed computing framework
  • IHaskell: A Haskell kernel for Jupyter notebooks.
  • Strong functional programming foundations
  • Excellent parallelism and concurrency primitives

Gaps to Address

  • No community of maintainers and contributors
  • Fragmented visualization ecosystem
  • Limited data I/O format support
  • Incomplete documentation and tutorials
  • Sparse integration examples between major libraries
  • Limited model deployment tooling

Critical Needs

  • Unified onboarding experience
  • Comprehensive benchmarking against Python/R
  • Production deployment patterns
  • Enterprise adoption case studies

Strategic Pillars

Pillar 1: Core Data Infrastructure

Phase 1 (Q1-Q2 2026) - Foundation

Owner: dataframe team

Goals:

  • Complete dataframe v1 release (March 2026)
  • Establish dataframe as the standard tabular data library
  • Performance parity with Pandas/Polars for common operations

Deliverables:

  1. dataframe v0.1.0
    • SQL-like API finalized
    • IHaskell integration complete
    • Type-safe column operations
    • Comprehensive test suite
    • Apache Arrow integration
  2. File Format Support
    • CSV/TSV (existing)
    • Parquet (high priority)
    • Arrow IPC format
    • Excel (xlsx)
    • JSON (nested structures)
    • HDF5 (coordination with scientific computing)
  3. Performance Benchmarks
    • Public benchmark suite comparing to:
      • Pandas
      • Polars
      • dplyr/tidyverse
    • Focus areas: filtering, grouping, joining, aggregations
    • Document optimization strategies

Phase 2 (Q3-Q4 2026) - Expansion

Owner: dataframe + community

Goals:

  • Advanced data manipulation features
  • Computing on files larger than memory
  • Integration with Cloud database systems

Deliverables:

  1. Advanced Operations
    • Window functions
    • Rolling aggregations
    • Pivot/unpivot operations
    • Complex joins (anti, semi)
    • Reshaping operations (melt, cast)
  2. Cloud database Connectivity
    • Read files from AWS/GCP/Azure
    • PostgreSQL integration
    • SQLite support
    • Query pushdown optimization
    • Streaming query results

Pillar 2: Statistical Computing & Visualization

Phase 1 (Q2-Q3 2026) - Statistics Core

Owner: Community (needs maintainer)

Goals:

  • Create a unified machine learning library on top of Hasktorch and Statistics
  • Create unified plotting API

Deliverables:

  1. statistics
    • Extend hypothesis testing (t-test, ANOVA)
    • Simple regression models (linear and logistic)
    • Generalized linear models (GLM)
    • Survival analysis basics
    • Integration with dataframe
  2. Plotting & Visualization
    • Option A: Extend hvega (Vega-Lite) with dataframe integration
    • Option B: Create native plotting library with backends
    • Priority features:
      • Scatter plots, line plots, bar charts
      • Histograms and distributions
      • Heatmaps and correlation plots
      • Interactive plots for notebooks
      • Export to PNG, SVG, PDF

Phase 2 (Q4 2026 - Q1 2027) - Advanced Analytics

Owner: Community

Deliverables:

  1. Advanced Statistical Methods
    • Mixed effects models
    • Time series analysis (ARIMA, state space models)
    • Bayesian inference (integration with existing libraries)
    • Causal inference methods
    • Spatial statistics
  2. Visualization Expansion
    • Grammar of graphics implementation
    • Geographic/mapping support
    • Network visualization
    • 3D plotting capabilities

Pillar 3: Machine Learning & Deep Learning

Phase 1 (Q1-Q2 2026) - Integration

Owners: Hasktorch + dataframe teams

Goals:

  • Improve dataframe → tensor pipeline
  • Example-driven documentation

Deliverables:

  1. dataframe ↔ Hasktorch Bridge
    • Zero-copy conversion where possible
    • Automatic type mapping
    • GPU memory management
    • Batch loading utilities
  2. ML Workflow Examples with new unified library
    • End-to-end classification (Iris, MNIST)
    • Regression examples (California Housing)
    • Time series forecasting
    • NLP pipeline (text classification)
    • Computer vision (image classification)
  3. Data Preprocessing
    • Feature scaling/normalization
    • One-hot encoding
    • Missing value imputation
    • Train/test splitting
    • Cross-validation utilities

Phase 2 (Q3-Q4 2026) - Classical ML

Owner: Community (coordinate with Hasktorch)

Goals:

  • Fill gap between dataframe and deep learning
  • Provide scikit-learn equivalent

Deliverables:

  1. haskell-ml-toolkit (new library)
    • Decision trees and random forests
    • Gradient boosting (XGBoost integration or native)
    • Support Vector Machines
    • K-means and hierarchical clustering
    • Dimensionality reduction (PCA, t-SNE, UMAP)
    • Model evaluation metrics
    • Hyperparameter optimization
  2. Feature Engineering
    • Automatic feature generation
    • Feature selection methods
    • Polynomial features
    • Text feature extraction

Phase 3 (Q1-Q2 2027) - Model Management

Owners: Hasktorch + community

Deliverables:

  1. Model Serialization & Versioning
    • Standard model format
    • Version tracking
    • Metadata storage
    • Model registry concept
  2. Model Deployment
    • REST API server templates
    • Batch prediction utilities
    • Model monitoring hooks
    • ONNX export for interoperability

Pillar 4: Distributed & Parallel Computing

Phase 1 (Q2-Q3 2026) - Core Integration

Owners: distributed-process + dataframe teams

Goals:

  • Enable distributed data processing
  • Provide MapReduce-style operations

Deliverables:

  1. Distributed DataFrame Operations
    • Distributed CSV/Parquet reading
    • Parallel groupby and aggregations
    • Distributed joins
    • Shuffle operations
    • Fault tolerance mechanisms
  2. distributed-ml (new library)
    • Distributed model training
    • Parameter servers
    • Data parallelism primitives
    • Model parallelism support
    • Integration with Hasktorch
  3. Examples & Patterns
    • Multi-node data processing
    • Distributed hyperparameter search
    • Large-scale model training
    • Stream processing patterns

Phase 2 (Q4 2026 - Q1 2027) - Production Features

Owner: distributed-process team

Deliverables:

  1. Cluster Management
    • Node discovery and registration
    • Health monitoring
    • Resource allocation
    • Job scheduling
  2. Cloud Integration
    • AWS backend
    • Google Cloud backend
    • Kubernetes deployment patterns
    • Docker containerization templates

Pillar 5: Developer Experience

Phase 1 (Q1-Q2 2026) - Documentation Blitz

Owner: All maintainers + community

Goals:

  • Lower barrier to entry
  • Comprehensive learning path

Deliverables:

  1. DataHaskell Website Revamp
    • Clear getting started guide
    • Library comparison matrix
    • Migration guides (from Python, R)
    • Success stories
  2. Tutorial Series
    • Installation and setup (all platforms)
    • Your first data analysis
    • DataFrames deep dive
    • Machine learning workflow
    • Distributed computing basics
    • Production deployment
  3. Notebook Gallery
    • 20+ example notebooks covering:
      • Data cleaning and exploration
      • Statistical analysis
      • ML model building
      • Visualization
      • Domain-specific examples (finance, biology, etc.)

Phase 2 (Q3-Q4 2026) - Tooling

Owner: Community

Deliverables:

  1. datahaskell-cli (new tool)
    • Project scaffolding
    • Dependency management presets
    • Environment setup automation
    • Example project templates
  2. IDE Support Improvements
    • VSCode IHaskell support with dataHaskell stack supported out the box
    • HLS integration guides
    • Debugging workflows
    • IHaskell kernel improvements
  3. Testing & CI Templates
    • Property-based testing examples
    • Benchmark suites
    • GitHub Actions templates
    • Continuous deployment patterns

Pillar 6: Community & Ecosystem

Ongoing Initiatives

Goals:

  • Grow contributor base
  • Foster collaboration
  • Drive adoption

Deliverables:

  1. Community Building
    • Monthly community calls (starting Q1 2026)
    • Discord/Slack workspace
    • Quarterly virtual conferences
    • Mentorship program
  2. Contribution Framework
    • Good first issues across all projects
    • Contribution guidelines
    • Code review standards
    • Recognition program
  3. Outreach
    • Blog post series
    • Conference talks (Haskell Symposium, ZuriHac, etc.)
    • Academic collaborations
    • Industry partnerships
  4. Package Standards
    • Naming conventions
    • API design guidelines
    • Documentation requirements
    • Testing standards
    • Version compatibility matrix

Success Metrics

Q2 2026

  • dataframe v1 released
  • 3 complete end-to-end tutorials published
  • Performance benchmarks showing ≥70% of Pandas speed
  • 5 integration examples between major libraries

Q4 2026

  • 10,000+ total library downloads/month across ecosystem
  • 5+ active contributors
  • Performance parity (≥90%) with Pandas for common operations
  • Complete ML workflow from data to deployment documented

Q2 2027

  • 2+ companies using DataHaskell
  • DataHaskell track at major Haskell conference
  • 3+ published case studies
  • Comprehensive distributed computing examples

Q4 2027

  • Feature completeness with Python’s core data science stack
  • 5+ production ML systems case studies
  • Enterprise support offerings available

Resource Requirements

Maintainer Coordination

  • Monthly sync: All pillar leads (1 hour)
  • Quarterly planning: Full maintainer group (2 hours)

Funding Needs (Optional but Helpful)

  1. Infrastructure
    • Benchmark server (GPU-enabled)
    • CI/CD resources
    • Documentation hosting
  2. Developer Support
    • Part-time technical writer
    • Maintainer stipends or grants
    • Summer of Haskell projects
  3. Events
    • Quarterly virtual meetups
    • Annual in-person hackathon
    • Conference sponsorships

Risk Mitigation

Technical Risks

| Risk | Mitigation | |——|———–| | Performance doesn’t match Python | Early benchmarking, profiling, and optimization sprints | | Integration complexity | Defined interfaces, versioning strategy, compatibility tests | | Breaking changes in dependencies | Conservative version bounds, testing matrix |

Community Risks

| Risk | Mitigation | |——|———–| | Maintainer burnout | Distributed ownership, recognition program, funding support | | Fragmentation | Regular coordination, shared roadmap, integration testing | | Slow adoption | Marketing efforts, case studies, migration guides |

Ecosystem Risks

| Risk | Mitigation | |——|———–| | GHC changes break libraries | Test against multiple GHC versions, engage with GHC team | | Competing projects | Focus on collaboration, clear differentiation | | Limited contributor pool | Mentorship, good documentation, welcoming community |


Decision Framework

When to add new libraries

Criteria:

  1. Fills clear gap in ecosystem
  2. Has committed maintainer
  3. Integrates with existing components
  4. Follows API design guidelines
  5. Includes comprehensive tests and docs

When to deprecate/consolidate

Criteria:

  1. Unmaintained for >6 months
  2. Better alternative exists
  3. Creates confusion in ecosystem

Version Compatibility Policy

  • Support last 2 major GHC versions
  • Semantic versioning (PVP)
  • Deprecation warnings for 2 releases before removal
  • Compatibility matrix published on website

Communication Plan

Internal (Maintainers)

  • Discord channel: Daily async communication
  • GitHub Discussions: Technical decisions, RFCs
  • Monthly video call: Roadmap progress, blockers
  • Quarterly planning session: Next phase priorities

External (Community)

  • Blog: Monthly progress updates
  • Twitter/Social: Weekly highlights
  • Haskell Discourse: Major announcements
  • Newsletter: Quarterly ecosystem update
  • Documentation: Always up-to-date

How to Use This Roadmap

This is a living document. We will:

  • Review quarterly and adjust priorities
  • Track progress in GitHub projects
  • Celebrate milestones publicly
  • Adapt based on community feedback

Questions? Open a discussion on GitHub or join our community calls.


Let’s build the future of data science in Haskell together! 🚀