DataHaskell Roadmap 2026-2027

Version: 1.0
Date: November 2026
Coordinators: DataHaskell Community
Key Partners: dataframe, Hasktorch, distributed-process


Executive Summary

This roadmap outlines the strategic direction for building a complete, production-ready Haskell data science ecosystem. With three major pillars already in active developmentβ€”dataframe (data manipulation), Hasktorch (deep learning), and distributed-process (distributed computing)β€”we are positioned to create a cohesive platform that rivals Python, R, and Julia for data science workloads.

Vision

By 2027, DataHaskell will provide a high-performance, end-to-end data science toolkit that enables practitioners to build reliable machine learning systems from data ingestion through model deployment.

Core Principles

  1. Type Safety First: Leverage Haskell’s type system to catch errors at compile time
  2. Interoperability: Seamless integration between ecosystem components
  3. Performance: Match or exceed Python/R performance benchmarks
  4. Ergonomics: Intuitive APIs that lower the barrier to entry
  5. Production Ready: Focus on reliability, monitoring, and deployment

Current State Assessment

🟒 Strengths

  • dataframe (v0.1 launch March 5): Modern, type-safe dataframe library with IHaskell integration
  • Hasktorch: Mature deep learning library with PyTorch backend and GPU support
  • distributed-process: Battle-tested distributed computing framework
  • Strong functional programming foundations
  • Excellent parallelism and concurrency primitives

🟑 Gaps to Address

  • Fragmented visualization ecosystem
  • Limited data I/O format support
  • Incomplete documentation and tutorials
  • Sparse integration examples between major libraries
  • Limited model deployment tooling

πŸ”΄ Critical Needs

  • Unified onboarding experience
  • Comprehensive benchmarking against Python/R
  • Production deployment patterns
  • Enterprise adoption case studies

Strategic Pillars

Pillar 1: Core Data Infrastructure

Phase 1 (Q1-Q2 2026) - Foundation

Owner: dataframe team

Goals:

  • βœ… Complete dataframe v0.1 release (March 2026)
  • Establish dataframe as the standard tabular data library
  • Performance parity with Pandas/Polars for common operations

Deliverables:

  1. dataframe v0.1.0
    • SQL-like API finalized
    • IHaskell integration complete
    • Type-safe column operations
    • Comprehensive test suite
    • Apache Arrow integration
  2. File Format Support
    • CSV/TSV (existing)
    • Parquet (high priority)
    • Arrow IPC format
    • Excel (xlsx)
    • JSON (nested structures)
    • HDF5 (coordination with scientific computing)
  3. Performance Benchmarks
    • Public benchmark suite comparing to:
      • Pandas
      • Polars
      • dplyr/tidyverse
    • Focus areas: filtering, grouping, joining, aggregations
    • Document optimization strategies

Phase 2 (Q3-Q4 2026) - Expansion

Owner: dataframe + community

Goals:

  • Advanced data manipulation features
  • Integration with database systems
  • Time series support

Deliverables:

  1. Advanced Operations
    • Window functions
    • Rolling aggregations
    • Pivot/unpivot operations
    • Complex joins (anti, semi)
    • Reshaping operations (melt, cast)
  2. Database Connectivity
    • PostgreSQL integration
    • SQLite support
    • Query pushdown optimization
    • Streaming query results
  3. Time Series Extensions
    • Date/time indexing
    • Resampling operations
    • Time-based rolling windows
    • Timezone handling

Pillar 2: Statistical Computing & Visualization

Phase 1 (Q2-Q3 2026) - Statistics Core

Owner: Community (needs maintainer)

Goals:

  • Establish comprehensive statistics library
  • Create unified plotting API

Deliverables:

  1. statistics-next (modernize existing library)
    • Descriptive statistics
    • Hypothesis testing (t-test, ANOVA, chi-square)
    • Linear regression
    • Generalized linear models (GLM)
    • Survival analysis basics
    • Integration with dataframe
  2. Plotting & Visualization
    • Option A: Extend hvega (Vega-Lite) with dataframe integration
    • Option B: Create native plotting library with backends
    • Priority features:
      • Scatter plots, line plots, bar charts
      • Histograms and distributions
      • Heatmaps and correlation plots
      • Interactive plots for notebooks
      • Export to PNG, SVG, PDF

Phase 2 (Q4 2026 - Q1 2027) - Advanced Analytics

Owner: Community

Deliverables:

  1. Advanced Statistical Methods
    • Mixed effects models
    • Time series analysis (ARIMA, state space models)
    • Bayesian inference (integration with existing libraries)
    • Causal inference methods
    • Spatial statistics
  2. Visualization Expansion
    • Grammar of graphics implementation
    • Geographic/mapping support
    • Network visualization
    • 3D plotting capabilities

Pillar 3: Machine Learning & Deep Learning

Phase 1 (Q1-Q2 2026) - Integration

Owners: Hasktorch + dataframe teams

Goals:

  • Seamless dataframe β†’ tensor pipeline
  • Example-driven documentation

Deliverables:

  1. dataframe ↔ Hasktorch Bridge
    • Zero-copy conversion where possible
    • Automatic type mapping
    • GPU memory management
    • Batch loading utilities
  2. ML Workflow Examples
    • End-to-end classification (Iris, MNIST)
    • Regression examples (California Housing)
    • Time series forecasting
    • NLP pipeline (text classification)
    • Computer vision (image classification)
  3. Data Preprocessing
    • Feature scaling/normalization
    • One-hot encoding
    • Missing value imputation
    • Train/test splitting
    • Cross-validation utilities

Phase 2 (Q3-Q4 2026) - Classical ML

Owner: Community (coordinate with Hasktorch)

Goals:

  • Fill gap between dataframe and deep learning
  • Provide scikit-learn equivalent

Deliverables:

  1. haskell-ml-toolkit (new library)
    • Decision trees and random forests
    • Gradient boosting (XGBoost integration or native)
    • Support Vector Machines
    • K-means and hierarchical clustering
    • Dimensionality reduction (PCA, t-SNE, UMAP)
    • Model evaluation metrics
    • Hyperparameter optimization
  2. Feature Engineering
    • Automatic feature generation
    • Feature selection methods
    • Polynomial features
    • Text feature extraction

Phase 3 (Q1-Q2 2027) - Model Management

Owners: Hasktorch + community

Deliverables:

  1. Model Serialization & Versioning
    • Standard model format
    • Version tracking
    • Metadata storage
    • Model registry concept
  2. Model Deployment
    • REST API server templates
    • Batch prediction utilities
    • Model monitoring hooks
    • ONNX export for interoperability

Pillar 4: Distributed & Parallel Computing

Phase 1 (Q2-Q3 2026) - Core Integration

Owners: distributed-process + dataframe teams

Goals:

  • Enable distributed data processing
  • Provide MapReduce-style operations

Deliverables:

  1. Distributed DataFrame Operations
    • Distributed CSV/Parquet reading
    • Parallel groupby and aggregations
    • Distributed joins
    • Shuffle operations
    • Fault tolerance mechanisms
  2. distributed-ml (new library)
    • Distributed model training
    • Parameter servers
    • Data parallelism primitives
    • Model parallelism support
    • Integration with Hasktorch
  3. Examples & Patterns
    • Multi-node data processing
    • Distributed hyperparameter search
    • Large-scale model training
    • Stream processing patterns

Phase 2 (Q4 2026 - Q1 2027) - Production Features

Owner: distributed-process team

Deliverables:

  1. Cluster Management
    • Node discovery and registration
    • Health monitoring
    • Resource allocation
    • Job scheduling
  2. Cloud Integration
    • AWS backend
    • Google Cloud backend
    • Kubernetes deployment patterns
    • Docker containerization templates

Pillar 5: Developer Experience

Phase 1 (Q1-Q2 2026) - Documentation Blitz

Owner: All maintainers + community

Goals:

  • Lower barrier to entry
  • Comprehensive learning path

Deliverables:

  1. DataHaskell Website Revamp
    • Modern design
    • Clear getting started guide
    • Library comparison matrix
    • Migration guides (from Python, R)
    • Success stories
  2. Tutorial Series
    • Installation and setup (all platforms)
    • Your first data analysis
    • DataFrames deep dive
    • Machine learning workflow
    • Distributed computing basics
    • Production deployment
  3. Notebook Gallery
    • 20+ example notebooks covering:
      • Data cleaning and exploration
      • Statistical analysis
      • ML model building
      • Visualization
      • Domain-specific examples (finance, biology, etc.)

Phase 2 (Q3-Q4 2026) - Tooling

Owner: Community

Deliverables:

  1. datahaskell-cli (new tool)
    • Project scaffolding
    • Dependency management presets
    • Environment setup automation
    • Example project templates
  2. IDE Support Improvements
    • VSCode extension enhancements
    • HLS integration guides
    • Debugging workflows
    • IHaskell kernel improvements
  3. Testing & CI Templates
    • Property-based testing examples
    • Benchmark suites
    • GitHub Actions templates
    • Continuous deployment patterns

Pillar 6: Community & Ecosystem

Ongoing Initiatives

Goals:

  • Grow contributor base
  • Foster collaboration
  • Drive adoption

Deliverables:

  1. Community Building
    • Monthly community calls (starting Q1 2026)
    • Discord/Slack workspace
    • Quarterly virtual conferences
    • Mentorship program
  2. Contribution Framework
    • Good first issues across all projects
    • Contribution guidelines
    • Code review standards
    • Recognition program
  3. Outreach
    • Blog post series
    • Conference talks (Haskell Symposium, ZuriHac, etc.)
    • Academic collaborations
    • Industry partnerships
  4. Package Standards
    • Naming conventions
    • API design guidelines
    • Documentation requirements
    • Testing standards
    • Version compatibility matrix

Integration Priority Matrix

Critical Integrations (Start Immediately)

  1. dataframe ↔ Hasktorch: Data β†’ Training pipeline
  2. dataframe ↔ IHaskell: Interactive analysis
  3. dataframe ↔ statistics: Analysis workflow

High Priority (Q2-Q3 2026)

  1. dataframe ↔ distributed-process: Distributed operations
  2. Hasktorch ↔ distributed-process: Distributed training
  3. statistics ↔ visualization: Plot statistical results

Medium Priority (Q4 2026)

  1. All ↔ model deployment: Production pipeline
  2. All ↔ monitoring: Observability

Success Metrics

Q2 2026

  • dataframe v0.1 released with 500+ downloads/month
  • 3 complete end-to-end tutorials published
  • Performance benchmarks showing β‰₯70% of Pandas speed
  • 5 integration examples between major libraries

Q4 2026

  • 10,000+ total library downloads/month across ecosystem
  • 20+ companies using DataHaskell in production
  • 50+ active contributors
  • Performance parity (β‰₯90%) with Pandas for common operations
  • Complete ML workflow from data to deployment documented

Q2 2027

  • 100+ companies using DataHaskell
  • DataHaskell track at major Haskell conference
  • 3+ published case studies
  • Comprehensive distributed computing examples

Q4 2027

  • Feature completeness with Python’s core data science stack
  • 5+ production ML systems case studies
  • Enterprise support offerings available

Resource Requirements

Maintainer Coordination

  • Monthly sync: All pillar leads (1 hour)
  • Quarterly planning: Full maintainer group (2 hours)
  • Annual retreat: Strategic direction (virtual or in-person)

Funding Needs (Optional but Helpful)

  1. Infrastructure
    • Benchmark server (GPU-enabled)
    • CI/CD resources
    • Documentation hosting
  2. Developer Support
    • Part-time technical writer
    • Maintainer stipends (Haskell Foundation)
    • Summer of Haskell projects
  3. Events
    • Quarterly virtual meetups
    • Annual in-person hackathon
    • Conference sponsorships

Risk Mitigation

Technical Risks

| Risk | Mitigation | |β€”β€”|———–| | Performance doesn’t match Python | Early benchmarking, profiling, and optimization sprints | | Integration complexity | Defined interfaces, versioning strategy, compatibility tests | | Breaking changes in dependencies | Conservative version bounds, testing matrix |

Community Risks

| Risk | Mitigation | |β€”β€”|———–| | Maintainer burnout | Distributed ownership, recognition program, funding support | | Fragmentation | Regular coordination, shared roadmap, integration testing | | Slow adoption | Marketing efforts, case studies, migration guides |

Ecosystem Risks

| Risk | Mitigation | |β€”β€”|———–| | GHC changes break libraries | Test against multiple GHC versions, engage with GHC team | | Competing projects | Focus on collaboration, clear differentiation | | Limited contributor pool | Mentorship, good documentation, welcoming community |


Decision Framework

When to add new libraries

Criteria:

  1. Fills clear gap in ecosystem
  2. Has committed maintainer
  3. Integrates with existing components
  4. Follows API design guidelines
  5. Includes comprehensive tests and docs

When to deprecate/consolidate

Criteria:

  1. Unmaintained for >6 months
  2. Better alternative exists
  3. Low usage (<100 downloads/month)
  4. Creates confusion in ecosystem

Version Compatibility Policy

  • Support last 2 GHC versions
  • Semantic versioning (PVP)
  • Deprecation warnings for 2 releases before removal
  • Compatibility matrix published on website

Communication Plan

Internal (Maintainers)

  • Slack/Discord channel: Daily async communication
  • GitHub Discussions: Technical decisions, RFCs
  • Monthly video call: Roadmap progress, blockers
  • Quarterly planning session: Next phase priorities

External (Community)

  • Blog: Monthly progress updates
  • Twitter/Social: Weekly highlights
  • Haskell Discourse: Major announcements
  • Newsletter: Quarterly ecosystem update
  • Documentation: Always up-to-date

Near-Term Action Items (Next 30 Days)

For dataframe maintainer (mchav)

  1. Finalize v0.1 release checklist
  2. Write Parquet support specification
  3. Create 3 dataframe ↔ Hasktorch examples
  4. Set up benchmark infrastructure

For Hasktorch team

  1. Test dataframe integration patterns
  2. Document tensor conversion APIs
  3. Create example pipeline notebook
  4. Identify distributed training requirements

For distributed-process team

  1. Prototype distributed dataframe operations
  2. Document deployment patterns
  3. Create cluster setup guide
  4. Design fault-tolerance strategy

For community coordinator

  1. Set up monthly call schedule
  2. Create Discord/Slack workspace
  3. Draft website redesign plan
  4. Reach out to potential contributors

For all

  1. Review and comment on this roadmap
  2. Identify personal capacity for next 6 months
  3. Claim ownership of specific deliverables
  4. Share roadmap with broader community

Existing Haskell Projects

  • Frames: Alternative dataframe (potential collaboration/consolidation?)
  • hmatrix: Linear algebra (ensure compatibility)
  • statistics: Statistical computing (modernization candidate)
  • Chart/hvega: Visualization (integration targets)
  • postgresql-simple: Database connectivity
  • accelerate: Array processing with GPU support

External Integration Targets

  • Apache Arrow: Zero-copy data interchange
  • DuckDB: Embedded analytical database
  • ONNX: Model interchange format
  • MLflow: ML lifecycle management

Appendix B: Glossary

Critical Path: dataframe β†’ statistics β†’ ML toolkit β†’ distributed operations Integration Points: Where libraries share data structures or APIs Zero-Copy: Data sharing without duplication in memory Type-Safe: Compile-time guarantees about data structure and operations


Appendix C: Version History

Version Date Changes Author
1.0 Nov 2026 Initial comprehensive roadmap DataHaskell coordinators

How to Use This Roadmap

This is a living document. We will:

  • Review quarterly and adjust priorities
  • Track progress in GitHub projects
  • Celebrate milestones publicly
  • Adapt based on community feedback

Contributing: See [CONTRIBUTING.md] for how to propose changes to this roadmap.

Questions? Open a discussion on GitHub or join our community calls.


Let’s build the future of data science in Haskell together! πŸš€