Roadmap
DataHaskell Roadmap 2026-2027
Version: 1.0
Date: November 2026
Coordinators: DataHaskell Community
Key Partners: dataframe, Hasktorch, distributed-process
Executive Summary
This roadmap outlines the strategic direction for building a complete, production-ready Haskell data science ecosystem. With three major pillars already in active development—dataframe (data manipulation), Hasktorch (deep learning), and distributed-process (distributed computing)—we are positioned to create a cohesive platform that rivals Python, R, and Julia for data science workloads.
Vision
By 2027, DataHaskell will provide a high-performance, end-to-end data science toolkit that enables practitioners to build reliable machine learning systems from data ingestion through model deployment.
Core Principles
- Interoperability: Seamless integration between ecosystem components
- Performance: Match or exceed Python/R performance benchmarks
- Ergonomics: Intuitive APIs that lower the barrier to entry
- Production Ready: Focus on reliability, monitoring, and deployment
- Type Safety: Leverage Haskell’s type system (where possible) to catch errors at compile time
Current State Assessment
Strengths
- dataframe: Modern dataframe library with IHaskell integration
- Hasktorch: Mature deep learning library with PyTorch backend and GPU support
- distributed-process: Battle-tested distributed computing framework
- IHaskell: A Haskell kernel for Jupyter notebooks.
- Strong functional programming foundations
- Excellent parallelism and concurrency primitives
Gaps to Address
- No community of maintainers and contributors
- Fragmented visualization ecosystem
- Limited data I/O format support
- Incomplete documentation and tutorials
- Sparse integration examples between major libraries
- Limited model deployment tooling
Critical Needs
- Unified onboarding experience
- Comprehensive benchmarking against Python/R
- Production deployment patterns
- Enterprise adoption case studies
Strategic Pillars
Pillar 1: Core Data Infrastructure
Phase 1 (Q1-Q2 2026) - Foundation
Owner: dataframe team
Goals:
- Complete dataframe v1 release (March 2026)
- Establish dataframe as the standard tabular data library
- Performance parity with Pandas/Polars for common operations
Deliverables:
- dataframe v0.1.0
- SQL-like API finalized
- IHaskell integration complete
- Type-safe column operations
- Comprehensive test suite
- Apache Arrow integration
- File Format Support
- CSV/TSV (existing)
- Parquet (high priority)
- Arrow IPC format
- Excel (xlsx)
- JSON (nested structures)
- HDF5 (coordination with scientific computing)
- Performance Benchmarks
- Public benchmark suite comparing to:
- Pandas
- Polars
- dplyr/tidyverse
- Focus areas: filtering, grouping, joining, aggregations
- Document optimization strategies
- Public benchmark suite comparing to:
Phase 2 (Q3-Q4 2026) - Expansion
Owner: dataframe + community
Goals:
- Advanced data manipulation features
- Computing on files larger than memory
- Integration with Cloud database systems
Deliverables:
- Advanced Operations
- Window functions
- Rolling aggregations
- Pivot/unpivot operations
- Complex joins (anti, semi)
- Reshaping operations (melt, cast)
- Cloud database Connectivity
- Read files from AWS/GCP/Azure
- PostgreSQL integration
- SQLite support
- Query pushdown optimization
- Streaming query results
Pillar 2: Statistical Computing & Visualization
Phase 1 (Q2-Q3 2026) - Statistics Core
Owner: Community (needs maintainer)
Goals:
- Create a unified machine learning library on top of Hasktorch and Statistics
- Create unified plotting API
Deliverables:
- statistics
- Extend hypothesis testing (t-test, ANOVA)
- Simple regression models (linear and logistic)
- Generalized linear models (GLM)
- Survival analysis basics
- Integration with dataframe
- Plotting & Visualization
- Option A: Extend hvega (Vega-Lite) with dataframe integration
- Option B: Create native plotting library with backends
- Priority features:
- Scatter plots, line plots, bar charts
- Histograms and distributions
- Heatmaps and correlation plots
- Interactive plots for notebooks
- Export to PNG, SVG, PDF
Phase 2 (Q4 2026 - Q1 2027) - Advanced Analytics
Owner: Community
Deliverables:
- Advanced Statistical Methods
- Mixed effects models
- Time series analysis (ARIMA, state space models)
- Bayesian inference (integration with existing libraries)
- Causal inference methods
- Spatial statistics
- Visualization Expansion
- Grammar of graphics implementation
- Geographic/mapping support
- Network visualization
- 3D plotting capabilities
Pillar 3: Machine Learning & Deep Learning
Phase 1 (Q1-Q2 2026) - Integration
Owners: Hasktorch + dataframe teams
Goals:
- Improve dataframe → tensor pipeline
- Example-driven documentation
Deliverables:
- dataframe ↔ Hasktorch Bridge
- Zero-copy conversion where possible
- Automatic type mapping
- GPU memory management
- Batch loading utilities
- ML Workflow Examples with new unified library
- End-to-end classification (Iris, MNIST)
- Regression examples (California Housing)
- Time series forecasting
- NLP pipeline (text classification)
- Computer vision (image classification)
- Data Preprocessing
- Feature scaling/normalization
- One-hot encoding
- Missing value imputation
- Train/test splitting
- Cross-validation utilities
Phase 2 (Q3-Q4 2026) - Classical ML
Owner: Community (coordinate with Hasktorch)
Goals:
- Fill gap between dataframe and deep learning
- Provide scikit-learn equivalent
Deliverables:
- haskell-ml-toolkit (new library)
- Decision trees and random forests
- Gradient boosting (XGBoost integration or native)
- Support Vector Machines
- K-means and hierarchical clustering
- Dimensionality reduction (PCA, t-SNE, UMAP)
- Model evaluation metrics
- Hyperparameter optimization
- Feature Engineering
- Automatic feature generation
- Feature selection methods
- Polynomial features
- Text feature extraction
Phase 3 (Q1-Q2 2027) - Model Management
Owners: Hasktorch + community
Deliverables:
- Model Serialization & Versioning
- Standard model format
- Version tracking
- Metadata storage
- Model registry concept
- Model Deployment
- REST API server templates
- Batch prediction utilities
- Model monitoring hooks
- ONNX export for interoperability
Pillar 4: Distributed & Parallel Computing
Phase 1 (Q2-Q3 2026) - Core Integration
Owners: distributed-process + dataframe teams
Goals:
- Enable distributed data processing
- Provide MapReduce-style operations
Deliverables:
- Distributed DataFrame Operations
- Distributed CSV/Parquet reading
- Parallel groupby and aggregations
- Distributed joins
- Shuffle operations
- Fault tolerance mechanisms
- distributed-ml (new library)
- Distributed model training
- Parameter servers
- Data parallelism primitives
- Model parallelism support
- Integration with Hasktorch
- Examples & Patterns
- Multi-node data processing
- Distributed hyperparameter search
- Large-scale model training
- Stream processing patterns
Phase 2 (Q4 2026 - Q1 2027) - Production Features
Owner: distributed-process team
Deliverables:
- Cluster Management
- Node discovery and registration
- Health monitoring
- Resource allocation
- Job scheduling
- Cloud Integration
- AWS backend
- Google Cloud backend
- Kubernetes deployment patterns
- Docker containerization templates
Pillar 5: Developer Experience
Phase 1 (Q1-Q2 2026) - Documentation Blitz
Owner: All maintainers + community
Goals:
- Lower barrier to entry
- Comprehensive learning path
Deliverables:
- DataHaskell Website Revamp
- Clear getting started guide
- Library comparison matrix
- Migration guides (from Python, R)
- Success stories
- Tutorial Series
- Installation and setup (all platforms)
- Your first data analysis
- DataFrames deep dive
- Machine learning workflow
- Distributed computing basics
- Production deployment
- Notebook Gallery
- 20+ example notebooks covering:
- Data cleaning and exploration
- Statistical analysis
- ML model building
- Visualization
- Domain-specific examples (finance, biology, etc.)
- 20+ example notebooks covering:
Phase 2 (Q3-Q4 2026) - Tooling
Owner: Community
Deliverables:
- datahaskell-cli (new tool)
- Project scaffolding
- Dependency management presets
- Environment setup automation
- Example project templates
- IDE Support Improvements
- VSCode IHaskell support with dataHaskell stack supported out the box
- HLS integration guides
- Debugging workflows
- IHaskell kernel improvements
- Testing & CI Templates
- Property-based testing examples
- Benchmark suites
- GitHub Actions templates
- Continuous deployment patterns
Pillar 6: Community & Ecosystem
Ongoing Initiatives
Goals:
- Grow contributor base
- Foster collaboration
- Drive adoption
Deliverables:
- Community Building
- Monthly community calls (starting Q1 2026)
- Discord/Slack workspace
- Quarterly virtual conferences
- Mentorship program
- Contribution Framework
- Good first issues across all projects
- Contribution guidelines
- Code review standards
- Recognition program
- Outreach
- Blog post series
- Conference talks (Haskell Symposium, ZuriHac, etc.)
- Academic collaborations
- Industry partnerships
- Package Standards
- Naming conventions
- API design guidelines
- Documentation requirements
- Testing standards
- Version compatibility matrix
Success Metrics
Q2 2026
- dataframe v1 released
- 3 complete end-to-end tutorials published
- Performance benchmarks showing ≥70% of Pandas speed
- 5 integration examples between major libraries
Q4 2026
- 10,000+ total library downloads/month across ecosystem
- 5+ active contributors
- Performance parity (≥90%) with Pandas for common operations
- Complete ML workflow from data to deployment documented
Q2 2027
- 2+ companies using DataHaskell
- DataHaskell track at major Haskell conference
- 3+ published case studies
- Comprehensive distributed computing examples
Q4 2027
- Feature completeness with Python’s core data science stack
- 5+ production ML systems case studies
- Enterprise support offerings available
Resource Requirements
Maintainer Coordination
- Monthly sync: All pillar leads (1 hour)
- Quarterly planning: Full maintainer group (2 hours)
Funding Needs (Optional but Helpful)
- Infrastructure
- Benchmark server (GPU-enabled)
- CI/CD resources
- Documentation hosting
- Developer Support
- Part-time technical writer
- Maintainer stipends or grants
- Summer of Haskell projects
- Events
- Quarterly virtual meetups
- Annual in-person hackathon
- Conference sponsorships
Risk Mitigation
Technical Risks
| Risk | Mitigation | |——|———–| | Performance doesn’t match Python | Early benchmarking, profiling, and optimization sprints | | Integration complexity | Defined interfaces, versioning strategy, compatibility tests | | Breaking changes in dependencies | Conservative version bounds, testing matrix |
Community Risks
| Risk | Mitigation | |——|———–| | Maintainer burnout | Distributed ownership, recognition program, funding support | | Fragmentation | Regular coordination, shared roadmap, integration testing | | Slow adoption | Marketing efforts, case studies, migration guides |
Ecosystem Risks
| Risk | Mitigation | |——|———–| | GHC changes break libraries | Test against multiple GHC versions, engage with GHC team | | Competing projects | Focus on collaboration, clear differentiation | | Limited contributor pool | Mentorship, good documentation, welcoming community |
Decision Framework
When to add new libraries
Criteria:
- Fills clear gap in ecosystem
- Has committed maintainer
- Integrates with existing components
- Follows API design guidelines
- Includes comprehensive tests and docs
When to deprecate/consolidate
Criteria:
- Unmaintained for >6 months
- Better alternative exists
- Creates confusion in ecosystem
Version Compatibility Policy
- Support last 2 major GHC versions
- Semantic versioning (PVP)
- Deprecation warnings for 2 releases before removal
- Compatibility matrix published on website
Communication Plan
Internal (Maintainers)
- Discord channel: Daily async communication
- GitHub Discussions: Technical decisions, RFCs
- Monthly video call: Roadmap progress, blockers
- Quarterly planning session: Next phase priorities
External (Community)
- Blog: Monthly progress updates
- Twitter/Social: Weekly highlights
- Haskell Discourse: Major announcements
- Newsletter: Quarterly ecosystem update
- Documentation: Always up-to-date
How to Use This Roadmap
This is a living document. We will:
- Review quarterly and adjust priorities
- Track progress in GitHub projects
- Celebrate milestones publicly
- Adapt based on community feedback
Questions? Open a discussion on GitHub or join our community calls.
Let’s build the future of data science in Haskell together! 🚀