Roadmap
DataHaskell Roadmap 2026-2027
Version: 1.0
Date: November 2026
Coordinators: DataHaskell Community
Key Partners: dataframe, Hasktorch, distributed-process
Executive Summary
This roadmap outlines the strategic direction for building a complete, production-ready Haskell data science ecosystem. With three major pillars already in active developmentβdataframe (data manipulation), Hasktorch (deep learning), and distributed-process (distributed computing)βwe are positioned to create a cohesive platform that rivals Python, R, and Julia for data science workloads.
Vision
By 2027, DataHaskell will provide a high-performance, end-to-end data science toolkit that enables practitioners to build reliable machine learning systems from data ingestion through model deployment.
Core Principles
- Type Safety First: Leverage Haskellβs type system to catch errors at compile time
 - Interoperability: Seamless integration between ecosystem components
 - Performance: Match or exceed Python/R performance benchmarks
 - Ergonomics: Intuitive APIs that lower the barrier to entry
 - Production Ready: Focus on reliability, monitoring, and deployment
 
Current State Assessment
π’ Strengths
- dataframe (v0.1 launch March 5): Modern, type-safe dataframe library with IHaskell integration
 - Hasktorch: Mature deep learning library with PyTorch backend and GPU support
 - distributed-process: Battle-tested distributed computing framework
 - Strong functional programming foundations
 - Excellent parallelism and concurrency primitives
 
π‘ Gaps to Address
- Fragmented visualization ecosystem
 - Limited data I/O format support
 - Incomplete documentation and tutorials
 - Sparse integration examples between major libraries
 - Limited model deployment tooling
 
π΄ Critical Needs
- Unified onboarding experience
 - Comprehensive benchmarking against Python/R
 - Production deployment patterns
 - Enterprise adoption case studies
 
Strategic Pillars
Pillar 1: Core Data Infrastructure
Phase 1 (Q1-Q2 2026) - Foundation
Owner: dataframe team
Goals:
- β Complete dataframe v0.1 release (March 2026)
 - Establish dataframe as the standard tabular data library
 - Performance parity with Pandas/Polars for common operations
 
Deliverables:
- dataframe v0.1.0
    
- SQL-like API finalized
 - IHaskell integration complete
 - Type-safe column operations
 - Comprehensive test suite
 - Apache Arrow integration
 
 - File Format Support
    
- CSV/TSV (existing)
 - Parquet (high priority)
 - Arrow IPC format
 - Excel (xlsx)
 - JSON (nested structures)
 - HDF5 (coordination with scientific computing)
 
 - Performance Benchmarks
    
- Public benchmark suite comparing to:
        
- Pandas
 - Polars
 - dplyr/tidyverse
 
 - Focus areas: filtering, grouping, joining, aggregations
 - Document optimization strategies
 
 - Public benchmark suite comparing to:
        
 
Phase 2 (Q3-Q4 2026) - Expansion
Owner: dataframe + community
Goals:
- Advanced data manipulation features
 - Integration with database systems
 - Time series support
 
Deliverables:
- Advanced Operations
    
- Window functions
 - Rolling aggregations
 - Pivot/unpivot operations
 - Complex joins (anti, semi)
 - Reshaping operations (melt, cast)
 
 - Database Connectivity
    
- PostgreSQL integration
 - SQLite support
 - Query pushdown optimization
 - Streaming query results
 
 - Time Series Extensions
    
- Date/time indexing
 - Resampling operations
 - Time-based rolling windows
 - Timezone handling
 
 
Pillar 2: Statistical Computing & Visualization
Phase 1 (Q2-Q3 2026) - Statistics Core
Owner: Community (needs maintainer)
Goals:
- Establish comprehensive statistics library
 - Create unified plotting API
 
Deliverables:
- statistics-next (modernize existing library)
    
- Descriptive statistics
 - Hypothesis testing (t-test, ANOVA, chi-square)
 - Linear regression
 - Generalized linear models (GLM)
 - Survival analysis basics
 - Integration with dataframe
 
 - Plotting & Visualization
    
- Option A: Extend hvega (Vega-Lite) with dataframe integration
 - Option B: Create native plotting library with backends
 - Priority features:
        
- Scatter plots, line plots, bar charts
 - Histograms and distributions
 - Heatmaps and correlation plots
 - Interactive plots for notebooks
 - Export to PNG, SVG, PDF
 
 
 
Phase 2 (Q4 2026 - Q1 2027) - Advanced Analytics
Owner: Community
Deliverables:
- Advanced Statistical Methods
    
- Mixed effects models
 - Time series analysis (ARIMA, state space models)
 - Bayesian inference (integration with existing libraries)
 - Causal inference methods
 - Spatial statistics
 
 - Visualization Expansion
    
- Grammar of graphics implementation
 - Geographic/mapping support
 - Network visualization
 - 3D plotting capabilities
 
 
Pillar 3: Machine Learning & Deep Learning
Phase 1 (Q1-Q2 2026) - Integration
Owners: Hasktorch + dataframe teams
Goals:
- Seamless dataframe β tensor pipeline
 - Example-driven documentation
 
Deliverables:
- dataframe β Hasktorch Bridge
    
- Zero-copy conversion where possible
 - Automatic type mapping
 - GPU memory management
 - Batch loading utilities
 
 - ML Workflow Examples
    
- End-to-end classification (Iris, MNIST)
 - Regression examples (California Housing)
 - Time series forecasting
 - NLP pipeline (text classification)
 - Computer vision (image classification)
 
 - Data Preprocessing
    
- Feature scaling/normalization
 - One-hot encoding
 - Missing value imputation
 - Train/test splitting
 - Cross-validation utilities
 
 
Phase 2 (Q3-Q4 2026) - Classical ML
Owner: Community (coordinate with Hasktorch)
Goals:
- Fill gap between dataframe and deep learning
 - Provide scikit-learn equivalent
 
Deliverables:
- haskell-ml-toolkit (new library)
    
- Decision trees and random forests
 - Gradient boosting (XGBoost integration or native)
 - Support Vector Machines
 - K-means and hierarchical clustering
 - Dimensionality reduction (PCA, t-SNE, UMAP)
 - Model evaluation metrics
 - Hyperparameter optimization
 
 - Feature Engineering
    
- Automatic feature generation
 - Feature selection methods
 - Polynomial features
 - Text feature extraction
 
 
Phase 3 (Q1-Q2 2027) - Model Management
Owners: Hasktorch + community
Deliverables:
- Model Serialization & Versioning
    
- Standard model format
 - Version tracking
 - Metadata storage
 - Model registry concept
 
 - Model Deployment
    
- REST API server templates
 - Batch prediction utilities
 - Model monitoring hooks
 - ONNX export for interoperability
 
 
Pillar 4: Distributed & Parallel Computing
Phase 1 (Q2-Q3 2026) - Core Integration
Owners: distributed-process + dataframe teams
Goals:
- Enable distributed data processing
 - Provide MapReduce-style operations
 
Deliverables:
- Distributed DataFrame Operations
    
- Distributed CSV/Parquet reading
 - Parallel groupby and aggregations
 - Distributed joins
 - Shuffle operations
 - Fault tolerance mechanisms
 
 - distributed-ml (new library)
    
- Distributed model training
 - Parameter servers
 - Data parallelism primitives
 - Model parallelism support
 - Integration with Hasktorch
 
 - Examples & Patterns
    
- Multi-node data processing
 - Distributed hyperparameter search
 - Large-scale model training
 - Stream processing patterns
 
 
Phase 2 (Q4 2026 - Q1 2027) - Production Features
Owner: distributed-process team
Deliverables:
- Cluster Management
    
- Node discovery and registration
 - Health monitoring
 - Resource allocation
 - Job scheduling
 
 - Cloud Integration
    
- AWS backend
 - Google Cloud backend
 - Kubernetes deployment patterns
 - Docker containerization templates
 
 
Pillar 5: Developer Experience
Phase 1 (Q1-Q2 2026) - Documentation Blitz
Owner: All maintainers + community
Goals:
- Lower barrier to entry
 - Comprehensive learning path
 
Deliverables:
- DataHaskell Website Revamp
    
- Modern design
 - Clear getting started guide
 - Library comparison matrix
 - Migration guides (from Python, R)
 - Success stories
 
 - Tutorial Series
    
- Installation and setup (all platforms)
 - Your first data analysis
 - DataFrames deep dive
 - Machine learning workflow
 - Distributed computing basics
 - Production deployment
 
 - Notebook Gallery
    
- 20+ example notebooks covering:
        
- Data cleaning and exploration
 - Statistical analysis
 - ML model building
 - Visualization
 - Domain-specific examples (finance, biology, etc.)
 
 
 - 20+ example notebooks covering:
        
 
Phase 2 (Q3-Q4 2026) - Tooling
Owner: Community
Deliverables:
- datahaskell-cli (new tool)
    
- Project scaffolding
 - Dependency management presets
 - Environment setup automation
 - Example project templates
 
 - IDE Support Improvements
    
- VSCode extension enhancements
 - HLS integration guides
 - Debugging workflows
 - IHaskell kernel improvements
 
 - Testing & CI Templates
    
- Property-based testing examples
 - Benchmark suites
 - GitHub Actions templates
 - Continuous deployment patterns
 
 
Pillar 6: Community & Ecosystem
Ongoing Initiatives
Goals:
- Grow contributor base
 - Foster collaboration
 - Drive adoption
 
Deliverables:
- Community Building
    
- Monthly community calls (starting Q1 2026)
 - Discord/Slack workspace
 - Quarterly virtual conferences
 - Mentorship program
 
 - Contribution Framework
    
- Good first issues across all projects
 - Contribution guidelines
 - Code review standards
 - Recognition program
 
 - Outreach
    
- Blog post series
 - Conference talks (Haskell Symposium, ZuriHac, etc.)
 - Academic collaborations
 - Industry partnerships
 
 - Package Standards
    
- Naming conventions
 - API design guidelines
 - Documentation requirements
 - Testing standards
 - Version compatibility matrix
 
 
Integration Priority Matrix
Critical Integrations (Start Immediately)
- dataframe β Hasktorch: Data β Training pipeline
 - dataframe β IHaskell: Interactive analysis
 - dataframe β statistics: Analysis workflow
 
High Priority (Q2-Q3 2026)
- dataframe β distributed-process: Distributed operations
 - Hasktorch β distributed-process: Distributed training
 - statistics β visualization: Plot statistical results
 
Medium Priority (Q4 2026)
- All β model deployment: Production pipeline
 - All β monitoring: Observability
 
Success Metrics
Q2 2026
- dataframe v0.1 released with 500+ downloads/month
 - 3 complete end-to-end tutorials published
 - Performance benchmarks showing β₯70% of Pandas speed
 - 5 integration examples between major libraries
 
Q4 2026
- 10,000+ total library downloads/month across ecosystem
 - 20+ companies using DataHaskell in production
 - 50+ active contributors
 - Performance parity (β₯90%) with Pandas for common operations
 - Complete ML workflow from data to deployment documented
 
Q2 2027
- 100+ companies using DataHaskell
 - DataHaskell track at major Haskell conference
 - 3+ published case studies
 - Comprehensive distributed computing examples
 
Q4 2027
- Feature completeness with Pythonβs core data science stack
 - 5+ production ML systems case studies
 - Enterprise support offerings available
 
Resource Requirements
Maintainer Coordination
- Monthly sync: All pillar leads (1 hour)
 - Quarterly planning: Full maintainer group (2 hours)
 - Annual retreat: Strategic direction (virtual or in-person)
 
Funding Needs (Optional but Helpful)
- Infrastructure
    
- Benchmark server (GPU-enabled)
 - CI/CD resources
 - Documentation hosting
 
 - Developer Support
    
- Part-time technical writer
 - Maintainer stipends (Haskell Foundation)
 - Summer of Haskell projects
 
 - Events
    
- Quarterly virtual meetups
 - Annual in-person hackathon
 - Conference sponsorships
 
 
Risk Mitigation
Technical Risks
| Risk | Mitigation | |ββ|ββββ| | Performance doesnβt match Python | Early benchmarking, profiling, and optimization sprints | | Integration complexity | Defined interfaces, versioning strategy, compatibility tests | | Breaking changes in dependencies | Conservative version bounds, testing matrix |
Community Risks
| Risk | Mitigation | |ββ|ββββ| | Maintainer burnout | Distributed ownership, recognition program, funding support | | Fragmentation | Regular coordination, shared roadmap, integration testing | | Slow adoption | Marketing efforts, case studies, migration guides |
Ecosystem Risks
| Risk | Mitigation | |ββ|ββββ| | GHC changes break libraries | Test against multiple GHC versions, engage with GHC team | | Competing projects | Focus on collaboration, clear differentiation | | Limited contributor pool | Mentorship, good documentation, welcoming community |
Decision Framework
When to add new libraries
Criteria:
- Fills clear gap in ecosystem
 - Has committed maintainer
 - Integrates with existing components
 - Follows API design guidelines
 - Includes comprehensive tests and docs
 
When to deprecate/consolidate
Criteria:
- Unmaintained for >6 months
 - Better alternative exists
 - Low usage (<100 downloads/month)
 - Creates confusion in ecosystem
 
Version Compatibility Policy
- Support last 2 GHC versions
 - Semantic versioning (PVP)
 - Deprecation warnings for 2 releases before removal
 - Compatibility matrix published on website
 
Communication Plan
Internal (Maintainers)
- Slack/Discord channel: Daily async communication
 - GitHub Discussions: Technical decisions, RFCs
 - Monthly video call: Roadmap progress, blockers
 - Quarterly planning session: Next phase priorities
 
External (Community)
- Blog: Monthly progress updates
 - Twitter/Social: Weekly highlights
 - Haskell Discourse: Major announcements
 - Newsletter: Quarterly ecosystem update
 - Documentation: Always up-to-date
 
Near-Term Action Items (Next 30 Days)
For dataframe maintainer (mchav)
- Finalize v0.1 release checklist
 - Write Parquet support specification
 - Create 3 dataframe β Hasktorch examples
 - Set up benchmark infrastructure
 
For Hasktorch team
- Test dataframe integration patterns
 - Document tensor conversion APIs
 - Create example pipeline notebook
 - Identify distributed training requirements
 
For distributed-process team
- Prototype distributed dataframe operations
 - Document deployment patterns
 - Create cluster setup guide
 - Design fault-tolerance strategy
 
For community coordinator
- Set up monthly call schedule
 - Create Discord/Slack workspace
 - Draft website redesign plan
 - Reach out to potential contributors
 
For all
- Review and comment on this roadmap
 - Identify personal capacity for next 6 months
 - Claim ownership of specific deliverables
 - Share roadmap with broader community
 
Appendix A: Related Projects to Consider
Existing Haskell Projects
- Frames: Alternative dataframe (potential collaboration/consolidation?)
 - hmatrix: Linear algebra (ensure compatibility)
 - statistics: Statistical computing (modernization candidate)
 - Chart/hvega: Visualization (integration targets)
 - postgresql-simple: Database connectivity
 - accelerate: Array processing with GPU support
 
External Integration Targets
- Apache Arrow: Zero-copy data interchange
 - DuckDB: Embedded analytical database
 - ONNX: Model interchange format
 - MLflow: ML lifecycle management
 
Appendix B: Glossary
Critical Path: dataframe β statistics β ML toolkit β distributed operations Integration Points: Where libraries share data structures or APIs Zero-Copy: Data sharing without duplication in memory Type-Safe: Compile-time guarantees about data structure and operations
Appendix C: Version History
| Version | Date | Changes | Author | 
|---|---|---|---|
| 1.0 | Nov 2026 | Initial comprehensive roadmap | DataHaskell coordinators | 
How to Use This Roadmap
This is a living document. We will:
- Review quarterly and adjust priorities
 - Track progress in GitHub projects
 - Celebrate milestones publicly
 - Adapt based on community feedback
 
Contributing: See [CONTRIBUTING.md] for how to propose changes to this roadmap.
Questions? Open a discussion on GitHub or join our community calls.
Letβs build the future of data science in Haskell together! π