Current Environment
A knowledge base of data science and machine learning tools and algorithms written in Haskell that either already exist or we would like to exist.
Note: some libraries are mentioned more than once, because they provide functionality that covers a few areas, however for clarity the links and project descriptions are only given at their first occurrence.
Visualization
 Chart : A library for generating 2D Charts and Plots, with backends provided by Cairo (http://hackage.haskell.org/package/Chartcairo) and Diagrams (http://hackage.haskell.org/package/Chartdiagrams). Documentation: https://github.com/timbod7/haskellchart/wiki.
 plotlyhs : This is a library for generating JSON value to use with the Plotly.js library. The interface directly reflects the structure of the Plotly.js library and is therefore quite lowlevel. Lenses are used throughout to set Maybe fields in records to provide both data and configuration options. This library does not attempt to communicate with the Plotly API in any other way. All generated plots can be hosted on standalone web pages.
 plots :
Diagrams
based plotting library.  hvega : Create VegaLite visualizations in Haskell, targeting version 3 of the vegalite specification. This library follows closely the naming conventions of vegalite, and offers an extensive tutorial.
Publication
 knit : knithaskell is a beginning attempt at bringing some of the benefits of Rmarkdown to Haskell. Various helper functions are provided to simplify common operations, making it especially straightforward to build an HTML document from bits of markdown, latex and Lucid or Blaze html. Support is also included for including
hvega
visualizations.  pandocpyplot : A Pandoc filter to include figures generated from Python code blocks. Keep the document and Python code in the same location. Output from Matplotlib is captured and included as a figure.
 inliterate :
inliterate
is a GHC preprocessor which transforms a markdown document into a Haskell program, which, when run, prints to stdout the input document in HTML format. Certain code blocks with special annotations can be treated in particular ways: as Haskell code that must be included in the generating program (at the top level or in a do block) and as code that must be evaluated with the results inserted into the HTML document. For an example document, see https://github.com/diffusionkinetics/open/blob/master/plotlyhs/gendoc/GenDocInlit.hs which compiles into https://glutamate.github.io/plotlyhs/.
Data structures
Data frames
 Frames : Userfriendly, type safe, runtime efficient tooling for working with tabular data deserialized from commaseparated values (CSV) files. The type of each row of data is inferred from data, which can then be streamed from disk, or worked with in memory. Also see the comprehensive tutorial
 Framesmapreduce : This library contains some useful functions for using the mapreducefolds package with Frames (containers of data rows) from the Frames package. Included, in Frames.MapReduce, are helpers for filtering Frames, splitting records into key and data columns and reattaching key columns after reducing. Also included, in the Frames.Folds module, are some helpful functions for building folds of Frames from folds over each column, specified either individually or via a constraint on all the columns being folded over.
 analyze :
pandas
like dataframe operations for tabular data with CSV interface. Currently maintained within the scope of the DataHaskelldhcore
project.  bookkeeper : A new take on datatypes and records using
OverloadedLabels
(which is available since GHC 8). It bears some similarities to Nikita Volkov’srecord
library, but requires no Template Haskell.  colonnade : The colonnade package provides a way to talk about columnar encodings and decodings of data. This package provides very general types and does not provide a way for the enduser to actually apply the columnar encodings they build to data. Most users will also want to one a companion packages that provides (1) a content type and (2) functions for feeding data into a columnar encoding:
 lucidcolonnade for lucid html tables
 blazecolonnade for blaze html tables
 reflexdomcolonnade for reactive reflexdom tables
 yesodcolonnade for yesod widgets
 siphon for encoding and decoding CSVs
Streaming / Folds
 mapreducefolds : mapreducefolds is an attempt to find a good balance between simplicity, performance, and flexibility for simple map/reduce style operations on a foldable container f of some type x (e.g., [x]). The goal of the package is to provide an efficient fold for the computation via the types in the foldl package. Folds can be composed Applicatively, which makes it simple to do many such operations on the same data and loop over it only once.
Arrays
 vector : An efficient implementation of Intindexed arrays (both mutable and immutable), with a powerful loop optimisation framework.
 contiguous : This package provides a typeclass Contiguous that offers a unified interface to working with Array, SmallArray, PrimArray, and UnliftedArray.
Multidimensional arrays
 accelerate : Data.Array.Accelerate defines an embedded array language for computations for highperformance computing in Haskell. Computations on multidimensional, regular arrays are expressed in the form of parameterised collective operations, such as maps, reductions, and permutations. These computations may then be online compiled and executed on a range of architectures.
 repa : Repa provides high performance, regular, multidimensional, shape polymorphic parallel arrays. All numeric data is stored unboxed.
 massiv : Repastyle highperformance multidimentional arrays with nested parallelism and stencil computation capabilities.
 arrayfire : Haskell bindings to the ArrayFire generalpurpose GPU library
Records
 vinyl : Extensible records for Haskell with lenses. It has minimal dependencies and
Frames
is based on it.  labels : Declare and access tuple fields with labels. An approach to anonymous records.
 superrecord Supercharged anonymous records. Introductory blogpost, with case study using ReaderT.
Graphs
 algebraicgraphs :
algebraicgraphs
(a.k.a.alga
) is a library for algebraic construction and manipulation of graphs in Haskell. See this paper for the motivation behind the library, the underlying theory and implementation details. The toplevel moduleAlgebra.Graph
defines the core data typeGraph
, which is a deep embedding of four graph construction primitivesempty
,vertex
,overlay
andconnect
. To represent nonempty graphs, seeAlgebra.Graph.NonEmpty
. More conventional graph representations can be found inAlgebra.Graph.AdjacencyMap
andAlgebra.Graph.Relation
. The type classes defined inAlgebra.Graph.Class
andAlgebra.Graph.HigherKinded.Class
can be used for polymorphic graph construction and manipulation. Also seeAlgebra.Graph.Fold
that defines the BoehmBerarducci encoding of algebraic graphs and provides additional flexibility for polymorphic graph manipulation.  fgl : An inductive representation of manipulating graph data structures. Original website can be found at http://web.engr.oregonstate.edu/~erwig/fgl/haskell.
 graphite : Represent, analyze and visualize graphs & networks. A beginner friendly tutorial can be found at https://haskellgraphite.readthedocs.io/en/latest/.
Trees
 treetraversals : The treetraversals package defines inorder, preorder, postorder, levelorder, and reversed levelorder traversals for treelike types, and it also provides newtype wrappers for the various traversals so they may be used with
traverse
.
Database interfaces
 beam Homepage : Beam is a highlygeneral library for accessing any kind of database with Haskell. It supports several backends. beampostgres and beamsqlite are included in the main beam repository. Others are hosted and maintained independently, such as beammysql and beamfirebird. The documentation here shows examples in all known backends. Beam is highly extensible and other backends can be shipped independently without requiring any changes in the core libraries.For information on creating additional SQL backends, see the manual section for more.
 selda : Selda is a Haskell library for interacting with SQLbased relational databases inspired by LINQ and Opaleye.
 relationalrecord Homepage : Haskell Relational Record (HRR) is a query generator based on typed relational algebra and correspondence between SQL value lists and Haskell record types, which provide programming interfaces to Relational DataBase Managemsnt Systems (RDBMS).
Numerical linear algebra
 hmatrix : Bindings to BLAS/LAPACK. Linear solvers, matrix decompositions, and more.
 denselinearalgebra : A collection of linearalgebra related modules, extracted from the
statistics
library. Matrices and vectors are internally represented with unboxedvector
s, and the algorithms rely on inplace mutation for high efficiency. Currently maintained within the scope of the DataHaskelldhcore
project.  sparselinearalgebra : Native library for sparse algebraic computation. Linear solvers, matrix decompositions and related tools; functional but not optimized for efficiency yet.
 linearEqSolver : Solve linear systems of equations over integers and rationals, using an SMT solver.
Generation of random data
 mwcprobability : A simple probability distribution type, where distributions are characterized by sampling functions. Simple and idiomatic interface based on Applicative and Monad instances.
 randomfu : Random number generation based on modeling random variables in two complementary ways: first, by the parameters of standard mathematical distributions and, second, by an abstract type (RVar) which can be composed and manipulated monadically and sampled in either monadic or “pure” styles. The primary purpose of this library is to support defining and sampling a wide variety of high quality random variables. Quality is prioritized over speed, but performance is an important goal too. Very flexible, providing both a concrete (‘Distribution’) and and abstract but composable (‘RVar’) view of random variables.
 pcgrandom : Haskell bindings to the PCG random number generator http://www.pcgrandom.org. The api is very similar to
mwcrandom
but the pcg generator appears to be slightly faster. There is also a pure interface via therandom
libray.
Statistics
 statistics : This library provides a number of common functions and types useful in statistics. We focus on high performance, numerical robustness, and use of good algorithms. Where possible, we provide references to the statistical literature.
The library’s facilities can be divided into four broad categories:
 Working with widely used discrete and continuous probability distributions. (There are dozens of exotic distributions in use; we focus on the most common.)
 Computing with sample data: quantile estimation, kernel density estimation, histograms, bootstrap methods, significance testing, and regression and autocorrelation analysis.
 Random variate generation under several different distributions.
 Common statistical tests for significant differences between samples.
 sampling Sampling from arbitrary Foldable collections:
 sample, for sampling without replacement
 resample, for sampling with replacement (i.e., a bootstrap) Each variation can be prefixed with p to sample from a container of values weighted by probability.
 foldlstatistics : A reimplementation of the Statistics.Sample module using the foldl package. The intention of this package is to allow these algorithms to be used on a much broader set of data input types, including lists and streaming libraries such as conduit and pipes, and any other type which is Foldable. All statistics in this package can be computed with no more than two passes over the data  once to compute the mean and once to compute any statistics which require the mean.
 tdigest : A new data structure for accurate online accumulation of rankbased statistics such as quantiles and trimmed means.
 histogramfill : A convenient way to create and fill histograms. It supports fixed and variablesize bins, missing data and 2D binning.
 uncertain : Provides tools to manipulate numbers with inherent experimental/measurement uncertainty, and propagates them through functions.
 measurable : Construct measures from samples, mass/density functions, or even sampling functions. Construct image measures by fmaping measurable functions over them, or create new measures from existing ones by measure convolution and friends provided by a simple Num instance enabled by an Applicative instance. Create measures from graphs of other measures using the Monad instance and donotation. Query measures by integrating meaurable functions against them. Extract moments, cumulative density functions, or probabilities. Caveat: while fun to play with, and rewarding to see how measures fit together, measure operations as nested integrals are exponentially complex. Don’t expect them to scale very far!
Integration
 Markov Chain Monte Carlo
 declarative : A simple combinator language for Markov transition operators that are useful in MCMC.
 flatmcmc : flatmcmc uses an ensemble sampler that is invariant to affine transformations of space. It wanders a target probability distribution’s parameter space as if it had been “flattened” or “unstretched” in some sense, allowing many particles to explore it locally and in parallel. In general this sampler is useful when you want decent performance without dealing with any tuning parameters or local proposal distributions.
 Dynamical systems
Differentiation
 Automatic differentiation
 ad : Automatic differentiation to arbitrary order, applicable to data provided in any Traversable container.
 backprop : Automatic heterogeneous backpropagation. Write your functions to compute your result, and the library will automatically generate functions to compute your gradient. Differs from
ad
by offering full heterogeneity – each intermediate step and the resulting value can have different types. Mostly intended for usage with gradient descent and other numeric optimization techniques. Introductory blogpost here.
Optimization
 Linear programming
 Convex optimization
Signal processing
Discrete/Fast Fourier Transform
Machine Learning frameworks
 probablybaysig : This library contains definitions and functions for probabilistic and statistical inference.
 Math.Probably.Sampler defines the sampling function monad
 Math.Probably.PDF defines some common parametric logprobability density functions
 Math.Probably.FoldingStats defines statistics as folds that can be composed and calculated independently of the container of the underlying data.
 Strategy.* implements various transition operators for Markov Chain Monte Carlo, including MetropolisHastings, Hamiltonian Monte Carlo, NUTS, and continuous/discrete slice samplers.
 Math.Probably.MCMC implements functions and combinators for running Markov chains and interleaving transition operators.
 mltool : Haskell Machine Learning Toolkit includes various methods of supervised learning: linear regression, logistic regression, SVN, neural networks, etc. as well as some methods of unsupervised methods: KMeans and PCA.
Bayesian inference
Nested sampling
 NestedSampling : The code here is a fairly straightforward translation of the tutorial nested sampling code from Skilling and Sivia. The original code can be found at http://www.inference.phy.cam.ac.uk/bayesys/sivia/ along with documentation at http://www.inference.phy.cam.ac.uk/bayesys/. An example program called lighthouse.hs is included.
 NestedSamplinghs : This is a Haskell implementation of the classic Nested Sampling algorithm introduced by John Skilling. You can use it for Bayesian inference, statistical mechanics, and optimisation applications, and it comes with a few example programs.
Probabilistic programming languages
 monadbayes : A library for probabilistic programming in Haskell using probability monads. The emphasis is on composition of inference algorithms implemented in terms of monad transformers. The code is still experimental, but will be released on Hackage as soon as it reaches relative stability. User’s guide will appear soon. In the meantime see the models folder that contains several examples.
 hakaru : Hakaru is a simplytyped probabilistic programming language, designed for easy specification of probabilistic models and inference algorithms. Hakaru enables the design of modular probabilistic inference programs by providing:
 A language for representing probabilistic distributions, queries, and inferences
 Methods for transforming probabilistic information, such as conditional probability and probabilistic inference, using computer algebra
 deanie :
deanie
is an embedded probabilistic programming language. It can be used to denote, sample from, and perform inference on probabilistic programs.
Supervised learning
Timeseries filtering
 Kalman filtering
 estimator : The goal of this library is to simplify implementation and use of statespace estimation algorithms, such as Kalman Filters. The interface for constructing models is isolated as much as possible from the specifics of a given algorithm, so swapping out a Kalman Filter for a Bayesian Particle Filter should involve a minimum of effort. This implementation is designed to support symbolic types, such as from sbv or ivory. As a result you can generate code in another language, such as C, from a model written using this package; or run static analyses on your model.
 kalman : Linear, extended and unscented Kalman filters are provided, along with their corresponding smoothers. Furthermore, a particle filter and smoother is provided.
Graphical models
 Hidden Markov models
 HMM
 hmmhmatrix : Hidden Markov Models implemented using HMatrix data types and operations. http://en.wikipedia.org/wiki/Hidden_Markov_Model
It supports any kind of emission distribution, where discrete and multivariate Gaussian distributions are implemented as examples.
It currently implements:
 generation of samples of emission sequences,
 computation of the likelihood of an observed sequence of emissions,
 construction of most likely state sequence that produces an observed sequence of emissions,
 supervised and unsupervised training of the model by BaumWelch algorithm.
 learninghmm : This library provides functions for the maximum likelihood estimation of discrete hidden Markov models. At present, only BaumWelch and Viterbi algorithms are implemented for the plain HMM and the inputoutput HMM.
Classification
 Linear discriminant analysis
 Support Vector Machines
 svmsimple : A set of simplified bindings to libsvm suite of support vector machines. This package provides tools for classification, oneclass classification and support vector regression.
 Decision trees
 hinduceclassifierdecisiontree : A very simple decision tree construction algorithm; an implementation of
hinduceclassifier
’s Classifier class.  HaskellGBM : Haskell wrapper around LightGBM, the distributed library for gradientboosted decision tree algorithms. The emphasis is on using Haskell types (in particular the
refined
library) to help ensure that the hyperparameter settings chosen by the user are coherent and inbounds at all times.
 hinduceclassifierdecisiontree : A very simple decision tree construction algorithm; an implementation of
 Gaussian processes
Neural Networks
 neural : The goal of neural is to provide a modular and flexible neural network library written in native Haskell.
Features include
 composability via arrowlike instances and pipes,
 automatic differentiation for automatic gradient descent/ backpropagation training (using Edward Kmett’s fabulous ad library). The idea is to be able to easily define new components and wire them up in flexible, possibly complicated ways (convolutional deep networks etc.). Four examples are included as proof of concept:
 A simple neural network that approximates the sine function on [0,2 pi].
 Another simple neural network that approximates the sqrt function on [0,4].
 A slightly more complicated neural network that solves the famous Iris flower problem.
 A first (still simple) neural network for recognizing handwritten digits from the equally famous MNIST database. The library is still very much experimental at this point.
 backproplearn : Combinators and types for easily building trainable neural networks using the ‘backprop’ library.
 grenade : Grenade is a composable, dependently typed, practical, and fast recurrent neural network library for precise specifications and complex deep neural networks in Haskell. Grenade provides an API for composing layers of a neural network into a sequence parallel graph in a type safe manner; running networks with reverse automatic differentiation to calculate their gradients; and applying gradient decent for learning. Documentation and examples are available on github https://github.com/HuwCampbell/grenade.
 hasktorch : Hasktorch is a library for tensors and neural networks in Haskell. It is an independent open source community project which leverages the core C libraries shared by Torch and PyTorch. This library leverages cabal newbuild and backpack. Note that this project is in early development and should only be used by contributing developers. Expect substantial changes to the library API as it evolves.
 tensorflow : Haskell bindings for Tensorflow.
 Recurrent Neural Networks
 grenade
 Convolutional Neural Networks
 grenade
 LSTM (Long ShortTerm Memory)
 grenade
 neural
 Convolutional Neural Networks
 tensorflow

Generative Neural Networks
 References : Neural Networks, Types, and Functional Programming
Naive Bayes
 Gaussian Naive Bayes
 Multinomial Naive Bayes
 Bernoulli Naive Bayes
Boosting
 XGBoost
 AdaBoost
Regression
 Nearest Neighbors
 Linear Regression
 statistics
 Gaussian processes
 HasGP
 Kalman filtering
 kalman
 estimator
Reinforcement learning
 reinforce :
reinforce
exports an openaigymlike typeclass, MonadEnv, with both an interface to gymhttpapi, as well as haskellnative environments which provide a substantial speedup to the httpserver interface.  gymhttpapi : This library provides a REST client to the gym opensource library. gymhttpapi itself provides a pythonbased REST server to the gym opensource library, allowing development in languages other than python. Note that the openai/gymhttpapi is a monorepo of all languageclients. This hackage library tracks stites/gymhttpapi which is the activelymaintained haskell fork.

Policy gradient
 QLearning
 Neural Network QLearning
Unsupervised Learning
Nearest neighbours/spatial queries
 KdTree : A simple library for kd trees in Haskell. It enables searching through collections of points in O(log N) average time, using the nearestNeighbor function.
 kdt : This package includes static and dynamic versions of kd trees, as well as “Map” variants that store data at each point in the kd tree structure. Supports nearest neighbor, k nearest neighbors, points within a given radius, and points within a given range. To learn to use this package, start with the documentation for the Data.KdTree.Static module.
Clustering
 KMeans
 kmeans : A simple implementation of the standard kmeans clustering algorithm.
 clustering : Methods included in this library: Agglomerative hierarchical clustering: Complete linkage O(n^2), Single linkage O(n^2), Average linkage O(n^2), Weighted linkage O(n^2), Ward’s linkage O(n^2). KMeans clustering.
 SelfOrganising Maps (SOM)
 HyperbolicSOM
 Hierarchical (H)SOMs
 Meanshift
 Affinity propagation
 Spectral Clustering
 Hierarchical clustering
 clustering
 Birch
Dimensionality reduction
 Principal Component Analysis (PCA)
 Kernel PCA
 Incremental PCA
 Truncated SVD*
 Independent Component Analysis (ICA)
 tSNE (tdistributed stochastic neighbor embedding)
Machine Learning Misc.
 sibe : A simple, experimental machine learning library. Contains implementations of
 Multiclass Naive Bayes classification
 Word2Vec word embedding
 Principal component analysis (PCA)
 aimahaskell : Algorithms from Artificial Intelligence: A Modern Approach by Russell and Norvig.
Applications
 Natural Language Processing (NLP)
 chatter : chatter is a collection of simple Natural Language Processing algorithms, which also comes with models for POS tagging and Phrasal Chunking that have been trained on the Brown corpus (POS only) and the Conll2000 corpus (POS and Chunking).
Chatter supports:
 Part of speech tagging with Averaged Perceptrons. Based on the Python implementation by Matthew Honnibal: (http://honnibal.wordpress.com/2013/09/11/agoodpartofspeechpostaggerinabout200linesofpython/) See NLP.POS for the details of partofspeech tagging with chatter.
 Phrasal Chunking (also with an Averaged Perceptron) to identify arbitrary chunks based on training data.
 Document similarity; A cosinebased similarity measure, and TFIDF calculations, are available in the NLP.Similarity.VectorSim module.
 Information Extraction patterns via (http://www.haskell.org/haskellwiki/Parsec/) Parsec
 snowball : The Snowball FFI binding library is used to compute the stems of words in natural languages. Compared to the older stemmer package, this one:
 Correctly handles unicode without relying on the system locale
 Takes greater care to avoid memory leaks and to be thread safe
 Uses Text rather than String
 chatter : chatter is a collection of simple Natural Language Processing algorithms, which also comes with models for POS tagging and Phrasal Chunking that have been trained on the Brown corpus (POS only) and the Conll2000 corpus (POS and Chunking).
Chatter supports:
 Bioinformatics
Datasets
 datasets : Classical machine learning and statistics datasets from the UCI Machine Learning Repository and other sources.
The datasets package defines two different kinds of datasets:
 Small data sets which are directly (or indirectly with fileembed) embedded in the package as pure values and do not require network or IO to download the data set. This includes Iris, Anscombe and OldFaithful
 Other data sets which need to be fetched over the network and are cached in a local temporary directory.
Currently maintained within the scope of the DataHaskell
dhcore
project.
 mnistidx : Read and write data in the IDX format used in e.g. the MINST database.
Language interop
R
 HaskellR (https://tweag.github.io/HaskellR/)
 inliner : Seamlessly call R from Haskell and vice versa. No FFI required. Efficiently mix Haskell and R code in the same source file using quasiquotation. R code is designed to be evaluated using an instance of the R interpreter embedded in the binary, with no marshalling costs and hence little to no overhead when communicating values back to Haskell.
 H : An interactive prompt for exploring and graphing data sets. This is a thin wrapper around GHCi, with the full power of an R prompt, and the full power of Haskell prompt: you can enter expressions of either language, providing you with plotting and distributed computing facilities outofthebox.
Data science frameworks
Apache Spark bindings

sparkle : A library for writing resilient analytics applications in Haskell that scale to thousands of nodes, using Spark and the rest of the Apache ecosystem under the hood. See the blog post for details.

distributeddataset : A distributed data processing framework in pure Haskell. Inspired by Apache Spark.
Control.Distributed.Dataset
provides a type which lets you express transformations on a distributed multiset. Its API is highly inspired by Apache Spark. It uses pluggable ShuffleStore’s for storing intermediate compuation results. See ‘distributeddatasetaws’ for an implementation using S3.Control.Distributed.Fork
contains a fork function which lets you run arbitrary IO actions on remote machines; leveraging StaticPointers language extension and distributedclosure library. This module is useful when your task is embarrassingly parallel: It uses pluggable Backends for spawning executors. See ‘distributeddatasetaws’ for an implementation using AWS Lambda . 
krapsh : Haskell bindings to Apache Spark. The library consists of:
 A specification to describe data pipelines in a languageagnostic manner, and a communication protocol to submit these pipelines to Spark.
 A serving library, called krapshserver, that implements this specification on top of Spark. It is written in Scala and is loaded as a standard Spark package.
 A client written in Haskell that sends pipelines to Spark for execution. In addition, this client serves as an experimental platform for wholeprogram optimization and verification, as well as compilerenforced type checking.
Contribute
If you know a library that has to do with Data Science, please consider adding it, if the category it belongs to doesn’t exist, suggest a category for it.
Add sections related to data science and not only Machine Learning such as Data Mining, Distributed Processing, etc