Copyright | (c) 2010-2011 Simon Meier (c) 2010 Jasper van der Jeugt |
---|---|

License | BSD3-style (see LICENSE) |

Maintainer | Simon Meier <iridcode@gmail.com> |

Portability | GHC |

Safe Haskell | Trustworthy |

Language | Haskell2010 |

This module provides `Builder`

*primitives*, which are lower level building
blocks for constructing `Builder`

s. You don't need to go down to this level but
it can be slightly faster.

Morally, builder primitives are like functions `a -> Builder`

, that is they
take a value and encode it as a sequence of bytes, represented as a `Builder`

.
Of course their implementation is a bit more specialised.

Builder primitives come in two forms: fixed-size and bounded-size.

*Fixed(-size) primitives*are builder primitives that always result in a sequence of bytes of a fixed length. That is, the length is independent of the value that is encoded. An example of a fixed size primitive is the big-endian encoding of a`Word64`

, which always results in exactly 8 bytes.*Bounded(-size) primitives*are builder primitives that always result in a sequence of bytes that is no larger than a predetermined bound. That is, the bound is independent of the value that is encoded but the actual length will depend on the value. An example for a bounded primitive is the UTF-8 encoding of a`Char`

, which can be 1,2,3 or 4 bytes long, so the bound is 4 bytes.

Note that fixed primitives can be considered as a special case of bounded primitives, and we can lift from fixed to bounded.

Because bounded primitives are the more general case, in this documentation we only refer to fixed size primitives where it matters that the resulting sequence of bytes is of a fixed length. Otherwise, we just refer to bounded size primitives.

The purpose of using builder primitives is to improve the performance of
`Builder`

s. These improvements stem from making the two most common steps
performed by a `Builder`

more efficient. We explain these two steps in turn.

The first most common step is the concatenation of two `Builder`

s. Internally,
concatenation corresponds to function composition. (Note that `Builder`

s can
be seen as difference-lists of buffer-filling functions; cf.
http://hackage.haskell.org/cgi-bin/hackage-scripts/package/dlist. )
Function composition is a fast *O(1)* operation. However, we can use bounded
primitives to remove some of these function compositions altogether, which is
more efficient.

The second most common step performed by a `Builder`

is to fill a buffer using
a bounded primitives, which works as follows. The `Builder`

checks whether
there is enough space left to execute the bounded primitive. If there is, then
the `Builder`

executes the bounded primitive and calls the next `Builder`

with
the updated buffer. Otherwise, the `Builder`

signals its driver that it
requires a new buffer. This buffer must be at least as large as the bound of
the primitive. We can use bounded primitives to reduce the number of
buffer-free checks by fusing the buffer-free checks of consecutive `Builder`

s.
We can also use bounded primitives to simplify the control flow for signalling
that a buffer is full by ensuring that we check first that there is enough
space left and only then decide on how to encode a given value.

Let us illustrate these improvements on the CSV-table rendering example from
Data.ByteString.Builder. Its "hot code" is the rendering of a table's
cells, which we implement as follows using only the functions from the
`Builder`

API.

import Data.ByteString.Builder as B renderCell :: Cell -> Builder renderCell (StringC cs) = renderString cs renderCell (IntC i) = B.intDec i renderString :: String -> Builder renderString cs = B.charUtf8 '"' <> foldMap escape cs <> B.charUtf8 '"' where escape '\\' = B.charUtf8 '\\' <> B.charUtf8 '\\' escape '\"' = B.charUtf8 '\\' <> B.charUtf8 '\"' escape c = B.charUtf8 c

Efficient encoding of `Int`

s as decimal numbers is performed by `intDec`

.
Optimization potential exists for the escaping of `String`

s. The above
implementation has two optimization opportunities. First, the buffer-free
checks of the `Builder`

s for escaping double quotes and backslashes can be
fused. Second, the concatenations performed by `foldMap`

can be eliminated.
The following implementation exploits these optimizations.

import qualified Data.ByteString.Builder.Prim as P import Data.ByteString.Builder.Prim (`condB`

,`liftFixedToBounded`

, (`>*<`

), (`>$<`

) ) renderString :: String -> Builder renderString cs = B.charUtf8 '"' <>`primMapListBounded`

escape cs <> B.charUtf8 '"' where escape ::`BoundedPrim`

Char escape =`condB`

(== '\\') (fixed2 ('\\', '\\')) $`condB`

(== '\"') (fixed2 ('\\', '\"')) $`charUtf8`

{-# INLINE fixed2 #-} fixed2 x =`liftFixedToBounded`

$ const x`>$<`

`char7`

`>*<`

`char7`

The code should be mostly self-explanatory. The slightly awkward syntax is
because the combinators are written such that the size-bound of the resulting
`BoundedPrim`

can be computed at compile time. We also explicitly inline the
`fixed2`

primitive, which encodes a fixed tuple of characters, to ensure that
the bound computation happens at compile time. When encoding the following list
of `String`

s, the optimized implementation of `renderString`

is two times
faster.

maxiStrings :: [String] maxiStrings = take 1000 $ cycle ["hello", "\"1\"", "λ-wörld"]

Most of the performance gain stems from using `primMapListBounded`

, which
encodes a list of values from left-to-right with a `BoundedPrim`

. It exploits
the `Builder`

internals to avoid unnecessary function compositions (i.e.,
concatenations). In the future, we might expect the compiler to perform the
optimizations implemented in `primMapListBounded`

. However, it seems that the
code is currently to complicated for the compiler to see through. Therefore, we
provide the `BoundedPrim`

escape hatch, which allows data structures to provide
very efficient encoding traversals, like `primMapListBounded`

for lists.

Note that `BoundedPrim`

s are a bit verbose, but quite versatile. Here is an
example of a `BoundedPrim`

for combined HTML escaping and UTF-8 encoding. It
exploits that the escaped character with the maximal Unicode codepoint is '>'.

{-# INLINE charUtf8HtmlEscaped #-} charUtf8HtmlEscaped ::`BoundedPrim`

Char charUtf8HtmlEscaped =`condB`

(> '>' )`charUtf8`

$`condB`

(== '<' ) (fixed4 ('&',('l',('t',';')))) $ -- <`condB`

(== '>' ) (fixed4 ('&',('g',('t',';')))) $ -- >`condB`

(== '&' ) (fixed5 ('&',('a',('m',('p',';'))))) $ -- &`condB`

(== '"' ) (fixed5 ('&',('#',('3',('4',';'))))) $ -- "`condB`

(== '\'') (fixed5 ('&',('#',('3',('9',';'))))) $ -- ' (`liftFixedToBounded`

`char7`

) -- fallback for`Char`

s smaller than '>' where {-# INLINE fixed4 #-} fixed4 x =`liftFixedToBounded`

$ const x`>$<`

char7`>*<`

char7`>*<`

char7`>*<`

char7 {-# INLINE fixed5 #-} fixed5 x =`liftFixedToBounded`

$ const x`>$<`

char7`>*<`

char7`>*<`

char7`>*<`

char7`>*<`

char7

This module currently does not expose functions that require the special
properties of fixed-size primitives. They are useful for prefixing `Builder`

s
with their size or for implementing chunked encodings. We will expose the
corresponding functions in future releases of this library.

## Synopsis

- data BoundedPrim a
- emptyB :: BoundedPrim a
- (>*<) :: Monoidal f => f a -> f b -> f (a, b)
- (>$<) :: Contravariant f => (b -> a) -> f a -> f b
- eitherB :: BoundedPrim a -> BoundedPrim b -> BoundedPrim (Either a b)
- condB :: (a -> Bool) -> BoundedPrim a -> BoundedPrim a -> BoundedPrim a
- primBounded :: BoundedPrim a -> a -> Builder
- primMapListBounded :: BoundedPrim a -> [a] -> Builder
- primUnfoldrBounded :: BoundedPrim b -> (a -> Maybe (b, a)) -> a -> Builder
- primMapByteStringBounded :: BoundedPrim Word8 -> ByteString -> Builder
- primMapLazyByteStringBounded :: BoundedPrim Word8 -> ByteString -> Builder
- data FixedPrim a
- emptyF :: FixedPrim a
- liftFixedToBounded :: FixedPrim a -> BoundedPrim a
- primFixed :: FixedPrim a -> a -> Builder
- primMapListFixed :: FixedPrim a -> [a] -> Builder
- primUnfoldrFixed :: FixedPrim b -> (a -> Maybe (b, a)) -> a -> Builder
- primMapByteStringFixed :: FixedPrim Word8 -> ByteString -> Builder
- primMapLazyByteStringFixed :: FixedPrim Word8 -> ByteString -> Builder
- int8 :: FixedPrim Int8
- word8 :: FixedPrim Word8
- int16BE :: FixedPrim Int16
- int32BE :: FixedPrim Int32
- int64BE :: FixedPrim Int64
- word16BE :: FixedPrim Word16
- word32BE :: FixedPrim Word32
- word64BE :: FixedPrim Word64
- floatBE :: FixedPrim Float
- doubleBE :: FixedPrim Double
- int16LE :: FixedPrim Int16
- int32LE :: FixedPrim Int32
- int64LE :: FixedPrim Int64
- word16LE :: FixedPrim Word16
- word32LE :: FixedPrim Word32
- word64LE :: FixedPrim Word64
- floatLE :: FixedPrim Float
- doubleLE :: FixedPrim Double
- intHost :: FixedPrim Int
- int16Host :: FixedPrim Int16
- int32Host :: FixedPrim Int32
- int64Host :: FixedPrim Int64
- wordHost :: FixedPrim Word
- word16Host :: FixedPrim Word16
- word32Host :: FixedPrim Word32
- word64Host :: FixedPrim Word64
- floatHost :: FixedPrim Float
- doubleHost :: FixedPrim Double
- char7 :: FixedPrim Char
- int8Dec :: BoundedPrim Int8
- int16Dec :: BoundedPrim Int16
- int32Dec :: BoundedPrim Int32
- int64Dec :: BoundedPrim Int64
- intDec :: BoundedPrim Int
- word8Dec :: BoundedPrim Word8
- word16Dec :: BoundedPrim Word16
- word32Dec :: BoundedPrim Word32
- word64Dec :: BoundedPrim Word64
- wordDec :: BoundedPrim Word
- word8Hex :: BoundedPrim Word8
- word16Hex :: BoundedPrim Word16
- word32Hex :: BoundedPrim Word32
- word64Hex :: BoundedPrim Word64
- wordHex :: BoundedPrim Word
- int8HexFixed :: FixedPrim Int8
- int16HexFixed :: FixedPrim Int16
- int32HexFixed :: FixedPrim Int32
- int64HexFixed :: FixedPrim Int64
- word8HexFixed :: FixedPrim Word8
- word16HexFixed :: FixedPrim Word16
- word32HexFixed :: FixedPrim Word32
- word64HexFixed :: FixedPrim Word64
- floatHexFixed :: FixedPrim Float
- doubleHexFixed :: FixedPrim Double
- char8 :: FixedPrim Char
- charUtf8 :: BoundedPrim Char
- cstring :: Addr# -> Builder
- cstringUtf8 :: Addr# -> Builder

# Bounded-size primitives

data BoundedPrim a Source #

A builder primitive that always results in sequence of bytes that is no longer than a pre-determined bound.

## Combinators

The combinators for `BoundedPrim`

s are implemented such that the
size of the resulting `BoundedPrim`

can be computed at compile time.

emptyB :: BoundedPrim a Source #

The `BoundedPrim`

that always results in the zero-length sequence.

(>*<) :: Monoidal f => f a -> f b -> f (a, b) infixr 5 Source #

A pairing/concatenation operator for builder primitives, both bounded and fixed size.

For example,

toLazyByteString (primFixed (char7 >*< char7) ('x','y')) = "xy"

We can combine multiple primitives using `>*<`

multiple times.

toLazyByteString (primFixed (char7 >*< char7 >*< char7) ('x',('y','z'))) = "xyz"

(>$<) :: Contravariant f => (b -> a) -> f a -> f b infixl 4 Source #

A fmap-like operator for builder primitives, both bounded and fixed size.

Builder primitives are contravariant so it's like the normal fmap, but backwards (look at the type). (If it helps to remember, the operator symbol is like ($) but backwards.)

We can use it for example to prepend and/or append fixed values to an primitive.

import Data.ByteString.Builder.Prim as P showEncoding ((\x -> ('\'', (x, '\''))) >$< fixed3) 'x' = "'x'" where fixed3 = P.char7 >*< P.char7 >*< P.char7

Note that the rather verbose syntax for composition stems from the requirement to be able to compute the size / size bound at compile time.

eitherB :: BoundedPrim a -> BoundedPrim b -> BoundedPrim (Either a b) Source #

Encode an `Either`

value using the first `BoundedPrim`

for `Left`

values and the second `BoundedPrim`

for `Right`

values.

Note that the functions `eitherB`

, `pairB`

, and `contramapB`

(written below
using `>$<`

) suffice to construct `BoundedPrim`

s for all non-recursive
algebraic datatypes. For example,

maybeB :: BoundedPrim () -> BoundedPrim a -> BoundedPrim (Maybe a) maybeB nothing just =`maybe`

(Left ()) Right`>$<`

eitherB nothing just

condB :: (a -> Bool) -> BoundedPrim a -> BoundedPrim a -> BoundedPrim a Source #

Conditionally select a `BoundedPrim`

.
For example, we can implement the ASCII primitive that drops characters with
Unicode codepoints above 127 as follows.

charASCIIDrop =`condB`

(< '\128') (`liftFixedToBounded`

`char7`

)`emptyB`

## Builder construction

primBounded :: BoundedPrim a -> a -> Builder Source #

Create a `Builder`

that encodes values with the given `BoundedPrim`

.

We rewrite consecutive uses of `primBounded`

such that the bound-checks are
fused. For example,

primBounded (word32 c1) `mappend` primBounded (word32 c2)

is rewritten such that the resulting `Builder`

checks only once, if ther are
at 8 free bytes, instead of checking twice, if there are 4 free bytes. This
optimization is not observationally equivalent in a strict sense, as it
influences the boundaries of the generated chunks. However, for a user of
this library it is observationally equivalent, as chunk boundaries of a lazy
`ByteString`

can only be observed through the internal interface.
Moreover, we expect that all primitives write much fewer than 4kb (the
default short buffer size). Hence, it is safe to ignore the additional
memory spilled due to the more aggressive buffer wrapping introduced by this
optimization.

primMapListBounded :: BoundedPrim a -> [a] -> Builder Source #

Create a `Builder`

that encodes a list of values consecutively using a
`BoundedPrim`

for each element. This function is more efficient than

mconcat . map (primBounded w)

or

foldMap (primBounded w)

because it moves several variables out of the inner loop.

primUnfoldrBounded :: BoundedPrim b -> (a -> Maybe (b, a)) -> a -> Builder Source #

Create a `Builder`

that encodes a sequence generated from a seed value
using a `BoundedPrim`

for each sequence element.

primMapByteStringBounded :: BoundedPrim Word8 -> ByteString -> Builder Source #

Create a `Builder`

that encodes each `Word8`

of a strict `ByteString`

using a `BoundedPrim`

. For example, we can write a `Builder`

that filters
a strict `ByteString`

as follows.

import qualified Data.ByteString.Builder.Prim as P

filterBS p = P.condB p (P.liftFixedToBounded P.word8) P.emptyB

primMapLazyByteStringBounded :: BoundedPrim Word8 -> ByteString -> Builder Source #

Chunk-wise application of `primMapByteStringBounded`

.

# Fixed-size primitives

A builder primitive that always results in a sequence of bytes of a pre-determined, fixed size.

## Combinators

The combinators for `FixedPrim`

s are implemented such that the
`size`

of the resulting `FixedPrim`

is computed at compile time.

The `(>*<)`

and `(>$<)`

pairing and mapping operators can be used
with `FixedPrim`

.

liftFixedToBounded :: FixedPrim a -> BoundedPrim a Source #

Lift a `FixedPrim`

to a `BoundedPrim`

.

## Builder construction

In terms of expressivity, the function `fixedPrim`

would be sufficient
for constructing `Builder`

s from `FixedPrim`

s. The fused variants of
this function are provided because they allow for more efficient
implementations. Our compilers are just not smart enough yet; and for some
of the employed optimizations (see the code of `primMapByteStringFixed`

)
they will very likely never be.

Note that functions marked with "*Heavy inlining.*" are forced to be
inlined because they must be specialized for concrete encodings,
but are rather heavy in terms of code size. We recommend to define a
top-level function for every concrete instantiation of such a function in
order to share its code. A typical example is the function
`byteStringHex`

from Data.ByteString.Builder.ASCII,
which is implemented as follows.

byteStringHex :: S.ByteString -> Builder byteStringHex =`primMapByteStringFixed`

`word8HexFixed`

primMapListFixed :: FixedPrim a -> [a] -> Builder Source #

Encode a list of values from left-to-right with a `FixedPrim`

.

primMapByteStringFixed :: FixedPrim Word8 -> ByteString -> Builder Source #

*Heavy inlining.* Encode all bytes of a strict `ByteString`

from
left-to-right with a `FixedPrim`

. This function is quite versatile. For
example, we can use it to construct a `Builder`

that maps every byte before
copying it to the buffer to be filled.

mapToBuilder :: (Word8 -> Word8) -> S.ByteString -> Builder mapToBuilder f = primMapByteStringFixed (contramapF f word8)

We can also use it to hex-encode a strict `ByteString`

as shown by the
`byteStringHex`

example above.

primMapLazyByteStringFixed :: FixedPrim Word8 -> ByteString -> Builder Source #

*Heavy inlining.* Encode all bytes of a lazy `ByteString`

from
left-to-right with a `FixedPrim`

.

# Standard encodings of Haskell values

intHost :: FixedPrim Int Source #

Encode a single native machine `Int`

. The `Int`

s is encoded in host order,
host endian form, for the machine you are on. On a 64 bit machine the `Int`

is an 8 byte value, on a 32 bit machine, 4 bytes. Values encoded this way
are not portable to different endian or integer sized machines, without
conversion.

wordHost :: FixedPrim Word Source #

Encode a single native machine `Word`

. The `Word`

s is encoded in host order,
host endian form, for the machine you are on. On a 64 bit machine the `Word`

is an 8 byte value, on a 32 bit machine, 4 bytes. Values encoded this way
are not portable to different endian or word sized machines, without
conversion.

floatHost :: FixedPrim Float Source #

Encode a `Float`

in native host order and host endianness. Values written
this way are not portable to different endian machines, without conversion.

## Character encodings

### ISO/IEC 8859-1 (Char8)

The ISO/IEC 8859-1 encoding is an 8-bit encoding often known as Latin-1.
The *Char8* encoding implemented here works by truncating the Unicode
codepoint to 8-bits and encoding them as a single byte. For the codepoints
0-255 this corresponds to the ISO/IEC 8859-1 encoding. Note that the
Char8 encoding is equivalent to the ASCII encoding on the Unicode
codepoints 0-127. Hence, functions such as `intDec`

can also be used for
encoding `Int`

s as a decimal number with Char8 encoded characters.

### UTF-8

The UTF-8 encoding can encode all Unicode codepoints.
It is equivalent to the ASCII encoding on the Unicode codepoints 0-127.
Hence, functions such as `intDec`

can also be used for encoding `Int`

s as
a decimal number with UTF-8 encoded characters.