STXXL
1.4-dev
|
This page gives a short introduction into the stream package. First the main abstractions are discussed and then some examples on how to utilize the existing algorithms are developed.
All example code can be found in examples/stream/stream1.cpp
In Generating Random Graphs using Streams another example is given, where an existing algorithm is "pipelined".
The stream package is built around the abstract notion of an object being able to produce a sequence of output values. Only three simple operations are necessary:
*
operator++
operatorempty()
functionThe most common place object that fits easily into this abstraction is the random generator. Actually, a random generator only requires two operations: it can be queried for its current value and be instructed to calculate/advance to new value. Of course the random sequence should be unbounded, so an empty()
function would always be false. Nevertheless, this common-place example illustrates the purpose of the stream interface pretty well.
All stream objects must support the three operations above, they form the stream algorithm concept. In C++ a class conforms to this concept if it implements the following interface:
A very simple stream object that produces the sequence 1,2,3,4,....,1000 is shown in the following snippet:
After this verbose interface definition, the actual iteration over a stream object can be done as follows:
For those who like to shorten everything into fewer lines, the above can also be expressed as a for loop:
Both loops will print the following output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 [...] 995 996 997 998 999 1000
The stream interface is so very useful for external memory algorithms because it represents the concept of sequential access to a stream of individual values. While the simple example above only works with integers, the value_type
of streams will more often contain complex tuple structs with multiple components.
A stream algorithm can then be constructed from multiple stream objects that pass data from one to another. This notion of "plugging together" stream objects is used in the following example to calculate the square of each value of an integer sequence:
For a beginner in stream object programming, the squaring example contains multiple unexpected, verbose complications.
InputStream
template parameter. As yet, in C++ we cannot syntactically define which concepts the template parameters must fulfill, in this case one would require InputStream
to implement the stream interface.InputStream
object is accepted by the constructor and a reference is saved into m_input_stream
.operator*()
must return a const-reference, so the square must actually be stored in a variable after it is calculated. Now note that the squaring operation in this version is implemented at two places: in the constructor and the operator++()
. operator++()
. A real implementation would probably combine the calculation code into a process()
function and also do additional allocation work in the constructor.An instance of the counter_object
can be plugged into a squaring_object
as done in the following example:
The example outputs:
1 4 9 16 25 36 49 64 81 100 121 144 169 [...] 986049 988036 990025 992016 994009 996004 998001 1000000
The above examples are pure C++ interface manipulations and do not even require STXXL. However, when writing stream algorithms you can take advantage of the utilities provided by the stream package to create complex algorithms. Probably the most useful is the pair of sorting classes, which will be discussed after a few preliminaries.
More complex algorithms will most often use tuples as values passed from one stream to another. These tuples wrap all information fields of a specific piece of data. Simple tuples can be created using std::pair
, tuples with larger number of components can use Boost.Tuple or just plain structs with multiple fields. (In the tuple case, the temporary value inside the stream struct can mostly be avoided.)
The stream package contains utilities to plug stream classes together to form complex algorithms. The following few examples are very basic algorithms:
Very often the input to a sequence of stream classes comes from an array or other container. In this case one requires an input stream object, which iterates through the container and outputs each element once. STXXL provides iterator2stream for this common purpose:
Most important: if the input container is a stxxl::vector, then one should use vector_iterator2stream, because this class will prefetch additional blocks from the vector while processing the stream.
The opposite to iterator2stream is to collect the output of a sequence of stream objects into a container or stxxl::vector. This operation is called materialize
and also comes in the general version and a special version for the STXXL-vector, which uses asynchronous writes.
This example shows how to materialize a stream into a usual STL vector.
And the only modification needed to support larger data sets is to materialize to an STXXL vector:
Maybe the most important set of tools in the stream package is the pairs of sorter classes runs_creator and runs_merger. The general way to sort a sequential input stream is to first consolidate a large number of input items in an internal memory buffer. Then when the buffer is full, it can be sorted in internal memory and subsequently written out to disk. This sorted sequence is then called a run. When the input stream is finished and the sorted output must be produced, theses sorted sequences can efficiently be merged using a tournament tree or similar multi-way comparison structure. (see Parallel Disk Sorting.)
STXXL implements this using two stream classes: runs_creator and runs_merger.
The following examples shows how to sort the integer sequence 1,2,...,1000 first by the right-most decimal digit, then by its absolute value (yes a somewhat constructed example, but it serves its purpose well.) For all sorters a comparator object is required which tells the sorter which of two objects is the smaller one. This is similar to the requirements of the usual STL, however, the STXXL sorters need to additional functions: min_value()
and max_value()
which are used as padding sentinels. These functions return the smallest and highest possible values of the given data type.
All sorters steps require an internal memory buffer. This size can be fixed using a parameter to runs_creator and runs_merger. The following example code instantiates a counter object, plugs this into a runs_creator which is followed by a runs_merger.
The output of the code above is:
10 20 30 40 50 60 70 80 [...] 990 1000 1 11 21 31 41 51 61 [...] 909 919 929 939 949 959 969 979 989 999
Note that in the above example the input of the runs_creator is itself a stream. If however the data is not naturally available as a stream, one can use a variant of runs_creator which accepts input via a push()
function. This is more useful when using an imperative programming style. Note that the runs_merger does not change.
And as the last example in this tutorial we show how to use stxxl::sorter, which combines runs_creator and runs_merger into one object. The sorter has two states: input and output. During input, new elements can be sorted using push()
. Then to switch to output state, the function sort()
is called, after which the sorter can be queried using the usual stream interface.
All three examples have the same output.