STXXL  1.4-dev
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
A Billing System for Phone Calls (stxxl::vector and stxxl::sort)
Author
Roman Dementiev (2006)

The intended audience of this tutorial are developers or researchers who develop applications or implement algorithms processing large data sets which do not fit into the main memory of a computer. They must have basic knowledge in the theory of external memory computing and have working knowledge of C++ and an experience with programming using STL. Familiarity with key concepts of generic programming and C++ template mechanism is assumed.

Let us start with a toy but pretty relevant problem: the phone call billing problem. You are given a sequence of event records. Each record has a time stamp (time when the event had happened), type of event ('call begin' or 'call end'), the callers number, and the destination number. The event sequence is time-ordered. Your task is to generate a bill for each subscriber that includes cost of all her calls. The solution is uncomplicated: sort the records by the callers number. Since the sort brings all records of a subscriber together, we scan the sorted result computing and summing up the costs of all calls of a particular subscriber. The phone companies record up to 300 million transactions per day. AT&T billing system Gecko [33] has to process databases with about 60 billion records, occupying 2.6 terabytes. Certainly this volume can not be sorted in the main memory of a single computer (Except may be in the main memory of an expensive supercomputer.) Therefore we need to sort those huge data sets out-of-memory. Now we show how STXXL can be useful here, since it can handle large volumes I/O efficiently.

STL Code

If you are familiar with STL your the main function of bill generation program will probably look like this:

int main(int argc, char * argv[])
{
if(argc < 4) // check if all parameters are given
{ // in the command line
print_usage(argv[0]);
return 0;
}
// open file with the event log
std::fstream in(argv[1], std::ios::in);
// create a vector of log entries to read in
std::vector<LogEntry> v;
// read the input file and push the records
// into the vector
std::copy(std::istream_iterator<LogEntry>(in),
std::istream_iterator<LogEntry>(),
std::back_inserter(v));
// sort records by callers number
std::sort(v.begin(), v.end(), SortByCaller());
// open bill file for output
std::fstream out(argv[3], std::ios::out);
// scan the vector and output bills
std::for_each(v.begin(), v.end(), ProduceBill(out));
return 0;
}

To complete the code we need to define the log entry data type LogEntry, input operator >> for LogEntry, comparison functor SortByCaller, unary functor ProduceBills used for computing bills, and the print_usage function.

#include <stxxl.h> // STXXL header
#include <algorithm> // for STL std::sort
#include <vector> // for STL std::vector
#include <fstream> // for std::fstream
#include <ctime> // for time_t type
#define CT_PER_MIN 2 // subscribers pay 2 cent per minute
struct LogEntry // the event log data structure
{
long long int from; // callers number (64 bit integer)
long long int to; // destination number (64 bit int)
time_t timestamp; // time of event
int event; // event type 1 - call started
// 2 - call ended
};
// input operator used for reading from the file
std::istream& operator >> (std::istream& i, LogEntry& entry)
{
i >> entry.from;
i >> entry.to;
i >> entry.timestamp;
i >> entry.event;
return i;
}
// output operator used for writing to file
std::ostream& operator << (std::ostream& i, const LogEntry& entry)
{
i << entry.from << " ";
i << entry.to << " ";
i << entry.timestamp << " ";
i << entry.event;
return i;
}
// unary function used for producing the bills
struct ProduceBill
{
std::ostream& out; // stream for outputting
// the bills
unsigned sum; // current subscribers debit
LogEntry last; // the last record
ProduceBill(std::ostream& o_) : out(o_), sum(0) { last.from = -1; }
void operator () (const LogEntry& e)
{
if (last.from == e.from)
{
// either the last event was 'call started' and current event
// is 'call ended' or the last event was 'call ended' and
// current event is 'call started'
assert((last.event == 1 && e.event == 2) ||
(last.event == 2 && e.event == 1));
if (e.event == 2) // call ended
sum += CT_PER_MIN * (unsigned int)(e.timestamp - last.timestamp) / 60;
}
else if (last.from != -1)
{
assert(last.event == 2); // must be 'call ended'
assert(e.event == 1); // must be 'call started'
// output the total sum
out << last.from << "; " << (sum / 100) << " EUR " << (sum % 100) <<
" ct" << std::endl;
sum = 0; // reset the sum
}
last = e;
}
};
struct SortByCaller
{
// comparison function
bool operator () (const LogEntry& a, const LogEntry& b) const
{
return a.from < b.from ||
(a.from == b.from && a.timestamp < b.timestamp) ||
(a.from == b.from && a.timestamp == b.timestamp &&
a.event < b.event);
}
// min sentinel = value which is strictly smaller that any input element
static LogEntry min_value()
{
LogEntry dummy;
dummy.timestamp = std::numeric_limits<time_t>::min();
return dummy;
}
// max sentinel = value which is strictly larger that any input element
static LogEntry max_value()
{
LogEntry dummy;
dummy.timestamp = std::numeric_limits<time_t>::max();
return dummy;
}
};
void print_usage(const char* program)
{
std::cout << "Usage: " << program << " logfile main billfile" << std::endl;
std::cout << " logfile - file name of the input" << std::endl;
std::cout << " main - memory to use (in MiB)" << std::endl;
std::cout << " billfile - file name of the output" << std::endl;
}

Going Large – Use STXXL

In order to make the program I/O efficient we will replace the STL internal memory data structures and algorithms by their STXXL counterparts. The changes are marked with //!

#include <stxxl.h> //! include STXXL headers
// the rest of the code remains the same
int main(int argc, char * argv[])
{
if(argc < 4) // check if all parameters are given
{ // in the command line
print_usage(argv[0]);
return 0;
}
// open file with the event log
std::fstream in(argv[1], std::ios::in);
// create a vector of log entries to read in
stxxl::vector<LogEntry> v; //! use stxxl::vector instead of std::vector
// read the input file and push the records
// into the vector
std::copy(std::istream_iterator<LogEntry>(in),
std::istream_iterator<LogEntry>(),
std::back_inserter(v));
// bound the main memory consumption by M
// during sorting
const unsigned M = atol(argv[2])*1024*1024; //! calculated memory limit M
// sort records by callers number
stxxl::sort(v.begin(), v.end(), SortByCaller(), M); //! use stxxl::sort instead of std::sort
// open bill file for output
std::fstream out(argv[3], std::ios::out);
// scan the vector and output bills
// the last parameter tells how many buffers
// to use for overlapping I/O and computation
stxxl::for_each(v.begin(), v.end(), ProduceBill(out), 2); //! use stxxl::for_each instead of std::for_each
return 0;
}

As you note the changes are minimal. Only the namespaces and some memory specific parameters had to be changed.

See examples/algo/phonebills.cpp for the full source code. The example program is automatically compiled when building STXXL, refer to Compilation and Configuration on how to build programs with STXXL.

The program examples/algo/phonebills_genlog.cpp can be used to generate logs for processing with the phonebills example.

Do not forget to configure you external memory space in file .stxxl. See config.