Package org.apache.jute
Hadoop record I/O contains classes and a record description language
translator for simplifying serialization and deserialization of records in a
language-neutral manner.
Introduction
Software systems of any significant complexity require mechanisms for data interchange with the outside world. These interchanges typically involve the marshaling and unmarshaling of logical units of data to and from data streams (files, network connections, memory buffers etc.). Applications usually have some code for serializing and deserializing the data types that they manipulate embedded in them. The work of serialization has several features that make automatic code generation for it worthwhile. Given a particular output encoding (binary, XML, etc.), serialization of primitive types and simple compositions of primitives (structs, vectors etc.) is a very mechanical task. Manually written serialization code can be susceptible to bugs especially when records have a large number of fields or a record definition changes between software versions. Lastly, it can be very useful for applications written in different programming languages to be able to share and interchange data. This can be made a lot easier by describing the data records manipulated by these applications in a language agnostic manner and using the descriptions to derive implementations of serialization in multiple target languages. This document describes Hadoop Record I/O, a mechanism that is aimed at- enabling the specification of simple serializable data types (records)
- enabling the generation of code in multiple target languages for marshaling and unmarshaling such types
- providing target language specific support that will enable application programmers to incorporate generated code into their applications
Goals
- Support for commonly used primitive types. Hadoop should include as primitives commonly used builtin types from programming languages we intend to support.
- Support for common data compositions (including recursive compositions). Hadoop should support widely used composite types such as structs and vectors.
- Code generation in multiple target languages. Hadoop should be capable of generating serialization code in multiple target languages and should be easily extensible to new target languages. The initial target languages are C++ and Java.
- Support for generated target languages. Hadooop should include support in the form of headers, libraries, packages for supported target languages that enable easy inclusion and use of generated code in applications.
- Support for multiple output encodings. Candidates include packed binary, comma-separated text, XML etc.
- Support for specifying record types in a backwards/forwards compatible manner. This will probably be in the form of support for optional fields in records. This version of the document does not include a description of the planned mechanism, we intend to include it in the next iteration.
Non-Goals
- Serializing existing arbitrary C++ classes.
- Serializing complex data structures such as trees, linked lists etc.
- Built-in indexing schemes, compression, or check-sums.
- Dynamic construction of objects from an XML schema.
Data Types and Streams
This section describes the primitive and composite types supported by Hadoop. We aim to support a set of types that can be used to simply and efficiently express a wide range of record types in different programming languages.Primitive Types
For the most part, the primitive types of Hadoop map directly to primitive types in high level programming languages. Special cases are the ustring (a Unicode string) and buffer types, which we believe find wide use and which are usually implemented in library code and not available as language built-ins. Hadoop also supplies these via library code when a target language built-in is not present and there is no widely adopted "standard" implementation. The complete list of primitive types is:- byte: An 8-bit unsigned integer.
- boolean: A boolean value.
- int: A 32-bit signed integer.
- long: A 64-bit signed integer.
- float: A single precision floating point number as described by IEEE-754.
- double: A double precision floating point number as described by IEEE-754.
- ustring: A string consisting of Unicode characters.
- buffer: An arbitrary sequence of bytes.
Composite Types
Hadoop supports a small set of composite types that enable the description of simple aggregate types and containers. A composite type is serialized by sequentially serializing it constituent elements. The supported composite types are:- record: An aggregate type like a C-struct. This is a list of typed fields that are together considered a single unit of data. A record is serialized by sequentially serializing its constituent fields. In addition to serialization a record has comparison operations (equality and less-than) implemented for it, these are defined as memberwise comparisons.
- vector: A sequence of entries of the same data type, primitive or composite.
- map: An associative container mapping instances of a key type to instances of a value type. The key and value types may themselves be primitive or composite types.
Streams
Hadoop generates code for serializing and deserializing record types to abstract streams. For each target language Hadoop defines very simple input and output stream interfaces. Application writers can usually develop concrete implementations of these by putting a one method wrapper around an existing stream implementation.DDL Syntax and Examples
We now describe the syntax of the Hadoop data description language. This is followed by a few examples of DDL usage.Hadoop DDL Syntax
recfile = *include module *record
include = "include" path
path = (relative-path / absolute-path)
module = "module" module-name
module-name = name *("." name)
record := "class" name "{" 1*(field) "}"
field := type name ";"
name := ALPHA (ALPHA / DIGIT / "_" )*
type := (ptype / ctype)
ptype := ("byte" / "boolean" / "int" |
"long" / "float" / "double"
"ustring" / "buffer")
ctype := (("vector" "<" type ">") /
("map" "<" type "," type ">" ) ) / name)
A DDL file describes one or more record types. It begins with zero or
more include declarations, a single mandatory module declaration
followed by zero or more class declarations. The semantics of each of
these declarations are described below:
- include: An include declaration specifies a DDL file to be referenced when generating code for types in the current DDL file. Record types in the current compilation unit may refer to types in all included files. File inclusion is recursive. An include does not trigger code generation for the referenced file.
- module: Every Hadoop DDL file must have a single module declaration that follows the list of includes and precedes all record declarations. A module declaration identifies a scope within which the names of all types in the current file are visible. Module names are mapped to C++ namespaces, Java packages etc. in generated code.
- class: Records types are specified through class declarations. A class declaration is like a Java class declaration. It specifies a named record type and a list of fields that constitute records of the type. Usage is illustrated in the following examples.
Examples
- A simple DDL file links.jr with just one record declaration.
module links { class Link { ustring URL; boolean isRelative; ustring anchorText; }; }
- A DDL file outlinks.jr which includes another
include "links.jr" module outlinks { class OutLinks { ustring baseURL; vector<links.Link> outLinks; }; }
Code Generation
The Hadoop translator is written in Java. Invocation is done by executing a wrapper shell script named named rcc. It takes a list of record description files as a mandatory argument and an optional language argument (the default is Java) --language or -l. Thus a typical invocation would look like:
$ rcc -l C++ <filename> ...
Target Language Mappings and Support
For all target languages, the unit of code generation is a record type. For each record type, Hadoop generates code for serialization and deserialization, record comparison and access to record members.C++
Support for including Hadoop generated C++ code in applications comes in the form of a header file recordio.hh which needs to be included in source that uses Hadoop types and a library librecordio.a which applications need to be linked with. The header declares the Hadoop C++ namespace which defines appropriate types for the various primitives, the basic interfaces for records and streams and enumerates the supported serialization encodings. Declarations of these interfaces and a description of their semantics follow:
namespace hadoop {
enum RecFormat { kBinary };
class InStream {
public:
virtual ssize_t read(void *buf, size_t n) = 0;
};
class OutStream {
public:
virtual ssize_t write(const void *buf, size_t n) = 0;
};
class IOError : public runtime_error {
public:
explicit IOError(const std::string& msg);
};
class IArchive;
class OArchive;
class RecordReader {
public:
RecordReader(InStream& in, RecFormat fmt);
virtual ~RecordReader(void);
virtual void read(Record& rec);
};
class RecordWriter {
public:
RecordWriter(OutStream& out, RecFormat fmt);
virtual ~RecordWriter(void);
virtual void write(Record& rec);
};
class Record {
public:
virtual std::string type(void) const = 0;
virtual std::string signature(void) const = 0;
protected:
virtual bool validate(void) const = 0;
virtual void
serialize(OArchive& oa, const std::string& tag) const = 0;
virtual void
deserialize(IArchive& ia, const std::string& tag) = 0;
};
}
- RecFormat: An enumeration of the serialization encodings supported by this implementation of Hadoop.
- InStream: A simple abstraction for an input stream. This has a single public read method that reads n bytes from the stream into the buffer buf. Has the same semantics as a blocking read system call. Returns the number of bytes read or -1 if an error occurs.
- OutStream: A simple abstraction for an output stream. This has a single write method that writes n bytes to the stream from the buffer buf. Has the same semantics as a blocking write system call. Returns the number of bytes written or -1 if an error occurs.
- RecordReader: A RecordReader reads records one at a time from an underlying stream in a specified record format. The reader is instantiated with a stream and a serialization format. It has a read method that takes an instance of a record and deserializes the record from the stream.
- RecordWriter: A RecordWriter writes records one at a time to an underlying stream in a specified record format. The writer is instantiated with a stream and a serialization format. It has a write method that takes an instance of a record and serializes the record to the stream.
- Record: The base class for all generated record types. This has two public methods type and signature that return the typename and the type signature of the record.
namespace links {
class Link : public hadoop::Record {
// ....
};
};
Each field within the record will cause the generation of a private member
declaration of the appropriate type in the class declaration, and one or more
acccessor methods. The generated class will implement the serialize and
deserialize methods defined in hadoop::Record+. It will also
implement the inspection methods type and signature from
hadoop::Record. A default constructor and virtual destructor will also
be generated. Serialization code will read/write records into streams that
implement the hadoop::InStream and the hadoop::OutStream interfaces.
For each member of a record an accessor method is generated that returns
either the member or a reference to the member. For members that are returned
by value, a setter method is also generated. This is true for primitive
data members of the types byte, int, long, boolean, float and
double. For example, for a int field called MyField the following
code is generated.
...
private:
int32_t mMyField;
...
public:
int32_t getMyField(void) const {
return mMyField;
};
void setMyField(int32_t m) {
mMyField = m;
};
...
For a ustring or buffer or composite field. The generated code
only contains accessors that return a reference to the field. A const
and a non-const accessor are generated. For example:
...
private:
std::string mMyBuf;
...
public:
std::string& getMyBuf() {
return mMyBuf;
};
const std::string& getMyBuf() const {
return mMyBuf;
};
...
Examples
Suppose the inclrec.jr file contains:
module inclrec {
class RI {
int I32;
double D;
ustring S;
};
}
and the testrec.jr file contains:
include "inclrec.jr"
module testrec {
class R {
vector<float> VF;
RI Rec;
buffer Buf;
};
}
Then the invocation of rcc such as:
$ rcc -l c++ inclrec.jr testrec.jr
will result in generation of four files:
inclrec.jr.{cc,hh} and testrec.jr.{cc,hh}.
The inclrec.jr.hh will contain:
#ifndef _INCLREC_JR_HH_
#define _INCLREC_JR_HH_
#include "recordio.hh"
namespace inclrec {
class RI : public hadoop::Record {
private:
int32_t mI32;
double mD;
std::string mS;
public:
RI(void);
virtual ~RI(void);
virtual bool operator==(const RI& peer) const;
virtual bool operator<(const RI& peer) const;
virtual int32_t getI32(void) const { return mI32; }
virtual void setI32(int32_t v) { mI32 = v; }
virtual double getD(void) const { return mD; }
virtual void setD(double v) { mD = v; }
virtual std::string& getS(void) const { return mS; }
virtual const std::string& getS(void) const { return mS; }
virtual std::string type(void) const;
virtual std::string signature(void) const;
protected:
virtual void serialize(hadoop::OArchive& a) const;
virtual void deserialize(hadoop::IArchive& a);
virtual bool validate(void);
};
} // end namespace inclrec
#endif /* _INCLREC_JR_HH_ */
The testrec.jr.hh file will contain:
#ifndef _TESTREC_JR_HH_
#define _TESTREC_JR_HH_
#include "inclrec.jr.hh"
namespace testrec {
class R : public hadoop::Record {
private:
std::vector<float> mVF;
inclrec::RI mRec;
std::string mBuf;
public:
R(void);
virtual ~R(void);
virtual bool operator==(const R& peer) const;
virtual bool operator<(const R& peer) const;
virtual std::vector<float>& getVF(void) const;
virtual const std::vector<float>& getVF(void) const;
virtual std::string& getBuf(void) const ;
virtual const std::string& getBuf(void) const;
virtual inclrec::RI& getRec(void) const;
virtual const inclrec::RI& getRec(void) const;
virtual bool serialize(hadoop::OutArchive& a) const;
virtual bool deserialize(hadoop::InArchive& a);
virtual std::string type(void) const;
virtual std::string signature(void) const;
};
}; // end namespace testrec
#endif /* _TESTREC_JR_HH_ */
Java
Code generation for Java is similar to that for C++. A Java class is generated for each record type with private members corresponding to the fields. Getters and setters for fields are also generated. Some differences arise in the way comparison is expressed and in the mapping of modules to packages and classes to files. For equality testing, an equals method is generated for each record type. As per Java requirements a hashCode method is also generated. For comparison a compareTo method is generated for each record type. This has the semantics as defined by the Java Comparable interface, that is, the method returns a negative integer, zero, or a positive integer as the invoked object is less than, equal to, or greater than the comparison parameter. A .java file is generated per record type as opposed to per DDL file as in C++. The module declaration translates to a Java package declaration. The module name maps to an identical Java package name. In addition to this mapping, the DDL compiler creates the appropriate directory hierarchy for the package and places the generated .java files in the correct directories.Mapping Summary
DDL Type C++ Type Java Type
boolean bool boolean
byte int8_t byte
int int32_t int
long int64_t long
float float float
double double double
ustring std::string Text
buffer std::string java.io.ByteArrayOutputStream
class type class type class type
vector<type> std::vector<type> java.util.ArrayList
map<type,type> std::map<type,type> java.util.TreeMap
Data encodings
This section describes the format of the data encodings supported by Hadoop. Currently, one data encoding is supported, namely binary.Binary Serialization Format
The binary data encoding format is fairly dense. Serialization of composite types is simply defined as a concatenation of serializations of the constituent elements (lengths are included in vectors and maps). Composite types are serialized as follows:- class: Sequence of serialized members.
- vector: The number of elements serialized as an int. Followed by a sequence of serialized elements.
- map: The number of key value pairs serialized as an int. Followed by a sequence of serialized (key,value) pairs.
- byte: Represented by 1 byte, as is.
- boolean: Represented by 1-byte (0 or 1)
- int/long: Integers and longs are serialized zero compressed. Represented as 1-byte if -120 <= value < 128. Otherwise, serialized as a sequence of 2-5 bytes for ints, 2-9 bytes for longs. The first byte represents the number of trailing bytes, N, as the negative number (-120-N). For example, the number 1024 (0x400) is represented by the byte sequence 'x86 x04 x00'. This doesn't help much for 4-byte integers but does a reasonably good job with longs without bit twiddling.
- float/double: Serialized in IEEE 754 single and double precision format in network byte order. This is the format used by Java.
- ustring: Serialized as 4-byte zero compressed length followed by data encoded as UTF-8. Strings are normalized to UTF-8 regardless of native language representation.
- buffer: Serialized as a 4-byte zero compressed length followed by the raw bytes in the buffer.
-
Interface Summary Interface Description Index Interface that acts as an iterator for deserializing maps.InputArchive Interface that all the Deserializers have to implement.OutputArchive Interface that all the serializers have to implement.Record Interface that is implemented by generated classes. -
Class Summary Class Description BinaryInputArchive BinaryOutputArchive RecordReader Front-end interface to deserializers.RecordWriter Front-end for serializers.ToStringOutputArchive Utils Various utility functions for Hadoop record I/O runtime.