Generate code to compare equality of a given object (objVar) against key column variables.
Generate code to calculate the hash code for given column variables that correspond to the key columns in this class.
Generate code to lookup the map or insert a new key, value if not found.
Generate code to update a class object fields with given resultVars.
Generate code to update a class object fields with given resultVars. If
accessors for fields have been generated (using getColumnVars
)
then those can be passed for faster reads where required.
the variable holding reference to the class object
accessors for object fields, if available
result values to be assigned to object fields
if true then update key fields else value fields
if true then a copy of reference values is assigned else only reference copy done
if true then this is for initialization of fields after object creation so some checks can be skipped
code to assign objVar fields to given resultVars
get the generated class name
Get the ExprCode for the key and/or value columns given a class object variable.
Get the ExprCode for the key and/or value columns given a class object variable. This also returns an initialization code that should be inserted in generated code first. The last element in the result tuple is the names of null mask variables.
Provides helper methods for generated code to use ObjectHashSet with a generated class (having key and value columns as corresponding java type fields). This implementation saves the entire overhead of UnsafeRow conversion for both key type (like in BytesToBytesMap) and value type (like in BytesToBytesMap and VectorizedHashMapGenerator).
It has been carefully optimized to minimize memory reads/writes, with minimalistic code to fit better in CPU instruction cache. Unlike the other two maps used by HashAggregateExec, this has no limitations on the key or value column types.
The basic idea being that all of the key and value columns will be individual fields in a generated java class having corresponding java types. Storage of a column value in the map is a simple matter of assignment of incoming variable to the corresponding field of the class object and access is likewise read from that field of class . Nullability information is crammed in long bit-mask fields which are generated as many required (instead of unnecessary overhead of something like a BitSet).
Hashcode and equals methods are generated for the key column fields. Having both key and value fields in the same class object helps both in cutting down of generated code as well as cache locality and reduces at least one memory access for each row. In testing this alone has shown to improve performance by ~25% in simple group by queries. Furthermore, this class also provides for inline hashcode and equals methods so that incoming register variables in generated code can be directly used (instead of stuffing into a lookup key that will again read those fields inside). The class hashcode method is supposed to be used only internally by rehashing and that too is just a field cached in the class object that is filled in during the initial insert (from the inline hashcode).
For memory management this uses a simple approach of starting with an estimated size, then improving that estimate for future in a rehash where the rehash will also collect the actual size of current entries. If the rehash tells that no memory is available, then it will fallback to dumping the current map into MemoryManager and creating a new one with merge being done by an external sorter in a manner similar to how UnsafeFixedWidthAggregationMap handles the situation. Caller can instead decide to dump the entire map in that scenario like when using for a HashJoin.
Overall this map is 5-10X faster than UnsafeFixedWidthAggregationMap and 2-4X faster than VectorizedHashMapGenerator. It is generic enough to be used for both group by aggregation as well as for HashJoins.