Class DataSet

    • Constructor Detail

      • DataSet

        public DataSet()
      • DataSet

        public DataSet​(INDArray first,
                       INDArray second)
        Creates a dataset with the specified input matrix and labels
        Parameters:
        first - the feature matrix
        second - the labels (these should be binarized label matrices such that the specified label has a value of 1 in the desired column with the label)
      • DataSet

        public DataSet​(INDArray features,
                       INDArray labels,
                       INDArray featuresMask,
                       INDArray labelsMask)
        Create a dataset with the specified input INDArray and labels (output) INDArray, plus (optionally) mask arrays for the features and labels
        Parameters:
        features - Features (input)
        labels - Labels (output)
        featuresMask - Mask array for features, may be null
        labelsMask - Mask array for labels, may be null
    • Method Detail

      • getExampleMetaData

        public List<Serializable> getExampleMetaData()
        Description copied from interface: DataSet
        Get the example metadata, or null if no metadata has been set
        Specified by:
        getExampleMetaData in interface DataSet
        Returns:
        List of metadata instances
      • getExampleMetaData

        public <T extends SerializableList<T> getExampleMetaData​(Class<T> metaDataType)
        Description copied from interface: DataSet
        Get the example metadata, or null if no metadata has been set
        Note: this method results in an unchecked cast - care should be taken when using this!
        Specified by:
        getExampleMetaData in interface DataSet
        Type Parameters:
        T - Type of metadata
        Parameters:
        metaDataType - Class of the metadata (used for opType information)
        Returns:
        List of metadata objects
      • setExampleMetaData

        public void setExampleMetaData​(List<? extends Serializable> exampleMetaData)
        Description copied from interface: DataSet
        Set the metadata for this DataSet
        By convention: the metadata can be any serializable object, one per example in the DataSet
        Specified by:
        setExampleMetaData in interface DataSet
        Parameters:
        exampleMetaData - Example metadata to set
      • isPreProcessed

        public boolean isPreProcessed()
      • markAsPreProcessed

        public void markAsPreProcessed()
      • empty

        public static DataSet empty()
        Returns a single dataset (all fields are null)
        Returns:
        an empty dataset (all fields are null)
      • merge

        public static DataSet merge​(List<? extends DataSet> data)
        Merge the list of datasets in to one list. All the rows are merged in to one dataset
        Parameters:
        data - the data to merge
        Returns:
        a single dataset
      • load

        public void load​(InputStream from)
        Description copied from interface: DataSet
        Load the contents of the DataSet from the specified InputStream. The current contents of the DataSet (if any) will be replaced.
        The InputStream should contain a DataSet that has been serialized with DataSet.save(OutputStream)
        Specified by:
        load in interface DataSet
        Parameters:
        from - InputStream to load the DataSet from
      • load

        public void load​(File from)
        Description copied from interface: DataSet
        Load the contents of the DataSet from the specified File. The current contents of the DataSet (if any) will be replaced.
        The InputStream should contain a DataSet that has been serialized with DataSet.save(File)
        Specified by:
        load in interface DataSet
        Parameters:
        from - File to load the DataSet from
      • save

        public void save​(OutputStream to)
        Description copied from interface: DataSet
        Write the contents of this DataSet to the specified OutputStream
        Specified by:
        save in interface DataSet
        Parameters:
        to - OutputStream to save the DataSet to
      • save

        public void save​(File to)
        Description copied from interface: DataSet
        Save this DataSet to a file. Can be loaded again using
        Specified by:
        save in interface DataSet
        Parameters:
        to - File to sa
      • getFeatures

        public INDArray getFeatures()
        Description copied from interface: DataSet
        Returns the features array for the DataSet
        Specified by:
        getFeatures in interface DataSet
        Returns:
        features array
      • setFeatures

        public void setFeatures​(INDArray features)
        Description copied from interface: DataSet
        Set the features array for the DataSet
        Specified by:
        setFeatures in interface DataSet
        Parameters:
        features - Features to set
      • labelCounts

        public Map<Integer,​Double> labelCounts()
        Description copied from interface: DataSet
        Calculate and return a count of each label, by index. Assumes labels are a one-hot INDArray, for classification
        Specified by:
        labelCounts in interface DataSet
        Returns:
        Map of countsn
      • copy

        public DataSet copy()
        Clone the dataset
        Specified by:
        copy in interface DataSet
        Returns:
        a clone of the dataset
      • reshape

        public DataSet reshape​(int rows,
                               int cols)
        Reshapes the input in to the given rows and columns
        Specified by:
        reshape in interface DataSet
        Parameters:
        rows - the row size
        cols - the column size
        Returns:
        a copy of this data op with the input resized
      • multiplyBy

        public void multiplyBy​(double num)
        Description copied from interface: DataSet
        Multiply the features by a scalar
        Specified by:
        multiplyBy in interface DataSet
      • divideBy

        public void divideBy​(int num)
        Description copied from interface: DataSet
        Divide the features by a scalar
        Specified by:
        divideBy in interface DataSet
      • shuffle

        public void shuffle()
        Description copied from interface: DataSet
        Shuffle the order of the rows in the DataSet. Note that this generally won't make any difference in practice unless the DataSet is later split.
        Specified by:
        shuffle in interface DataSet
      • shuffle

        public void shuffle​(long seed)
        Shuffles the dataset in place, given a seed for a random number generator. For reproducibility This will modify the dataset in place!!
        Parameters:
        seed - Seed to use for the random Number Generator
      • squishToRange

        public void squishToRange​(double min,
                                  double max)
        Squeezes input data to a max and a min
        Specified by:
        squishToRange in interface DataSet
        Parameters:
        min - the min value to occur in the dataset
        max - the max value to ccur in the dataset
      • scaleMinAndMax

        public void scaleMinAndMax​(double min,
                                   double max)
        Specified by:
        scaleMinAndMax in interface DataSet
      • scale

        public void scale()
        Divides the input data transform by the max number in each row
        Specified by:
        scale in interface DataSet
      • addFeatureVector

        public void addFeatureVector​(INDArray toAdd)
        Adds a feature for each example on to the current feature vector
        Specified by:
        addFeatureVector in interface DataSet
        Parameters:
        toAdd - the feature vector to add
      • addFeatureVector

        public void addFeatureVector​(INDArray feature,
                                     int example)
        The feature to add, and the example/row number
        Specified by:
        addFeatureVector in interface DataSet
        Parameters:
        feature - the feature vector to add
        example - the number of the example to append to
      • normalize

        public void normalize()
        Description copied from interface: DataSet
        Normalize this DataSet to mean 0, stdev 1 per input. This calculates statistics based on the values in a single DataSet only. For normalization over multiple DataSet objects, use NormalizerStandardize
        Specified by:
        normalize in interface DataSet
      • binarize

        public void binarize()
        Same as calling binarize(0)
        Specified by:
        binarize in interface DataSet
      • binarize

        public void binarize​(double cutoff)
        Binarizes the dataset such that any number greater than cutoff is 1 otherwise zero
        Specified by:
        binarize in interface DataSet
        Parameters:
        cutoff - the cutoff point
      • numInputs

        public int numInputs()
        The number of inputs in the feature matrix
        Specified by:
        numInputs in interface DataSet
        Returns:
      • validate

        public void validate()
        Specified by:
        validate in interface DataSet
      • outcome

        public int outcome()
        Specified by:
        outcome in interface DataSet
      • setNewNumberOfLabels

        public void setNewNumberOfLabels​(int labels)
        Clears the outcome matrix setting a new number of labels
        Specified by:
        setNewNumberOfLabels in interface DataSet
        Parameters:
        labels - the number of labels/columns in the outcome matrix Note that this clears the labels for each example
      • setOutcome

        public void setOutcome​(int example,
                               int label)
        Sets the outcome of a particular example
        Specified by:
        setOutcome in interface DataSet
        Parameters:
        example - the example to transform
        label - the label of the outcome
      • get

        public DataSet get​(int i)
        Gets a copy of example i
        Specified by:
        get in interface DataSet
        Parameters:
        i - the example to getFromOrigin
        Returns:
        the example at i (one example)
      • get

        public DataSet get​(int[] i)
        Gets a copy of example i
        Specified by:
        get in interface DataSet
        Parameters:
        i - the example to getFromOrigin
        Returns:
        the example at i (one example)
      • batchBy

        public List<DataSet> batchBy​(int num)
        Partitions a dataset in to mini batches where each dataset in each list is of the specified number of examples
        Specified by:
        batchBy in interface DataSet
        Parameters:
        num - the number to split by
        Returns:
        the partitioned datasets
      • filterBy

        public DataSet filterBy​(int[] labels)
        Strips the data transform of all but the passed in labels
        Specified by:
        filterBy in interface DataSet
        Parameters:
        labels - strips the data transform of all but the passed in labels
        Returns:
        the dataset with only the specified labels
      • filterAndStrip

        public void filterAndStrip​(int[] labels)
        Strips the dataset down to the specified labels and remaps them
        Specified by:
        filterAndStrip in interface DataSet
        Parameters:
        labels - the labels to strip down to
      • dataSetBatches

        public List<DataSet> dataSetBatches​(int num)
        Partitions the data transform by the specified number.
        Specified by:
        dataSetBatches in interface DataSet
        Parameters:
        num - the number to split by
        Returns:
        the partitioned data transform
      • sortAndBatchByNumLabels

        public List<DataSet> sortAndBatchByNumLabels()
        Sorts the dataset by label: Splits the data transform such that examples are sorted by their labels. A ten label dataset would produce lists with batches like the following: x1 y = 1 x2 y = 2 ... x10 y = 10
        Specified by:
        sortAndBatchByNumLabels in interface DataSet
        Returns:
        a list of data sets partitioned by outcomes
      • asList

        public List<DataSet> asList()
        Description copied from interface: DataSet
        Extract each example in the DataSet into its own DataSet object, and return all of them as a list
        Specified by:
        asList in interface DataSet
        Returns:
        List of DataSet objects, each with 1 example only
      • splitTestAndTrain

        public SplitTestAndTrain splitTestAndTrain​(int numHoldout,
                                                   Random rng)
        Splits a dataset in to test and train randomly. This will modify the dataset in place to shuffle it before splitting into test/train!
        Specified by:
        splitTestAndTrain in interface DataSet
        Parameters:
        numHoldout - the number to hold out for training
        rng - Random Number Generator to use to shuffle the dataset
        Returns:
        the pair of datasets for the train test split
      • splitTestAndTrain

        public SplitTestAndTrain splitTestAndTrain​(int numHoldout)
        Splits a dataset in to test and train
        Specified by:
        splitTestAndTrain in interface DataSet
        Parameters:
        numHoldout - the number to hold out for training
        Returns:
        the pair of datasets for the train test split
      • getLabels

        public INDArray getLabels()
        Returns the labels for the dataset
        Specified by:
        getLabels in interface DataSet
        Returns:
        the labels for the dataset
      • getLabelName

        public String getLabelName​(int idx)
        Specified by:
        getLabelName in interface DataSet
        Parameters:
        idx - the index to pullRows the string label value out of the list if it exists
        Returns:
        the label opName
      • getLabelNames

        public List<String> getLabelNames​(INDArray idxs)
        Specified by:
        getLabelNames in interface DataSet
        Parameters:
        idxs - list of index to pullRows the string label value out of the list if it exists
        Returns:
        the label opName
      • sortByLabel

        public void sortByLabel()
        Organizes the dataset to minimize sampling error while still allowing efficient batching.
        Specified by:
        sortByLabel in interface DataSet
      • sample

        public DataSet sample​(int numSamples)
        Sample without replacement and a random rng
        Specified by:
        sample in interface DataSet
        Parameters:
        numSamples - the number of samples to getFromOrigin
        Returns:
        a sample data transform without replacement
      • sample

        public DataSet sample​(int numSamples,
                              Random rng)
        Sample without replacement
        Specified by:
        sample in interface DataSet
        Parameters:
        numSamples - the number of samples to getFromOrigin
        rng - the rng to use
        Returns:
        the sampled dataset without replacement
      • sample

        public DataSet sample​(int numSamples,
                              boolean withReplacement)
        Sample a dataset numSamples times
        Specified by:
        sample in interface DataSet
        Parameters:
        numSamples - the number of samples to getFromOrigin
        withReplacement - the rng to use
        Returns:
        the sampled dataset without replacement
      • sample

        public DataSet sample​(int numSamples,
                              Random rng,
                              boolean withReplacement)
        Sample a dataset
        Specified by:
        sample in interface DataSet
        Parameters:
        numSamples - the number of samples to getFromOrigin
        rng - the rng to use
        withReplacement - whether to allow duplicates (only tracked by example row number)
        Returns:
        the sample dataset
      • roundToTheNearest

        public void roundToTheNearest​(int roundTo)
        Specified by:
        roundToTheNearest in interface DataSet
      • numOutcomes

        public int numOutcomes()
        Description copied from interface: DataSet
        Returns the number of outcomes (size of the labels array for each example)
        Specified by:
        numOutcomes in interface DataSet
      • numExamples

        public int numExamples()
        Description copied from interface: DataSet
        Number of examples in the DataSet
        Specified by:
        numExamples in interface DataSet
      • setLabelNames

        public void setLabelNames​(List<String> labelNames)
        Sets the label names, will throw an exception if the passed in label names doesn't equal the number of outcomes
        Specified by:
        setLabelNames in interface DataSet
        Parameters:
        labelNames - the label names to use
      • getColumnNames

        @Deprecated
        public List<String> getColumnNames()
        Deprecated.
        Optional column names of the data transform, this is mainly used for interpreting what columns are in the dataset
        Specified by:
        getColumnNames in interface DataSet
        Returns:
      • setColumnNames

        @Deprecated
        public void setColumnNames​(List<String> columnNames)
        Deprecated.
        Sets the column names, will throw an exception if the column names don't match the number of columns
        Specified by:
        setColumnNames in interface DataSet
        Parameters:
        columnNames -
      • splitTestAndTrain

        public SplitTestAndTrain splitTestAndTrain​(double fractionTrain)
        Description copied from interface: DataSet
        SplitV the DataSet into two DataSets randomly
        Specified by:
        splitTestAndTrain in interface DataSet
        Parameters:
        fractionTrain - Fraction (in range 0 to 1) of examples to be returned in the training DataSet object
      • getFeaturesMaskArray

        public INDArray getFeaturesMaskArray()
        Description copied from interface: DataSet
        Input mask array: a mask array for input, where each value is in {0,1} in order to specify whether an input is actually present or not. Typically used for situations such as RNNs with variable length inputs
        Specified by:
        getFeaturesMaskArray in interface DataSet
        Returns:
        Input mask array
      • setFeaturesMaskArray

        public void setFeaturesMaskArray​(INDArray featuresMask)
        Description copied from interface: DataSet
        Set the features mask array in this DataSet
        Specified by:
        setFeaturesMaskArray in interface DataSet
      • getLabelsMaskArray

        public INDArray getLabelsMaskArray()
        Description copied from interface: DataSet
        Labels (output) mask array: a mask array for input, where each value is in {0,1} in order to specify whether an output is actually present or not. Typically used for situations such as RNNs with variable length inputs or many- to-one situations.
        Specified by:
        getLabelsMaskArray in interface DataSet
        Returns:
        Labels (output) mask array
      • setLabelsMaskArray

        public void setLabelsMaskArray​(INDArray labelsMask)
        Description copied from interface: DataSet
        Set the labels mask array in this data set
        Specified by:
        setLabelsMaskArray in interface DataSet
      • hasMaskArrays

        public boolean hasMaskArrays()
        Description copied from interface: DataSet
        Whether the labels or input (features) mask arrays are present for this DataSet
        Specified by:
        hasMaskArrays in interface DataSet
      • hashCode

        public int hashCode()
        Overrides:
        hashCode in class Object
      • getMemoryFootprint

        public long getMemoryFootprint()
        This method returns memory used by this DataSet
        Specified by:
        getMemoryFootprint in interface DataSet
        Returns:
      • migrate

        public void migrate()
        Description copied from interface: DataSet
        This method migrates this DataSet into current Workspace (if any)
        Specified by:
        migrate in interface DataSet
      • detach

        public void detach()
        Description copied from interface: DataSet
        This method detaches this DataSet from current Workspace (if any)
        Specified by:
        detach in interface DataSet
      • isEmpty

        public boolean isEmpty()
        Specified by:
        isEmpty in interface DataSet
        Returns:
        true if the DataSet object is empty (no features, labels, or masks)