com.mattg.pipeline

Step

abstract class Step extends AnyRef

A Step is one piece of a pipeline designed to carry out some experiment workflow. A Step requires certain inputs, and it might know what other steps provide those inputs. A Step also produces some outputs. This class contains logic for hooking together Steps, so that you can tell the final Step in a pipeline that it should run, and it will run whatever pre-requisites are necessary, then run its own work.

Steps can optionally take parameters. When the Step takes a set of parameters, it will save those parameters to the filesystem, and check that the parameters match when a file in the pipeline already exists. Because of the way inputs() works, this also means that Steps later in the pipeline need to have information in their parameters about _all_ of the Steps prior to them in the pipeline (or you need something like Guice so that later Steps can construct earlier Steps without having those parameters themselves). Personally, I think this is a feature, not a bug, for experiment pipelines - you have a single file or block of code that specifies an experiment in its entirety, which the final Step (probably something that computes metrics) has access to. This has the nice property of a single, clear parameter file being associated with every output file in your experiments. No more wondering what the parameters were that produced a particular output you're looking at. (You still have to worry about code versions, though... TODO(matt): put some git hash logging into this.)

Note that there's a difference between passing None and Some(JNothing) to a step. The semantics of passing None means that _there are no configurable parameters_, so none will be checked for, and none will be saved. Passing Some(JNothing) just means (typically) that default values will be used for all configurable parameters.

TODO(matt): figure out the right way to do parallel execution of Steps.

Note that these Steps are defined in terms of their inputs and outputs in the filesystem. It's up to the caller to determine whether to use absolute or relative paths in these steps.

Another important point: in the typical use case for this pipeline (i.e., unless you decide to use Guice), _all_ Step objects in the whole pipeline will be constructed when the endpoint is constructed, because of how the inputs() method works. Make sure that your constructors are all very lightweight - do NOT load any data or other resources in the constructor; if you want the data to be a class variable, make sure it's a lazy val.

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. Step
  2. AnyRef
  3. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Instance Constructors

  1. new Step(params: Option[JValue], fileUtil: FileUtil = new FileUtil)

Abstract Value Members

  1. abstract def _runStep(): Unit

    This is what you override to actually do stuff in this step.

    This is what you override to actually do stuff in this step.

    Attributes
    protected
  2. abstract def inputs(): Set[(String, Option[Step])]

    What file inputs does this step require, and where do you expect to get them from? If you expect them to be inputs from outside this pipeline, you use None instead of Some(Step).

  3. abstract def outputs(): Set[String]

    What are the output files of this step?

Concrete Value Members

  1. final def !=(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  5. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  6. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  7. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  8. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  9. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  10. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  11. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  12. def name(): String

    The step name isn't really used anywhere except for logging things.

    The step name isn't really used anywhere except for logging things. You can override it if you want, or just ignore it.

  13. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  14. final def notify(): Unit

    Definition Classes
    AnyRef
  15. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  16. def paramFile(): String

    If this step takes parameters, we save them to this location.

    If this step takes parameters, we save them to this location. This is for two reasons: (1) you may run an experiment one day, then two months later come back and want to know what parameters were used to produce a particular output. Having a paramFile ensures that you know what parameters were used. (2) You might change some parameters, but forget to change others, or use the same output filenames for a step even though you changed the parameters. This can cause enormous confusion and silent errors. Having a paramFile lets us ensure that you have no such errors in your experiments.

  17. val params: Option[JValue]

  18. def runPipeline(): Unit

    Run the pipeline up to and including this step.

    Run the pipeline up to and including this step. If there are required input files that are not already present, we try to compute them using the Steps given by the inputs() method.

    Note that we do NOT check if the files provided by this step already exist. We assume that if you're calling this method on this object, you want to run this step no matter what. Best practice is to have a main method in your code that calls runPipeline on a summary class that just prints some stuff to stdout (or, in general, just has no output files).

  19. def runStep(): Unit

    Once we've determined that all of the required input files are present, run the work defined by this step of the pipeline.

    Once we've determined that all of the required input files are present, run the work defined by this step of the pipeline. In this method, we save a parameter file, then call the (abstract) method that actually does the computation.

    TODO(matt): add checks for parallel execution in here (like, e.g., checking for an "in_progress" file; in that case, we wait for the file to be removed, then skip _runStep()).

  20. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  21. def toString(): String

    Definition Classes
    AnyRef → Any
  22. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  23. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  24. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from AnyRef

Inherited from Any

Ungrouped