Step

A Step is one piece of a pipeline designed to carry out some experiment workflow. A Step requires certain inputs, and it might know what other steps provide those inputs. A Step also produces some outputs. This class contains logic for hooking together Steps, so that you can tell the final Step in a pipeline that it should run, and it will run whatever pre-requisites are necessary, then run its own work.

Steps can optionally take parameters. When the Step takes a set of parameters, it will save those parameters to the filesystem, and check that the parameters match when a file in the pipeline already exists. Because of the way inputs() works, this also means that Steps later in the pipeline need to have information in their parameters about _all_ of the Steps prior to them in the pipeline (or you need something like Guice so that later Steps can construct earlier Steps without having those parameters themselves). Personally, I think this is a feature, not a bug, for experiment pipelines - you have a single file or block of code that specifies an experiment in its entirety, which the final Step (probably something that computes metrics) has access to. This has the nice property of a single, clear parameter file being associated with every output file in your experiments. No more wondering what the parameters were that produced a particular output you're looking at. (You still have to worry about code versions, though... TODO(matt): put some git hash logging into this.)

Note that there's a difference between passing None and Some(JNothing) to a step. The semantics of passing None means that _there are no configurable parameters_, so none will be checked for, and none will be saved. Passing Some(JNothing) just means (typically) that default values will be used for all configurable parameters.

TODO(matt): figure out the right way to do parallel execution of Steps.

Note that these Steps are defined in terms of their inputs and outputs in the filesystem. It's up to the caller to determine whether to use absolute or relative paths in these steps.

Another important point: in the typical use case for this pipeline (i.e., unless you decide to use Guice), _all_ Step objects in the whole pipeline will be constructed when the endpoint is constructed, because of how the inputs() method works. Make sure that your constructors are all very lightweight - do NOT load any data or other resources in the constructor; if you want the data to be a class variable, make sure it's a lazy val.

Linear Supertypes

AnyRef, Any

Instance Constructors

new Step(params: Option[JValue], fileUtil: FileUtil = new FileUtil)

Abstract Value Members

abstract def _runStep(): Unit

This is what you override to actually do stuff in this step.
This is what you override to actually do stuff in this step.

Attributes
protected
abstract def inputs(): Set[(String, Option[Step])]

What file inputs does this step require, and where do you expect to get them from? If you expect them to be inputs from outside this pipeline, you use None instead of Some(Step).
abstract def outputs(): Set[String]

What are the output files of this step?

Concrete Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def name(): String

The step name isn't really used anywhere except for logging things.
The step name isn't really used anywhere except for logging things. You can override it if you want, or just ignore it.
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
def paramFile(): String

If this step takes parameters, we save them to this location.
If this step takes parameters, we save them to this location. This is for two reasons: (1) you may run an experiment one day, then two months later come back and want to know what parameters were used to produce a particular output. Having a paramFile ensures that you know what parameters were used. (2) You might change some parameters, but forget to change others, or use the same output filenames for a step even though you changed the parameters. This can cause enormous confusion and silent errors. Having a paramFile lets us ensure that you have no such errors in your experiments.
val params: Option[JValue]
def runPipeline(): Unit

Run the pipeline up to and including this step.
Run the pipeline up to and including this step. If there are required input files that are not already present, we try to compute them using the Steps given by the inputs() method.
Note that we do NOT check if the files provided by this step already exist. We assume that if you're calling this method on this object, you want to run this step no matter what. Best practice is to have a main method in your code that calls runPipeline on a summary class that just prints some stuff to stdout (or, in general, just has no output files).
def runStep(): Unit

Once we've determined that all of the required input files are present, run the work defined by this step of the pipeline.
Once we've determined that all of the required input files are present, run the work defined by this step of the pipeline. In this method, we save a parameter file, then call the (abstract) method that actually does the computation.
TODO(matt): add checks for parallel execution in here (like, e.g., checking for an "in_progress" file; in that case, we wait for the file to be removed, then skip _runStep()).
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

abstract class Step extends AnyRef

Instance Constructors

new Step(params: Option[JValue], fileUtil: FileUtil = new FileUtil)

Abstract Value Members

abstract def _runStep(): Unit

abstract def inputs(): Set[(String, Option[Step])]

abstract def outputs(): Set[String]

Concrete Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

def clone(): AnyRef

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def finalize(): Unit

final def getClass(): Class[_]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

def name(): String

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

def paramFile(): String

val params: Option[JValue]

def runPipeline(): Unit

def runStep(): Unit

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from AnyRef

Inherited from Any

Ungrouped