Unit tests generator, generates scalatest's in the Prophecy format for the given component and some input and
output DataFrames.
Unit tests generator, generates scalatest's in the Prophecy format for the given component and some input and
output DataFrames.
Note that, for the generated unit tests to be correct, this code should be executed on the gold standard
datasets.
Example usage:
val ut = new UnitTestsGenerator("hdfs:///path/to/generated/tests/")
val dfInput = Input(spark)
val (dfDistribute1, dfDistribute2) = Distribute(spark, dfInput)
ut.generateUnitTests("Distribute", Seq(dfInput), Seq(dfDistribute1, dfDistribute2))
val dfSomeJoin = SomeJoin(spark, dfDistribute1, dfDistribute2)
ut.generateUnitTests("SomeJoin", Seq(dfDistribute1, dfDistribute2), Seq(dfSomeJoin))
The above sets up a typical spark graph with additional calls to the UnitTestsGenerator#generateUnitTests
method, which executes the spark workflow for the given inputs & outputs and writes the unit tests.
TODO: To increase the performance and reduce the number of spark actions being executed, we can upgrade this to
a new Logical Plan operator (similarly like our org.apache.spark.sql.InterimExec). However,
due the tests being executed only on a very limited amount of data, for now this should not
cause significant performance degradation.
Unit tests generator, generates scalatest's in the Prophecy format for the given component and some input and output DataFrames.
Note that, for the generated unit tests to be correct, this code should be executed on the gold standard datasets.
Example usage:
The above sets up a typical spark graph with additional calls to the UnitTestsGenerator#generateUnitTests method, which executes the spark workflow for the given inputs & outputs and writes the unit tests.
TODO: To increase the performance and reduce the number of spark actions being executed, we can upgrade this to a new Logical Plan operator (similarly like our org.apache.spark.sql.InterimExec). However, due the tests being executed only on a very limited amount of data, for now this should not cause significant performance degradation.