Building Stages
Last updated
Last updated
In graph theory, stages are better known as vertices or nodes, while pipes are known as edges or lines. These concepts are an integral part in the Actor model. As an actor framework, Pronghorn has stages that can receive messages, make local decisions, process data, and respond to messages with new messages. Messages are passed through pipes, which allow for data to be shared between stages.
Every stage is defined with a fixed number of inputs and/or outputs. The PronghornStage
class is an abstract class that provides a protocol for creating custom stages. Please follow the guidelines below for writing correct Pronghorn code.
Add a new Java class to your existing Pronghorn project (or create a new project). If you prefer, create a new package in your project called “Stages” and place it there. Use the code below to help you get started:
Depending on the functionality of your stage, you may need to change RawDataSchema
to the schema that your stage will be processing. More on this can be found on the Schema page.
The constructor should be responsible for saving references to your pipes and defining notas.
Every Pronghorn stage requires you to set the GraphManager. You are not, however, required to have output or input pipes. If you choose to do so, super()
will have to look similar to this:
where NONE
is a constant declaring an empty pipe.
The static new instance method is Pronghorn convention for instantiating new stages. You are not required to have this, it exists primarily for readability reasons.
The startup()
method gets visited when your stage is requested to start.
Instead of allocating and initializing local objects and variables in the class constructor, do it instead in startup()
. Any other settings except notas should be configured here as well.
The run()
method gets called depending on the current scheduled rate (which can be changed using notas).
Fine-tuning this rate can vastly improve performance and/or its ability to handle larger volume of data. Reading from input pipes, writing to output pipes, and performing the core logic of your stage should be done here.
Shutting down your stage is another important task to do in your run()
method. Once your processing has finished, you need to let Pronghorn know that work is done and that run()
should not be executed again. You do this using the requestShutdown()
method. Do not call shutdown()
.
Important: It is important that the outputPublish
needs to occur before any of the inputs release.
Producers are the start of your graph and are responsible for providing data to the rest of the stages. You can have multiple producers on the same graph for different subsets of your graph.
An example of a producer would be the FileBlobReadStage
. It opens a file and passes the data onto a pipe into the next stage for further processing, e.g. to parse a certain value from the file or to replicate the file onto other stages.
To use and test your stage, it needs to be added to the current GraphManager.