Write your first topology¶
Organize your files¶
This is an example of the directory tree of a simple topology:
my_first_topology
|-- my_first_topology
| |-- __init__.py
| |-- dummy_bolt.py
| |-- dummy_spout.py
|-- pyleus_topology.yaml
|-- requirements.txt
When building your topology, the jar file generated from the pyleus build command will be named my_first_topology.jar, as the directory containing the YAML topology definition file.
File requirements.txt should list all the dependencies of the topology to be included in the jar. In this case, this file is empty.
See also
If you want to specify a different path for your requirements file, please see Topology definition YAML syntax. If you want to install some dependencies for all your topologies, see Configuration instead.
Define the topology layout¶
A simple pyleus_topology.yaml should look like the following:
# This is a very meaningful paragraph
# describing my_first_topology
name: my_first_topology
workers: 2
topology:
- spout:
name: my-first-spout
module: my_first_topology.dummy_spout
- bolt:
name: my-first-bolt
module: my_first_topology.dummy_bolt
groupings:
- shuffle_grouping: my-first-spout
This define a topology where a single bolt subscribe to the output stream of a single spout. As simple as it is.
Note
Components names do NOT need to match modules names. This is because the same module may be reused more than once in the same topology, perhaps with different input streams or options.
Tip
If you do not specify the number of workers for your topology, Storm will span just one worker. This is perfectly fine if you want to run your topology on your local machine, but you may like to change this value when running your topology on a real cluster. You can do that with the workers option as shown in the example above.
Write your first spout¶
This is the code implementing dummy_spout.py:
from pyleus.storm import Spout
class DummySpout(Spout):
OUTPUT_FIELDS = ['sentence', 'name']
def next_tuple(self):
self.emit(("This is a sentence.", "spout",))
if __name__ == '__main__':
DummySpout().run()
Every Spout must inherit from Spout and declare its OUTPUT_FIELDS as a tuple, a list or a namedtuple. The same goes for emit first argument.
Spouts also must define the method next_tuple(), that will be called within the component main loop in order to generate a stream of new tuples.
Note
Forgetting to call the run() method will prevent the topology from running.
See also
If you want to enable tuple tracking and leverage Storm reliability features, please read Guaranteeing message processing.
See also
For complete API documentation, see pyleus.storm.spout.
Write your first bolt¶
Let’s now look at dummy_bolt.py:
from pyleus.storm import SimpleBolt
class DummyBolt(SimpleBolt):
OUTPUT_FIELDS = ['sentence']
def process_tuple(self, tup):
sentence, name = tup.values
new_sentence = "{0} says, \"{1}\"".format(name, sentence)
self.emit((new_sentence,), anchors=[tup])
if __name__ == '__main__':
DummyBolt().run()
Every Bolt must inherit from Bolt or SimpleBolt, which is a bolt automatically acking/failing tuples and offering a nicer API to leverage tick tuples. The process_tuple method will be called whenever a new tuple reaches the bolt.
Note
Please note that SimpleBolt will NOT automatically anchor your tuples. See Guaranteeing message processing for more info on anchoring.
Note
Even if you want to define only one output field, please declare it as an element either of a list or of a tuple, as showed in the above example. Using just a string is not allowed.
See also
For complete API documentation, see pyleus.storm.bolt.
Warning
Do NOT print. I’m gonna say it again: Do. Not. Print. Or, at least, do not print until you want to crash your topology. The mechanism Storm uses to communicate with Python is based on stdin/stdout communication, so you are not allowed to use them. Use logging instead (see Logging).
Run your topology¶
Run your topology on your local machine for debugging:
pyleus build my_first_topology/pyleus_topology.yaml
pyleus local my_first_topology.jar -d
The -debug option will print all tuples flowing through the topology.
When you are done, hit C-C.