Context

bndl.compute.context.ComputeContext is the entry point into a cluster of BNDL workers from the ‘driver’ node. It provides a means to create partitioned distributed data sets (which can be then transformed and combined), broadcast data, create accumulators to collect data into, etc.

See Getting started for creating a compute context.

Datasets

See Datasets for more on data sets. ComputeContext is the main handle into creating some data sets (although most functionality is enclosed in the implementations of bndl.compute.dataset.Dataset. Some examples:

>>> r = ctx.range(10)
>>> r
<RangeDataset 38s8n3ym>
>>> r.collect()
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

>>> c = ctx.collection('The quick brown fox jumps over the lazy dog')
>>> c
<DistributedCollection 3kuycn22>
>>> for char, count in c.with_value(1).aggregate_by_key(sum).nlargest(4, key=1):
...     print(char, count)
...
  8
o 4
e 3
u 2

>>> f = ctx.files('.') # key value pairs of (filename:str, contents:bytes)
>>> f
<RemoteFilesDataset o4p97hrt>
>>> f.values().map(len).stats()
<Stats count=4495, mean=23657.52413793103, min=0.0, max=5709296.0, var=45209392506.7967, stdev=212625.00442515386, skew=17.166050243279887, kurt=356.5405806659577>

Distributed global variables

On occasions it’s convinient to share (broadcast) some data with all workers (and not have it serialized and set for every task again). Or the opposite: let every worker (e.g. in a mapper or reducer task) send data ‘out of band’ to (accumulate on) the driver. See Broadcasts and accumulators for more on these topics.

Workers / cluster and Profiling

ComputeContext inherits from ExecuteContext and thus exposes functions and properties for e.g. waiting for workers and profiling see Execute for more.