Context¶
bndl.compute.context.ComputeContext
is the entry point into a cluster of BNDL workers
from the ‘driver’ node. It provides a means to create partitioned distributed data sets (which can
be then transformed and combined), broadcast data, create accumulators to collect data into, etc.
See Getting started for creating a compute context.
Datasets¶
See Datasets for more on data sets. ComputeContext is the main handle into creating some
data sets (although most functionality is enclosed in the implementations of
bndl.compute.dataset.Dataset
. Some examples:
>>> r = ctx.range(10)
>>> r
<RangeDataset 38s8n3ym>
>>> r.collect()
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> c = ctx.collection('The quick brown fox jumps over the lazy dog')
>>> c
<DistributedCollection 3kuycn22>
>>> for char, count in c.with_value(1).aggregate_by_key(sum).nlargest(4, key=1):
... print(char, count)
...
8
o 4
e 3
u 2
>>> f = ctx.files('.') # key value pairs of (filename:str, contents:bytes)
>>> f
<RemoteFilesDataset o4p97hrt>
>>> f.values().map(len).stats()
<Stats count=4495, mean=23657.52413793103, min=0.0, max=5709296.0, var=45209392506.7967, stdev=212625.00442515386, skew=17.166050243279887, kurt=356.5405806659577>
Distributed global variables¶
On occasions it’s convinient to share (broadcast) some data with all workers (and not have it serialized and set for every task again). Or the opposite: let every worker (e.g. in a mapper or reducer task) send data ‘out of band’ to (accumulate on) the driver. See Broadcasts and accumulators for more on these topics.
Workers / cluster and Profiling¶
ComputeContext
inherits from
ExecuteContext
and thus exposes functions and
properties for e.g. waiting for workers and profiling see Execute for more.