Spark
Automatically deploy Spark clusters for your projects with a single line of code.
WARNING: This is very experimental and currently just a proof of concept. Proceed at your own risk.
Defining and running a Spark Cluster
This requires you to manually first install pyspark. Add it to your requirements.txt (or install it in your Dockerfile).
Spark clusters can be created using the SparkCluster task.
spark_cluster.py
from cowait.tasks import Task
from cowait.tasks.spark import SparkCluster
from pyspark.sql import SparkSession
class YourSparkJob(Task):
async def run(self, inputs**):
cluster = SparkCluster(workers=5)
conf = await cluster.get_config()
# create spark session
session = SparkSession.builder \
.config(conf=conf) \
.getOrCreate()
# use your Spark SQL session!
# you can also scale the cluster at will:
await cluster.scale(workers=2)
return "Spark job exited"Run it:
cowait run Spark_clusterSparkCluster RPC Methods
The SparkCluster task will automatically set up a Spark scheduler and a set of workers. It provides a number of RPC methods for controlling your cluster, summarized below:
cowait.tasks.Spark.SparkCluster
| RPC Method | Description |
|---|---|
get_workers() |
Get informations about all Spark workers |
scale(workers: int) |
Can be used to scale up or down your cluster |
get_config() |
Returns the Spark configuration |
teardown() |
Stop your Spark cluster task from running |
See github for full reference.
WARNING: This is very experimental and currently just a proof of concept.