Automatically deploy Spark clusters for your projects with a single line of code.
WARNING: This is very experimental and currently just a proof of concept. Proceed at your own risk.
Defining and running a Spark Cluster
This requires you to manually first install
pyspark. Add it to your
requirements.txt (or install it in your Dockerfile).
Spark clusters can be created using the
from cowait.tasks import Task from cowait.tasks.spark import SparkCluster from pyspark.sql import SparkSession class YourSparkJob(Task): async def run(self, inputs**): cluster = SparkCluster(workers=5) conf = await cluster.get_config() # create spark session session = SparkSession.builder \ .config(conf=conf) \ .getOrCreate() # use your Spark SQL session! # you can also scale the cluster at will: await cluster.scale(workers=2) return "Spark job exited"
cowait run Spark_cluster
SparkCluster RPC Methods
The SparkCluster task will automatically set up a Spark scheduler and a set of workers. It provides a number of RPC methods for controlling your cluster, summarized below:
||Get informations about all Spark workers|
||Can be used to scale up or down your cluster|
||Returns the Spark configuration|
||Stop your Spark cluster task from running|
See github for full reference.
WARNING: This is very experimental and currently just a proof of concept.