With the new structured streaming api released in Spark, the new api for writing a Stream is as follows. As an example, I am reading from kafka and writing to hdfs in avro format.
kafkaStream = ( spark .readStream .format("kafka") .option("kafka.bootstrap.servers", BROKER) .option("subscribe", INPUT_QUEUES) .load() )
now we write the stream
query = ( kafkaStream .writeStream .format("com.databricks.spark.avro") .option("path", PATH) .option("checkpointLocation", "/tmp/") .start() ) query.await
This works well if thewriteStream
format isparquet
orconsole
for example, but when you use the formatcom.databricks.spark.avro
it breaks with the following error:
java.lang.UnsupportedOperationException: Data source com.databricks.spark.avro does not support streamed writing
I posted this in an issue on github as well (it's been more than 2 months and not a single reply), hopefully someone will yield some light on the matter here.
Databricks Inc.
160 Spear Street, 13th Floor
San Francisco, CA 94105
info@databricks.com
1-866-330-0121