I'm getting an error when I try to register a function with Spark - SQL that I created with the Python API.
First, I train my logistic regression model:
from pyspark.mllib.regression import LinearRegressionWithSGD LRM = LinearRegressionWithSGD() linear_model = LRM.train(transformed_data_scaled, intercept = True)
I know the linear model is trained successfully because I get the following:
print 'Model Coefficients:', linear_model.weights print 'Model Intercept:', linear_model.interceptModel Coefficients: [-82.6194271643,2381.06882039,-65.0826814456,1.70454845719,-75.5970986012,-67.5109145931] Model Intercept: 2081.56222548
Next, I create a function by using the following code:
from pyspark.mllib.linalg import Vectorsdef predict(a,b,c,d,e,f): return linear_model.predict(Vectors.dense([a,b,c,d,e,f]))
I know this function is working correctly because I can test it as follows:
> predict(1,2,3,4,5,6)Out[44]: 5789.599608026406
Lastly, I register the function I created with Spark-SQL as follows:
sqlContext.registerFunction("predict", predict)
When I try to use this command in SQL, I get the following error:
%sql select predict(1,2,3,4,5,6)
Error in SQL statement: java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 22181.0 failed 4 times, most recent failure: Lost task 0.3 in stage 22181.0 (TID 536610, ip-10-0-162-160.ec2.internal): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
Answer by hamel.husain@gmail.com · May 10, 2015 at 07:05 PM
I found the answer. When you call the registerFunction, you have to provide the third argument, which is the datatype.
Note how the third parameter is the datatype with a default of string. I neglected to pass this parameter so I was getting the error.
Why are Python custom UDFs (registerFunction) showing Arrays with java.lang.Object references? 1 Answer
How do I register a UDF that returns an array of tuples in scala/spark? 7 Answers
How to efficiently concatenate data frames with different column sets in Spark? 0 Answers
Pyspark UDF - AttributeError: 'UserDefinedFunction' object has no attribute '_get_object_id' error., 0 Answers
Databricks Inc.
160 Spear Street, 13th Floor
San Francisco, CA 94105
info@databricks.com
1-866-330-0121