I am trying to run the example code from Chapter 6 (Understanding Wikipedia with Latent Semantic Analysis) of Advanced Analytics with Spark, but am not being able to download the required Stanford CoreNLP libraries from Maven Central. The two Maven coordinates are:
I am using the "Maven Coordinate" option under "Create Library". The main jar ("stanford-corenlp-3.6.0.jar" corresponding to the first maven coordinate shown above) downloads fine, but the second maven coordinate (for "stanford-corenlp-3.6.0-models-english.jar") does not work. In fact after entering the maven coordinate, the next page (of Create Library) does not even display this jar.
What am I missing?
Thanks for any guidance.
Answer by raela · Jun 30, 2016 at 08:27 PM
The ML team at Databricks recently released the first version of spark-corenlp.
Here's an example notebook to show how it works: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/3233914658600709/588180/latest.html
I was just thinking that you might be interested in this package and might want to give it a try. Do let us know if you have any feedback!
Problem running example.
Load of English model was fine. But when displaying output I get the following: Error while loading a tagger model (probably missing model file); Unable to open "edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger" as class path, filename or URL at edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystemIf I change theselect statement to
.select('sen, tokenize('sen).as('words))
it runs fine.
Perhaps ner and sentiment are looking under the library location rather than the /tmp/stanford-corenlp-3.6.0-models-english.jar?
This is clearly a very helpful development, but it also detracts from the real problem--which is getting the CoreNLP model JAR installed.
Installing the CoreNLP model jar is nontrivial for application developers, and for some reason the hack mentioned elsewhere on this thread has stopped working on DataBricks CE.
I do wish DataBricks would address the real need here. Without a documented solution for installing the CoreNLP model Jar, the above development (spark-corenlp) is pretty much a tall building without a foundation.
Answer by raela · Jun 03, 2016 at 04:47 PM
Here's a notebook that shows how you can workaround this:
Answer by sanjay_dasgupta · Jun 04, 2016 at 01:53 AM
Hi @raela, thank you for this lead.
I tried out the code on the Databricks CE with a Version 2.0 preview cluster, but ran into the following problem: The following line fails:
ucl.addURL(new java.net.URL("file:////tmp/stanford-corenlp-3.6.0-models-english.jar"))
The error message is:
java.lang.NoSuchMethodException: com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader.addURL(java.net.URL) at java.lang.Class.getMethod(Class.java:1786) at Notebook.reflMethod$Method1(<console>:47)
I did try to print out the method names using the following line:
ucl.getClass.getMethods.map(_.getName).foreach(println)
But the "addURL" method is not listed. None of the other methods had similar names.
What am I missing?
Thanks again for your help.
Answer by sanjay_dasgupta · Jun 04, 2016 at 04:47 AM
Hi @raela, here is an update to my last post. Using the parent classloader -- instead of the grandparent -- seems to be working for some of the notebook (upto the NERDemo.java cell):
val cl = java.lang.Thread.currentThread.getContextClassLoader.getParent//.getParent
The imports at the top of the NERDemo are still failing though. I am now using version 1.6.1/Hadoop2 of Databricks CE.
Thanks again for your lead.
Answer by sanjay_dasgupta · Jun 08, 2016 at 04:24 PM
Ok, here is what seems to be working (apache/branch-2.0 preview on Databricks CE) :
type CL = ClassLoader with AnyRef {def addURL(url: java.net.URL): Unit} val cl = java.lang.Thread.currentThread.getContextClassLoader.getParent val cl2 = java.lang.Thread.currentThread.getContextClassLoader.getParent.getParent.getParent // Using Scala's structural types to load the file. val ucl: CL = cl.asInstanceOf[CL] val ucl2: CL = cl2.asInstanceOf[CL] ucl.addURL(new java.net.URL("file:////tmp/stanford-corenlp-3.6.0-models-english.jar")) ucl2.addURL(new java.net.URL("file:////tmp/stanford-corenlp-3.6.0-models-english.jar")) sc.addJar("/tmp/stanford-corenlp-3.6.0-models-english.jar")
Answer by sanjay_dasgupta · Jul 01, 2016 at 04:17 PM
Hi @raela
Thank you for the heads up.
Somehow, the code that wgets the model jar and calls "sc.addJar(...)" does not work for me. However, when I replace that block of code with the following block (derived from your hint above), everything works fine.
import java.io.File val file = new File("file:/tmp/stanford-corenlp-3.6.0-models-english.jar") if (file.exists) { println("File exists, ...") } else { println("File NOT FOUND, making copy ...") val copyStatus = dbutils.fs.cp("/FileStore/tables/stanford-corenlp-3.6.0-models-english.jar", "file:/tmp/") println(s"copyStatus: $copyStatus") } type CL = ClassLoader with AnyRef {def addURL(url: java.net.URL): Unit} val cl = java.lang.Thread.currentThread.getContextClassLoader.getParent val cl2 = java.lang.Thread.currentThread.getContextClassLoader.getParent.getParent.getParent // Using Scala's structural types to load the file. val ucl: CL = cl.asInstanceOf[CL] val ucl2: CL = cl2.asInstanceOf[CL] ucl.addURL(new java.net.URL("file:////tmp/stanford-corenlp-3.6.0-models-english.jar")) ucl2.addURL(new java.net.URL("file:////tmp/stanford-corenlp-3.6.0-models-english.jar")) sc.addJar("/tmp/stanford-corenlp-3.6.0-models-english.jar")
(I had stashed the model jar inside a FileStore directory to avoid having to download it every time)
It would be great if Databricks could also address the need for the user to download the JAR, and of not allowing a JAR of this size to be uploaded directly using the normal library creation option. Is it not so much simpler to just loosen that JAR size limit a bit? I must be missing something :-)
Thanks for your help.
- Sanjay
Increasing JAR size limits did come up a bunch of times, but the engineering team ultimately decided not to do it because large JARs might cause webapp or cluster manager timeouts. So they're sacrificing a bit of functionality here for increased product stability. Sorry you have to jump through so many hoops just to get the JAR on Databricks!
Answer by sanjay_dasgupta · Aug 14, 2016 at 05:22 AM
Hi @raela, this does not seem to be working for me anymore. I'm using Spark-2.10 (with Scala 2.10), but have also tried falling back to 1.6.2. Code that worked earlier does not compile anymore. the "addURL()" calls and the one "addJar()" all appear to succeed, but imports in later application code still fail:
import edu.stanford.nlp.ling.CoreAnnotations.{LemmaAnnotation, PartOfSpeechAnnotation, SentencesAnnotation, TokensAnnotation} import edu.stanford.nlp.pipeline.{Annotation, StanfordCoreNLP}<console>:30: error: object ling is not a member of package edu.stanford.nlp import edu.stanford.nlp.ling.CoreAnnotations.{LemmaAnnotation, PartOfSpeechAnnotation, SentencesAnnotation, TokensAnnotation} ^ <console>:31: error: object Annotation is not a member of package edu.stanford.nlp.pipeline import edu.stanford.nlp.pipeline.{Annotation, StanfordCoreNLP}
Can you give me any insight into what I need to check?
Thanks for any hints.
Answer by Tom D · Aug 15, 2016 at 10:44 AM
Hi all,
I was successful in getting this to run in mid-July. Just tried it again (8/15/2016), and it still works (using ver. 1.6.2, with Hadoop 2). You can see an abridged version at
Here's what I did.
(1) Loaded and attached spark-corenlp-0.1
(2) Loaded stanford-english-corenlp-2016-01-10-models.jar into my AWS account, and created a "mnt" link in Databricks.
Feel free to reply if you have any questions.
Best,
Tom
Answer by sanjay_dasgupta · Aug 16, 2016 at 03:47 AM
Hi Tom,
Thanks for your response.
I see no material difference between your code and mine, and am really baffled by this sudden change. I tweaked and tinkered all day yesterday, this is so disappointing.
I am using the Databricks community edition, what about you?
Answer by Tom D · Aug 16, 2016 at 10:11 AM
Hi Sanjay,
Yes, I'm in CE.
My code does NOT run on Spark 2.0 (Scala 2.10). It first throws a warning:
Warning: there were 2 feature warning(s); re-run with -feature for details defined type alias CL cl: ClassLoader = org.apache.spark.repl.SparkIMain$$anon$2@56507b9a cl2: ClassLoader = com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@6eafb10e ucl: CL = org.apache.spark.repl.SparkIMain$$anon$2@56507b9a ucl2: CL = com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@6eafb10e
Then an error following this code.
val output = input .select(cleanxml('text).as('doc)) .select(explode(ssplit('doc)).as('sen)) .select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment)) ----------------- error: bad symbolic reference. A signature in functions.class refers to type UserDefinedFunction in package org.apache.spark.sql which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling functions.class. <console>:40: error: org.apache.spark.sql.UserDefinedFunction does not take parameters .select(cleanxml('text).as('doc))
I suspect "library" compiles are out of synch with Spark 2.0, but don't have time to track it down. Did you use a different library from the last time it worked in 1.6.2? Sorry can't be more help right now.
Best,
Tom
Databricks Inc.
160 Spear Street, 13th Floor
San Francisco, CA 94105
info@databricks.com
1-866-330-0121