• Create
    • Ask a question
    • Create an article
    • Topics
    • Questions
    • Articles
    • Users
    • Badges
  • Sign in
  • Home /
avatar image
0

Issues with UTF-16 files and unicode characters

csvutf-16
Question by Dominic Robinson · Dec 11, 2018 at 08:13 PM ·

Can someone please offer some insight - I've spent days trying to solve this issue

We have the task of loading in hundreds of tab seperated text files encoded in UTF-16 little endian with a tab delimiter. Our organisation is an international one and therefore our source contains lots of unicode characters. The encoding of the files cannot be changed, nor can the format.

The issue I'm seeing quite frequently is that these unicode characters are not getting displayed correctly via the spark interpreter - additionally this problem causes the tab delimeter to be escaped, ultimately resulting in subsequent columns shifting to the left.

A prime example of this is the euro symbol U+20AC €, the symbol displays fine when opened in Notepad++, vi or pretty much any unicode capable editor.

However when displayed in a dataframe I see ""¥", I thought this might be a problem with the way our application is encoding files, but no it seems to extend to any UTF-16LE file encoded in Windows. I can reproduce this every single time by simply typing the euro symbol into Windows notepad saving the file with UTF-16 encoding and loading it into databricks.

This is causing us real problems - can anyone help?

Sample code:

val df = spark.read
.format("com.databricks.spark.csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .option("delimiter", "\\t")
      .option("endian", "little")
      .option("encoding", "UTF-16")
      .option("charset", "UTF-16")
      .option("timestampFormat", "yyyy-MM-dd hh:mm:ss")
      .option("codec", "gzip")
      .option("sep", "\t")
.csv("mnt/adls/test/cu100.gz")
display(df)

It somehow seems like it might be a problem with the csv connector, because:

val test = Seq("€")
val t = test.toDF
display(t)

Works absoloutely fine

Add comment
Comment
10 |600 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users

4 Answers

Sort

  • Votes
  • Created
  • Oldest
avatar image
0

Answer by fish_databricks · Dec 12, 2018 at 12:19 AM

hi @Dominic Robinson, my colleague tells me that the CSV source should support UTF-16LE and UTF-16BE, but not plain UTF-16. It may be helpful to look at the test suite for the CSV source - it has simple examples of what is and isn't possible. It seems like you are saying that should be covered by UTF-16LE - if so, you may want to verify that there isn't a discrepancy caused by creating the file in Windows. If I recall correctly, Windows formats text files slightly differently than Unix/Mac does.

Side note, you should not use "com.databricks.spark.csv" anymore. Spark has a built-in csv data source as of Spark 2.0 and the Databricks package is no longer updated.

Comment
Add comment · Share
10 |600 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users
avatar image
0

Answer by Dominic Robinson · Dec 12, 2018 at 10:32 AM

It can't read the a simple one column text file with the euro symbol - it doesn't seem to be a windows encoding issue either as I've written a file using vi on Fedora:

Here is a very simple example file:

https://codiad.dcrdev.com/workspace/Workbin/test1.txt

Comment
Add comment · Share
10 |600 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users
avatar image
0

Answer by fish_databricks · Dec 12, 2018 at 10:04 PM

hi @Dominic Robinson I'm unable to create a simple reproduction of this issue. I was able to write out a file with the Euro symbol as the column using dataframe.write.csv(path), and the symbol was fine when I read the file back in using spark.read.csv(path). I think you are correct that the problem is the interaction between the csv source and whatever is producing your files.

Did you try this out with the built-in csv source yet?

If you are continuing to have problems, please raise a support ticket with Databricks. It could be a bug, or it could be your particular use case is unsupported and could be added to the csv source by Databricks.

Comment
Add comment · Share
10 |600 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users
avatar image
0

Answer by fish_databricks · Dec 12, 2018 at 10:05 PM

You can also always read in the file as a textFile, and then run a UTF-16 decoder/encoder library as a UDF on the text.

Comment
Add comment · Share
10 |600 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users

Your answer

Hint: You can notify a user about this post by typing @username

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

Follow this Question

9 People are following this question.

avatar image avatar image avatar image avatar image avatar image avatar image avatar image avatar image avatar image

Related Questions

How do I import a CSV file (local or remote) into Databricks Cloud? 3 Answers

"Length of parsed input exceeds max..." when loading in csv files 2 Answers

Is it possible to read a CSV file via SFTP using spark-csv 3 Answers

Handle comma inside cell of CSV 0 Answers

Length of parsed input (1000001) exceeds the maximum number of characters defined in your parser settings (1000000) 1 Answer

  • Product
    • Databricks Cloud
    • FAQ
  • Spark
    • About Spark
    • Developer Resources
    • Community + Events
  • Services
    • Certification
    • Spark Support
    • Spark Training
  • Company
    • About Us
    • Team
    • News
    • Contact
  • Careers
  • Blog

Databricks Inc.
160 Spear Street, 13th Floor
San Francisco, CA 94105

info@databricks.com
1-866-330-0121

  • Twitter
  • LinkedIn
  • Facebook
  • Facebook

© Databricks 2015. All rights reserved. Apache Spark and the Apache Spark Logo are trademarks of the Apache Software Foundation.

  • Anonymous
  • Sign in
  • Create
  • Ask a question
  • Create an article
  • Explore
  • Topics
  • Questions
  • Articles
  • Users
  • Badges