How do you read a csv file from a mounted S3?
This below returns a 'file does not exist' error
df = pd.read_csv('/dbfs/my_bucket_name/filename_and_cluster.csv')
How do I get pandas to read the csv?
FYI, the below gives no error.
csv = sc.textFile("dbfs://my_bucket_name/filename_and_cluster.csv")
Thanks. I'm very new to databricks. Any help is appreciated
Answer by bill · Sep 14, 2016 at 11:54 PM
Good question! You don't access the data via the bucket name but rather the mount name. This means that if you mount the bucket with dbfs, you'll access it at that path.
For example:ACCESS_KEY = "YOUR_ACCESS_KEY" # Encode the Secret Key as that can contain "/" SECRET_KEY = "YOUR_SECRET_KEY".replace("/", "%2F") AWS_BUCKET_NAME = "MY_BUCKET" MOUNT_NAME = "MOUNT_NAME"
dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
then you would run
rdd = sc.textFiles("/mnt/%s/...." % MOUNT_NAME)
to get and RDD. You should also be able to read in the pandas DataFrame via pd.read_csv("/mnt/%s/" % MOUNT_NAME) etc.
Answer by grahama · Sep 15, 2016 at 12:59 AM
Thanks for responding :)
My S3 mount creds are working..mounting the files when I run
Yet the pandas code below gives me an
IOError: File /s3bucket/filename_and_cluster.csv does not exist
df = pd.read_csv('/s3bucket/filename_and_cluster.csv')
What am I doing incorrectly?
Thanks# This works ACCESS_KEY = "my access key code" SECRET_KEY = "my secret key code" ENCODED_SECRET_KEY = urllib.quote(SECRET_KEY, "") AWS_BUCKET_NAME = "my aws bucket name" MOUNT_NAME = "s3bucket" dbutils.fs.mount("s3n://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
Answer by ssmaroju · 5 days ago
If you already have a mount point name, you can use the below codeMOUNT_NAME = "MEANINGFUL-MOUNT-NAME"
import pandas as pd pandas_df = pd.read_csv("/dbfs/mnt/%s/filename_and_cluster.csv" % MOUNT_NAME)
However, if you have to access an S3 bucket using pandas, I will create a mount point and access as below:import urllib import pandas as pd
ACCESS_KEY = "YOUR-ACCESS-KEY"
SECRET_KEY = "YOUR-SECRET-KEY"
ENCODED_SECRET_KEY = urllib.quote(SECRET_KEY,"") AWS_BUCKET_NAME = "YOUR-AWS-BUCKET-NAME"
MOUNT_NAME = "MEANINGFUL-MOUNT-NAME"
dbutils.fs.mount("s3n://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
pandas_df = pd.read_csv("/dbfs/mnt/%s/filename_and_cluster.csv" % MOUNT_NAME)
The only thing tricky about the path when calling pandas read_csv is to provide an absolute path starting with ('/dbfs/') and NOT like a root ('dbfs:/')
Reading from mounted S3 Bucket fails 3 Answers