A while ago I started working with DataBricks, that can be accessed from inside Microsoft Azure. This blogpost is about importing data from a Blob storage, what can go right, what can go wrong, and how to solve it.
How to import data from a Blob storage
I’ll keep it short this time, because the video below speaks for itself. But to those who rather read written instructions: let me do you a favor.
- Launch the Databricks workspace in the Azure Portal.
- In Databicks, go to “Data”.
- Click on the plus sign next to “tables”
- Under “Create new table”, select “Spark Data Sources” and checkmark “Azure Blob Storage”
- Click “Create Table in Notebook”. This launches a ready-to-use notebook for you.
- Fill in the right parameters in the notebook. See the section “Pitfalls” for a more elaborate explanation.
- Run the notebook (or just the cells you need), and voilà, your table is there!
- When you have written your dataframe to a table in the Databricks Filestore (this is a cell in the notebook), then you can by going to “Data” -> “Tables”.
Pitfalls
1)When importing data from a Blob storage, fill in the right parameters in the ready-to-use Python Notebook.
It seems a very straight-forward advice, but in reality that’s not always the case. The documentation of DataBricks sometimes requires some knowlegde that’s not always there. Especially when you’re working for the first time with DataBricks and Blob storages. To summarize:
- STORAGE_ACCOUNT_NAME: In Azure, go to your blob storage, what’s it’s name? Exactly, that’s what you should enter here.
- YOUR_ACCESS_KEY: In Azure, go to your blob storage, then at the left side you will see a column with properties, overview, activity logging, etc. In this column you can also see a section “access keys”. There you find the primary and secondary access key. In theory, you should be able to use both.
- wasbs://example/location: This is the path where your file can be found in the blob. Typically, it has the following form. wasbs://blob_container@account_name.blob.core.windows.net/ But what is the blob container? Well, this is part of the file structure of the blob storage. You make a container, and your files are located in that container. The account name is the same as I described above.
2)Don’t try to manipulate data that’s already written to a table.
Start working with the DataFrame (df) itself, that’s created when you run the python notebook commands that preceed the writing of the table. Otherwise, you will face issues. For instance that you can’t change the datatype of the column in the DataFrame, and you know, plotting strings is not very convenient. Sometimes the solution is that simple, but as every coder will know, it can take you ages to find it.
3)Select inferSchema = True if you want the datatypes in the DataFrame to be detected automatically.
If you don’t, then you might end up with strings, which you’ll have to change later to plot and do calculations with the data. It’s more convenient to let it do automatically for you. Sometimes however, it might convert dates to integers, but at least the majority of the datatypes is correct in the DataFrame.
Please find the typical line of code in which you set the inferschema to true below:
In this line of code you can also specify options, like whether there is a header, or what the type of delimiter is in your .csv file. It’s basically the line of code where you create your DataFrame. Please mind that a DataFrame is something different than a table. A table is stored in the Filestore, and it’s harder to change things like datatypes in a table than in a DataFrame.
4)Last but not least: If you want to start working with the data in Python or R inside Databricks, mind that the PySpark and SparkR packages are used.
But what is PySpark? Well, according to DataCamp it’s the following:
Spark is a tool for doing parallel computation with large datasets and it integrates well with Python. PySpark is the Python package that makes the magic happen.
And SparkR? The Spark website gave me the answer:
SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. In Spark 1.6.0, SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation etc. (similar to R data frames, dplyr) but on large datasets. SparkR also supports distributed machine learning using MLlib.
That’s it for today, good luck with your Databricks projects!
This blog was originally posted here.