Plotting data against each other, can be done in several ways in Databricks. This blogpost focusses on PySpark to plot the data. If you’re working in R, then you can either switch the language of that particular cell (in your notebook) to Python, or use the methods available for R. However, I will not delve deeper into R in this blog.
But before you start:
Are the datatypes correct in order to plot your data?
Because plotting strings is not convenient. Check how you can automatically detect the datatypes of your variables while importing the data in my previous blogpost about Databricks.
Don’t forget to create your dataframe in Python, just in case your notebook is in R.
Dataframes don’t get transferred when you switch from Python to R and back.
The display method
One of the quickest and easiest ways to create your plot in Databricks is the display method. When you create a dataframe df, you can call: display(df). Initially, you’ll see a table with a part of the rows and columns of your dataset. However, below the table, you’ll find some icons.
Click on the first icon, and you’ll get a table.
Click on the second icon, and you’ll get a plot.
Click on the third icon, and you’ll get a download.
Now assume you just clicked on the second icon. Then, you’ll suddenly be able to click on “Plot Options”:
And when you click on this, you’ll see this (I loaded an example dataset):
All you’ll now have to do is drag and drop the fields you want to plot to “Keys”, perhaps “Series groupings” and definitely to “Values”. Then, a sample plot will show up.
Don’t forget to choose the aggregation method: sum, avg, … and the type of plot (line chart, bar plot, histogram, …). Also be sure to click “apply” and to run the cell in your notebook. Only then you’ll get the full plot, not a sample.
You can change the size of the plot in the results panel of the cell, by dragging the arrow in the bottom right corner.
The matplotlib library
The matplotlib library allows you to customize plots more than when using the display method. It’s also easier to use when you want to plot two (numpy) arrays as X and y. Then you don’t have to turn them into a dataframe. The code snippet below show you how to plot using matplotlib:
So first you call the matplotlib library. Then, then you call plt.subplots() and assign to parameters to it, fig and ax. fig contains the full plot, but using ax, you can really plot the data in the format that you like (line, bar, scatter, pie, …). ax.plot plots a line for example, where ax.scatter returns a scatter plot. Using ax you can also set x- and ylimits, x- and yticks and x- and yticklabels, for instance for the x-axis, use: ax.set_xlim(), ax.set_xticks() and ax.set_xticklabels().
As you can see, you can customize the plot quite a lot using matplotlib. There are even more ways to customize a plot, but that would lead us a bit too far. Look for examples here.
Now, when you want to change the labels of the axes of the plot, the title or the legend, you can call plt.title (for instance). The legend can be located in the best way (that it doesn’t overlap with the rest,…) by calling plt.legend(loc=”best”).
Finally, you’ll still need to call the display method to show the plot in the results panel.
And what about ggplot?
I think matplotlib and the display method are most straight forward to use. So I’m not a big fan of ggplot in Databricks, and I rather don’t pretend that I know much about it. However, those who want more info about ggplot, can take a look at this notebook.
This blog was originally posted here.