Wednesday, August 27, 2014

Visualizing electricity prices with Plotly

We have already mentioned plotly many times (here are other two posts about it) and this time we'll see how to use it in order to build an interactive visualization of the latest data about the domestic electricity prices provided by International Energy Agency (IEA).

In the chart that we are going to make, we will show the prices of the domestic electricity among the countries monitored by IEA in 2013 with a bar chart where each bar shows the electricity price and the fraction of the price represented by the taxes.

First, we import the data (the full data is available here, in this post we'll use only the Table 5.5.1 in cvs format) using pandas:
import pandas as pd
ieaprices = pd.read_csv('iea_prices.csv',
                        na_values=('..','+','-','+/-'))
ieaprices = ieaprices.dropna()
ieaprices.set_index(['Country'],inplace=True)
countries = ieaprices.sort('2013_with_tax').index
Then, we arrange the data in order create a plotly bar chart:
from plotly.graph_objs import Bar,Data,Layout,Figure
from plotly.graph_objs import XAxis,YAxis,Marker,Scatter,Legend

prices_bars = []

# computing the taxes
taxes = ieaprices['2013_with_tax']-ieaprices['2013_no_tax']

# adding the prices to the chart
prices_bars.append(Bar(x=countries.values, 
             y=ieaprices['2013_no_tax'].ix[countries].values,
             marker=Marker(color='#0074D9'),
             name='price without taxes'))

# adding the taxes to the chart
prices_bars.append(Bar(x=countries.values, 
             y=taxes.ix[countries].values,
             marker=Marker(color='#0099D9'),name='taxes'))
And now we are ready to submit the data to the plotly server to render the chart:
import plotly.plotly as py

py.sign_in("SexyUser", "asexykeyforasexyuser")

meadian_line = Scatter(
    x=countries.values,
    y=np.ones(len(countries))*ieaprices['2013_with_tax'].median(),
    marker=Marker(color='rgb(40, 40, 40)'),
    opacity=0.5,
    mode='lines',
    name='Median')

data = Data(prices_bars+[meadian_line])

layout = Layout(
    title='Domestic electricity prices in the IEA in 2013',
    xaxis=XAxis(type='category'),
    yaxis=YAxis(title='Price (Pence per Kwh)'),
    legend=Legend(x=0.0,y=1.0),
    barmode='stack',
    hovermode='closest')

fig = Figure(data=data, layout=layout)

# this line will work only in ipython
# use py.plot() in other environments
plot_url = py.iplot(fig, filename='ieaprices2013') 
The result should look like this:

Looking at the chart we note that, during 2013, the average domestic electricity prices, including taxes, in Denmark and Germany were the highest in the IEA. We also note that in Denmark the fraction of taxes paid is higher than the actual electricity price whereas in Germany the actual electricity price and the taxes are almost the same. Interestingly, USA has the lowest price and the lowest taxation.

This post shows how to create one of the charts commented here, where a more insights about the IEA data are provided.

Wednesday, August 20, 2014

Quick HDF5 with Pandas

HDF5 is a format designed to store large numerical arrays of homogenous type. It cames particularly handy when you need to organize your data models in a hierarchical fashion and you also need a fast way to retrieve the data. Pandas implements a quick and intuitive interface for this format and in this post will shortly introduce how it works.

We can create a HDF5 file using the HDFStore class provided by Pandas:
import numpy as np
from pandas import HDFStore,DataFrame
# create (or open) an hdf5 file and opens in append mode
hdf = HDFStore('storage.h5')
Now we can store a dataset into the file we just created:
df = DataFrame(np.random.rand(5,3), columns=('A','B','C'))
# put the dataset in the storage
hdf.put('d1', df, format='table', data_columns=True)
The structure used to represent the hdf file in Python is a dictionary and we can access to our data using the name of the dataset as key:
print hdf['d1'].shape
(5, 3)
The data in the storage can be manipulated. For example, we can append new data to the dataset we just created:
hdf.append('d1', DataFrame(np.random.rand(5,3), 
           columns=('A','B','C')), 
           format='table', data_columns=True)
hdf.close() # closes the file
There are many ways to open a hdf5 storage, we could use again the constructor of the class HDFStorage, but the function read_hdf makes us also able to query the data:
from pandas import read_hdf
# this query selects the columns A and B
# where the values of A is greather than 0.5
hdf = read_hdf('storage.h5', 'd1',
               where=['A>.5'], columns=['A','B'])
At this point, we have a storage which contains a single dataset. The structure of the storage can be organized using groups. In the following example we add three different datasets to the hdf5 file, two in the same group and another one in a different one:
hdf = HDFStore('storage.h5')
hdf.put('tables/t1', DataFrame(np.random.rand(20,5)))
hdf.put('tables/t2', DataFrame(np.random.rand(10,3)))
hdf.put('new_tables/t1', DataFrame(np.random.rand(15,2)))
Our hdf5 storage now looks like this:
print hdf

File path: storage.h5
/d1             frame_table  (typ->appendable,nrows->10,ncols->3,indexers->[index],dc->[A,B,C])
/new_tables/t1  frame        (shape->[15,2])                                                   
/tables/t1      frame        (shape->[20,5])                                                   
/tables/t2      frame        (shape->[10,3])  
On the left we can see the hierarchy of the groups added to the storage, in the middle we have the type of dataset and on the right there is the list of attributes attached to the dataset. Attributes are pieces of metadata you can stick on objects in the file and the attributes we see here are automatically created by Pandas in order to describe the information required to recover the data from the hdf5 storage system.