A demonstration of simple uses of MultiIndex¶

Pandas Dataframes generally have an "index", one column of a dataset that gives the name for each row. It works like a primary key in a database table. But Pandas also supports a MultiIndex, in which the index for a row is some composite key of several columns. It's quite confusing at first, here's a simple demo of creating a multi-indexed DataFrame and then querying subsets with various syntax.

import pandas, io

Create an unindexed DataFrame from a CSV file¶

data = io.StringIO('''Fruit,Color,Count,Price
Apple,Red,3,$1.29
Apple,Green,9,$0.99
Pear,Red,25,$2.59
Pear,Green,26,$2.79
Lime,Green,99,$0.39
''')
df_unindexed = pandas.read_csv(data)
df_unindexed

Add a multi-index based on two columns¶

Note, set_index() creates a new DataFrame

df = df_unindexed.set_index(['Fruit', 'Color'])
df

Querying the data frame with `.xs()`¶

This works pretty simply, but the resulting DataFrames no longer have the multi-index. Also .xs() is not the most powerful way to subset a DataFrame.

Find all Apples

df.xs('Apple')

Find all red fruits

df.xs('Red', level='Color')

Querying the data frame with `.loc[]` for single values¶

.loc[] is the preferred way to subset a DataFrame based on labels.

Find all rows with the label "Apple". Extract all columns

df.loc['Apple', :]        # the , : means "all columns"; we could name a subset of columns here

Find all red apples, using a tuple

df.loc[('Apple', 'Red'), :]

Count        3
Price    $1.29
Name: (Apple, Red), dtype: object

Find all apples using a tuple. This prints a warning in Pandas 0.18.1, see below

df.loc[('Apple', ), :]

/home/nelson/src/lol-winrate-by-tier/venv/lib/python3.4/site-packages/pandas/core/indexing.py:1294: PerformanceWarning: indexing past lexsort depth may impact performance.
  return self._getitem_tuple(key)

Lexsorting¶

Unfortunately we've been cheating with our MultiIndex all along. Pandas really wants your DataFrame to be sorted if you are doing complicated queries with a MultiIndex. Slice queries require sorting. This is documented as the need for sortedness with MultiIndex. Sometimes this is called "lexsorting", because lexicographic sorting of the index keys is common.

If your DataFrame is not sorted, you will see several possible errors or warnings.

PerformanceWarning: indexing past lexsort depth may impact performance.
KeyError: 'Key length (1) was greater than MultiIndex lexsort depth (0)'
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (1)'

The simple solution is to sort the DataFrame using sortlevel(), which sorts lexicographically. If you don't like that sorting order, more control is available with sort_index(). I believe all that Pandas cares about is that your DataFrame was sorted; it tracks whether that happened with one flag per MultiIndex level. It doesn't care what order they were actually sorted in, although presumably it affects the semantics of label slicing.

Anyway, let's sort our DataFrame.

df.sortlevel(inplace=True)

💥 Ta-Da!💥¶

Note that the dataframe is now reordered, with Lime in the middle and Green before Red.

df

Now querying with a tuple won't print a warning about lexsort

df.loc[('Apple', ), :]

Querying the data frame with .loc[] for ranges of values¶

We can uses slices to find ranges of things in our DataFrame. Note the full MultiIndex is preserved in the result sets, which is nice.

Slicing with index labels is a bit weird. As the docs say, "contrary to usual python slices, both the start and the stop are included!" Also the order implied by slicing is a bit ambiguous (see discussion above about Lexsorting).

Find all Apples with an explicit slice construction

df.loc[slice('Apple', 'Apple'), :]

Find everything between Limes and Pears

df.loc[slice('Lime', 'Pear'), :]

So far we've only been querying by the first level of the index. How do we query by the second, the color?
First, let's make the "all apples" a bit more explicit, using a tuple of two slices.

df.loc[(slice('Apple', 'Apple'), slice(None)), :]

slice(None) is a wildcard of sorts. How about all red fruits?

df.loc[(slice(None), slice('Red', 'Red')), :]

Typing slice() a lot is tedious, so Pandas has a (poorly documented) helper called IndexSlice that allows for some syntactic sugar. Here's all red fruits using that syntax.

idx = pandas.IndexSlice

df.loc[idx[:,['Red']], :]

And finally all green Limes and Pears

df.loc[idx['Lime':'Pear','Green'],:]

Slicing DataFrames can get quite complex. For more examples, see the Pandas docs

	Fruit	Color	Count	Price
0	Apple	Red	3	$1.29
1	Apple	Green	9	$0.99
2	Pear	Red	25	$2.59
3	Pear	Green	26	$2.79
4	Lime	Green	99	$0.39

		Count	Price
Fruit	Color
Apple	Red	3	$1.29
Apple	Green	9	$0.99
Pear	Red	25	$2.59
Pear	Green	26	$2.79
Lime	Green	99	$0.39