Pandas Dataframes generally have an "index", one column of a dataset that gives the name for each row. It works like a primary key in a database table. But Pandas also supports a MultiIndex, in which the index for a row is some composite key of several columns. It's quite confusing at first, here's a simple demo of creating a multi-indexed DataFrame and then querying subsets with various syntax.
import pandas, io
data = io.StringIO('''Fruit,Color,Count,Price
Apple,Red,3,$1.29
Apple,Green,9,$0.99
Pear,Red,25,$2.59
Pear,Green,26,$2.79
Lime,Green,99,$0.39
''')
df_unindexed = pandas.read_csv(data)
df_unindexed
Note, set_index()
creates a new DataFrame
df = df_unindexed.set_index(['Fruit', 'Color'])
df
.xs()
¶This works pretty simply, but the resulting DataFrames no longer have the multi-index. Also .xs()
is not the most powerful way to subset a DataFrame.
Find all Apples
df.xs('Apple')
Find all red fruits
df.xs('Red', level='Color')
.loc[]
for single values¶.loc[]
is the preferred way to subset a DataFrame based on labels.
Find all rows with the label "Apple". Extract all columns
df.loc['Apple', :] # the , : means "all columns"; we could name a subset of columns here
Find all red apples, using a tuple
df.loc[('Apple', 'Red'), :]
Find all apples using a tuple. This prints a warning in Pandas 0.18.1, see below
df.loc[('Apple', ), :]
Unfortunately we've been cheating with our MultiIndex all along. Pandas really wants your DataFrame to be sorted if you are doing complicated queries with a MultiIndex. Slice queries require sorting. This is documented as the need for sortedness with MultiIndex. Sometimes this is called "lexsorting", because lexicographic sorting of the index keys is common.
If your DataFrame is not sorted, you will see several possible errors or warnings.
PerformanceWarning: indexing past lexsort depth may impact performance.
KeyError: 'Key length (1) was greater than MultiIndex lexsort depth (0)'
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (1)'
The simple solution is to sort the DataFrame using sortlevel(), which sorts lexicographically. If you don't like that sorting order, more control is available with sort_index(). I believe all that Pandas cares about is that your DataFrame was sorted; it tracks whether that happened with one flag per MultiIndex level. It doesn't care what order they were actually sorted in, although presumably it affects the semantics of label slicing.
Anyway, let's sort our DataFrame.
df.sortlevel(inplace=True)
Note that the dataframe is now reordered, with Lime in the middle and Green before Red.
df
Now querying with a tuple won't print a warning about lexsort
df.loc[('Apple', ), :]
We can uses slices to find ranges of things in our DataFrame. Note the full MultiIndex is preserved in the result sets, which is nice.
Slicing with index labels is a bit weird. As the docs say, "contrary to usual python slices, both the start and the stop are included!" Also the order implied by slicing is a bit ambiguous (see discussion above about Lexsorting).
Find all Apples with an explicit slice construction
df.loc[slice('Apple', 'Apple'), :]
Find everything between Limes and Pears
df.loc[slice('Lime', 'Pear'), :]
So far we've only been querying by the first level of the index. How do we query by the second, the color?
First, let's make the "all apples" a bit more explicit, using a tuple of two slices.
df.loc[(slice('Apple', 'Apple'), slice(None)), :]
slice(None)
is a wildcard of sorts. How about all red fruits?
df.loc[(slice(None), slice('Red', 'Red')), :]
Typing slice()
a lot is tedious, so Pandas has a (poorly documented) helper called IndexSlice
that allows for some syntactic sugar. Here's all red fruits using that syntax.
idx = pandas.IndexSlice
df.loc[idx[:,['Red']], :]
And finally all green Limes and Pears
df.loc[idx['Lime':'Pear','Green'],:]
Slicing DataFrames can get quite complex. For more examples, see the Pandas docs