Data Manipulation

Taken from Chapter 3 of
Python Data Science Handbook,
by Jake VanderPlas, O'Reilly Media, Inc 2016

Dealing with NaN and None

Finding and removing

SORTED AND UNSORTED INDICES

Some operations require that insices be in order

Hierarchical indices and columns

Multi-indexes allow Dataframes of higer dimensions

Data Aggregations on Multi-Indices

Combining Datasets: Concat and Append

Combining Series is straight forward

Helper functions

Used to keep slides small

Merge

1-1
1-many
many-many

One-One

Many-One

Many-Many

On

Specify which colums or rows to merge on

How to handle the case when column names are not the same in DataFrames

How to merge group and salary?

Column names do not match

What if multuple column names are the same?

How to specify which column to merge on?

Explain This Merge

First some data

THE LEFT_INDEX AND RIGHT_INDEX KEYWORDS

Join on rows

First some data

Inner, Outer, Left, Right Joins

Inner join - just data in common
Outer join - all the data
Left join - Left table + common data in right table
Right join - Right table + common data in left table

By changing order of dataframes left join becomes right join

Note only Mary is in common

Inner Join - on Mary
Outer Join - all

query() and eval()

Pandas do most of the work in C code

But at times can not do all computation to one call to C code

So make multiple calls to C & store intermediate results

Creating memory slows down the computation

The Problem

mask = (x > 0.5) & (y < 0.5)

is equivalent to

tmp1 = (x > 0.5) tmp2 = (y < 0.5) mask = tmp1 & tmp2

So we allocate space for tmp1 & tmp2

The Solution

eval will perform all operations in C

So eval is about 50% faster

Supported Operations

+, -, *, /, **, %, //

boolean operations: | (or), & (and), and ~ (not)

and, or, and not with the same semantics as the corresponding bitwise operators

Series and DataFrame objects are supported and behave as they would with plain ol’ Python evaluation.

DataFrame.eval() for Column-Wise Operations

Local Python Variables in eval

eval can access values of python variables

DataFrame.query() Method

query Details

DataFrame.index and DataFrame.columns are in the query namespace

Panda.eval used to evaluate the expression

Result of evaluation is first passed to DataFrame.loc
If that fails it is passed to DataFrame.getitem()

Performance: When to Use These Functions

Memory

Computational Time

Expanding it out

x = df[(df.A < 0.5) & (df.B < 0.5)]

same as

tmp1 = df.A < 0.5
tmp2 = df.B < 0.5
tmp3 = tmp1 & tmp2
x = df[tmp3]

Example - US State Data

Goal - Compute population density and sort

Until you do it

Until you do it a lot

Aggregation and Grouping

Aggregation in Pandas

GroupBy: Split, Apply, Combine

Split

Apply

Combine

AGGREGATE, FILTER, TRANSFORM, APPLY

Aggregation

Allows us to apply multiple functions at once

Filtering

Drop data based on some critrea

Use funtion that return boolean

Transformation

Change the data

apply() method

Apply an arbitrary function to the group results

Function

Combine operation used depends on return type