Lecture Doc 5 Panda Data Structures

Series

1D array 

DataFrames

2D table - like spreadsheet

Presentation Produced Using

Jupyter Notebook

RISE 5.4.1

Imports for Examples

In [ ]:
import numpy as np
import pandas as pd

Series

Creating

pd.Series(data, index=optionalIndex)

data

  • Python dict
  • ndarray
  • scalar
In [ ]:
ints = pd.Series([1,3,5,6])
ints
In [ ]:
pd.Series(5, index=['a','b','c'])
In [ ]:
pd.Series(np.random.randn(3), index=['a', 'b', 'c'])
In [ ]:
pd.Series(np.random.randn(3), ['a', 'c', 'd'])

Series from Dictionary

In [ ]:
d = {'a': 0., 'b': 1., 'c': 2.}
pd.Series(d)
In [ ]:
pd.Series(d, ['b', 'c', 'd', 'a'])

Accessing Elements

In [ ]:
ints = pd.Series([1,3,5,6])
ints[2]
In [ ]:
ints[2] = 11
ints
In [ ]:
ints.get(2)
In [ ]:
ints.index
In [ ]:
ints.dtype

Out of Range

In [ ]:
ints = pd.Series([1,3,5,6])
x = ints.get(10)
print(x)
In [ ]:
ints[10]

Default Value

In [ ]:
ints.get(10, -1)

Explicit Index

In [ ]:
ints = pd.Series([1,3,5,7], index=['a','b','c','d'])
ints
In [ ]:
ints['b']
In [ ]:
ints[['a','c','d']]
In [ ]:
ints[0]
In [ ]:
ints[[1,2]]

In

ints = pd.Series([1,3,5,7], index=['a','b','c','d'])

In [ ]:
'b' in ints
In [ ]:
3 in ints

Slicing

ints = pd.Series([1,3,5,7], index=['a','b','c','d'])

In [ ]:
ints['a':'c']
In [ ]:
ints['c':]
In [ ]:
ints = pd.Series([1,3,5,7], index=['a','b','c','d'])
In [ ]:
ints[1:3]
In [ ]:
ints['a':'c'] = 0
ints
In [ ]:
ints['b','d'] = 11
ints
In [ ]:
ints = pd.Series([1,3,5,7], index=['a','b','c','d'])
In [ ]:
ints[0:2] = 42
ints

In case you missed it

In [ ]:
ints = pd.Series([1,3,5,7], index=['a','b','c','d'])
ints['a':'c'] = 0
ints
In [ ]:
ints = pd.Series([1,3,5,7], index=['a','b','c','d'])
ints[0:2] = 0
ints

Now for some Fun

In [ ]:
ints = pd.Series([1,3,5,7], index=['a','b','c','d'])
ints[ints > 3]
In [ ]:
ints
In [ ]:
ints[ints > ints.median()]

How does that Work?

In [ ]:
ints > ints.median()
In [ ]:
ints[[False, False, True,True]]

What Methods does Series Have?

Answer: A lot

API Reference

Auto Complete - tab

In [ ]:
ints.

Operations on Series

Done element wise

In [ ]:
odd = pd.Series([1,3,5,7],['a','b','c','d'])
In [ ]:
odd + 1
In [ ]:
odd * 2
In [ ]:
odd + odd
In [ ]:
odd * odd
In [ ]:
np.sin(odd)

np.sin is a NumPy function

Panda Series can be used instead of ndarray in most NumPy functions

In [ ]:
np.sin(odd) < 0.2

Sneak Preview

In [ ]:
gain_loss = pd.Series(np.random.randn(1000),
                       index=pd.date_range('1/1/2019', periods=1000))
gain_loss.head()
In [ ]:
gain_loss.plot()
In [ ]:
cumulative_gain = gain_loss.cumsum()
cumulative_gain.plot()

Missed Matched Indexes

In [ ]:
odd = pd.Series([1,3,5,7],['a','b','c','d'])
even = pd.Series([2,4,6],['d','b','e'])

odd + even

NaN

Not a number

Used to indicate a missing value

np.nan

In [ ]:
sample = pd.Series([1,2,3,np.nan])
sample
In [ ]:
sample[0] = np.nan
sample

NaN and Operations

Any operation on NaN result is Nan

In [ ]:
odd = pd.Series([1,3,5],['a','b','c'])
even = pd.Series([2,4,6],['d','b','c'])

result = odd + even
result
In [ ]:
result + 1
In [ ]:
result.mean()

What should mean of result be?

Pain of NaN

Missing data becomes NaN

Find all instances

Decide how to handle

More Indexing

In [ ]:
even = pd.Series([2,4,6],['a','a','b'])
even
In [ ]:
odd = pd.Series([1,3,5],['a','a','b'])
odd + even
In [ ]:
odd = pd.Series([1,3,5],['a','a','b'])
even = pd.Series([2,4,6],['a','a','c'])

odd + even
In [ ]:
odd = pd.Series([1,3],['a','a'])
even = pd.Series([2,4],['a','a'])

odd + even

You might want to avoid having duplicate index values

Reindexing

In [ ]:
d = {'a': 0., 'b': 1., 'c': 2.}
odd_order = pd.Series(d, ['b', 'c', 'd', 'a'])
odd_order
In [ ]:
better_order = odd_order.reindex(['a','b','c','d','e'])
better_order
In [ ]:
int_order = odd_order.reindex([1,2,3,4])
int_order

Replacing an Index

In [ ]:
odd_order = pd.Series([1.,2.,3.], ['b', 'c', 'd'])
odd_order
In [ ]:
odd_order.index = [1,2,3]
odd_order
In [ ]:
odd_order.index = ['a',2,"cat"]
odd_order

Expanding while Redexing

In [ ]:
gaps = pd.Series(['a','b','c'], [1, 4, 6])
gaps
In [ ]:
gaps.reindex(range(7))
In [ ]:
gaps.reindex(range(7), method='ffill')   #forward fill
In [ ]:
gaps.reindex(range(7), method='bfill')   #backward fill
In [ ]:
gaps.reindex(range(7), fill_value="cat") 
In [ ]:
gaps.reindex(range(7), fill_value=0) 

Series not restricted to one data type

DataFrame

2D data structure

Rows & Columns labeled

Columns can have different data types

Create from

  • Dictionary of 1D ndarrays, listss, dicts, or Series
  • 2D numpy.ndarray
  • Structured or record ndarray
  • Series
  • Another DataFrame
  • File - csv, excel, etc
In [ ]:
data = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
        'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

sample_df = pd.DataFrame(data)
sample_df
In [ ]:
data = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
        'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

sample_df = pd.DataFrame(data, index=['d','c','b'])
sample_df
In [ ]:
data = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
        'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

sample_df = pd.DataFrame(data, index=['d','c','b'], columns=['two', 'one'])
sample_df

Accessing Elements

In [ ]:
data = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
        'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
sample_df = pd.DataFrame(data)

sample_df['one']
In [ ]:
sample_df['one']['a']
In [ ]:
sample_df['one']['a'] = 42
sample_df
In [ ]:
sample_df[1]