Lecture Doc 5 Panda Data Structures¶

Series

1D array 

DataFrames

2D table - like spreadsheet

Presentation Produced Using¶

Jupyter Notebook

RISE 5.6.0

Imports for Examples¶

In [2]:
import numpy as np
import pandas as pd
In [ ]:
 

Series¶

Creating

pd.Series(data, index=optionalIndex)

data

  • Python dict
  • ndarray
  • scalar
In [2]:
ints = pd.Series([1,3,5,6])
ints
Out[2]:
0    1
1    3
2    5
3    6
dtype: int64
In [3]:
pd.Series(5, index=['a','b','c'])
Out[3]:
a    5
b    5
c    5
dtype: int64
In [4]:
pd.Series(np.random.randn(3), index=['a', 'b', 'c'])
Out[4]:
a   -0.033369
b   -0.436952
c    0.030384
dtype: float64
In [3]:
pd.Series(np.random.randn(3), ['a', 'c', 'd'])
Out[3]:
a   -1.123177
c   -0.183864
d   -1.150105
dtype: float64

Series from Dictionary¶

In [4]:
d = {'a': 0., 'b': 1., 'c': 2.}
pd.Series(d)
Out[4]:
a    0.0
b    1.0
c    2.0
dtype: float64
In [5]:
pd.Series(d, ['b', 'c', 'd', 'a'])
Out[5]:
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

Accessing Elements¶

In [6]:
ints = pd.Series([1,3,5,6])
ints[2]
Out[6]:
5
In [7]:
ints[2] = 11
ints
Out[7]:
0     1
1     3
2    11
3     6
dtype: int64
In [8]:
ints.get(2)
Out[8]:
11
In [9]:
ints.index
Out[9]:
RangeIndex(start=0, stop=4, step=1)
In [10]:
ints.dtype
Out[10]:
dtype('int64')

Out of Range¶

In [11]:
ints = pd.Series([1,3,5,6])
x = ints.get(10)
print(x)
None
In [12]:
ints[10]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
    350                 try:
--> 351                     return self._range.index(new_key)
    352                 except ValueError as err:

ValueError: 10 is not in range

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-12-249e29dff803> in <module>
----> 1 ints[10]

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/series.py in __getitem__(self, key)
    822 
    823         elif key_is_scalar:
--> 824             return self._get_value(key)
    825 
    826         if is_hashable(key):

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/series.py in _get_value(self, label, takeable)
    930 
    931         # Similar to Index.get_value, but we do not fall back to positional
--> 932         loc = self.index.get_loc(label)
    933         return self.index._get_values_for_loc(self, loc, label)
    934 

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
    351                     return self._range.index(new_key)
    352                 except ValueError as err:
--> 353                     raise KeyError(key) from err
    354             raise KeyError(key)
    355         return super().get_loc(key, method=method, tolerance=tolerance)

KeyError: 10

Default Value¶

In [13]:
ints.get(10, -1)
Out[13]:
-1

Explicit Index¶

In [14]:
ints = pd.Series([1,3,5,7], index=['a','b','c','d'])
ints
Out[14]:
a    1
b    3
c    5
d    7
dtype: int64
In [15]:
ints['b']
Out[15]:
3
In [16]:
ints[['a','c','d']]
Out[16]:
a    1
c    5
d    7
dtype: int64
In [17]:
ints[0]
Out[17]:
1
In [18]:
ints[[1,2]]
Out[18]:
b    3
c    5
dtype: int64

In¶

ints = pd.Series([1,3,5,7], index=['a','b','c','d'])

In [19]:
'b' in ints
Out[19]:
True
In [20]:
3 in ints
Out[20]:
False

Slicing¶

ints = pd.Series([1,3,5,7], index=['a','b','c','d'])

In [21]:
ints['a':'c']
Out[21]:
a    1
b    3
c    5
dtype: int64
In [22]:
ints['c':]
Out[22]:
c    5
d    7
dtype: int64
In [31]:
ints = pd.Series([1,3,5,7], index=['a','b','c','d'])
In [32]:
ints2[1:3]
Out[32]:
b    3
c    5
dtype: int64
In [25]:
ints['a':'c'] = 0
ints
Out[25]:
a    0
b    0
c    0
d    7
dtype: int64
In [28]:
ints[['b','d']] = 11
ints
Out[28]:
a     1
b    11
c     5
d    11
dtype: int64
In [33]:
ints = pd.Series([1,3,5,7], index=['a','b','c','d'])
In [34]:
ints[0:2] = 42
ints
Out[34]:
a    42
b    42
c     5
d     7
dtype: int64

In case you missed it¶

In [4]:
ints = pd.Series([1,3,5,7])
ints[1:3] = 0
ints
Out[4]:
0    1
1    0
2    0
3    7
dtype: int64
In [3]:
ints = pd.Series([1,3,5,7], index=['a','b','c','d'])
ints[1:2] = 0
ints
Out[3]:
a    1
b    0
c    5
d    7
dtype: int64

Now for some Fun¶

In [38]:
ints = pd.Series([1,3,5,7], index=['a','b','c', 'd'])
ints[ints > 3]
Out[38]:
c    5
d    7
dtype: int64
In [39]:
ints
Out[39]:
a    1
b    3
c    5
d    7
dtype: int64
In [40]:
ints[ints > ints.median()]
Out[40]:
c    5
d    7
dtype: int64

How does that Work?¶

In [41]:
import numpy as np
import pandas as pd 
ints > ints.median()
Out[41]:
a    False
b    False
c     True
d     True
dtype: bool
In [44]:
ints[[False, False, True,True]]
Out[44]:
c    5
d    7
dtype: int64

Fun With Indexing

In [45]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data
Out[45]:
1    a
3    b
5    c
dtype: object
In [46]:
data[1]   # Two possible meanings - position or index
Out[46]:
'a'
In [47]:
data[1:3] # Two possible meanings
Out[47]:
3    b
5    c
dtype: object

data[1] Use index data[1:3] Use position

In [ ]:
loc - use index
In [48]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data.loc[1]
Out[48]:
'a'
In [49]:
data.loc[1:3]
Out[49]:
1    a
3    b
dtype: object

iloc - use position

In [50]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data.iloc[1]
Out[50]:
'b'
In [52]:
data.loc[1:3]
Out[52]:
1    a
3    b
dtype: object

What Methods does Series Have?¶

Answer: A lot

API Reference

Auto Complete - tab¶

In [ ]:
ints.t

Operations on Series¶

Done element wise

In [59]:
odd = pd.Series([1,3,5,7],['a','b','c','d'])
In [60]:
odd + 1
Out[60]:
a    2
b    4
c    6
d    8
dtype: int64
In [61]:
odd * 2
Out[61]:
a     2
b     6
c    10
d    14
dtype: int64
In [62]:
odd + odd
Out[62]:
a     2
b     6
c    10
d    14
dtype: int64
In [63]:
odd * odd
Out[63]:
a     1
b     9
c    25
d    49
dtype: int64
In [64]:
np.sin(odd)
Out[64]:
a    0.841471
b    0.141120
c   -0.958924
d    0.656987
dtype: float64

np.sin is a NumPy function

Panda Series can be used instead of ndarray in most NumPy functions

In [65]:
np.sin(odd) < 0.2
Out[65]:
a    False
b     True
c     True
d    False
dtype: bool

Sneak Preview¶

In [2]:
import numpy as np
import pandas as pd

def speed_up(N,p):
    return 1/(1 - p + p/N)

speed_up(5, 0.6)
Out[2]:
1.923076923076923
In [5]:
max_N = 50

N_series = pd.Series(range(1,max_N), index=range(1,max_N))
N_series.head(5)
Out[5]:
1    1
2    2
3    3
4    4
5    5
dtype: int64
In [6]:
speed_up(N_series,0.5).head(3)
Out[6]:
1    1.000000
2    1.333333
3    1.500000
dtype: float64
In [7]:
speed_up(N_series,0.5).plot()
Out[7]:
<AxesSubplot:>

Missed Matched Indexes¶

In [69]:
odd = pd.Series([1,3,5,7],['a','b','c','d'])
even = pd.Series([2,4,6],['d','b','e'])

odd + even
Out[69]:
a    NaN
b    7.0
c    NaN
d    9.0
e    NaN
dtype: float64

NaN¶

Not a number

Used to indicate a missing value

np.nan

In [70]:
sample = pd.Series([1,2,3,np.nan])
sample
Out[70]:
0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64
In [71]:
sample[0] = np.nan
sample
Out[71]:
0    NaN
1    2.0
2    3.0
3    NaN
dtype: float64

NaN and Operations¶

Any operation on NaN result is Nan

In [72]:
odd = pd.Series([1,3,5],['a','b','c'])
even = pd.Series([2,4,6],['d','b','c'])

result = odd + even
result
Out[72]:
a     NaN
b     7.0
c    11.0
d     NaN
dtype: float64
In [73]:
result + 1
Out[73]:
a     NaN
b     8.0
c    12.0
d     NaN
dtype: float64
In [74]:
result.mean()
Out[74]:
9.0

What should mean of result be?

Pain of NaN¶

Missing data becomes NaN

Find all instances

Decide how to handle

More Indexing¶

In [2]:
even = pd.Series([2,4,6],['a','a','b'])
even
Out[2]:
a    2
a    4
b    6
dtype: int64
In [3]:
odd = pd.Series([1,3,5],['a','a','b'])
odd + even
Out[3]:
a     3
a     7
b    11
dtype: int64
In [4]:
odd = pd.Series([1,3,5],['a','a','b'])
even = pd.Series([2,4,6],['a','a','c'])

odd + even
Out[4]:
a    3.0
a    5.0
a    5.0
a    7.0
b    NaN
c    NaN
dtype: float64
In [5]:
odd = pd.Series([1,3],['a','a'])
even = pd.Series([2,4],['a','a'])

odd + even
Out[5]:
a    3
a    7
dtype: int64

You might want to avoid having duplicate index values

Reindexing¶

In [6]:
d = {'a': 0., 'b': 1., 'c': 2.}
odd_order = pd.Series(d, ['b', 'c', 'd', 'a'])
odd_order
Out[6]:
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64
In [7]:
better_order = odd_order.reindex(['a','b','c','d','e'])
better_order
Out[7]:
a    0.0
b    1.0
c    2.0
d    NaN
e    NaN
dtype: float64
In [8]:
int_order = odd_order.reindex([1,2,3,4])
int_order
Out[8]:
1   NaN
2   NaN
3   NaN
4   NaN
dtype: float64

Replacing an Index¶

In [9]:
odd_order = pd.Series([1.,2.,3.], ['b', 'c', 'd'])
odd_order
Out[9]:
b    1.0
c    2.0
d    3.0
dtype: float64
In [10]:
odd_order.index = [1,2,3]
odd_order
Out[10]:
1    1.0
2    2.0
3    3.0
dtype: float64
In [11]:
odd_order.index = ['a',2,"cat"]
odd_order
Out[11]:
a      1.0
2      2.0
cat    3.0
dtype: float64

Expanding while Redexing¶

In [12]:
gaps = pd.Series(['a','b','c'], [1, 4, 6])
gaps
Out[12]:
1    a
4    b
6    c
dtype: object
In [13]:
gaps.reindex(range(7))
Out[13]:
0    NaN
1      a
2    NaN
3    NaN
4      b
5    NaN
6      c
dtype: object
In [14]:
gaps.reindex(range(7), method='ffill')   #forward fill
Out[14]:
0    NaN
1      a
2      a
3      a
4      b
5      b
6      c
dtype: object
In [15]:
gaps.reindex(range(7), method='bfill')   #backward fill
Out[15]:
0    a
1    a
2    b
3    b
4    b
5    c
6    c
dtype: object
In [16]:
gaps.reindex(range(7), fill_value="cat") 
Out[16]:
0    cat
1      a
2    cat
3    cat
4      b
5    cat
6      c
dtype: object
In [18]:
gaps
Out[18]:
1    a
4    b
6    c
dtype: object