```
import numpy as np
import pandas as pd
from IPython.display import display
```

### 1. weighted average

Let's denote `x = [x_1, ..., x_n]`

. The weight `w`

is denoted as `w = [w_1, ..., w_n]`

. The weighted average of `x`

by `w`

is \(\frac{ \sum_{i=1}^{n} x_i * w_i } { \sum_{i=1}^{n} w_i}\)

`numpy`

provides a function called `np.average()`

to calculate the weighted average. An example of calculate by hand and by the `np.average`

is given below:

```
np.random.seed(9999)
df = pd.DataFrame(np.random.random(20).reshape(10, 2), columns = ['val1', 'wt'])
# if we want to calcuate the weighted average of val1 weighted by wt, that is:
sum(df.val1 * df.wt) / sum(df.wt) # 0.42105022244260254
# np.average() function
np.average(df.val1, weights = df.wt, axis = 0) # 0.42105022244260254
```

```
0.42105022244260254
```

### 2. calculate weighted average within each group

The above example is very simple. A more common situation is there are different groups, and we need to calculate the weighted average within each group. The following is an example:

- there are two groups, called 'id'
- we want to calculate the weighted average for data in group 1(id == 1) and group 2(id == 2)
- calculate the weighted average of var1 and var2 by wt in group 1, and group 2 seperately

so, `0.339688030253 = sum(df1.val1 * df1.wt) / df1.wt.sum()`

```
np.random.seed(9999)
df = pd.DataFrame(np.random.random(20).reshape(10, 2), columns = ['val1', 'val2'])
df['id'] = np.repeat([1, 2], 5)
df['wt'] = [1, 2] * 5
print df
print '\n' + '-'*20 + ' weighted average of val1 and val2 by wt in each group ' + '-'*20
wtavg = lambda x: np.average(x.ix[:, :2], weights = x.wt, axis = 0)
dfwavg = df.groupby('id').apply(wtavg)
display(dfwavg)
```

```
val1 val2 id wt
0 0.823389 0.218603 1 1
1 0.031461 0.545926 1 2
2 0.154456 0.186963 1 1
3 0.422348 0.605552 1 2
4 0.492354 0.644119 1 1
5 0.913049 0.771104 2 2
6 0.288361 0.415441 2 1
7 0.331490 0.366327 2 2
8 0.312195 0.580446 2 1
9 0.112952 0.325875 2 2
-------------------- weighted average of val1 and val2 by wt in each group --------------------
id
1 [0.339688030253, 0.478948919665]
2 [0.41444248036, 0.490312314239]
dtype: object
```

```
## verified by manual calculation
df1 = df.query('id == 1')
sum(df1.val1 * df1.wt) / df1.wt.sum() ## 0.33968803025301908
```

```
0.33968803025301908
```

Difference between `apply`

and `agg`

: `apply`

will apply the funciton on the data frame of each group, while `agg`

will aggregate each column of each group. So the arguments in the apply function is a dataframe.

The following is an example from pandas docs. The arguments in function f0 is a dataframe in each id group.

```
def f0(group):
return pd.DataFrame({'original': group, 'group - mean': group - group.mean()})
df.groupby('id')['val1'].apply(f0)
```

group - mean | original | |
---|---|---|

0 | 0.438588 | 0.823389 |

1 | -0.353340 | 0.031461 |

2 | -0.230346 | 0.154456 |

3 | 0.037546 | 0.422348 |

4 | 0.107553 | 0.492354 |

5 | 0.521440 | 0.913049 |

6 | -0.103248 | 0.288361 |

7 | -0.060119 | 0.331490 |

8 | -0.079414 | 0.312195 |

9 | -0.278658 | 0.112952 |

### multiple functions

#### 1. different function for different column

One condition is you want to apply different function on different columns in the dataframe. For example, you want to apply sum on one column, and stdev on another column.

if you want to apply multiple functions to aggregate, then you need to put them in the **list** or **dict**.

```
np.random.seed(9999)
snapdf = pd.DataFrame(np.random.random(3000).reshape(-1, 3), \
columns = ['wt_var', 'var_to_sum', "var_to_stdev"])
snapdf['stage'] = np.repeat([1, 2], 500)
f = {"var_to_sum": np.sum, "var_to_stdev": np.std}
stage_s1 = snapdf.ix[:, ['stage', 'var_to_sum', 'var_to_stdev']].groupby('stage').agg(f).reset_index()
stage_s1
```

stage | var_to_sum | var_to_stdev | |
---|---|---|---|

0 | 1 | 245.666111 | 0.288775 |

1 | 2 | 247.715539 | 0.284739 |

#### 2. multiple functions on one column

Another condition is you want to apply multiple functions on each column. For example, you want to calcualte both mean and stdev for eachc column. Then your function is a dict of dicts: first key is the column(variable) to be applied by the functions. second key is the functions.

An example is like:

```
np.random.seed(9999)
snapdf = pd.DataFrame(np.random.random(3000).reshape(-1, 3), \
columns = ['wt_var', 'var1', "var2"])
snapdf['stage'] = np.repeat([1, 2], 500)
snapdf.groupby('stage').agg({np.sum, np.mean, np.std})
```

wt_var | var1 | var2 | |||||||
---|---|---|---|---|---|---|---|---|---|

std | mean | sum | std | mean | sum | std | mean | sum | |

stage | |||||||||

1 | 0.27400 | 0.484682 | 242.341116 | 0.287491 | 0.491332 | 245.666111 | 0.288775 | 0.506056 | 253.028070 |

2 | 0.29702 | 0.507317 | 253.658370 | 0.282293 | 0.495431 | 247.715539 | 0.284739 | 0.501093 | 250.546638 |

```
np.random.seed(9999)
snapdf = pd.DataFrame(np.random.random(3000).reshape(-1, 3), \
columns = ['wt_var', 'var_to_wt_avg_1', 'var_to_wt_avg_2'])
snapdf['stage'] = np.repeat([1, 2], 500)
wtavg = lambda x: np.average(x, weights = snapdf.loc[x.index, "wt_var"])
f2 = {'var_to_wt_avg_1': {'mean':'mean', 'wt_avg': wtavg}, 'var_to_wt_avg_2':{'mean':'mean', 'wt_avg':wtavg}}
stage_s2 = snapdf.ix[:, ['stage', 'var_to_wt_avg_1', 'var_to_wt_avg_2']].groupby('stage').agg(f2).reset_index()
# stage_s2.columns = stage_s2.columns.droplevel(0)
stage_s2
```

stage | var_to_wt_avg_2 | var_to_wt_avg_1 | |||
---|---|---|---|---|---|

wt_avg | mean | wt_avg | mean | ||

0 | 1 | 0.498908 | 0.506056 | 0.486139 | 0.491332 |

1 | 2 | 0.517657 | 0.501093 | 0.499796 | 0.495431 |

### 3. weighted from stacked data: unstack data

Data in the previous are stored in horizontial way: variables are in different columns. But sometimes data are in vertical way: in the following example, the 3 ratings(I/S/P) are stacked in vertical line. We want to get the weighted average of I/S/P. So, the first step is to unstak the data. Then calculate the weighted average.

```
np.random.seed(9999)
df = pd.DataFrame(np.random.random(72).reshape(36, 2), columns = ["var1", "var2"])
df["rating"] = np.repeat(['I', 'S', 'P'], 12)
df['qtrs'] = range(1, 13) * 3
df = df.set_index(['rating', 'qtrs']).unstack(level = 0)
df['cnts', 'I'] = 100
df['cnts', 'P'] = 200
df['cnts', "S"] = 300
display(df.head())
df['wt_var_1'] = np.average(df.var1, weights = df.cnts, axis = 1)
df['wt_var_2'] = np.average(df.var2, weights = df.cnts, axis = 1)
df.reset_index().ix[:, ["qtrs", "wt_var_1", "wt_var_2"]]
```

var1 | var2 | cnts | |||||||
---|---|---|---|---|---|---|---|---|---|

rating | I | P | S | I | P | S | I | P | S |

qtrs | |||||||||

1 | 0.823389 | 0.154764 | 0.602518 | 0.218603 | 0.904529 | 0.435665 | 100 | 200 | 300 |

2 | 0.031461 | 0.365481 | 0.472321 | 0.545926 | 0.580417 | 0.866001 | 100 | 200 | 300 |

3 | 0.154456 | 0.494557 | 0.537686 | 0.186963 | 0.824609 | 0.484590 | 100 | 200 | 300 |

4 | 0.422348 | 0.503091 | 0.333305 | 0.605552 | 0.670854 | 0.582543 | 100 | 200 | 300 |

5 | 0.492354 | 0.446044 | 0.679368 | 0.644119 | 0.050341 | 0.544653 | 100 | 200 | 300 |

qtrs | wt_var_1 | wt_var_2 | |
---|---|---|---|

rating | |||

0 | 1 | 0.490078 | 0.555776 |

1 | 2 | 0.363231 | 0.717461 |

2 | 3 | 0.459438 | 0.548325 |

3 | 4 | 0.404741 | 0.615815 |

4 | 5 | 0.570425 | 0.396460 |

5 | 6 | 0.870236 | 0.845366 |

6 | 7 | 0.446611 | 0.123709 |

7 | 8 | 0.735676 | 0.185098 |

8 | 9 | 0.378114 | 0.435725 |

9 | 10 | 0.533662 | 0.248189 |

10 | 11 | 0.529564 | 0.565223 |

11 | 12 | 0.536358 | 0.297243 |