pydata

Keep Looking, Don't Settle

python v.s. SAS

python v.s. SAS

Tables python SAS
frequency
  • x.value_counts(dropna = False).sort_index()
  • pd.crosstab(df.A, df.B).apply(lambda r: r/r.sum(), axis = 1)
proc freq;
drop/deep
  • df.drop(['d1', 'd2', 'd3'], axis = 1)
  • df.loc[:, ['k1', 'k2'])
  • data a(drop=d1 d2 d2);
  • data a(keep=k1 k2 k3);
rename
  • df.columns = ['a1', 'a2', 'a3']
  • df.rename(columns={'orig1':'new1', 'orig1':'new2'})
data a(rename=(orig1=new1 orig2=new2));
summarize
  • df.x.describe()
proc summary;
bin
  • pd.cut(x, [min, cut1, …, cutk, max])
  • np.digitize(x, [cut1, cut2, …, cutk])
proc rank;
row select
  • df.loc[(cond1) & (cond2), :]
  • df.iloc[:, [1, 3, 5]]
where cond1 and cond2;
merge
  • pd.merge(df1, df2, on = , how= )
  • pd.concat([df1, df2, df3], axis = 0, ignore_index = True)
  • df1.join(df1, how = )
merge df1 df2; by col1;
sort
  • df.sort(['sort_by1', 'sort_by2'], ascending = [True, False])
proc sort; by sortby1 descending sort_by2;
count missing
  • df.count()
  • df.isnull().sum()
proc means n nmiss min mean median max;
format
  • 1 <-> 1 mapping:     df.column1.replace(zip(old_value, new_value))
  • hgc2012.sic_code.replace(dict(zip(sic_indust.sic, sic_indust.indust)))
  • interval bin/mapping, like from PD to risk ranking:    [ranking[j] for j in np.digitize(x, intervals)]
  • pd.cut(vectorx, [-np.inf, .25, .5, 1, 1.5, 2, 3, 3.6, 4.2, np.inf], labels = range(9))
proc format;
columns start with
  • [x for x in list(df) if x.startswith('score')]
var scores: ;
index
  • df.reset_index(inplace = True)
  • df.set_index(‘Column_name’, inplace = True)
apply / map / applymap
  • pd.Series.map(func, dict, Series)
  • pd.Series.apply()
  • pd.DataFrame.apply(func, axis = 0)
data transformation, fotmat, aggregate

python tips

Tables desc code
1 replace blanks in var name by "_" and to lower case
  • hgcallvar = list(hgc)
  • [x.replace(' ', '_').lower() for x in hgcallvar]
2 string contains method
  • [x for x in list(hgc2012data) if 'sic' in x]
  • [x for x in list(hgc2012data) if 'sic' not in x]
  • [x for x in list(hgc2012data) if x.find('sic') >= 0]
3 search history in ipython
  • %hist -g median
4 history in a list in ipython
  • _ih
5 change string to float and replace '.' by 0
  • map(lambda x: 0 if x == '.' else float(x), data['col_name'])
6 pandas dataframe column or series null value check
  • data['col_name'].isnull()
7 matplotlib inline plot, change figure size
  • plt.style.use('ggplot')
  • %matplotlib inline
  • plt.rcParams['figure.figsize'] = (10.0, 8.0)
8 scatter plot / line plot / bar chart
  • fig = plt.figure()
  • ax = fig.add_subplot(111)
  • plt.scatter(x, y, s = 5, color = u'b', marker = '.', label = 'scatter plt')
  • ax.plot(x, y, linestyle = '', linewidth = 0.25, markeredgecolor='none', marker = 'o', label = r'label')
  • ax.plot(x, f2_pred, linestyle = ':', linewidth = 0.25, markeredgecolor='none', marker = '', label = r'label')
9 sort a dict by key or by value
  • import operator
  • x = {1: 2, 3: 4, 4: 3, 2: 1, 0: 0}
  • sorted_x = sorted(x.items(), key=operator.itemgetter(1), reverse = True) # sort by values, output is list of tuples
  • import operator
  • x = {1: 2, 3: 4, 4: 3, 2: 1, 0: 0}
  • sorted_x = sorted(x.items(), key=operator.itemgetter(0)) # sort by keys, output is list of tuples
10 numpy sort array by a column: easier to do in pandas
  • testarray = np.array(np.random.randn(5, 3))
  • testarray[testarray[:, 1].argsort()]
  • a[a[:,1].argsort()[::-1]] #reverse order
11 numpy sort array by muptiple columns (using lexsort)
  • a = np.random.randint(3, size = (5, 3))
  • ind = a_sort = np.lexsort((a[:, 1], a[: 0]))
  • a[ind]
12