python v.s. SAS

Tables	python	SAS
frequency	`x.value_counts(dropna = False).sort_index()` `pd.crosstab(df.A, df.B).apply(lambda r: r/r.sum(), axis = 1)`	`proc freq;`
drop/deep	`df.drop(['d1', 'd2', 'd3'], axis = 1)` `df.loc[:, ['k1', 'k2'])`	`data a(drop=d1 d2 d2);` `data a(keep=k1 k2 k3);`
rename	`df.columns = ['a1', 'a2', 'a3']` `df.rename(columns={'orig1':'new1', 'orig1':'new2'})`	`data a(rename=(orig1=new1 orig2=new2));`
summarize	`df.x.describe()`	`proc summary;`
bin	`pd.cut(x, [min, cut1, …, cutk, max])` `np.digitize(x, [cut1, cut2, …, cutk])`	`proc rank;`
row select	`df.loc[(cond1) & (cond2), :]` `df.iloc[:, [1, 3, 5]]`	`where cond1 and cond2;`
merge	`pd.merge(df1, df2, on = , how= )` `pd.concat([df1, df2, df3], axis = 0, ignore_index = True)` `df1.join(df1, how = )`	`merge df1 df2; by col1;`
sort	`df.sort(['sort_by1', 'sort_by2'], ascending = [True, False])`	`proc sort; by sortby1 descending sort_by2;`
count missing	`df.count()` `df.isnull().sum()`	`proc means n nmiss min mean median max;`
format	1 <-> 1 mapping: `df.column1.replace(zip(old_value, new_value))` `hgc2012.sic_code.replace(dict(zip(sic_indust.sic, sic_indust.indust)))` interval bin/mapping, like from PD to risk ranking: `[ranking[j] for j in np.digitize(x, intervals)]` `pd.cut(vectorx, [-np.inf, .25, .5, 1, 1.5, 2, 3, 3.6, 4.2, np.inf], labels = range(9))`	`proc format;`
columns start with	`[x for x in list(df) if x.startswith('score')]`	`var scores: ;`
index	`df.reset_index(inplace = True)` `df.set_index(‘Column_name’, inplace = True)`
apply / map / applymap	`pd.Series.map(func, dict, Series)` `pd.Series.apply()` `pd.DataFrame.apply(func, axis = 0)`	`data transformation, fotmat, aggregate`

python tips

Tables	desc	code
1	replace blanks in var name by "_" and to lower case	`hgcallvar = list(hgc)` `[x.replace(' ', '_').lower() for x in hgcallvar]`
2	string contains method	`[x for x in list(hgc2012data) if 'sic' in x]` `[x for x in list(hgc2012data) if 'sic' not in x]` `[x for x in list(hgc2012data) if x.find('sic') >= 0]`
3	search history in ipython	`%hist -g median`
4	history in a list in ipython	`_ih`
5	change string to float and replace '.' by 0	`map(lambda x: 0 if x == '.' else float(x), data['col_name'])`
6	pandas dataframe column or series null value check	`data['col_name'].isnull()`
7	matplotlib inline plot, change figure size	`plt.style.use('ggplot')` `%matplotlib inline` `plt.rcParams['figure.figsize'] = (10.0, 8.0)`
8	scatter plot / line plot / bar chart	`fig = plt.figure()` `ax = fig.add_subplot(111)` `plt.scatter(x, y, s = 5, color = u'b', marker = '.', label = 'scatter plt')` `ax.plot(x, y, linestyle = '', linewidth = 0.25, markeredgecolor='none', marker = 'o', label = r'label')` `ax.plot(x, f2_pred, linestyle = ':', linewidth = 0.25, markeredgecolor='none', marker = '', label = r'label')`
9	sort a dict by key or by value	`import operator` `x = {1: 2, 3: 4, 4: 3, 2: 1, 0: 0}` `sorted_x = sorted(x.items(), key=operator.itemgetter(1), reverse = True) # sort by values, output is list of tuples` `import operator` `x = {1: 2, 3: 4, 4: 3, 2: 1, 0: 0}` `sorted_x = sorted(x.items(), key=operator.itemgetter(0)) # sort by keys, output is list of tuples`
10	numpy sort array by a column: easier to do in pandas	`testarray = np.array(np.random.randn(5, 3))` `testarray[testarray[:, 1].argsort()]` `a[a[:,1].argsort()[::-1]] #reverse order`
11	numpy sort array by muptiple columns (using lexsort)	`a = np.random.randint(3, size = (5, 3))` `ind = a_sort = np.lexsort((a[:, 1], a[: 0]))` `a[ind]`
12