AI -

AI -

0.

AI ****\

AI Python 

AI Numpy

AI Pandas

AI Scipy( )

AI matplotlib seaborn

[1] : 

www.oreilly.com/library/vie
[2] apachecn: 

github.com/apachecn
[3] @ZhenLeiXu: 

github.com/HadXu

jupyter notebook

github

github.com/fengdu78/Da

pan.baidu.com/s/1uDXt5jWU 8p5d

( ) ( )

500 100 5

Web id IP ID IP IP IP

164.203.x.x
164.202.x.x
;

bin-counting bin-counting

1 k k

One-hot

1 Scikit Learn sklearn.preprocessing.OneHotEncoder k k

1-1 3

Citye1e2e3
San Francisco100
New York010
Seattle001

k-1 1 k 1

5-1.

e1,e2,e3

tfidf

dummy

Pandas pandas.get_dummies

1-2 3 dummy

Citye1e2
San Francisco10
New York01
Seattle00

1-3

1-3

idcityRent
0SF3999
1SF4000
2SF4001
3NYC3499
4NYC3500
5NYC3501
6Seattle2499
7Seattle2500
8Seattle2501

1-1 one-hot/

0 0.

1-1.

import pandas as pd from sklearn import linear_model df = pd.DataFrame({ 'City': ['SF', 'SF', 'SF', 'NYC', 'NYC', 'NYC', 'Seattle', 'Seattle', 'Seattle'], 'Rent': [3999, 4000, 4001, 3499, 3500, 3501, 2499, 2500, 2501] }) df['Rent'].mean()
3333.3333333333335
one_hot_df = pd.get_dummies(df, prefix=['city']) one_hot_df
Rentcity_NYCcity_SFcity_Seattle
03999010
14000010
24001010
33499100
43500100
53501100
62499001
72500001
82501001
model = linear_model.LinearRegression() model.fit(one_hot_df[['city_NYC', 'city_SF', 'city_Seattle']], one_hot_df[['Rent']]) model.coef_
array([[ 166.66666667, 666.66666667, -833.33333333]])
model.intercept_
array([3333.33333333])

dummy code

dummy_df = pd.get_dummies(df, prefix=['city'], drop_first=True) dummy_df
Rentcity_SFcity_Seattle
0399910
1400010
2400110
3349900
4350000
5350100
6249901
7250001
8250101
model.fit(dummy_df[['city_SF', 'city_Seattle']], dummy_df['Rent'])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
model.coef_
array([ 500., -1000.])
model.intercept_
3500.0

y i i

1-4

idx1x2x3b
one-hot166.67666.67-833.333333.33
dummy coding0500-10003500

Effect

Effect Effect -1

1-5: Effect 3

Citye1e2
San Francisco10
New York01
Seattle-1-1

Effect 1-2 ( what is effect coding?)

1-2 Effect

effect_df = dummy_df.copy() effect_df.loc[3:5, ['city_SF', 'city_Seattle']] = -1.0 effect_df
Rentcity_SFcity_Seattle
039991.00.0
140001.00.0
240011.00.0
33499-1.0-1.0
43500-1.0-1.0
53501-1.0-1.0
624990.01.0
725000.01.0
825010.01.0
model.fit(effect_df[['city_SF', 'city_Seattle']], effect_df['Rent'])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
model.coef_
array([ 666.66666667, -833.33333333])
model.intercept_
3333.3333333333335

-1 Pandas Scikit Learn ML

ID

  • a. --
  • b. bin-counting--

one-hot Graepel [2010] [Weinberger et al.2009 ] McMahan [2013] [Bilenko 2015]

a m bin

1-2/

m ID m

1-3

def hash_features(word_list, m): output = [0] * m for word in word_list: index = hash_fcn(word) % m output[index] += 1 return output

def hash_features(word_list, m): output = [0] * m for word in word_list: index = hash_fcn(word) % m sign_bit = sign_hash(word) % 2 if (sign_bit == 0): output[index] -= 1 else: output[index] += 1 return output

O(1/(m**0.5))
. m m [Weinberger 2009] McMahan et al. [2013 ] m

Yelp , sklearn

FeatureHasher

import pandas as pd import json js = [] with open('data/yelp_academic_dataset_review.json') as f: for i in range(10000): js.append(json.loads(f.readline())) review_df = pd.DataFrame(js) m = len(review_df.business_id.unique())
m
4174
from sklearn.feature_extraction import FeatureHasher h = FeatureHasher(n_features=m, input_type='string') f = h.transform(review_df['business_id'])
review_df['business_id'].unique().tolist()[0:5]
['9yKzy9PApeiPPOUJEtnvkg', 'ZRJwVLyzEJq1VAihDhYiow', '6oRAC4uyJCsJl1X0WZpVSA', '_1QQZuf4zZOyFCvXc0o6Vg', '6ozycU1RpktNG2-1BroVtw']
f.toarray()
array([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]])

from sys import getsizeof
print('Our pandas Series, in bytes: ', getsizeof(review_df['business_id'])) print('Our hashed numpy array, in bytes: ', getsizeof(f))
Our pandas Series, in bytes: 790104 Our hashed numpy array, in bytes: 56

bin-counting

Bin-counting [Yeh and Patt 1991; Lee 1998; Pavlov 2009; 2010] Misha Bilenko Big Learning Made with Easy

bin-counting Na?veBayes

1-6. bin-counting

UserNumber of clicksNumber of non-clicksprobability of clickQueryHash,AdDomainNumber of clicksNumber of non-clicksprobability of click
Alice51200.04000x598fd4fe,foo.com5000300000.167
bob202300.08000x50fa3cc0,bar.org100,900,0.100
...
joe230.4000x437a45e1,qux.net6,18,0.250

Bin-counting 1-6 Alice - - QueryHash-AdDomain 0x437a45e1 qux.net

10,000 10,000 1 Bin-counting 10,000 0 1

Bin-counting

Alice Alice

1-7.

clickNon-ClickTotal
Alice5120125
Not Alice9951888019875
Total10001900020000

:

Alice

bin-counting

1-3/

Kaggle Avazu

Avazu Click

  • 24 ' '/'device_id'
  • 4,0428,967 2,686,408

mAvazu bin

6g 10k

train_subset.csv

1-6 Bin-counting

import pandas as pd # 10k df = pd.read_csv('data/train_subset.csv') # len(df['device_id'].unique())
1075
df.head()

5 rows 24 columns\

def click_counting(x, bin_column): clicks = pd.Series( x[x['click'] > 0][bin_column].value_counts(), name='clicks') no_clicks = pd.Series( x[x['click'] < 1][bin_column].value_counts(), name='no_clicks') counts = pd.DataFrame([clicks, no_clicks]).T.fillna('0') counts['total'] = counts['clicks'].astype( 'int64') + counts['no_clicks'].astype('int64') return counts def bin_counting(counts): counts['N+'] = counts['clicks'].astype('int64').divide( counts['total'].astype('int64')) counts['N-'] = counts['no_clicks'].astype('int64').divide( counts['total'].astype('int64')) counts['log_N+'] = counts['N+'].divide(counts['N-']) # If we wanted to only return bin-counting properties, we would filter here bin_counts = counts.filter(items=['N+', 'N-', 'log_N+']) return counts, bin_counts
bin_column = 'device_id' device_clicks = click_counting(df.filter(items= [bin_column, 'click']), bin_column) device_all, device_bin_counts = bin_counting(device_clicks)
# check to make sure we have all the devices len(device_bin_counts)
1075
device_all.sort_values(by = 'total', ascending=False).head(4)
clicksno_clickstotalN+N-log_N+
a99f214a1561716387240.1789320.8210680.217925
c357dbff215170.1176470.8823530.133333
a167aa830990.0000001.0000000.000000
3c0208dc0990.0000001.0000000.000000
# We can see how this can change model evaluation time by comparing raw vs. bin-counting size from sys import getsizeof print('Our pandas Series, in bytes: ', getsizeof(df.filter(items=['device_id', 'click']))) print('Our bin-counting feature, in bytes: ', getsizeof(device_bin_counts))
Our pandas Series, in bytes: 730104 Our bin-counting feature, in bytes: 95699

back-off

1-4\

count-min sketch [Cormode Muthukrishnan 2005]

Kaggle

Owen Zhang

Counts without bounds

x 3 1 0.7 x x 0 5

Plain one-hot encoding

:

Feature hashing

:

:

Bin-counting

:

:

www.oreilly.com/library/vie

github

github.com/fengdu78/Da

pan.baidu.com/s/1uDXt5jWU 8p5d

- qq 4500+ ID 92416895