🔥알림🔥
① 테디노트 유튜브 -
구경하러 가기!
② LangChain 한국어 튜토리얼
바로가기 👀
③ 랭체인 노트 무료 전자책(wikidocs)
바로가기 🙌
④ RAG 비법노트 LangChain 강의오픈
바로가기 🙌
⑤ 서울대 PyTorch 딥러닝 강의
바로가기 🙌
[캐글] 성인 인구조사 소득 예측 대회 커널
작년 T-Academy와 KaKr가 주최하는 성인 인구조사 소득 예측 대회에 참여하여 EDA 노트북을 공유했었습니다.
KaKr(캐글코리아) 는 국내에서 가장 큰 캐글 커뮤니티며 전 세계적으로 그 영향력을 인정 받았다고 하네요~
페이스북 그룹이 있으니 관심있으신 분들은 가입하여 캐글 관련 정보를 공유하세요.
작년에 캐글 노트북으로 공유한 커널을 오랜만에 다시 끄집어 내어 블로그에 공유해 봅니다.
대회 정보는 [T-Academy X KaKr] 성인 인구조사 소득 예측 대회 에서 보실 수 있습니다. 관련 데이터셋도 Data
탭에서 확인할 수 있습니다.
제가 캐글에서 공유한 커널은 캐하~ EDA + LightGBM + PyCaret 에서 확인하실 수 있습니다. Copy and Edit으로 수정하여 바로 돌려볼 수 있습니다.
import numpy as np
import pandas as pd
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
/kaggle/input/kakr-4th-competition/train.csv /kaggle/input/kakr-4th-competition/test.csv /kaggle/input/kakr-4th-competition/sample_submission.csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import os
warnings.filterwarnings('ignore')
SEED = 1234
FIG_SIZE = (10, 7)
DIR = '/kaggle/input/kakr-4th-competition'
train = pd.read_csv(os.path.join(DIR, 'train.csv'))
test = pd.read_csv(os.path.join(DIR, 'test.csv'))
-
id
-
age : 나이
-
workclass : 고용 형태
-
fnlwgt : 사람 대표성을 나타내는 가중치 (final weight의 약자)
-
education : 교육 수준
-
education_num : 교육 수준 수치
-
marital_status: 결혼 상태
-
occupation : 업종
-
relationship : 가족 관계
-
race : 인종
-
sex : 성별
-
capital_gain : 양도 소득
-
capital_loss : 양도 손실
-
hours_per_week : 주당 근무 시간
-
native_country : 국적
-
income : 수익 (예측해야 하는 값)
-
50K : 1
- <=50K : 0
-
print(train.shape, test.shape)
(26049, 16) (6512, 15)
train.head()
id | age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 40 | Private | 168538 | HS-grad | 9 | Married-civ-spouse | Sales | Husband | White | Male | 0 | 0 | 60 | United-States | >50K |
1 | 1 | 17 | Private | 101626 | 9th | 5 | Never-married | Machine-op-inspct | Own-child | White | Male | 0 | 0 | 20 | United-States | <=50K |
2 | 2 | 18 | Private | 353358 | Some-college | 10 | Never-married | Other-service | Own-child | White | Male | 0 | 0 | 16 | United-States | <=50K |
3 | 3 | 21 | Private | 151158 | Some-college | 10 | Never-married | Prof-specialty | Own-child | White | Female | 0 | 0 | 25 | United-States | <=50K |
4 | 4 | 24 | Private | 122234 | Some-college | 10 | Never-married | Adm-clerical | Not-in-family | Black | Female | 0 | 0 | 20 | ? | <=50K |
test.head()
id | age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 28 | Private | 67661 | Some-college | 10 | Never-married | Adm-clerical | Other-relative | White | Female | 0 | 0 | 40 | United-States |
1 | 1 | 40 | Self-emp-inc | 37869 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 50 | United-States |
2 | 2 | 20 | Private | 109952 | Some-college | 10 | Never-married | Handlers-cleaners | Own-child | White | Male | 0 | 0 | 25 | United-States |
3 | 3 | 40 | Private | 114537 | Assoc-voc | 11 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 50 | United-States |
4 | 4 | 37 | Private | 51264 | Doctorate | 16 | Married-civ-spouse | Prof-specialty | Husband | White | Male | 0 | 0 | 99 | France |
결측치
결측치 없음 (깰끔!)
train.isnull().sum()
id 0 age 0 workclass 0 fnlwgt 0 education 0 education_num 0 marital_status 0 occupation 0 relationship 0 race 0 sex 0 capital_gain 0 capital_loss 0 hours_per_week 0 native_country 0 income 0 dtype: int64
test.isnull().sum()
id 0 age 0 workclass 0 fnlwgt 0 education 0 education_num 0 marital_status 0 occupation 0 relationship 0 race 0 sex 0 capital_gain 0 capital_loss 0 hours_per_week 0 native_country 0 dtype: int64
컬럼 별 info() 확인
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 26049 entries, 0 to 26048 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 26049 non-null int64 1 age 26049 non-null int64 2 workclass 26049 non-null object 3 fnlwgt 26049 non-null int64 4 education 26049 non-null object 5 education_num 26049 non-null int64 6 marital_status 26049 non-null object 7 occupation 26049 non-null object 8 relationship 26049 non-null object 9 race 26049 non-null object 10 sex 26049 non-null object 11 capital_gain 26049 non-null int64 12 capital_loss 26049 non-null int64 13 hours_per_week 26049 non-null int64 14 native_country 26049 non-null object 15 income 26049 non-null object dtypes: int64(7), object(9) memory usage: 3.2+ MB
test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6512 entries, 0 to 6511 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 6512 non-null int64 1 age 6512 non-null int64 2 workclass 6512 non-null object 3 fnlwgt 6512 non-null int64 4 education 6512 non-null object 5 education_num 6512 non-null int64 6 marital_status 6512 non-null object 7 occupation 6512 non-null object 8 relationship 6512 non-null object 9 race 6512 non-null object 10 sex 6512 non-null object 11 capital_gain 6512 non-null int64 12 capital_loss 6512 non-null int64 13 hours_per_week 6512 non-null int64 14 native_country 6512 non-null object dtypes: int64(7), object(8) memory usage: 763.2+ KB
Target 변환 (Income)
train['income'].value_counts()
<=50K 19744 >50K 6305 Name: income, dtype: int64
train['income'] = train['income'].apply(lambda x: 0 if x == '<=50K' else 1)
train['income'].value_counts()
0 19744 1 6305 Name: income, dtype: int64
all_data로 train + test 세트 합치기 (전처리 동시 진행)
원래 개별 처리 해주는 것이 정식입니다.
train / test 의 분포를 따로 봐야하는 이유는 캐글에서 가끔 함정으로 train 에 없는 값 분포를 test에 심어 놓기도 하죠.
이전에 이미 개별 처리로 분포 확인을 진행한 상태가 편의상 train + test 합친 후 전처리 진행합니다.
all_data = pd.concat([train, test], sort=False)
workclass
all_data['workclass'].value_counts()
Private 22696 Self-emp-not-inc 2541 Local-gov 2093 ? 1836 State-gov 1298 Self-emp-inc 1116 Federal-gov 960 Without-pay 14 Never-worked 7 Name: workclass, dtype: int64
all_data.groupby('workclass')['income'].mean().sort_values().plot(kind='bar', figsize=FIG_SIZE)
<matplotlib.axes._subplots.AxesSubplot at 0x7fb79df07d90>
-
Without-pay
컬럼과Never-worked
컬럼의 income은 모두 0 임을 확인한다. -
Without-pay
컬럼과Never-worked
컬럼을Ohter
컬럼으로 합친다.
workclass_other = ['Without-pay', 'Never-worked']
all_data['workclass'] = all_data['workclass'].apply(lambda x: 'Other' if x in workclass_other else x)
all_data['workclass'].value_counts()
Private 22696 Self-emp-not-inc 2541 Local-gov 2093 ? 1836 State-gov 1298 Self-emp-inc 1116 Federal-gov 960 Other 21 Name: workclass, dtype: int64
age: 나이
나이는 numeric column 입니다.
income 별 나이의 분포를 확인해 보도록 하겠습니다.
df1 = all_data.loc[all_data['income'] == 0, 'age']
df2 = all_data.loc[all_data['income'] == 1, 'age']
plt.figure(figsize=FIG_SIZE)
sns.distplot(df1, kde=True, rug=True, hist=False, color='blue')
sns.distplot(df2, kde=True, rug=True, hist=False, color='red')
<matplotlib.axes._subplots.AxesSubplot at 0x7fb79ddbcf50>
fnlwgt: 사람의 대표성을 나타내는 가중치
사람의 대표성을 나타내는 가중치라고는 나와있는디… 뭔말인지;; data에 대한 설명은 딱히 없어서 분포도 확인 해봤습니다.
df1 = all_data.loc[all_data['income'] == 0, 'fnlwgt']
df2 = all_data.loc[all_data['income'] == 1, 'fnlwgt']
plt.figure(figsize=FIG_SIZE)
sns.distplot(df1, kde=True, rug=True, hist=False, color='blue')
sns.distplot(df2, kde=True, rug=True, hist=False, color='red')
<matplotlib.axes._subplots.AxesSubplot at 0x7fb79ac40ed0>
income 그룹으로 나누어 확인해 보면, fnlwgt
컬럼의 분포가 거의 차이가 없을 을 확인할 수 있습니다.
다른 column 과의 상호 작용 때문에 feature 제거에는 좀 조심 스럽습니다만, 나중에 별 특이성이 없다면 feature 제거도 고려해 봐야겠습니다.
g = sns.FacetGrid(all_data, col="income", height=5)
g.map(sns.distplot, 'fnlwgt')
<seaborn.axisgrid.FacetGrid at 0x7fb799e46790>
log 를 취해 줍니다. (feature의 variance가 쓸데없이 크고, 정규 분포로 만들어줘서 최적화 이득을 보려고 합니다.)
all_data['fnlwgt_log'] = np.log(all_data['fnlwgt'])
education / education_num
education
컬럼은 education_num
컬럼과 value_counts()가 동일하게 찍히는 것을 알 수 있습니다.
따라서, 두 개의 컬럼 중 한개만 사용 나머지 한 개의 컬럼은 버리도록 하겠습니다.
all_data['education'].value_counts()
HS-grad 10501 Some-college 7291 Bachelors 5355 Masters 1723 Assoc-voc 1382 11th 1175 Assoc-acdm 1067 10th 933 7th-8th 646 Prof-school 576 9th 514 12th 433 Doctorate 413 5th-6th 333 1st-4th 168 Preschool 51 Name: education, dtype: int64
all_data['education_num'].value_counts()
9 10501 10 7291 13 5355 14 1723 11 1382 7 1175 12 1067 6 933 4 646 15 576 5 514 8 433 16 413 3 333 2 168 1 51 Name: education_num, dtype: int64
- Preschool 인 value는 모두 0임을 확인한다.
all_data.loc[all_data['education'] == 'Preschool', 'income'].sum()
0.0
education
과 income
의 관련성이 꽤 높아 보입니다.
- 다만, 단계가 너무 많으면 모델 학습시 과적합이 일어날 수 있으므로, 단계를 묶어 주도록 하겠습니다.
all_data.groupby(['education'])['income'].agg(['mean', 'count']).sort_values('mean')
mean | count | |
---|---|---|
education | ||
Preschool | 0.000000 | 40 |
1st-4th | 0.037313 | 134 |
5th-6th | 0.049057 | 265 |
9th | 0.052632 | 418 |
7th-8th | 0.057426 | 505 |
11th | 0.059653 | 922 |
12th | 0.072423 | 359 |
10th | 0.072503 | 731 |
HS-grad | 0.158544 | 8433 |
Some-college | 0.192586 | 5800 |
Assoc-acdm | 0.255344 | 842 |
Assoc-voc | 0.255474 | 1096 |
Bachelors | 0.415516 | 4344 |
Masters | 0.561684 | 1378 |
Prof-school | 0.733906 | 466 |
Doctorate | 0.734177 | 316 |
education_map = {
'Preschool': 'level_0',
'1st-4th': 'level_1',
'5th-6th': 'level_1',
'7th-8th': 'level_2',
'9th': 'level_2',
'10th': 'level_3',
'11th': 'level_3',
'12th': 'level_3',
'HS-grad': 'level_4',
'Some-college': 'level_5',
'Assoc-acdm': 'level_6',
'Assoc-voc': 'level_6',
'Bachelors': 'level_7',
'Masters': 'level_8',
'Prof-school': 'level_9',
'Doctorate': 'level_9',
}
all_data['education'] = all_data['education'].map(education_map)
all_data['education'].value_counts()
level_4 10501 level_5 7291 level_7 5355 level_3 2541 level_6 2449 level_8 1723 level_2 1160 level_9 989 level_1 501 level_0 51 Name: education, dtype: int64
그룹 별로 묶어 주었는데, 추후에 묶어 주는 단계에 변화를 줘 봐도 될 것 같습니다.
level_1, level_2, level_3는 거의 차이가 없어 보이네요.
all_data.pivot_table(index='education', values=['income']).sort_values('income').plot(kind='bar', figsize=FIG_SIZE)
<matplotlib.axes._subplots.AxesSubplot at 0x7fb799c65ad0>
Preschool 의 평균 income = 0
사용을 안하는 education_num
은 drop 합니다.
all_data = all_data.drop('education_num', 1)
all_data.columns
Index(['id', 'age', 'workclass', 'fnlwgt', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income', 'fnlwgt_log'], dtype='object')
marital_status
all_data['marital_status'].value_counts()
Married-civ-spouse 14976 Never-married 10683 Divorced 4443 Separated 1025 Widowed 993 Married-spouse-absent 418 Married-AF-spouse 23 Name: marital_status, dtype: int64
all_data.pivot_table(index='marital_status', values='income', aggfunc=['mean', 'count'])#.sort_values('income')
mean | count | |
---|---|---|
income | income | |
marital_status | ||
Divorced | 0.104921 | 3536 |
Married-AF-spouse | 0.526316 | 19 |
Married-civ-spouse | 0.448789 | 11970 |
Married-spouse-absent | 0.080838 | 334 |
Never-married | 0.046802 | 8568 |
Separated | 0.065375 | 826 |
Widowed | 0.087940 | 796 |
Married-AF-spouse
컬럼의 데이터 갯수가 적습니다.
유사 그룹인 Married-civ-spouse
으로 변형해주도록 하겠습니다.
all_data.loc[all_data['marital_status'] == 'Married-AF-spouse', 'marital_status'] = 'Married-civ-spouse'
occupation
all_data['occupation'].value_counts()
Prof-specialty 4140 Craft-repair 4099 Exec-managerial 4066 Adm-clerical 3770 Sales 3650 Other-service 3295 Machine-op-inspct 2002 ? 1843 Transport-moving 1597 Handlers-cleaners 1370 Farming-fishing 994 Tech-support 928 Protective-serv 649 Priv-house-serv 149 Armed-Forces 9 Name: occupation, dtype: int64
all_data.groupby('occupation')['income'].mean().sort_values().plot(kind='bar', figsize=FIG_SIZE)
<matplotlib.axes._subplots.AxesSubplot at 0x7fb79997abd0>
-
occupation == 'Armed-Forces'
는 모두 0 income 임을 확인합니다. -
또한,
Armed-Forces
역시 데이터 갯수가 적으므로, 과적합 방지를 위하여Priv-house-serve
와 합쳐줍니다.
all_data.loc[train['occupation'].isin(['Armed-Forces', 'Priv-house-serv']), 'income'].value_counts()
0.0 129 1.0 1 Name: income, dtype: int64
all_data.loc[all_data['occupation'].isin(['Armed-Forces', 'Priv-house-serv']), 'occupation'] = 'Priv-house-serv'
all_data['occupation'].value_counts()
Prof-specialty 4140 Craft-repair 4099 Exec-managerial 4066 Adm-clerical 3770 Sales 3650 Other-service 3295 Machine-op-inspct 2002 ? 1843 Transport-moving 1597 Handlers-cleaners 1370 Farming-fishing 994 Tech-support 928 Protective-serv 649 Priv-house-serv 158 Name: occupation, dtype: int64
relationship
relationship
컬럼은 별다른 특이사항을 찾지 못하여 그냥 방치 go
all_data['relationship'].value_counts()
Husband 13193 Not-in-family 8305 Own-child 5068 Unmarried 3446 Wife 1568 Other-relative 981 Name: relationship, dtype: int64
all_data.groupby('relationship')['income'].mean().plot(kind='bar', figsize=FIG_SIZE)
<matplotlib.axes._subplots.AxesSubplot at 0x7fb799c62e90>
race
race
컬럼도 특이사항 확인 못하여 그냥 방치 go
all_data['race'].value_counts()
White 27816 Black 3124 Asian-Pac-Islander 1039 Amer-Indian-Eskimo 311 Other 271 Name: race, dtype: int64
race
별 income
확인
all_data.groupby('race')['income'].mean().plot(kind='bar', figsize=FIG_SIZE)
<matplotlib.axes._subplots.AxesSubplot at 0x7fb79989e990>
sex
sex
컬럼도 특이사항 확인 못하여 그냥 방치 go
all_data['sex'].value_counts()
Male 21790 Female 10771 Name: sex, dtype: int64
all_data.groupby('sex')['income'].mean().plot(kind='bar', figsize=FIG_SIZE)
<matplotlib.axes._subplots.AxesSubplot at 0x7fb7995e0450>
capital_gain
capital_gain
과 capital_loss
를 같이 봐야겠다는 생각으로 접근합니다.
(가설) - capital_gain이 크면 소득 수준이 높지 않을까?
plt.figure(figsize=(12, 9))
sns.distplot(all_data.loc[train['capital_gain'] > 0, 'capital_gain'])
<matplotlib.axes._subplots.AxesSubplot at 0x7fb79954a090>
재밌는 사실을 발견했죠…
capital_gain
> 50000이면 모두 income 이 1 입니다.
g = sns.FacetGrid(all_data.loc[all_data['capital_gain']> 0], col="income", height=7, aspect=1.5)
g.map(sns.distplot, 'capital_gain')
<seaborn.axisgrid.FacetGrid at 0x7fb799915a90>
capital_gain
& capital_loss
은 모두 Numerical 처럼 보이지만, categorical 로 만들어도 값의 variance가 크지 않습니다.
그래서 value_counts()로 income 별 값 분포를 확인합니다.
income == 1
인 그룹이 가지고 있는 특정 key와 income == 0
인 그룹이 가지고 있는 특정 key가 극명히 갈리는 것을 확인할 수 있습니다.
fig, axes = plt.subplots(1, 2)
fig.set_size_inches(20, 8)
df1 = train.loc[(train['income'] == 0) & (train['capital_gain'] > 0), 'capital_gain'].value_counts().sort_index()
df1.plot(kind='bar', ax=axes[0])
df1 = train.loc[(train['income'] == 1) & (train['capital_gain'] > 0), 'capital_gain'].value_counts().sort_index()
df1.plot(kind='bar', ax=axes[1])
plt.tight_layout()
plt.show()
capital_loss
fig, axes = plt.subplots(1, 2)
fig.set_size_inches(20, 8)
df1 = train.loc[(train['income'] == 0) & (train['capital_loss'] > 0), 'capital_loss'].value_counts().sort_index()
df1.plot(kind='bar', ax=axes[0])
df1 = train.loc[(train['income'] == 1) & (train['capital_loss'] > 0), 'capital_loss'].value_counts().sort_index()
df1.plot(kind='bar', ax=axes[1])
plt.tight_layout()
plt.show()
capital net
capital_gain - capital_loss 진행하여 Net을 구합니다.
all_data['capital_net'] = all_data['capital_gain'] - all_data['capital_loss']
train['capital_net'] = train['capital_gain'] - train['capital_loss']
test['capital_net'] = test['capital_gain'] - test['capital_loss']
plt.figure(figsize=(16, 9))
plt.subplot(1, 2, 1)
sns.distplot(train.loc[ (train['capital_net'] > 0) & (train['income'] == 1), 'capital_net'])
plt.subplot(1, 2, 2)
sns.distplot(train.loc[ (train['capital_net'] > 0) & (train['income'] == 0), 'capital_net'])
<matplotlib.axes._subplots.AxesSubplot at 0x7fb7987effd0>
fig, axes = plt.subplots(1, 2)
fig.set_size_inches(20, 8)
df1 = all_data.loc[(all_data['income'] == 0) & (all_data['capital_net'] > 0), 'capital_net'].value_counts().sort_index()
df1.plot(kind='bar', ax=axes[0])
df2 = all_data.loc[(all_data['income'] == 1) & (all_data['capital_net'] > 0), 'capital_net'].value_counts().sort_index()
df2.plot(kind='bar', ax=axes[1])
plt.tight_layout()
plt.show()
capital_net 기준으로 income == 1 or 0 이 나오는 key 값 추출
pos_key = all_data.loc[(all_data['income'] == 1) & (all_data['capital_net'] > 0), 'capital_net'].value_counts().sort_index().keys().tolist()
all_key = all_data.loc[(all_data['income'] == 1) & (all_data['capital_net'] > 0), 'capital_net'].value_counts().sort_index().keys().tolist()
all_key.extend(all_data.loc[(all_data['income'] == 0) & (all_data['capital_net'] > 0), 'capital_net'].value_counts().sort_index().keys().tolist())
all_key[:5]
[3103, 4386, 4687, 4787, 4934]
몇 개 겹치는 것도 있긴 합니다.
df1 = all_data.loc[(all_data['income'] == 0) & (all_data['capital_net'].isin(pos_key)), 'capital_net'].value_counts().sort_index()
df1.plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x7fb79815ec50>
pos_key = all_data.loc[(all_data['income'] == 1) & (all_data['capital_net'] > 0), 'capital_net'].value_counts().sort_index().keys().tolist()
neg_key = all_data.loc[(all_data['income'] == 0) & (all_data['capital_net'] > 0), 'capital_net'].value_counts().sort_index().keys().tolist()
겹치지 않는 것들만 추려 주려고요
capital_net_pos_key = [key for key in pos_key if key not in neg_key]
capital_net_neg_key = [key for key in neg_key if key not in pos_key]
all_data['capital_net_pos_key'] = all_data['capital_net'].apply(lambda x: x in capital_net_pos_key)
all_data['capital_net_neg_key'] = all_data['capital_net'].apply(lambda x: x in capital_net_neg_key)
hours_per_week
40시간 근로자들이 많네요~
40시간 이상 근로자들은 income == 1 쪽이 많이 보입니다.
all_data['hours_per_week'].value_counts()
40 15217 50 2819 45 1824 60 1475 35 1297 ... 92 1 94 1 87 1 74 1 82 1 Name: hours_per_week, Length: 94, dtype: int64
fig, axes = plt.subplots(1, 2)
fig.set_size_inches(20, 8)
df1 = all_data.loc[(all_data['income'] == 0), 'hours_per_week'].value_counts().sort_index()
df1.plot(kind='bar', ax=axes[0])
df2 = all_data.loc[(all_data['income'] == 1), 'hours_per_week'].value_counts().sort_index()
df2.plot(kind='bar', ax=axes[1])
plt.tight_layout()
plt.show()
native_country
나라가 좀 골치 덩어리 였습니다.
일단 value의 variance가 크고, 데이터의 갯수가 몇 개 없는 feature 들이 있습니다.
합쳐 주도록 하겠습니다.
train['native_country'].value_counts().shape, test['native_country'].value_counts().shape
((41,), (42,))
all_data['native_country'].value_counts()
United-States 29170 Mexico 643 ? 583 Philippines 198 Germany 137 Canada 121 Puerto-Rico 114 El-Salvador 106 India 100 Cuba 95 England 90 Jamaica 81 South 80 China 75 Italy 73 Dominican-Republic 70 Vietnam 67 Guatemala 64 Japan 62 Poland 60 Columbia 59 Taiwan 51 Haiti 44 Iran 43 Portugal 37 Nicaragua 34 Peru 31 France 29 Greece 29 Ecuador 28 Ireland 24 Hong 20 Cambodia 19 Trinadad&Tobago 19 Laos 18 Thailand 18 Yugoslavia 16 Outlying-US(Guam-USVI-etc) 14 Honduras 13 Hungary 13 Scotland 12 Holand-Netherlands 1 Name: native_country, dtype: int64
나중에 아래 wiki에서 국가별 소득 수준 별로 그룹을 만들어서 합쳐도 보려고요
List of countries by GNI (nominal) per capita (Wikipedia)
all_data.groupby('native_country')['income'].mean().reset_index()
native_country | income | |
---|---|---|
0 | ? | 0.234649 |
1 | Cambodia | 0.428571 |
2 | Canada | 0.315217 |
3 | China | 0.228070 |
4 | Columbia | 0.038462 |
5 | Cuba | 0.263158 |
6 | Dominican-Republic | 0.041667 |
7 | Ecuador | 0.166667 |
8 | El-Salvador | 0.088608 |
9 | England | 0.343284 |
10 | France | 0.416667 |
11 | Germany | 0.346535 |
12 | Greece | 0.250000 |
13 | Guatemala | 0.057692 |
14 | Haiti | 0.114286 |
15 | Holand-Netherlands | NaN |
16 | Honduras | 0.000000 |
17 | Hong | 0.285714 |
18 | Hungary | 0.272727 |
19 | India | 0.402597 |
20 | Iran | 0.485714 |
21 | Ireland | 0.222222 |
22 | Italy | 0.380000 |
23 | Jamaica | 0.109375 |
24 | Japan | 0.404255 |
25 | Laos | 0.133333 |
26 | Mexico | 0.048689 |
27 | Nicaragua | 0.071429 |
28 | Outlying-US(Guam-USVI-etc) | 0.000000 |
29 | Peru | 0.076923 |
30 | Philippines | 0.300613 |
31 | Poland | 0.212766 |
32 | Portugal | 0.066667 |
33 | Puerto-Rico | 0.115789 |
34 | Scotland | 0.250000 |
35 | South | 0.222222 |
36 | Taiwan | 0.461538 |
37 | Thailand | 0.153846 |
38 | Trinadad&Tobago | 0.071429 |
39 | United-States | 0.247315 |
40 | Vietnam | 0.080000 |
41 | Yugoslavia | 0.416667 |
income_01 = ['Jamaica',
'Haiti',
'Puerto-Rico',
'Laos',
'Thailand',
'Ecuador',]
income_02 = ['Outlying-US(Guam-USVI-etc)',
'Honduras',
'Columbia',
'Dominican-Republic',
'Mexico',
'Guatemala',
'Portugal',
'Trinadad&Tobago',
'Nicaragua',
'Peru',
'Vietnam',
'El-Salvador',]
income_03 = ['Poland',
'Ireland',
'South',
'China',]
income_04 = [
'United-States',
]
income_05 = [
'Greece',
'Scotland',
'Cuba',
'Hungary',
'Hong',
'Holand-Netherlands',
]
income_06 = [
'Philippines',
'Canada',
]
income_07 = [
'England',
'Germany',
]
income_08 = [
'Italy',
'India',
'Japan',
'France',
'Yugoslavia',
'Cambodia',
]
income_09 = [
'Taiwan',
'Iran',
]
income_other=['?', ]
def convert_country(x):
if x in income_01:
return 'income_01'
elif x in income_02:
return 'income_02'
elif x in income_03:
return 'income_03'
elif x in income_04:
return 'income_04'
elif x in income_05:
return 'income_05'
elif x in income_06:
return 'income_06'
elif x in income_07:
return 'income_07'
elif x in income_08:
return 'income_08'
elif x in income_09:
return 'income_09'
else:
return 'income_other'
all_data['country_bin'] = all_data['native_country'].apply(convert_country)
all_data['country_bin'].value_counts()
income_04 29170 income_02 1157 income_other 583 income_06 319 income_01 303 income_08 299 income_03 239 income_07 227 income_05 170 income_09 94 Name: country_bin, dtype: int64
Define Features
쓸만한 feature들을 골라보자.
all_data.columns
Index(['id', 'age', 'workclass', 'fnlwgt', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income', 'fnlwgt_log', 'capital_net', 'capital_net_pos_key', 'capital_net_neg_key', 'country_bin'], dtype='object')
features = [
# 'id',
'age',
'workclass',
# 'fnlwgt',
'fnlwgt_log',
'education',
'marital_status',
'occupation',
'relationship',
'race',
'sex',
'capital_gain',
'capital_loss',
'hours_per_week',
'native_country',
# 'income',
# 'capital_net', capital_gain과 corr이 커서 제거
'capital_net_pos_key',
'capital_net_neg_key',
'country_bin',
]
label = [
'income'
]
all_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 32561 entries, 0 to 6511 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 32561 non-null int64 1 age 32561 non-null int64 2 workclass 32561 non-null object 3 fnlwgt 32561 non-null int64 4 education 32561 non-null object 5 marital_status 32561 non-null object 6 occupation 32561 non-null object 7 relationship 32561 non-null object 8 race 32561 non-null object 9 sex 32561 non-null object 10 capital_gain 32561 non-null int64 11 capital_loss 32561 non-null int64 12 hours_per_week 32561 non-null int64 13 native_country 32561 non-null object 14 income 26049 non-null float64 15 fnlwgt_log 32561 non-null float64 16 capital_net 32561 non-null int64 17 capital_net_pos_key 32561 non-null bool 18 capital_net_neg_key 32561 non-null bool 19 country_bin 32561 non-null object dtypes: bool(2), float64(2), int64(7), object(9) memory usage: 6.0+ MB
plt.figure(figsize=(12, 12))
sns.heatmap(abs(all_data.corr()), annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7fb7985fa590>
all_data_dummies = pd.get_dummies(all_data[features + label])
all_data_dummies.head()
age | fnlwgt_log | capital_gain | capital_loss | hours_per_week | capital_net_pos_key | capital_net_neg_key | income | workclass_? | workclass_Federal-gov | ... | country_bin_income_01 | country_bin_income_02 | country_bin_income_03 | country_bin_income_04 | country_bin_income_05 | country_bin_income_06 | country_bin_income_07 | country_bin_income_08 | country_bin_income_09 | country_bin_income_other | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 40 | 12.034917 | 0 | 0 | 60 | False | False | 1.0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 17 | 11.529055 | 0 | 0 | 20 | False | False | 0.0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 18 | 12.775237 | 0 | 0 | 16 | False | False | 0.0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 21 | 11.926081 | 0 | 0 | 25 | False | False | 0.0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 24 | 11.713693 | 0 | 0 | 20 | False | False | 0.0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 111 columns
train_features = all_data_dummies.drop('income', 1).iloc[:len(train)]
test_features = all_data_dummies.drop('income', 1).iloc[len(train):]
train_label = train[label]
train_features.shape, test_features.shape
((26049, 110), (6512, 110))
Model (LightGBM)
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import precision_score, recall_score, classification_report, f1_score, confusion_matrix
from sklearn.metrics import log_loss
from tqdm import tqdm_notebook
import lightgbm as lgbm
x_train, x_valid, y_train, y_valid = train_test_split(train_features, train_label, stratify=train_label, test_size=0.2, random_state=SEED)
NUM_BOOST_ROUND = 10000
N_SPLITS = 5
lgbm_param = {
'objective': 'binary',
'boosting_type':'gbdt',
'colsample_bytree':1.0,
'importance_type':'split',
'learning_rate':0.1,
'min_child_samples':20,
'min_child_weight':0.001,
'min_split_gain':0,
'n_estimators':10000,
'num_leaves':40,
'random_state':SEED,
'early_stopping_rounds': 200,
'reg_alpha':0.6,
'reg_lambda':0.5,
'subsample':1.0,
'subsample_for_bin':200000,
'subsample_freq':0,
'n_jobs':-1,
}
dtrain = lgbm.Dataset(x_train, y_train)
dvalid = lgbm.Dataset(x_valid, y_valid)
model = lgbm.train(lgbm_param, dtrain, NUM_BOOST_ROUND,
valid_sets=(dtrain, dvalid),
valid_names=('train', 'valid'),
verbose_eval=100,
)
Training until validation scores don't improve for 200 rounds [100] train's binary_logloss: 0.224887 valid's binary_logloss: 0.284848 [200] train's binary_logloss: 0.19576 valid's binary_logloss: 0.291253 Early stopping, best iteration is: [65] train's binary_logloss: 0.239304 valid's binary_logloss: 0.282915
Threshold 별 F1 Score 확인
threshold = 0.5
valid_prediction = model.predict(x_valid)
valid_prediction[valid_prediction > threshold] = 1
valid_prediction[valid_prediction <= threshold] = 0
print(classification_report(y_valid, valid_prediction))
precision recall f1-score support 0 0.89 0.94 0.92 3949 1 0.78 0.64 0.70 1261 accuracy 0.87 5210 macro avg 0.83 0.79 0.81 5210 weighted avg 0.86 0.87 0.86 5210
Threshold 별 F1_Score의 변화 확인
f1_threshold = np.linspace(0.4, 0.6, 30)
f1_scores = []
max_score = 0
max_threshold = 0
for t in f1_threshold:
valid_prediction = model.predict(x_valid)
valid_prediction[valid_prediction > t] = 1
valid_prediction[valid_prediction <= t] = 0
score_ = f1_score(y_valid, valid_prediction)
f1_scores.append(score_)
if score_ > max_score:
max_score = score_
max_threshold = t
plt.figure(figsize=(16, 6))
plt.plot(f1_threshold, f1_scores)
plt.axvline(x=max_threshold, linestyle=':', color='r')
plt.xticks(f1_threshold, rotation=90)
plt.show()
confusion_matrix
plt.figure(figsize=FIG_SIZE)
sns.heatmap(confusion_matrix(y_valid, valid_prediction), annot=True, fmt='g')
<matplotlib.axes._subplots.AxesSubplot at 0x7fb798b4b750>
Prediction
pred = model.predict(test_features)
pred 값의 분포 확인
plt.figure(figsize=FIG_SIZE)
sns.distplot(pred)
<matplotlib.axes._subplots.AxesSubplot at 0x7fb798b578d0>
# 기본 0.5으로 설정
THRESHOLD = 0.5
print(len(pred[pred >= THRESHOLD]) / len(pred[pred < THRESHOLD]))
0.25062415978490493
pred[pred >= THRESHOLD] = 1
pred[pred < THRESHOLD] = 0
income_pct = train['income'].value_counts()[1] / train['income'].value_counts()[0]
income_pct
0.3193375202593193
plt.figure(figsize=(10, 6))
plt.subplot(121)
sns.countplot(pred)
plt.subplot(122)
sns.countplot(train['income'])
plt.show()
PyCarot
!pip install pycaret
Collecting pycaret Downloading pycaret-2.1.2-py3-none-any.whl (252 kB) [K |████████████████████████████████| 252 kB 402 kB/s [?25hRequirement already satisfied: imbalanced-learn>=0.6.2 in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.7.0) Requirement already satisfied: joblib in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.14.1) Requirement already satisfied: spacy in /opt/conda/lib/python3.7/site-packages (from pycaret) (2.3.2) Requirement already satisfied: matplotlib in /opt/conda/lib/python3.7/site-packages (from pycaret) (3.2.1) Requirement already satisfied: mlxtend in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.17.3) Requirement already satisfied: xgboost>=0.90 in /opt/conda/lib/python3.7/site-packages (from pycaret) (1.2.0) Collecting datefinder>=0.7.0 Downloading datefinder-0.7.1-py2.py3-none-any.whl (10 kB) Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.7/site-packages (from pycaret) (1.18.5) Requirement already satisfied: yellowbrick>=1.0.1 in /opt/conda/lib/python3.7/site-packages (from pycaret) (1.1) Requirement already satisfied: pyLDAvis in /opt/conda/lib/python3.7/site-packages (from pycaret) (2.1.2) Requirement already satisfied: cufflinks>=0.17.0 in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.17.3) Collecting mlflow Downloading mlflow-1.11.0-py3-none-any.whl (13.9 MB) [K |████████████████████████████████| 13.9 MB 5.3 MB/s [?25hCollecting pyod Downloading pyod-0.8.3.tar.gz (96 kB) [K |████████████████████████████████| 96 kB 3.1 MB/s [?25hRequirement already satisfied: textblob in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.15.3) Requirement already satisfied: pandas in /opt/conda/lib/python3.7/site-packages (from pycaret) (1.1.3) Requirement already satisfied: umap-learn in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.4.6) Requirement already satisfied: kmodes>=0.10.1 in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.10.2) Requirement already satisfied: lightgbm>=2.3.1 in /opt/conda/lib/python3.7/site-packages (from pycaret) (2.3.1) Requirement already satisfied: gensim in /opt/conda/lib/python3.7/site-packages (from pycaret) (3.8.3) Requirement already satisfied: plotly>=4.4.1 in /opt/conda/lib/python3.7/site-packages (from pycaret) (4.11.0) Requirement already satisfied: wordcloud in /opt/conda/lib/python3.7/site-packages (from pycaret) (1.8.0) Requirement already satisfied: catboost>=0.23.2 in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.24.1) Requirement already satisfied: seaborn in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.10.0) Requirement already satisfied: scikit-learn>=0.23 in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.23.2) Collecting pandas-profiling>=2.8.0 Downloading pandas_profiling-2.9.0-py2.py3-none-any.whl (258 kB) [K |████████████████████████████████| 258 kB 13.6 MB/s [?25hRequirement already satisfied: nltk in /opt/conda/lib/python3.7/site-packages (from pycaret) (3.2.4) Requirement already satisfied: IPython in /opt/conda/lib/python3.7/site-packages (from pycaret) (7.13.0) Requirement already satisfied: ipywidgets in /opt/conda/lib/python3.7/site-packages (from pycaret) (7.5.1) Requirement already satisfied: scipy>=0.19.1 in /opt/conda/lib/python3.7/site-packages (from imbalanced-learn>=0.6.2->pycaret) (1.4.1) Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (0.8.0) Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (1.0.2) Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (2.0.3) Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (3.0.2) Requirement already satisfied: blis<0.5.0,>=0.4.0 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (0.4.1) Requirement already satisfied: thinc==7.4.1 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (7.4.1) Requirement already satisfied: requests<3.0.0,>=2.13.0 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (2.23.0) Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (1.0.0) Requirement already satisfied: plac<1.2.0,>=0.9.6 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (1.1.3) Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (4.45.0) Requirement already satisfied: setuptools in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (46.1.3.post20200325) Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (1.0.2) Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pycaret) (0.10.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pycaret) (2.4.7) Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pycaret) (1.2.0) Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pycaret) (2.8.1) Requirement already satisfied: pytz in /opt/conda/lib/python3.7/site-packages (from datefinder>=0.7.0->pycaret) (2019.3) Requirement already satisfied: regex>=2017.02.08 in /opt/conda/lib/python3.7/site-packages (from datefinder>=0.7.0->pycaret) (2020.4.4) Requirement already satisfied: wheel>=0.23.0 in /opt/conda/lib/python3.7/site-packages (from pyLDAvis->pycaret) (0.34.2) Requirement already satisfied: pytest in /opt/conda/lib/python3.7/site-packages (from pyLDAvis->pycaret) (5.4.1) Requirement already satisfied: funcy in /opt/conda/lib/python3.7/site-packages (from pyLDAvis->pycaret) (1.15) Requirement already satisfied: jinja2>=2.7.2 in /opt/conda/lib/python3.7/site-packages (from pyLDAvis->pycaret) (2.11.2) Requirement already satisfied: numexpr in /opt/conda/lib/python3.7/site-packages (from pyLDAvis->pycaret) (2.7.1) Requirement already satisfied: future in /opt/conda/lib/python3.7/site-packages (from pyLDAvis->pycaret) (0.18.2) Requirement already satisfied: colorlover>=0.2.1 in /opt/conda/lib/python3.7/site-packages (from cufflinks>=0.17.0->pycaret) (0.3.0) Requirement already satisfied: six>=1.9.0 in /opt/conda/lib/python3.7/site-packages (from cufflinks>=0.17.0->pycaret) (1.14.0) Collecting databricks-cli>=0.8.7 Downloading databricks-cli-0.12.2.tar.gz (55 kB) [K |████████████████████████████████| 55 kB 1.8 MB/s [?25hCollecting alembic<=1.4.1 Downloading alembic-1.4.1.tar.gz (1.1 MB) [K |████████████████████████████████| 1.1 MB 14.2 MB/s [?25hRequirement already satisfied: cloudpickle in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (1.3.0) Collecting sqlalchemy<=1.3.13 Downloading SQLAlchemy-1.3.13.tar.gz (6.0 MB) [K |████████████████████████████████| 6.0 MB 15.5 MB/s [?25hRequirement already satisfied: pyyaml in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (5.3.1) Requirement already satisfied: protobuf>=3.6.0 in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (3.13.0) Requirement already satisfied: click>=7.0 in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (7.1.1) Requirement already satisfied: docker>=4.0.0 in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (4.2.0) Collecting azure-storage-blob>=12.0 Downloading azure_storage_blob-12.5.0-py2.py3-none-any.whl (326 kB) [K |████████████████████████████████| 326 kB 19.1 MB/s [?25hRequirement already satisfied: entrypoints in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (0.3) Requirement already satisfied: gitpython>=2.1.0 in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (3.1.1) Collecting gunicorn; platform_system != "Windows" Downloading gunicorn-20.0.4-py2.py3-none-any.whl (77 kB) [K |████████████████████████████████| 77 kB 3.9 MB/s [?25hCollecting querystring-parser Downloading querystring_parser-1.2.4.tar.gz (5.5 kB) Collecting gorilla Downloading gorilla-0.3.0-py2.py3-none-any.whl (11 kB) Requirement already satisfied: sqlparse in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (0.3.1) Collecting prometheus-flask-exporter Downloading prometheus_flask_exporter-0.18.1.tar.gz (21 kB) Requirement already satisfied: Flask in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (1.1.2) Collecting combo Downloading combo-0.1.1.tar.gz (37 kB) Requirement already satisfied: numba>=0.35 in /opt/conda/lib/python3.7/site-packages (from pyod->pycaret) (0.48.0) Requirement already satisfied: statsmodels in /opt/conda/lib/python3.7/site-packages (from pyod->pycaret) (0.11.1) Collecting suod Downloading suod-0.0.4.tar.gz (2.1 MB) [K |████████████████████████████████| 2.1 MB 19.0 MB/s [?25hRequirement already satisfied: smart-open>=1.8.1 in /opt/conda/lib/python3.7/site-packages (from gensim->pycaret) (2.2.1) Requirement already satisfied: retrying>=1.3.3 in /opt/conda/lib/python3.7/site-packages (from plotly>=4.4.1->pycaret) (1.3.3) Requirement already satisfied: pillow in /opt/conda/lib/python3.7/site-packages (from wordcloud->pycaret) (7.2.0) Requirement already satisfied: graphviz in /opt/conda/lib/python3.7/site-packages (from catboost>=0.23.2->pycaret) (0.8.4) Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=0.23->pycaret) (2.1.0) Requirement already satisfied: missingno>=0.4.2 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret) (0.4.2) Requirement already satisfied: confuse>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret) (1.1.0) Requirement already satisfied: attrs>=19.3.0 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret) (19.3.0) Requirement already satisfied: htmlmin>=0.1.12 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret) (0.1.12) Collecting visions[type_image_path]==0.5.0 Downloading visions-0.5.0-py3-none-any.whl (64 kB) [K |████████████████████████████████| 64 kB 2.2 MB/s [?25hCollecting tangled-up-in-unicode>=0.0.6 Downloading tangled_up_in_unicode-0.0.6-py3-none-any.whl (3.1 MB) [K |████████████████████████████████| 3.1 MB 24.2 MB/s [?25hRequirement already satisfied: phik>=0.9.10 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret) (0.9.11) Requirement already satisfied: pygments in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret) (2.6.1) Requirement already satisfied: backcall in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret) (0.1.0) Requirement already satisfied: pexpect; sys_platform != "win32" in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret) (4.8.0) Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret) (3.0.5) Requirement already satisfied: jedi>=0.10 in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret) (0.15.2) Requirement already satisfied: traitlets>=4.2 in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret) (4.3.3) Requirement already satisfied: pickleshare in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret) (0.7.5) Requirement already satisfied: decorator in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret) (4.4.2) Requirement already satisfied: widgetsnbextension~=3.5.0 in /opt/conda/lib/python3.7/site-packages (from ipywidgets->pycaret) (3.5.1) Requirement already satisfied: ipykernel>=4.5.1 in /opt/conda/lib/python3.7/site-packages (from ipywidgets->pycaret) (5.1.1) Requirement already satisfied: nbformat>=4.2.0 in /opt/conda/lib/python3.7/site-packages (from ipywidgets->pycaret) (5.0.6) Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy->pycaret) (2020.6.20) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy->pycaret) (1.24.3) Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy->pycaret) (2.9) Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy->pycaret) (3.0.4) Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /opt/conda/lib/python3.7/site-packages (from catalogue<1.1.0,>=0.0.7->spacy->pycaret) (2.0.0) Requirement already satisfied: py>=1.5.0 in /opt/conda/lib/python3.7/site-packages (from pytest->pyLDAvis->pycaret) (1.8.1) Requirement already satisfied: packaging in /opt/conda/lib/python3.7/site-packages (from pytest->pyLDAvis->pycaret) (20.1) Requirement already satisfied: more-itertools>=4.0.0 in /opt/conda/lib/python3.7/site-packages (from pytest->pyLDAvis->pycaret) (8.2.0) Requirement already satisfied: pluggy<1.0,>=0.12 in /opt/conda/lib/python3.7/site-packages (from pytest->pyLDAvis->pycaret) (0.13.0) Requirement already satisfied: wcwidth in /opt/conda/lib/python3.7/site-packages (from pytest->pyLDAvis->pycaret) (0.1.9) Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.7/site-packages (from jinja2>=2.7.2->pyLDAvis->pycaret) (1.1.1) Requirement already satisfied: tabulate>=0.7.7 in /opt/conda/lib/python3.7/site-packages (from databricks-cli>=0.8.7->mlflow->pycaret) (0.8.7) Collecting tenacity>=6.2.0 Downloading tenacity-6.2.0-py2.py3-none-any.whl (24 kB) Requirement already satisfied: Mako in /opt/conda/lib/python3.7/site-packages (from alembic<=1.4.1->mlflow->pycaret) (1.1.3) Requirement already satisfied: python-editor>=0.3 in /opt/conda/lib/python3.7/site-packages (from alembic<=1.4.1->mlflow->pycaret) (1.0.4) Requirement already satisfied: websocket-client>=0.32.0 in /opt/conda/lib/python3.7/site-packages (from docker>=4.0.0->mlflow->pycaret) (0.57.0) Collecting azure-core<2.0.0,>=1.6.0 Downloading azure_core-1.8.2-py2.py3-none-any.whl (122 kB) [K |████████████████████████████████| 122 kB 28.6 MB/s [?25hCollecting msrest>=0.6.10 Downloading msrest-0.6.19-py2.py3-none-any.whl (84 kB) [K |████████████████████████████████| 84 kB 1.9 MB/s [?25hRequirement already satisfied: cryptography>=2.1.4 in /opt/conda/lib/python3.7/site-packages (from azure-storage-blob>=12.0->mlflow->pycaret) (2.8) Requirement already satisfied: gitdb<5,>=4.0.1 in /opt/conda/lib/python3.7/site-packages (from gitpython>=2.1.0->mlflow->pycaret) (4.0.4) Requirement already satisfied: prometheus_client in /opt/conda/lib/python3.7/site-packages (from prometheus-flask-exporter->mlflow->pycaret) (0.7.1) Requirement already satisfied: Werkzeug>=0.15 in /opt/conda/lib/python3.7/site-packages (from Flask->mlflow->pycaret) (1.0.1) Requirement already satisfied: itsdangerous>=0.24 in /opt/conda/lib/python3.7/site-packages (from Flask->mlflow->pycaret) (1.1.0) Requirement already satisfied: llvmlite<0.32.0,>=0.31.0dev0 in /opt/conda/lib/python3.7/site-packages (from numba>=0.35->pyod->pycaret) (0.31.0) Requirement already satisfied: patsy>=0.5 in /opt/conda/lib/python3.7/site-packages (from statsmodels->pyod->pycaret) (0.5.1) Requirement already satisfied: boto3 in /opt/conda/lib/python3.7/site-packages (from smart-open>=1.8.1->gensim->pycaret) (1.15.13) Requirement already satisfied: networkx>=2.4 in /opt/conda/lib/python3.7/site-packages (from visions[type_image_path]==0.5.0->pandas-profiling>=2.8.0->pycaret) (2.4) Requirement already satisfied: imagehash; extra == "type_image_path" in /opt/conda/lib/python3.7/site-packages (from visions[type_image_path]==0.5.0->pandas-profiling>=2.8.0->pycaret) (4.1.0) Requirement already satisfied: ptyprocess>=0.5 in /opt/conda/lib/python3.7/site-packages (from pexpect; sys_platform != "win32"->IPython->pycaret) (0.6.0) Requirement already satisfied: parso>=0.5.2 in /opt/conda/lib/python3.7/site-packages (from jedi>=0.10->IPython->pycaret) (0.5.2) Requirement already satisfied: ipython-genutils in /opt/conda/lib/python3.7/site-packages (from traitlets>=4.2->IPython->pycaret) (0.2.0) Requirement already satisfied: notebook>=4.4.1 in /opt/conda/lib/python3.7/site-packages (from widgetsnbextension~=3.5.0->ipywidgets->pycaret) (5.5.0) Requirement already satisfied: tornado>=4.2 in /opt/conda/lib/python3.7/site-packages (from ipykernel>=4.5.1->ipywidgets->pycaret) (5.0.2) Requirement already satisfied: jupyter-client in /opt/conda/lib/python3.7/site-packages (from ipykernel>=4.5.1->ipywidgets->pycaret) (6.1.3) Requirement already satisfied: jupyter-core in /opt/conda/lib/python3.7/site-packages (from nbformat>=4.2.0->ipywidgets->pycaret) (4.6.3) Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /opt/conda/lib/python3.7/site-packages (from nbformat>=4.2.0->ipywidgets->pycaret) (3.2.0) Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy->pycaret) (3.1.0) Collecting isodate>=0.6.0 Downloading isodate-0.6.0-py2.py3-none-any.whl (45 kB) [K |████████████████████████████████| 45 kB 1.4 MB/s [?25hRequirement already satisfied: requests-oauthlib>=0.5.0 in /opt/conda/lib/python3.7/site-packages (from msrest>=0.6.10->azure-storage-blob>=12.0->mlflow->pycaret) (1.2.0) Requirement already satisfied: cffi!=1.11.3,>=1.8 in /opt/conda/lib/python3.7/site-packages (from cryptography>=2.1.4->azure-storage-blob>=12.0->mlflow->pycaret) (1.14.0) Requirement already satisfied: smmap<4,>=3.0.1 in /opt/conda/lib/python3.7/site-packages (from gitdb<5,>=4.0.1->gitpython>=2.1.0->mlflow->pycaret) (3.0.2) Requirement already satisfied: botocore<1.19.0,>=1.18.13 in /opt/conda/lib/python3.7/site-packages (from boto3->smart-open>=1.8.1->gensim->pycaret) (1.18.13) Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /opt/conda/lib/python3.7/site-packages (from boto3->smart-open>=1.8.1->gensim->pycaret) (0.10.0) Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /opt/conda/lib/python3.7/site-packages (from boto3->smart-open>=1.8.1->gensim->pycaret) (0.3.3) Requirement already satisfied: PyWavelets in /opt/conda/lib/python3.7/site-packages (from imagehash; extra == "type_image_path"->visions[type_image_path]==0.5.0->pandas-profiling>=2.8.0->pycaret) (1.1.1) Requirement already satisfied: nbconvert in /opt/conda/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (5.6.1) Requirement already satisfied: pyzmq>=17 in /opt/conda/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (19.0.0) Requirement already satisfied: Send2Trash in /opt/conda/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (1.5.0) Requirement already satisfied: terminado>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.8.3) Requirement already satisfied: pyrsistent>=0.14.0 in /opt/conda/lib/python3.7/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets->pycaret) (0.16.0) Requirement already satisfied: oauthlib>=3.0.0 in /opt/conda/lib/python3.7/site-packages (from requests-oauthlib>=0.5.0->msrest>=0.6.10->azure-storage-blob>=12.0->mlflow->pycaret) (3.0.1) Requirement already satisfied: pycparser in /opt/conda/lib/python3.7/site-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.1.4->azure-storage-blob>=12.0->mlflow->pycaret) (2.20) Requirement already satisfied: pandocfilters>=1.4.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (1.4.2) Requirement already satisfied: bleach in /opt/conda/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (3.1.4) Requirement already satisfied: mistune<2,>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.8.4) Requirement already satisfied: testpath in /opt/conda/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.4.4) Requirement already satisfied: defusedxml in /opt/conda/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.6.0) Requirement already satisfied: webencodings in /opt/conda/lib/python3.7/site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.5.1) Building wheels for collected packages: pyod, databricks-cli, alembic, sqlalchemy, querystring-parser, prometheus-flask-exporter, combo, suod Building wheel for pyod (setup.py) ... [?25l- \ | done [?25h Created wheel for pyod: filename=pyod-0.8.3-py3-none-any.whl size=110347 sha256=6858fa6eda242cf3101a17d6095f16ac26c9dc1b497ae42b4ee3cfeb5156be5d Stored in directory: /root/.cache/pip/wheels/fc/fc/77/6e530134c9ee2b45ef0840f0c8046b3be595624881cf533d7a Building wheel for databricks-cli (setup.py) ... [?25l- \ | done [?25h Created wheel for databricks-cli: filename=databricks_cli-0.12.2-py3-none-any.whl size=101163 sha256=be3329799d7581f8e81992ec5ca7ab24167fb3b92d0187c4a62c8e049195a955 Stored in directory: /root/.cache/pip/wheels/9e/bb/9d/78e02afa234019a22759d08d285bae87a88fa881f5db58db25 Building wheel for alembic (setup.py) ... [?25l- \ | done [?25h Created wheel for alembic: filename=alembic-1.4.1-py2.py3-none-any.whl size=158154 sha256=3a4b7a763a6ce226a933b9aa155d11719a424ddff92458f00091c0e7c3bd50cf Stored in directory: /root/.cache/pip/wheels/be/5d/0a/9e13f53f4f5dfb67cd8d245bb7cdffe12f135846f491a283e3 Building wheel for sqlalchemy (setup.py) ... [?25l- \ | / - \ | / - \ done [?25h Created wheel for sqlalchemy: filename=SQLAlchemy-1.3.13-cp37-cp37m-linux_x86_64.whl size=1221862 sha256=8a33081e209764349239860912cc6cde907f8613081f067c95fecca823653bd3 Stored in directory: /root/.cache/pip/wheels/b9/ba/77/163f10f14bd489351530603e750c195b0ceceed2f3be2b32f1 Building wheel for querystring-parser (setup.py) ... [?25l- \ done [?25h Created wheel for querystring-parser: filename=querystring_parser-1.2.4-py3-none-any.whl size=7076 sha256=eed4ac8c5058079d17a797b70ae3ed32cbea0c95ed24a361365cd86217a49dca Stored in directory: /root/.cache/pip/wheels/69/38/7a/072b5863ca334d012821a287fd1d066cea33abdcda3ef2f878 Building wheel for prometheus-flask-exporter (setup.py) ... [?25l- \ done [?25h Created wheel for prometheus-flask-exporter: filename=prometheus_flask_exporter-0.18.1-py3-none-any.whl size=17157 sha256=66518d2e9e9f0b4e8e78fa57037102604ebeae2bff773b5d1fcc7350404f267d Stored in directory: /root/.cache/pip/wheels/c4/b6/b5/e76659f3b2a3a226565e27f0a7eb7a3ac93c3f4d68acfbe617 Building wheel for combo (setup.py) ... [?25l- \ done [?25h Created wheel for combo: filename=combo-0.1.1-py3-none-any.whl size=42113 sha256=ab8b32daeae645fb4bc1c7d5de5441eeaad8eefa09bdaf459e8168e23e25d8b5 Stored in directory: /root/.cache/pip/wheels/3e/e1/f8/08f19ba48f75d3dbbb549cec4b86cc0392c14b2b6bb81f4e1f Building wheel for suod (setup.py) ... [?25l- \ | / done [?25h Created wheel for suod: filename=suod-0.0.4-py3-none-any.whl size=2167157 sha256=cc1b8461f955dadb8d45e095b62560588668395536e17bb7953d4c329d640dff Stored in directory: /root/.cache/pip/wheels/dc/ae/aa/3b8cc857617f3ba6cb9e6b804c79c69d0ed60a08e022e9a4f3 Successfully built pyod databricks-cli alembic sqlalchemy querystring-parser prometheus-flask-exporter combo suod Installing collected packages: datefinder, tenacity, databricks-cli, sqlalchemy, alembic, azure-core, isodate, msrest, azure-storage-blob, gunicorn, querystring-parser, gorilla, prometheus-flask-exporter, mlflow, combo, suod, pyod, tangled-up-in-unicode, visions, pandas-profiling, pycaret Attempting uninstall: tenacity Found existing installation: tenacity 6.1.0 Uninstalling tenacity-6.1.0: Successfully uninstalled tenacity-6.1.0 Attempting uninstall: sqlalchemy Found existing installation: SQLAlchemy 1.3.16 Uninstalling SQLAlchemy-1.3.16: Successfully uninstalled SQLAlchemy-1.3.16 Attempting uninstall: alembic Found existing installation: alembic 1.4.3 Uninstalling alembic-1.4.3: Successfully uninstalled alembic-1.4.3 Attempting uninstall: tangled-up-in-unicode Found existing installation: tangled-up-in-unicode 0.0.4 Uninstalling tangled-up-in-unicode-0.0.4: Successfully uninstalled tangled-up-in-unicode-0.0.4 Attempting uninstall: visions Found existing installation: visions 0.4.1 Uninstalling visions-0.4.1: Successfully uninstalled visions-0.4.1 Attempting uninstall: pandas-profiling Found existing installation: pandas-profiling 2.6.0 Uninstalling pandas-profiling-2.6.0: Successfully uninstalled pandas-profiling-2.6.0 [31mERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts. We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default. pandas-profiling 2.9.0 requires seaborn>=0.10.1, but you'll have seaborn 0.10.0 which is incompatible.[0m Successfully installed alembic-1.4.1 azure-core-1.8.2 azure-storage-blob-12.5.0 combo-0.1.1 databricks-cli-0.12.2 datefinder-0.7.1 gorilla-0.3.0 gunicorn-20.0.4 isodate-0.6.0 mlflow-1.11.0 msrest-0.6.19 pandas-profiling-2.9.0 prometheus-flask-exporter-0.18.1 pycaret-2.1.2 pyod-0.8.3 querystring-parser-1.2.4 sqlalchemy-1.3.13 suod-0.0.4 tangled-up-in-unicode-0.0.6 tenacity-6.2.0 visions-0.5.0 [33mWARNING: You are using pip version 20.2.3; however, version 20.2.4 is available. You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.[0m
from pycaret.classification import *
위에서 정의한 features & label 확인
features, label
(['age', 'workclass', 'fnlwgt_log', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'capital_net_pos_key', 'capital_net_neg_key', 'country_bin'], ['income'])
all_data_caret = all_data[features + label]
all_data_caret.head()
age | workclass | fnlwgt_log | education | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | capital_net_pos_key | capital_net_neg_key | country_bin | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 40 | Private | 12.034917 | level_4 | Married-civ-spouse | Sales | Husband | White | Male | 0 | 0 | 60 | United-States | False | False | income_04 | 1.0 |
1 | 17 | Private | 11.529055 | level_2 | Never-married | Machine-op-inspct | Own-child | White | Male | 0 | 0 | 20 | United-States | False | False | income_04 | 0.0 |
2 | 18 | Private | 12.775237 | level_5 | Never-married | Other-service | Own-child | White | Male | 0 | 0 | 16 | United-States | False | False | income_04 | 0.0 |
3 | 21 | Private | 11.926081 | level_5 | Never-married | Prof-specialty | Own-child | White | Female | 0 | 0 | 25 | United-States | False | False | income_04 | 0.0 |
4 | 24 | Private | 11.713693 | level_5 | Never-married | Adm-clerical | Not-in-family | Black | Female | 0 | 0 | 20 | ? | False | False | income_other | 0.0 |
type casting 을 안해주면 잘 설정이 안되더라..ㅠ
all_data_caret['age'] = all_data_caret['age'].astype('float')
# all_data_caret['capital_net'] = all_data_caret['capital_net'].astype('float')
all_data_caret['hours_per_week'] = all_data_caret['hours_per_week'].astype('float')
all_data_caret['capital_gain'] = all_data_caret['capital_gain'].astype('float')
all_data_caret['capital_loss'] = all_data_caret['capital_loss'].astype('float')
train_clean = all_data_caret[:len(train)]
test_clean = all_data_caret[len(train):]
train_clean['income'] = train_clean['income'].astype('int')
train_clean.head()
age | workclass | fnlwgt_log | education | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | capital_net_pos_key | capital_net_neg_key | country_bin | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 40.0 | Private | 12.034917 | level_4 | Married-civ-spouse | Sales | Husband | White | Male | 0.0 | 0.0 | 60.0 | United-States | False | False | income_04 | 1 |
1 | 17.0 | Private | 11.529055 | level_2 | Never-married | Machine-op-inspct | Own-child | White | Male | 0.0 | 0.0 | 20.0 | United-States | False | False | income_04 | 0 |
2 | 18.0 | Private | 12.775237 | level_5 | Never-married | Other-service | Own-child | White | Male | 0.0 | 0.0 | 16.0 | United-States | False | False | income_04 | 0 |
3 | 21.0 | Private | 11.926081 | level_5 | Never-married | Prof-specialty | Own-child | White | Female | 0.0 | 0.0 | 25.0 | United-States | False | False | income_04 | 0 |
4 | 24.0 | Private | 11.713693 | level_5 | Never-married | Adm-clerical | Not-in-family | Black | Female | 0.0 | 0.0 | 20.0 | ? | False | False | income_other | 0 |
setup(data = train_clean, target = 'income', session_id=SEED, silent=True)
Setup Succesfully Completed!
Description | Value | |
---|---|---|
0 | session_id | 1234 |
1 | Target Type | Binary |
2 | Label Encoded | 0: 0, 1: 1 |
3 | Original Data | (26049, 17) |
4 | Missing Values | False |
5 | Numeric Features | 5 |
6 | Categorical Features | 11 |
7 | Ordinal Features | False |
8 | High Cardinality Features | False |
9 | High Cardinality Method | None |
10 | Sampled Data | (26049, 17) |
11 | Transformed Train Set | (18234, 111) |
12 | Transformed Test Set | (7815, 111) |
13 | Numeric Imputer | mean |
14 | Categorical Imputer | constant |
15 | Normalize | False |
16 | Normalize Method | None |
17 | Transformation | False |
18 | Transformation Method | None |
19 | PCA | False |
20 | PCA Method | None |
21 | PCA Components | None |
22 | Ignore Low Variance | False |
23 | Combine Rare Levels | False |
24 | Rare Level Threshold | None |
25 | Numeric Binning | False |
26 | Remove Outliers | False |
27 | Outliers Threshold | None |
28 | Remove Multicollinearity | False |
29 | Multicollinearity Threshold | None |
30 | Clustering | False |
31 | Clustering Iteration | None |
32 | Polynomial Features | False |
33 | Polynomial Degree | None |
34 | Trignometry Features | False |
35 | Polynomial Threshold | None |
36 | Group Features | False |
37 | Feature Selection | False |
38 | Features Selection Threshold | None |
39 | Feature Interaction | False |
40 | Feature Ratio | False |
41 | Interaction Threshold | None |
42 | Fix Imbalance | False |
43 | Fix Imbalance Method | SMOTE |
( age fnlwgt_log capital_gain capital_loss hours_per_week \ 0 40.0 12.034917 0.0 0.0 60.0 1 17.0 11.529055 0.0 0.0 20.0 2 18.0 12.775237 0.0 0.0 16.0 3 21.0 11.926081 0.0 0.0 25.0 4 24.0 11.713693 0.0 0.0 20.0 ... ... ... ... ... ... 26044 57.0 12.430020 0.0 0.0 52.0 26045 23.0 12.380412 0.0 0.0 40.0 26046 78.0 12.017898 0.0 0.0 15.0 26047 26.0 11.929172 0.0 0.0 40.0 26048 20.0 11.511835 0.0 0.0 30.0 workclass_? workclass_Federal-gov workclass_Local-gov \ 0 0.0 0.0 0.0 1 0.0 0.0 0.0 2 0.0 0.0 0.0 3 0.0 0.0 0.0 4 0.0 0.0 0.0 ... ... ... ... 26044 0.0 0.0 0.0 26045 0.0 0.0 0.0 26046 1.0 0.0 0.0 26047 0.0 0.0 0.0 26048 1.0 0.0 0.0 workclass_Other workclass_Private ... country_bin_income_01 \ 0 0.0 1.0 ... 0.0 1 0.0 1.0 ... 0.0 2 0.0 1.0 ... 0.0 3 0.0 1.0 ... 0.0 4 0.0 1.0 ... 0.0 ... ... ... ... ... 26044 0.0 1.0 ... 0.0 26045 0.0 1.0 ... 0.0 26046 0.0 0.0 ... 0.0 26047 0.0 0.0 ... 0.0 26048 0.0 0.0 ... 0.0 country_bin_income_02 country_bin_income_03 country_bin_income_04 \ 0 0.0 0.0 1.0 1 0.0 0.0 1.0 2 0.0 0.0 1.0 3 0.0 0.0 1.0 4 0.0 0.0 0.0 ... ... ... ... 26044 0.0 0.0 1.0 26045 0.0 0.0 1.0 26046 0.0 0.0 1.0 26047 0.0 0.0 1.0 26048 0.0 0.0 1.0 country_bin_income_05 country_bin_income_06 country_bin_income_07 \ 0 0.0 0.0 0.0 1 0.0 0.0 0.0 2 0.0 0.0 0.0 3 0.0 0.0 0.0 4 0.0 0.0 0.0 ... ... ... ... 26044 0.0 0.0 0.0 26045 0.0 0.0 0.0 26046 0.0 0.0 0.0 26047 0.0 0.0 0.0 26048 0.0 0.0 0.0 country_bin_income_08 country_bin_income_09 country_bin_income_other 0 0.0 0.0 0.0 1 0.0 0.0 0.0 2 0.0 0.0 0.0 3 0.0 0.0 0.0 4 0.0 0.0 1.0 ... ... ... ... 26044 0.0 0.0 0.0 26045 0.0 0.0 0.0 26046 0.0 0.0 0.0 26047 0.0 0.0 0.0 26048 0.0 0.0 0.0 [26049 rows x 111 columns], 0 1 1 0 2 0 3 0 4 0 .. 26044 0 26045 0 26046 0 26047 0 26048 0 Name: income, Length: 26049, dtype: int64, age fnlwgt_log capital_gain capital_loss hours_per_week \ 14079 56.0 10.366655 0.0 0.0 40.0 2026 69.0 12.086016 1848.0 0.0 12.0 10955 36.0 12.419400 5178.0 0.0 60.0 1385 52.0 11.593906 0.0 1902.0 50.0 7067 32.0 12.870491 0.0 0.0 16.0 ... ... ... ... ... ... 25430 29.0 11.693980 0.0 0.0 40.0 14899 39.0 11.546902 0.0 0.0 45.0 9236 30.0 12.591117 0.0 0.0 50.0 23705 59.0 12.834812 0.0 0.0 41.0 18592 41.0 12.687850 0.0 0.0 55.0 workclass_? workclass_Federal-gov workclass_Local-gov \ 14079 0.0 0.0 0.0 2026 0.0 0.0 0.0 10955 0.0 0.0 0.0 1385 0.0 0.0 0.0 7067 0.0 0.0 0.0 ... ... ... ... 25430 0.0 1.0 0.0 14899 0.0 0.0 0.0 9236 0.0 0.0 0.0 23705 1.0 0.0 0.0 18592 0.0 0.0 0.0 workclass_Other workclass_Private ... country_bin_income_01 \ 14079 0.0 1.0 ... 0.0 2026 0.0 1.0 ... 0.0 10955 0.0 1.0 ... 0.0 1385 0.0 1.0 ... 0.0 7067 0.0 1.0 ... 0.0 ... ... ... ... ... 25430 0.0 0.0 ... 0.0 14899 0.0 1.0 ... 0.0 9236 0.0 1.0 ... 0.0 23705 0.0 0.0 ... 0.0 18592 0.0 1.0 ... 0.0 country_bin_income_02 country_bin_income_03 country_bin_income_04 \ 14079 0.0 0.0 1.0 2026 0.0 0.0 1.0 10955 0.0 0.0 0.0 1385 0.0 0.0 0.0 7067 0.0 0.0 1.0 ... ... ... ... 25430 0.0 0.0 1.0 14899 0.0 0.0 1.0 9236 0.0 0.0 0.0 23705 0.0 0.0 1.0 18592 0.0 0.0 1.0 country_bin_income_05 country_bin_income_06 country_bin_income_07 \ 14079 0.0 0.0 0.0 2026 0.0 0.0 0.0 10955 0.0 0.0 0.0 1385 1.0 0.0 0.0 7067 0.0 0.0 0.0 ... ... ... ... 25430 0.0 0.0 0.0 14899 0.0 0.0 0.0 9236 0.0 0.0 0.0 23705 0.0 0.0 0.0 18592 0.0 0.0 0.0 country_bin_income_08 country_bin_income_09 country_bin_income_other 14079 0.0 0.0 0.0 2026 0.0 0.0 0.0 10955 0.0 0.0 1.0 1385 0.0 0.0 0.0 7067 0.0 0.0 0.0 ... ... ... ... 25430 0.0 0.0 0.0 14899 0.0 0.0 0.0 9236 0.0 0.0 1.0 23705 0.0 0.0 0.0 18592 0.0 0.0 0.0 [18234 rows x 111 columns], age fnlwgt_log capital_gain capital_loss hours_per_week \ 21893 49.0 11.173178 0.0 0.0 60.0 24714 41.0 12.399248 0.0 0.0 80.0 20725 49.0 12.172340 0.0 0.0 40.0 13981 49.0 11.314145 0.0 0.0 40.0 25627 31.0 11.674253 0.0 0.0 45.0 ... ... ... ... ... ... 3937 73.0 10.175345 0.0 0.0 30.0 23595 20.0 12.354411 0.0 0.0 32.0 25500 55.0 11.870810 0.0 0.0 40.0 22934 24.0 11.093508 0.0 0.0 30.0 18262 28.0 12.160489 0.0 1741.0 52.0 workclass_? workclass_Federal-gov workclass_Local-gov \ 21893 0.0 0.0 0.0 24714 0.0 0.0 0.0 20725 0.0 0.0 0.0 13981 0.0 0.0 0.0 25627 0.0 0.0 0.0 ... ... ... ... 3937 0.0 0.0 0.0 23595 0.0 0.0 0.0 25500 0.0 0.0 0.0 22934 0.0 0.0 0.0 18262 0.0 0.0 0.0 workclass_Other workclass_Private ... country_bin_income_01 \ 21893 0.0 1.0 ... 0.0 24714 0.0 1.0 ... 0.0 20725 0.0 1.0 ... 0.0 13981 0.0 1.0 ... 0.0 25627 0.0 1.0 ... 0.0 ... ... ... ... ... 3937 0.0 1.0 ... 0.0 23595 0.0 1.0 ... 0.0 25500 0.0 1.0 ... 0.0 22934 0.0 1.0 ... 0.0 18262 0.0 1.0 ... 0.0 country_bin_income_02 country_bin_income_03 country_bin_income_04 \ 21893 0.0 0.0 1.0 24714 0.0 0.0 1.0 20725 0.0 0.0 1.0 13981 0.0 0.0 1.0 25627 0.0 0.0 1.0 ... ... ... ... 3937 0.0 0.0 1.0 23595 0.0 0.0 1.0 25500 0.0 0.0 1.0 22934 0.0 0.0 1.0 18262 0.0 0.0 1.0 country_bin_income_05 country_bin_income_06 country_bin_income_07 \ 21893 0.0 0.0 0.0 24714 0.0 0.0 0.0 20725 0.0 0.0 0.0 13981 0.0 0.0 0.0 25627 0.0 0.0 0.0 ... ... ... ... 3937 0.0 0.0 0.0 23595 0.0 0.0 0.0 25500 0.0 0.0 0.0 22934 0.0 0.0 0.0 18262 0.0 0.0 0.0 country_bin_income_08 country_bin_income_09 country_bin_income_other 21893 0.0 0.0 0.0 24714 0.0 0.0 0.0 20725 0.0 0.0 0.0 13981 0.0 0.0 0.0 25627 0.0 0.0 0.0 ... ... ... ... 3937 0.0 0.0 0.0 23595 0.0 0.0 0.0 25500 0.0 0.0 0.0 22934 0.0 0.0 0.0 18262 0.0 0.0 0.0 [7815 rows x 111 columns], 14079 0 2026 0 10955 1 1385 1 7067 0 .. 25430 0 14899 1 9236 0 23705 1 18592 0 Name: income, Length: 18234, dtype: int64, 21893 0 24714 0 20725 0 13981 1 25627 0 .. 3937 1 23595 0 25500 0 22934 0 18262 0 Name: income, Length: 7815, dtype: int64, 1234, Pipeline(memory=None, steps=[('dtypes', DataTypes_Auto_infer(categorical_features=[], display_types=False, features_todrop=[], ml_usecase='classification', numerical_features=[], target='income', time_features=[])), ('imputer', Simple_Imputer(categorical_strategy='not_available', numeric_strategy='mean', target_variable=None)), ('new_levels1', New_Catagorical_Le... ('group', Empty()), ('nonliner', Empty()), ('scaling', Empty()), ('P_transform', Empty()), ('pt_target', Empty()), ('binn', Empty()), ('rem_outliers', Empty()), ('cluster_all', Empty()), ('dummy', Dummify(target='income')), ('fix_perfect', Empty()), ('clean_names', Clean_Colum_Names()), ('feature_select', Empty()), ('fix_multi', Empty()), ('dfs', Empty()), ('pca', Empty())], verbose=False), [('Classification Setup Config', Description Value 0 session_id 1234 1 Target Type Binary 2 Label Encoded 0: 0, 1: 1 3 Original Data (26049, 17) 4 Missing Values False 5 Numeric Features 5 6 Categorical Features 11 7 Ordinal Features False 8 High Cardinality Features False 9 High Cardinality Method None 10 Sampled Data (26049, 17) 11 Transformed Train Set (18234, 111) 12 Transformed Test Set (7815, 111) 13 Numeric Imputer mean 14 Categorical Imputer constant 15 Normalize False 16 Normalize Method None 17 Transformation False 18 Transformation Method None 19 PCA False 20 PCA Method None 21 PCA Components None 22 Ignore Low Variance False 23 Combine Rare Levels False 24 Rare Level Threshold None 25 Numeric Binning False 26 Remove Outliers False 27 Outliers Threshold None 28 Remove Multicollinearity False 29 Multicollinearity Threshold None 30 Clustering False 31 Clustering Iteration None 32 Polynomial Features False 33 Polynomial Degree None 34 Trignometry Features False 35 Polynomial Threshold None 36 Group Features False 37 Feature Selection False 38 Features Selection Threshold None 39 Feature Interaction False 40 Feature Ratio False 41 Interaction Threshold None 42 Fix Imbalance False 43 Fix Imbalance Method SMOTE), ('X_training Set', age fnlwgt_log capital_gain capital_loss hours_per_week \ 14079 56.0 10.366655 0.0 0.0 40.0 2026 69.0 12.086016 1848.0 0.0 12.0 10955 36.0 12.419400 5178.0 0.0 60.0 1385 52.0 11.593906 0.0 1902.0 50.0 7067 32.0 12.870491 0.0 0.0 16.0 ... ... ... ... ... ... 25430 29.0 11.693980 0.0 0.0 40.0 14899 39.0 11.546902 0.0 0.0 45.0 9236 30.0 12.591117 0.0 0.0 50.0 23705 59.0 12.834812 0.0 0.0 41.0 18592 41.0 12.687850 0.0 0.0 55.0 workclass_? workclass_Federal-gov workclass_Local-gov \ 14079 0.0 0.0 0.0 2026 0.0 0.0 0.0 10955 0.0 0.0 0.0 1385 0.0 0.0 0.0 7067 0.0 0.0 0.0 ... ... ... ... 25430 0.0 1.0 0.0 14899 0.0 0.0 0.0 9236 0.0 0.0 0.0 23705 1.0 0.0 0.0 18592 0.0 0.0 0.0 workclass_Other workclass_Private ... country_bin_income_01 \ 14079 0.0 1.0 ... 0.0 2026 0.0 1.0 ... 0.0 10955 0.0 1.0 ... 0.0 1385 0.0 1.0 ... 0.0 7067 0.0 1.0 ... 0.0 ... ... ... ... ... 25430 0.0 0.0 ... 0.0 14899 0.0 1.0 ... 0.0 9236 0.0 1.0 ... 0.0 23705 0.0 0.0 ... 0.0 18592 0.0 1.0 ... 0.0 country_bin_income_02 country_bin_income_03 country_bin_income_04 \ 14079 0.0 0.0 1.0 2026 0.0 0.0 1.0 10955 0.0 0.0 0.0 1385 0.0 0.0 0.0 7067 0.0 0.0 1.0 ... ... ... ... 25430 0.0 0.0 1.0 14899 0.0 0.0 1.0 9236 0.0 0.0 0.0 23705 0.0 0.0 1.0 18592 0.0 0.0 1.0 country_bin_income_05 country_bin_income_06 country_bin_income_07 \ 14079 0.0 0.0 0.0 2026 0.0 0.0 0.0 10955 0.0 0.0 0.0 1385 1.0 0.0 0.0 7067 0.0 0.0 0.0 ... ... ... ... 25430 0.0 0.0 0.0 14899 0.0 0.0 0.0 9236 0.0 0.0 0.0 23705 0.0 0.0 0.0 18592 0.0 0.0 0.0 country_bin_income_08 country_bin_income_09 country_bin_income_other 14079 0.0 0.0 0.0 2026 0.0 0.0 0.0 10955 0.0 0.0 1.0 1385 0.0 0.0 0.0 7067 0.0 0.0 0.0 ... ... ... ... 25430 0.0 0.0 0.0 14899 0.0 0.0 0.0 9236 0.0 0.0 1.0 23705 0.0 0.0 0.0 18592 0.0 0.0 0.0 [18234 rows x 111 columns]), ('y_training Set', 14079 0 2026 0 10955 1 1385 1 7067 0 .. 25430 0 14899 1 9236 0 23705 1 18592 0 Name: income, Length: 18234, dtype: int64), ('X_test Set', age fnlwgt_log capital_gain capital_loss hours_per_week \ 21893 49.0 11.173178 0.0 0.0 60.0 24714 41.0 12.399248 0.0 0.0 80.0 20725 49.0 12.172340 0.0 0.0 40.0 13981 49.0 11.314145 0.0 0.0 40.0 25627 31.0 11.674253 0.0 0.0 45.0 ... ... ... ... ... ... 3937 73.0 10.175345 0.0 0.0 30.0 23595 20.0 12.354411 0.0 0.0 32.0 25500 55.0 11.870810 0.0 0.0 40.0 22934 24.0 11.093508 0.0 0.0 30.0 18262 28.0 12.160489 0.0 1741.0 52.0 workclass_? workclass_Federal-gov workclass_Local-gov \ 21893 0.0 0.0 0.0 24714 0.0 0.0 0.0 20725 0.0 0.0 0.0 13981 0.0 0.0 0.0 25627 0.0 0.0 0.0 ... ... ... ... 3937 0.0 0.0 0.0 23595 0.0 0.0 0.0 25500 0.0 0.0 0.0 22934 0.0 0.0 0.0 18262 0.0 0.0 0.0 workclass_Other workclass_Private ... country_bin_income_01 \ 21893 0.0 1.0 ... 0.0 24714 0.0 1.0 ... 0.0 20725 0.0 1.0 ... 0.0 13981 0.0 1.0 ... 0.0 25627 0.0 1.0 ... 0.0 ... ... ... ... ... 3937 0.0 1.0 ... 0.0 23595 0.0 1.0 ... 0.0 25500 0.0 1.0 ... 0.0 22934 0.0 1.0 ... 0.0 18262 0.0 1.0 ... 0.0 country_bin_income_02 country_bin_income_03 country_bin_income_04 \ 21893 0.0 0.0 1.0 24714 0.0 0.0 1.0 20725 0.0 0.0 1.0 13981 0.0 0.0 1.0 25627 0.0 0.0 1.0 ... ... ... ... 3937 0.0 0.0 1.0 23595 0.0 0.0 1.0 25500 0.0 0.0 1.0 22934 0.0 0.0 1.0 18262 0.0 0.0 1.0 country_bin_income_05 country_bin_income_06 country_bin_income_07 \ 21893 0.0 0.0 0.0 24714 0.0 0.0 0.0 20725 0.0 0.0 0.0 13981 0.0 0.0 0.0 25627 0.0 0.0 0.0 ... ... ... ... 3937 0.0 0.0 0.0 23595 0.0 0.0 0.0 25500 0.0 0.0 0.0 22934 0.0 0.0 0.0 18262 0.0 0.0 0.0 country_bin_income_08 country_bin_income_09 country_bin_income_other 21893 0.0 0.0 0.0 24714 0.0 0.0 0.0 20725 0.0 0.0 0.0 13981 0.0 0.0 0.0 25627 0.0 0.0 0.0 ... ... ... ... 3937 0.0 0.0 0.0 23595 0.0 0.0 0.0 25500 0.0 0.0 0.0 22934 0.0 0.0 0.0 18262 0.0 0.0 0.0 [7815 rows x 111 columns]), ('y_test Set', 21893 0 24714 0 20725 0 13981 1 25627 0 .. 3937 1 23595 0 25500 0 22934 0 18262 0 Name: income, Length: 7815, dtype: int64), ('Transformation Pipeline', Pipeline(memory=None, steps=[('dtypes', DataTypes_Auto_infer(categorical_features=[], display_types=False, features_todrop=[], ml_usecase='classification', numerical_features=[], target='income', time_features=[])), ('imputer', Simple_Imputer(categorical_strategy='not_available', numeric_strategy='mean', target_variable=None)), ('new_levels1', New_Catagorical_Le... ('group', Empty()), ('nonliner', Empty()), ('scaling', Empty()), ('P_transform', Empty()), ('pt_target', Empty()), ('binn', Empty()), ('rem_outliers', Empty()), ('cluster_all', Empty()), ('dummy', Dummify(target='income')), ('fix_perfect', Empty()), ('clean_names', Clean_Colum_Names()), ('feature_select', Empty()), ('fix_multi', Empty()), ('dfs', Empty()), ('pca', Empty())], verbose=False))], False, -1, True, [], [], [], 'no_logging', False, False, '87a3', False, None, <_Logger logs (DEBUG)>, age workclass fnlwgt_log education marital_status \ 0 40.0 Private 12.034917 level_4 Married-civ-spouse 1 17.0 Private 11.529055 level_2 Never-married 2 18.0 Private 12.775237 level_5 Never-married 3 21.0 Private 11.926081 level_5 Never-married 4 24.0 Private 11.713693 level_5 Never-married ... ... ... ... ... ... 26044 57.0 Private 12.430020 level_3 Married-civ-spouse 26045 23.0 Private 12.380412 level_7 Never-married 26046 78.0 ? 12.017898 level_8 Widowed 26047 26.0 Self-emp-not-inc 11.929172 level_4 Never-married 26048 20.0 ? 11.511835 level_5 Never-married occupation relationship race sex capital_gain \ 0 Sales Husband White Male 0.0 1 Machine-op-inspct Own-child White Male 0.0 2 Other-service Own-child White Male 0.0 3 Prof-specialty Own-child White Female 0.0 4 Adm-clerical Not-in-family Black Female 0.0 ... ... ... ... ... ... 26044 Other-service Husband White Male 0.0 26045 Prof-specialty Own-child White Male 0.0 26046 ? Not-in-family White Female 0.0 26047 Prof-specialty Own-child Black Female 0.0 26048 ? Own-child White Female 0.0 capital_loss hours_per_week native_country capital_net_pos_key \ 0 0.0 60.0 United-States False 1 0.0 20.0 United-States False 2 0.0 16.0 United-States False 3 0.0 25.0 United-States False 4 0.0 20.0 ? False ... ... ... ... ... 26044 0.0 52.0 United-States False 26045 0.0 40.0 United-States False 26046 0.0 15.0 United-States False 26047 0.0 40.0 United-States False 26048 0.0 30.0 United-States False capital_net_neg_key country_bin income 0 False income_04 1 1 False income_04 0 2 False income_04 0 3 False income_04 0 4 False income_other 0 ... ... ... ... 26044 False income_04 0 26045 False income_04 0 26046 False income_04 0 26047 False income_04 0 26048 False income_04 0 [26049 rows x 17 columns], 'income', False)
lgbm = create_model('lightgbm')
tuned_lgbm = tune_model(lgbm, optimize='F1')
Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|
0 | 0.8717 | 0.9263 | 0.6553 | 0.7790 | 0.7118 | 0.6301 | 0.6340 |
1 | 0.8668 | 0.9245 | 0.6584 | 0.7598 | 0.7055 | 0.6199 | 0.6226 |
2 | 0.8624 | 0.9205 | 0.6267 | 0.7631 | 0.6882 | 0.6010 | 0.6058 |
3 | 0.8777 | 0.9430 | 0.6516 | 0.8067 | 0.7209 | 0.6438 | 0.6498 |
4 | 0.8738 | 0.9296 | 0.6576 | 0.7859 | 0.7160 | 0.6358 | 0.6399 |
5 | 0.8700 | 0.9255 | 0.6576 | 0.7713 | 0.7099 | 0.6268 | 0.6301 |
6 | 0.8645 | 0.9224 | 0.6440 | 0.7594 | 0.6969 | 0.6104 | 0.6139 |
7 | 0.8722 | 0.9289 | 0.6689 | 0.7723 | 0.7169 | 0.6349 | 0.6376 |
8 | 0.8733 | 0.9334 | 0.7029 | 0.7561 | 0.7286 | 0.6461 | 0.6468 |
9 | 0.8848 | 0.9418 | 0.7143 | 0.7895 | 0.7500 | 0.6754 | 0.6768 |
Mean | 0.8717 | 0.9296 | 0.6637 | 0.7743 | 0.7145 | 0.6324 | 0.6357 |
SD | 0.0062 | 0.0073 | 0.0249 | 0.0153 | 0.0162 | 0.0196 | 0.0190 |
calibrated_lgbm = calibrate_model(tuned_lgbm)
Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|
0 | 0.8723 | 0.9264 | 0.6485 | 0.7857 | 0.7106 | 0.6296 | 0.6343 |
1 | 0.8657 | 0.9230 | 0.6448 | 0.7641 | 0.6994 | 0.6137 | 0.6174 |
2 | 0.8569 | 0.9198 | 0.6109 | 0.7521 | 0.6742 | 0.5837 | 0.5889 |
3 | 0.8805 | 0.9452 | 0.6448 | 0.8237 | 0.7234 | 0.6486 | 0.6565 |
4 | 0.8788 | 0.9306 | 0.6644 | 0.8005 | 0.7261 | 0.6492 | 0.6538 |
5 | 0.8727 | 0.9265 | 0.6485 | 0.7879 | 0.7114 | 0.6308 | 0.6357 |
6 | 0.8694 | 0.9232 | 0.6417 | 0.7796 | 0.7040 | 0.6212 | 0.6261 |
7 | 0.8727 | 0.9298 | 0.6599 | 0.7802 | 0.7150 | 0.6338 | 0.6375 |
8 | 0.8793 | 0.9357 | 0.6984 | 0.7797 | 0.7368 | 0.6589 | 0.6605 |
9 | 0.8914 | 0.9440 | 0.7143 | 0.8140 | 0.7609 | 0.6910 | 0.6935 |
Mean | 0.8740 | 0.9304 | 0.6576 | 0.7867 | 0.7162 | 0.6360 | 0.6404 |
SD | 0.0089 | 0.0083 | 0.0280 | 0.0204 | 0.0220 | 0.0272 | 0.0267 |
interpret_model(tuned_lgbm, plot = 'reason', observation = 15)
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
plot_model(tuned_lgbm)
plot_model(tuned_lgbm, 'threshold')
plot_model(lgbm, 'confusion_matrix')
plot_model(lgbm, 'calibration')
tuned_lgbm_pred = predict_model(tuned_lgbm, data = test_clean)
tuned_lgbm_pred
age | workclass | fnlwgt_log | education | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | capital_net_pos_key | capital_net_neg_key | country_bin | income | Label | Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 28.0 | Private | 11.122265 | level_5 | Never-married | Adm-clerical | Other-relative | White | Female | 0.0 | 0.0 | 40.0 | United-States | False | False | income_04 | NaN | 0 | 0.0039 |
1 | 40.0 | Self-emp-inc | 10.541888 | level_4 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0.0 | 0.0 | 50.0 | United-States | False | False | income_04 | NaN | 0 | 0.4239 |
2 | 20.0 | Private | 11.607799 | level_5 | Never-married | Handlers-cleaners | Own-child | White | Male | 0.0 | 0.0 | 25.0 | United-States | False | False | income_04 | NaN | 0 | 0.0004 |
3 | 40.0 | Private | 11.648653 | level_6 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0.0 | 0.0 | 50.0 | United-States | False | False | income_04 | NaN | 1 | 0.8180 |
4 | 37.0 | Private | 10.844744 | level_9 | Married-civ-spouse | Prof-specialty | Husband | White | Male | 0.0 | 0.0 | 99.0 | France | False | False | income_08 | NaN | 1 | 0.5937 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6507 | 35.0 | Private | 11.024236 | level_7 | Married-civ-spouse | Sales | Husband | White | Male | 0.0 | 0.0 | 40.0 | United-States | False | False | income_04 | NaN | 1 | 0.5984 |
6508 | 41.0 | Self-emp-inc | 10.379256 | level_7 | Married-civ-spouse | Tech-support | Husband | White | Male | 0.0 | 0.0 | 40.0 | United-States | False | False | income_04 | NaN | 1 | 0.5455 |
6509 | 39.0 | Private | 12.921932 | level_1 | Married-civ-spouse | Other-service | Husband | White | Male | 0.0 | 0.0 | 40.0 | Mexico | False | False | income_02 | NaN | 0 | 0.0185 |
6510 | 35.0 | Private | 12.102610 | level_4 | Married-civ-spouse | Craft-repair | Husband | White | Male | 0.0 | 0.0 | 40.0 | United-States | False | False | income_04 | NaN | 0 | 0.2055 |
6511 | 28.0 | Private | 11.962848 | level_4 | Divorced | Handlers-cleaners | Unmarried | White | Female | 0.0 | 0.0 | 36.0 | United-States | False | False | income_04 | NaN | 0 | 0.0077 |
6512 rows × 19 columns
PyCaret (Ensemble)
campare_model = compare_models(sort = 'F1', n_select = 3)
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | TT (Sec) | |
---|---|---|---|---|---|---|---|---|---|
0 | CatBoost Classifier | 0.8761 | 0.9318 | 0.6583 | 0.7948 | 0.7199 | 0.6413 | 0.6462 | 10.5274 |
1 | Extreme Gradient Boosting | 0.8739 | 0.9292 | 0.6658 | 0.7816 | 0.7187 | 0.6381 | 0.6418 | 9.4363 |
2 | Light Gradient Boosting Machine | 0.8734 | 0.9298 | 0.6640 | 0.7804 | 0.7172 | 0.6363 | 0.6400 | 0.4357 |
3 | Ada Boost Classifier | 0.8695 | 0.9253 | 0.6397 | 0.7818 | 0.7034 | 0.6208 | 0.6262 | 1.3897 |
4 | Gradient Boosting Classifier | 0.8718 | 0.9278 | 0.6103 | 0.8138 | 0.6972 | 0.6181 | 0.6285 | 4.1646 |
5 | Naive Bayes | 0.8191 | 0.9007 | 0.7820 | 0.5967 | 0.6767 | 0.5543 | 0.5642 | 0.0331 |
6 | Extra Trees Classifier | 0.8490 | 0.8985 | 0.6374 | 0.7092 | 0.6711 | 0.5736 | 0.5751 | 1.3146 |
7 | Random Forest Classifier | 0.8548 | 0.8910 | 0.5964 | 0.7527 | 0.6651 | 0.5741 | 0.5807 | 0.2215 |
8 | Linear Discriminant Analysis | 0.8592 | 0.9130 | 0.5756 | 0.7857 | 0.6641 | 0.5778 | 0.5891 | 0.2736 |
9 | K Neighbors Classifier | 0.8402 | 0.8684 | 0.6202 | 0.6890 | 0.6524 | 0.5491 | 0.5506 | 0.5477 |
10 | Ridge Classifier | 0.8569 | 0.0000 | 0.5291 | 0.8148 | 0.6412 | 0.5570 | 0.5774 | 0.0510 |
11 | Logistic Regression | 0.8429 | 0.8885 | 0.5722 | 0.7212 | 0.6375 | 0.5390 | 0.5453 | 0.4018 |
12 | Decision Tree Classifier | 0.8208 | 0.7586 | 0.6381 | 0.6279 | 0.6329 | 0.5144 | 0.5145 | 0.2522 |
13 | SVM - Linear Kernel | 0.7638 | 0.0000 | 0.5630 | 0.5595 | 0.5019 | 0.3649 | 0.3886 | 0.3444 |
14 | Quadratic Discriminant Analysis | 0.7544 | 0.6260 | 0.3191 | 0.7977 | 0.3599 | 0.2539 | 0.3499 | 0.1040 |
blended_model = blend_models(estimator_list = campare_model, fold = 5, method = 'soft')
Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|
0 | 0.8687 | 0.9258 | 0.6497 | 0.7712 | 0.7052 | 0.6215 | 0.6253 |
1 | 0.8676 | 0.9323 | 0.6285 | 0.7817 | 0.6968 | 0.6134 | 0.6193 |
2 | 0.8774 | 0.9320 | 0.6682 | 0.7930 | 0.7253 | 0.6471 | 0.6511 |
3 | 0.8700 | 0.9269 | 0.6433 | 0.7813 | 0.7056 | 0.6232 | 0.6280 |
4 | 0.8845 | 0.9387 | 0.7143 | 0.7885 | 0.7496 | 0.6748 | 0.6762 |
Mean | 0.8736 | 0.9311 | 0.6608 | 0.7831 | 0.7165 | 0.6360 | 0.6400 |
SD | 0.0064 | 0.0046 | 0.0296 | 0.0074 | 0.0190 | 0.0224 | 0.0210 |
final_model = finalize_model(blended_model)
ensemble_prediction = predict_model(final_model, data = test_clean)
ensemble_pred = ensemble_prediction['Score']
THRESHOLD = 0.5
ensemble_pred[ensemble_pred >= THRESHOLD] = 1
ensemble_pred[ensemble_pred < THRESHOLD] = 0
plt.figure(figsize=(10, 6))
plt.subplot(121)
sns.countplot(ensemble_pred)
plt.subplot(122)
sns.countplot(train['income'])
plt.show()
Make Submission
submission = pd.read_csv(os.path.join(DIR, 'sample_submission.csv'))
submission.head()
id | prediction | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 0 |
2 | 2 | 0 |
3 | 3 | 0 |
4 | 4 | 0 |
submission['prediction'] = ensemble_pred
submission['prediction'] = submission['prediction'].astype('int')
submission['prediction'].value_counts()
0 5225 1 1287 Name: prediction, dtype: int64
import datetime
timestring = datetime.datetime.now().strftime('%m-%d-%H-%M-%S')
filename = f'kakr-submission-{timestring}.csv'
filename
'kakr-submission-10-20-14-47-37.csv'
submission.to_csv(filename, index=False)
댓글남기기