🔥알림🔥
① 테디노트 유튜브 - 구경하러 가기!
② LangChain 한국어 튜토리얼 바로가기 👀
③ 랭체인 노트 무료 전자책(wikidocs) 바로가기 🙌
④ RAG 비법노트 LangChain 강의오픈 바로가기 🙌
⑤ 서울대 PyTorch 딥러닝 강의 바로가기 🙌

43 분 소요

작년 T-Academy와 KaKr가 주최하는 성인 인구조사 소득 예측 대회에 참여하여 EDA 노트북을 공유했었습니다.

KaKr(캐글코리아) 는 국내에서 가장 큰 캐글 커뮤니티며 전 세계적으로 그 영향력을 인정 받았다고 하네요~

페이스북 그룹이 있으니 관심있으신 분들은 가입하여 캐글 관련 정보를 공유하세요.

캐글코리아 페이스북그룹

작년에 캐글 노트북으로 공유한 커널을 오랜만에 다시 끄집어 내어 블로그에 공유해 봅니다.

대회 정보는 [T-Academy X KaKr] 성인 인구조사 소득 예측 대회 에서 보실 수 있습니다. 관련 데이터셋도 Data 탭에서 확인할 수 있습니다.

제가 캐글에서 공유한 커널은 캐하~ EDA + LightGBM + PyCaret 에서 확인하실 수 있습니다. Copy and Edit으로 수정하여 바로 돌려볼 수 있습니다.

import numpy as np 
import pandas as pd
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/kakr-4th-competition/train.csv
/kaggle/input/kakr-4th-competition/test.csv
/kaggle/input/kakr-4th-competition/sample_submission.csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
import os

warnings.filterwarnings('ignore')

SEED = 1234
FIG_SIZE = (10, 7)
DIR = '/kaggle/input/kakr-4th-competition'
train = pd.read_csv(os.path.join(DIR, 'train.csv'))
test = pd.read_csv(os.path.join(DIR, 'test.csv'))
  • id

  • age : 나이

  • workclass : 고용 형태

  • fnlwgt : 사람 대표성을 나타내는 가중치 (final weight의 약자)

  • education : 교육 수준

  • education_num : 교육 수준 수치

  • marital_status: 결혼 상태

  • occupation : 업종

  • relationship : 가족 관계

  • race : 인종

  • sex : 성별

  • capital_gain : 양도 소득

  • capital_loss : 양도 손실

  • hours_per_week : 주당 근무 시간

  • native_country : 국적

  • income : 수익 (예측해야 하는 값)

    • 50K : 1

    • <=50K : 0
print(train.shape, test.shape)
(26049, 16) (6512, 15)
train.head()
id age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country income
0 0 40 Private 168538 HS-grad 9 Married-civ-spouse Sales Husband White Male 0 0 60 United-States >50K
1 1 17 Private 101626 9th 5 Never-married Machine-op-inspct Own-child White Male 0 0 20 United-States <=50K
2 2 18 Private 353358 Some-college 10 Never-married Other-service Own-child White Male 0 0 16 United-States <=50K
3 3 21 Private 151158 Some-college 10 Never-married Prof-specialty Own-child White Female 0 0 25 United-States <=50K
4 4 24 Private 122234 Some-college 10 Never-married Adm-clerical Not-in-family Black Female 0 0 20 ? <=50K
test.head()
id age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country
0 0 28 Private 67661 Some-college 10 Never-married Adm-clerical Other-relative White Female 0 0 40 United-States
1 1 40 Self-emp-inc 37869 HS-grad 9 Married-civ-spouse Exec-managerial Husband White Male 0 0 50 United-States
2 2 20 Private 109952 Some-college 10 Never-married Handlers-cleaners Own-child White Male 0 0 25 United-States
3 3 40 Private 114537 Assoc-voc 11 Married-civ-spouse Exec-managerial Husband White Male 0 0 50 United-States
4 4 37 Private 51264 Doctorate 16 Married-civ-spouse Prof-specialty Husband White Male 0 0 99 France

결측치

결측치 없음 (깰끔!)

train.isnull().sum()
id                0
age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64
test.isnull().sum()
id                0
age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
dtype: int64

컬럼 별 info() 확인

train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26049 entries, 0 to 26048
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              26049 non-null  int64 
 1   age             26049 non-null  int64 
 2   workclass       26049 non-null  object
 3   fnlwgt          26049 non-null  int64 
 4   education       26049 non-null  object
 5   education_num   26049 non-null  int64 
 6   marital_status  26049 non-null  object
 7   occupation      26049 non-null  object
 8   relationship    26049 non-null  object
 9   race            26049 non-null  object
 10  sex             26049 non-null  object
 11  capital_gain    26049 non-null  int64 
 12  capital_loss    26049 non-null  int64 
 13  hours_per_week  26049 non-null  int64 
 14  native_country  26049 non-null  object
 15  income          26049 non-null  object
dtypes: int64(7), object(9)
memory usage: 3.2+ MB
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6512 entries, 0 to 6511
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              6512 non-null   int64 
 1   age             6512 non-null   int64 
 2   workclass       6512 non-null   object
 3   fnlwgt          6512 non-null   int64 
 4   education       6512 non-null   object
 5   education_num   6512 non-null   int64 
 6   marital_status  6512 non-null   object
 7   occupation      6512 non-null   object
 8   relationship    6512 non-null   object
 9   race            6512 non-null   object
 10  sex             6512 non-null   object
 11  capital_gain    6512 non-null   int64 
 12  capital_loss    6512 non-null   int64 
 13  hours_per_week  6512 non-null   int64 
 14  native_country  6512 non-null   object
dtypes: int64(7), object(8)
memory usage: 763.2+ KB

Target 변환 (Income)

train['income'].value_counts()
<=50K    19744
>50K      6305
Name: income, dtype: int64
train['income'] = train['income'].apply(lambda x: 0 if x == '<=50K' else 1)
train['income'].value_counts()
0    19744
1     6305
Name: income, dtype: int64

all_data로 train + test 세트 합치기 (전처리 동시 진행)

원래 개별 처리 해주는 것이 정식입니다.

train / test 의 분포를 따로 봐야하는 이유는 캐글에서 가끔 함정으로 train 에 없는 값 분포를 test에 심어 놓기도 하죠.

이전에 이미 개별 처리로 분포 확인을 진행한 상태가 편의상 train + test 합친 후 전처리 진행합니다.

all_data = pd.concat([train, test], sort=False)

workclass

all_data['workclass'].value_counts()
Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: workclass, dtype: int64
all_data.groupby('workclass')['income'].mean().sort_values().plot(kind='bar', figsize=FIG_SIZE)
<matplotlib.axes._subplots.AxesSubplot at 0x7fb79df07d90>

  • Without-pay 컬럼과 Never-worked 컬럼의 income은 모두 0 임을 확인한다.

  • Without-pay 컬럼과 Never-worked 컬럼을 Ohter 컬럼으로 합친다.

workclass_other = ['Without-pay', 'Never-worked']
all_data['workclass'] = all_data['workclass'].apply(lambda x: 'Other' if x in workclass_other else x)
all_data['workclass'].value_counts()
Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Other                  21
Name: workclass, dtype: int64

age: 나이

나이는 numeric column 입니다.

income 별 나이의 분포를 확인해 보도록 하겠습니다.

df1 = all_data.loc[all_data['income'] == 0, 'age']
df2 = all_data.loc[all_data['income'] == 1, 'age']

plt.figure(figsize=FIG_SIZE)
sns.distplot(df1, kde=True, rug=True, hist=False, color='blue')
sns.distplot(df2, kde=True, rug=True, hist=False, color='red')
<matplotlib.axes._subplots.AxesSubplot at 0x7fb79ddbcf50>

fnlwgt: 사람의 대표성을 나타내는 가중치

사람의 대표성을 나타내는 가중치라고는 나와있는디… 뭔말인지;; data에 대한 설명은 딱히 없어서 분포도 확인 해봤습니다.

df1 = all_data.loc[all_data['income'] == 0, 'fnlwgt']
df2 = all_data.loc[all_data['income'] == 1, 'fnlwgt']

plt.figure(figsize=FIG_SIZE)
sns.distplot(df1, kde=True, rug=True, hist=False, color='blue')
sns.distplot(df2, kde=True, rug=True, hist=False, color='red')
<matplotlib.axes._subplots.AxesSubplot at 0x7fb79ac40ed0>

income 그룹으로 나누어 확인해 보면, fnlwgt 컬럼의 분포가 거의 차이가 없을 을 확인할 수 있습니다.

다른 column 과의 상호 작용 때문에 feature 제거에는 좀 조심 스럽습니다만, 나중에 별 특이성이 없다면 feature 제거도 고려해 봐야겠습니다.

g = sns.FacetGrid(all_data, col="income", height=5)
g.map(sns.distplot, 'fnlwgt')
<seaborn.axisgrid.FacetGrid at 0x7fb799e46790>

log 를 취해 줍니다. (feature의 variance가 쓸데없이 크고, 정규 분포로 만들어줘서 최적화 이득을 보려고 합니다.)

all_data['fnlwgt_log'] = np.log(all_data['fnlwgt'])

education / education_num

education 컬럼은 education_num 컬럼과 value_counts()가 동일하게 찍히는 것을 알 수 있습니다.

따라서, 두 개의 컬럼 중 한개만 사용 나머지 한 개의 컬럼은 버리도록 하겠습니다.

all_data['education'].value_counts()
HS-grad         10501
Some-college     7291
Bachelors        5355
Masters          1723
Assoc-voc        1382
11th             1175
Assoc-acdm       1067
10th              933
7th-8th           646
Prof-school       576
9th               514
12th              433
Doctorate         413
5th-6th           333
1st-4th           168
Preschool          51
Name: education, dtype: int64
all_data['education_num'].value_counts()
9     10501
10     7291
13     5355
14     1723
11     1382
7      1175
12     1067
6       933
4       646
15      576
5       514
8       433
16      413
3       333
2       168
1        51
Name: education_num, dtype: int64
  • Preschool 인 value는 모두 0임을 확인한다.
all_data.loc[all_data['education'] == 'Preschool', 'income'].sum()
0.0

educationincome의 관련성이 꽤 높아 보입니다.

  • 다만, 단계가 너무 많으면 모델 학습시 과적합이 일어날 수 있으므로, 단계를 묶어 주도록 하겠습니다.
all_data.groupby(['education'])['income'].agg(['mean', 'count']).sort_values('mean')
mean count
education
Preschool 0.000000 40
1st-4th 0.037313 134
5th-6th 0.049057 265
9th 0.052632 418
7th-8th 0.057426 505
11th 0.059653 922
12th 0.072423 359
10th 0.072503 731
HS-grad 0.158544 8433
Some-college 0.192586 5800
Assoc-acdm 0.255344 842
Assoc-voc 0.255474 1096
Bachelors 0.415516 4344
Masters 0.561684 1378
Prof-school 0.733906 466
Doctorate 0.734177 316
education_map = {
    'Preschool': 'level_0', 
    '1st-4th': 'level_1', 
    '5th-6th': 'level_1', 
    '7th-8th': 'level_2', 
    '9th': 'level_2', 
    '10th': 'level_3', 
    '11th': 'level_3', 
    '12th': 'level_3', 
    'HS-grad': 'level_4', 
    'Some-college': 'level_5', 
    'Assoc-acdm': 'level_6', 
    'Assoc-voc': 'level_6', 
    'Bachelors': 'level_7', 
    'Masters': 'level_8', 
    'Prof-school': 'level_9', 
    'Doctorate': 'level_9',
}
all_data['education'] = all_data['education'].map(education_map)
all_data['education'].value_counts()
level_4    10501
level_5     7291
level_7     5355
level_3     2541
level_6     2449
level_8     1723
level_2     1160
level_9      989
level_1      501
level_0       51
Name: education, dtype: int64

그룹 별로 묶어 주었는데, 추후에 묶어 주는 단계에 변화를 줘 봐도 될 것 같습니다.

level_1, level_2, level_3는 거의 차이가 없어 보이네요.

all_data.pivot_table(index='education', values=['income']).sort_values('income').plot(kind='bar', figsize=FIG_SIZE)
<matplotlib.axes._subplots.AxesSubplot at 0x7fb799c65ad0>

Preschool 의 평균 income = 0

사용을 안하는 education_num은 drop 합니다.

all_data = all_data.drop('education_num', 1)
all_data.columns
Index(['id', 'age', 'workclass', 'fnlwgt', 'education', 'marital_status',
       'occupation', 'relationship', 'race', 'sex', 'capital_gain',
       'capital_loss', 'hours_per_week', 'native_country', 'income',
       'fnlwgt_log'],
      dtype='object')

marital_status

all_data['marital_status'].value_counts()
Married-civ-spouse       14976
Never-married            10683
Divorced                  4443
Separated                 1025
Widowed                    993
Married-spouse-absent      418
Married-AF-spouse           23
Name: marital_status, dtype: int64
all_data.pivot_table(index='marital_status', values='income', aggfunc=['mean', 'count'])#.sort_values('income')
mean count
income income
marital_status
Divorced 0.104921 3536
Married-AF-spouse 0.526316 19
Married-civ-spouse 0.448789 11970
Married-spouse-absent 0.080838 334
Never-married 0.046802 8568
Separated 0.065375 826
Widowed 0.087940 796

Married-AF-spouse 컬럼의 데이터 갯수가 적습니다.

유사 그룹인 Married-civ-spouse으로 변형해주도록 하겠습니다.

all_data.loc[all_data['marital_status'] == 'Married-AF-spouse', 'marital_status'] = 'Married-civ-spouse'

occupation

all_data['occupation'].value_counts()
Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
?                    1843
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       149
Armed-Forces            9
Name: occupation, dtype: int64
all_data.groupby('occupation')['income'].mean().sort_values().plot(kind='bar', figsize=FIG_SIZE)
<matplotlib.axes._subplots.AxesSubplot at 0x7fb79997abd0>

  • occupation == 'Armed-Forces'는 모두 0 income 임을 확인합니다.

  • 또한, Armed-Forces 역시 데이터 갯수가 적으므로, 과적합 방지를 위하여 Priv-house-serve와 합쳐줍니다.

all_data.loc[train['occupation'].isin(['Armed-Forces', 'Priv-house-serv']), 'income'].value_counts()
0.0    129
1.0      1
Name: income, dtype: int64
all_data.loc[all_data['occupation'].isin(['Armed-Forces', 'Priv-house-serv']), 'occupation'] = 'Priv-house-serv'
all_data['occupation'].value_counts()
Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
?                    1843
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       158
Name: occupation, dtype: int64

relationship

relationship 컬럼은 별다른 특이사항을 찾지 못하여 그냥 방치 go

all_data['relationship'].value_counts()
Husband           13193
Not-in-family      8305
Own-child          5068
Unmarried          3446
Wife               1568
Other-relative      981
Name: relationship, dtype: int64
all_data.groupby('relationship')['income'].mean().plot(kind='bar', figsize=FIG_SIZE)
<matplotlib.axes._subplots.AxesSubplot at 0x7fb799c62e90>

race

race 컬럼도 특이사항 확인 못하여 그냥 방치 go

all_data['race'].value_counts()
White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64

raceincome 확인

all_data.groupby('race')['income'].mean().plot(kind='bar', figsize=FIG_SIZE)
<matplotlib.axes._subplots.AxesSubplot at 0x7fb79989e990>

sex

sex 컬럼도 특이사항 확인 못하여 그냥 방치 go

all_data['sex'].value_counts()
Male      21790
Female    10771
Name: sex, dtype: int64
all_data.groupby('sex')['income'].mean().plot(kind='bar', figsize=FIG_SIZE)
<matplotlib.axes._subplots.AxesSubplot at 0x7fb7995e0450>

capital_gain

capital_gaincapital_loss를 같이 봐야겠다는 생각으로 접근합니다.

(가설) - capital_gain이 크면 소득 수준이 높지 않을까?

plt.figure(figsize=(12, 9))
sns.distplot(all_data.loc[train['capital_gain'] > 0, 'capital_gain'])
<matplotlib.axes._subplots.AxesSubplot at 0x7fb79954a090>

재밌는 사실을 발견했죠…

capital_gain > 50000이면 모두 income 이 1 입니다.

g = sns.FacetGrid(all_data.loc[all_data['capital_gain']> 0], col="income", height=7, aspect=1.5)
g.map(sns.distplot, 'capital_gain')
<seaborn.axisgrid.FacetGrid at 0x7fb799915a90>

capital_gain & capital_loss은 모두 Numerical 처럼 보이지만, categorical 로 만들어도 값의 variance가 크지 않습니다.

그래서 value_counts()로 income 별 값 분포를 확인합니다.

income == 1 인 그룹이 가지고 있는 특정 key와 income == 0 인 그룹이 가지고 있는 특정 key가 극명히 갈리는 것을 확인할 수 있습니다.

fig, axes = plt.subplots(1, 2)
fig.set_size_inches(20, 8)

df1 = train.loc[(train['income'] == 0) & (train['capital_gain'] > 0), 'capital_gain'].value_counts().sort_index()
df1.plot(kind='bar', ax=axes[0])

df1 = train.loc[(train['income'] == 1) & (train['capital_gain'] > 0), 'capital_gain'].value_counts().sort_index()
df1.plot(kind='bar', ax=axes[1])

plt.tight_layout()
plt.show()

capital_loss

fig, axes = plt.subplots(1, 2)
fig.set_size_inches(20, 8)

df1 = train.loc[(train['income'] == 0) & (train['capital_loss'] > 0), 'capital_loss'].value_counts().sort_index()
df1.plot(kind='bar', ax=axes[0])

df1 = train.loc[(train['income'] == 1) & (train['capital_loss'] > 0), 'capital_loss'].value_counts().sort_index()
df1.plot(kind='bar', ax=axes[1])

plt.tight_layout()
plt.show()

capital net

capital_gain - capital_loss 진행하여 Net을 구합니다.

all_data['capital_net'] = all_data['capital_gain'] - all_data['capital_loss']
train['capital_net'] = train['capital_gain'] - train['capital_loss']
test['capital_net'] = test['capital_gain'] - test['capital_loss']
plt.figure(figsize=(16, 9))
plt.subplot(1, 2, 1)
sns.distplot(train.loc[ (train['capital_net'] > 0) & (train['income'] == 1), 'capital_net'])

plt.subplot(1, 2, 2)
sns.distplot(train.loc[ (train['capital_net'] > 0) & (train['income'] == 0), 'capital_net'])
<matplotlib.axes._subplots.AxesSubplot at 0x7fb7987effd0>

fig, axes = plt.subplots(1, 2)
fig.set_size_inches(20, 8)

df1 = all_data.loc[(all_data['income'] == 0) & (all_data['capital_net'] > 0), 'capital_net'].value_counts().sort_index()
df1.plot(kind='bar', ax=axes[0])

df2 = all_data.loc[(all_data['income'] == 1) & (all_data['capital_net'] > 0), 'capital_net'].value_counts().sort_index()
df2.plot(kind='bar', ax=axes[1])

plt.tight_layout()
plt.show()

capital_net 기준으로 income == 1 or 0 이 나오는 key 값 추출

pos_key = all_data.loc[(all_data['income'] == 1) & (all_data['capital_net'] > 0), 'capital_net'].value_counts().sort_index().keys().tolist()
all_key = all_data.loc[(all_data['income'] == 1) & (all_data['capital_net'] > 0), 'capital_net'].value_counts().sort_index().keys().tolist()
all_key.extend(all_data.loc[(all_data['income'] == 0) & (all_data['capital_net'] > 0), 'capital_net'].value_counts().sort_index().keys().tolist())
all_key[:5]
[3103, 4386, 4687, 4787, 4934]

몇 개 겹치는 것도 있긴 합니다.

df1 = all_data.loc[(all_data['income'] == 0) & (all_data['capital_net'].isin(pos_key)), 'capital_net'].value_counts().sort_index()
df1.plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x7fb79815ec50>

pos_key = all_data.loc[(all_data['income'] == 1) & (all_data['capital_net'] > 0), 'capital_net'].value_counts().sort_index().keys().tolist()
neg_key = all_data.loc[(all_data['income'] == 0) & (all_data['capital_net'] > 0), 'capital_net'].value_counts().sort_index().keys().tolist()

겹치지 않는 것들만 추려 주려고요

capital_net_pos_key = [key for key in pos_key if key not in neg_key]
capital_net_neg_key = [key for key in neg_key if key not in pos_key]
all_data['capital_net_pos_key'] = all_data['capital_net'].apply(lambda x: x in capital_net_pos_key)
all_data['capital_net_neg_key'] = all_data['capital_net'].apply(lambda x: x in capital_net_neg_key)

hours_per_week

40시간 근로자들이 많네요~

40시간 이상 근로자들은 income == 1 쪽이 많이 보입니다.

all_data['hours_per_week'].value_counts()
40    15217
50     2819
45     1824
60     1475
35     1297
      ...  
92        1
94        1
87        1
74        1
82        1
Name: hours_per_week, Length: 94, dtype: int64
fig, axes = plt.subplots(1, 2)
fig.set_size_inches(20, 8)

df1 = all_data.loc[(all_data['income'] == 0), 'hours_per_week'].value_counts().sort_index()
df1.plot(kind='bar', ax=axes[0])

df2 = all_data.loc[(all_data['income'] == 1), 'hours_per_week'].value_counts().sort_index()
df2.plot(kind='bar', ax=axes[1])

plt.tight_layout()
plt.show()

native_country

나라가 좀 골치 덩어리 였습니다.

일단 value의 variance가 크고, 데이터의 갯수가 몇 개 없는 feature 들이 있습니다.

합쳐 주도록 하겠습니다.

train['native_country'].value_counts().shape, test['native_country'].value_counts().shape
((41,), (42,))
all_data['native_country'].value_counts()
United-States                 29170
Mexico                          643
?                               583
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England                          90
Jamaica                          81
South                            80
China                            75
Italy                            73
Dominican-Republic               70
Vietnam                          67
Guatemala                        64
Japan                            62
Poland                           60
Columbia                         59
Taiwan                           51
Haiti                            44
Iran                             43
Portugal                         37
Nicaragua                        34
Peru                             31
France                           29
Greece                           29
Ecuador                          28
Ireland                          24
Hong                             20
Cambodia                         19
Trinadad&Tobago                  19
Laos                             18
Thailand                         18
Yugoslavia                       16
Outlying-US(Guam-USVI-etc)       14
Honduras                         13
Hungary                          13
Scotland                         12
Holand-Netherlands                1
Name: native_country, dtype: int64

나중에 아래 wiki에서 국가별 소득 수준 별로 그룹을 만들어서 합쳐도 보려고요

List of countries by GNI (nominal) per capita (Wikipedia)

all_data.groupby('native_country')['income'].mean().reset_index()
native_country income
0 ? 0.234649
1 Cambodia 0.428571
2 Canada 0.315217
3 China 0.228070
4 Columbia 0.038462
5 Cuba 0.263158
6 Dominican-Republic 0.041667
7 Ecuador 0.166667
8 El-Salvador 0.088608
9 England 0.343284
10 France 0.416667
11 Germany 0.346535
12 Greece 0.250000
13 Guatemala 0.057692
14 Haiti 0.114286
15 Holand-Netherlands NaN
16 Honduras 0.000000
17 Hong 0.285714
18 Hungary 0.272727
19 India 0.402597
20 Iran 0.485714
21 Ireland 0.222222
22 Italy 0.380000
23 Jamaica 0.109375
24 Japan 0.404255
25 Laos 0.133333
26 Mexico 0.048689
27 Nicaragua 0.071429
28 Outlying-US(Guam-USVI-etc) 0.000000
29 Peru 0.076923
30 Philippines 0.300613
31 Poland 0.212766
32 Portugal 0.066667
33 Puerto-Rico 0.115789
34 Scotland 0.250000
35 South 0.222222
36 Taiwan 0.461538
37 Thailand 0.153846
38 Trinadad&Tobago 0.071429
39 United-States 0.247315
40 Vietnam 0.080000
41 Yugoslavia 0.416667
income_01 = ['Jamaica',
 'Haiti',
 'Puerto-Rico',
 'Laos',
 'Thailand',
 'Ecuador',]

income_02 = ['Outlying-US(Guam-USVI-etc)',
 'Honduras',
 'Columbia',
 'Dominican-Republic',
 'Mexico',
 'Guatemala',
 'Portugal',
 'Trinadad&Tobago',
 'Nicaragua',
 'Peru',
 'Vietnam',
 'El-Salvador',]

income_03 = ['Poland',
 'Ireland',
 'South',
 'China',]

income_04 = [
    'United-States',
]
income_05 = [
 'Greece',
 'Scotland',
 'Cuba',
 'Hungary',
 'Hong',
 'Holand-Netherlands',
]
income_06 = [
 'Philippines',
 'Canada',
]
income_07 = [
 'England',
 'Germany',
]

income_08 = [
 'Italy',
 'India',
 'Japan',
 'France',
 'Yugoslavia',
 'Cambodia',
]

income_09 = [
 'Taiwan',
 'Iran',
]

income_other=['?', ]
def convert_country(x):
    if x in income_01:
        return 'income_01'
    elif x in income_02:
        return 'income_02'
    elif x in income_03:
        return 'income_03'
    elif x in income_04:
        return 'income_04'
    elif x in income_05:
        return 'income_05'
    elif x in income_06:
        return 'income_06'
    elif x in income_07:
        return 'income_07'
    elif x in income_08:
        return 'income_08'
    elif x in income_09:
        return 'income_09'
    else:
        return 'income_other'
all_data['country_bin'] = all_data['native_country'].apply(convert_country)
all_data['country_bin'].value_counts()
income_04       29170
income_02        1157
income_other      583
income_06         319
income_01         303
income_08         299
income_03         239
income_07         227
income_05         170
income_09          94
Name: country_bin, dtype: int64

Define Features

쓸만한 feature들을 골라보자.

all_data.columns
Index(['id', 'age', 'workclass', 'fnlwgt', 'education', 'marital_status',
       'occupation', 'relationship', 'race', 'sex', 'capital_gain',
       'capital_loss', 'hours_per_week', 'native_country', 'income',
       'fnlwgt_log', 'capital_net', 'capital_net_pos_key',
       'capital_net_neg_key', 'country_bin'],
      dtype='object')
features = [
#     'id', 
    'age', 
    'workclass', 
#     'fnlwgt', 
    'fnlwgt_log', 
    'education', 
    'marital_status',
    'occupation',
    'relationship', 
    'race',
    'sex',
    'capital_gain',
    'capital_loss', 
    'hours_per_week',
    'native_country',
#     'income',
#     'capital_net', capital_gain과 corr이 커서 제거
    'capital_net_pos_key',
    'capital_net_neg_key',
    'country_bin',
]
label = [
    'income'
]
all_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 32561 entries, 0 to 6511
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   32561 non-null  int64  
 1   age                  32561 non-null  int64  
 2   workclass            32561 non-null  object 
 3   fnlwgt               32561 non-null  int64  
 4   education            32561 non-null  object 
 5   marital_status       32561 non-null  object 
 6   occupation           32561 non-null  object 
 7   relationship         32561 non-null  object 
 8   race                 32561 non-null  object 
 9   sex                  32561 non-null  object 
 10  capital_gain         32561 non-null  int64  
 11  capital_loss         32561 non-null  int64  
 12  hours_per_week       32561 non-null  int64  
 13  native_country       32561 non-null  object 
 14  income               26049 non-null  float64
 15  fnlwgt_log           32561 non-null  float64
 16  capital_net          32561 non-null  int64  
 17  capital_net_pos_key  32561 non-null  bool   
 18  capital_net_neg_key  32561 non-null  bool   
 19  country_bin          32561 non-null  object 
dtypes: bool(2), float64(2), int64(7), object(9)
memory usage: 6.0+ MB
plt.figure(figsize=(12, 12))
sns.heatmap(abs(all_data.corr()), annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7fb7985fa590>

all_data_dummies = pd.get_dummies(all_data[features + label])
all_data_dummies.head()
age fnlwgt_log capital_gain capital_loss hours_per_week capital_net_pos_key capital_net_neg_key income workclass_? workclass_Federal-gov ... country_bin_income_01 country_bin_income_02 country_bin_income_03 country_bin_income_04 country_bin_income_05 country_bin_income_06 country_bin_income_07 country_bin_income_08 country_bin_income_09 country_bin_income_other
0 40 12.034917 0 0 60 False False 1.0 0 0 ... 0 0 0 1 0 0 0 0 0 0
1 17 11.529055 0 0 20 False False 0.0 0 0 ... 0 0 0 1 0 0 0 0 0 0
2 18 12.775237 0 0 16 False False 0.0 0 0 ... 0 0 0 1 0 0 0 0 0 0
3 21 11.926081 0 0 25 False False 0.0 0 0 ... 0 0 0 1 0 0 0 0 0 0
4 24 11.713693 0 0 20 False False 0.0 0 0 ... 0 0 0 0 0 0 0 0 0 1

5 rows × 111 columns

train_features = all_data_dummies.drop('income', 1).iloc[:len(train)]
test_features = all_data_dummies.drop('income', 1).iloc[len(train):]
train_label = train[label]
train_features.shape, test_features.shape
((26049, 110), (6512, 110))

Model (LightGBM)

from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import precision_score, recall_score, classification_report, f1_score, confusion_matrix
from sklearn.metrics import log_loss
from tqdm import tqdm_notebook
import lightgbm as lgbm
x_train, x_valid, y_train, y_valid = train_test_split(train_features, train_label, stratify=train_label, test_size=0.2, random_state=SEED)
NUM_BOOST_ROUND = 10000
N_SPLITS = 5

lgbm_param = {
    'objective': 'binary',
    'boosting_type':'gbdt',
    'colsample_bytree':1.0,
    'importance_type':'split',
    'learning_rate':0.1,
    'min_child_samples':20,
    'min_child_weight':0.001,
    'min_split_gain':0,
    'n_estimators':10000,
    'num_leaves':40,
    'random_state':SEED,
    'early_stopping_rounds': 200,
    'reg_alpha':0.6,
    'reg_lambda':0.5,
    'subsample':1.0,
    'subsample_for_bin':200000,
    'subsample_freq':0, 
    'n_jobs':-1, 
}
dtrain = lgbm.Dataset(x_train, y_train)
dvalid = lgbm.Dataset(x_valid, y_valid)
model = lgbm.train(lgbm_param, dtrain, NUM_BOOST_ROUND, 
                   valid_sets=(dtrain, dvalid), 
                   valid_names=('train', 'valid'), 
                   verbose_eval=100,
                  )
Training until validation scores don't improve for 200 rounds
[100]	train's binary_logloss: 0.224887	valid's binary_logloss: 0.284848
[200]	train's binary_logloss: 0.19576	valid's binary_logloss: 0.291253
Early stopping, best iteration is:
[65]	train's binary_logloss: 0.239304	valid's binary_logloss: 0.282915

Threshold 별 F1 Score 확인

threshold = 0.5
valid_prediction = model.predict(x_valid)
valid_prediction[valid_prediction > threshold] = 1
valid_prediction[valid_prediction <= threshold] = 0
print(classification_report(y_valid, valid_prediction))
              precision    recall  f1-score   support

           0       0.89      0.94      0.92      3949
           1       0.78      0.64      0.70      1261

    accuracy                           0.87      5210
   macro avg       0.83      0.79      0.81      5210
weighted avg       0.86      0.87      0.86      5210

Threshold 별 F1_Score의 변화 확인

f1_threshold = np.linspace(0.4, 0.6, 30)
f1_scores = []
max_score = 0
max_threshold = 0

for t in f1_threshold:
    valid_prediction = model.predict(x_valid)
    valid_prediction[valid_prediction > t] = 1
    valid_prediction[valid_prediction <= t] = 0
    score_ = f1_score(y_valid, valid_prediction)
    f1_scores.append(score_)
    if score_ > max_score:
        max_score = score_
        max_threshold = t
        
plt.figure(figsize=(16, 6))
plt.plot(f1_threshold, f1_scores)
plt.axvline(x=max_threshold, linestyle=':', color='r')
plt.xticks(f1_threshold, rotation=90)
plt.show()

confusion_matrix

plt.figure(figsize=FIG_SIZE)
sns.heatmap(confusion_matrix(y_valid, valid_prediction), annot=True, fmt='g')
<matplotlib.axes._subplots.AxesSubplot at 0x7fb798b4b750>

Prediction

pred = model.predict(test_features)

pred 값의 분포 확인

plt.figure(figsize=FIG_SIZE)
sns.distplot(pred)
<matplotlib.axes._subplots.AxesSubplot at 0x7fb798b578d0>

# 기본 0.5으로 설정
THRESHOLD = 0.5

print(len(pred[pred >= THRESHOLD]) / len(pred[pred < THRESHOLD]))
0.25062415978490493
pred[pred >= THRESHOLD] = 1
pred[pred < THRESHOLD] = 0
income_pct = train['income'].value_counts()[1] / train['income'].value_counts()[0]
income_pct
0.3193375202593193
plt.figure(figsize=(10, 6))
plt.subplot(121)
sns.countplot(pred)

plt.subplot(122)
sns.countplot(train['income'])
plt.show()

PyCarot

!pip install pycaret
Collecting pycaret
  Downloading pycaret-2.1.2-py3-none-any.whl (252 kB)
     |████████████████████████████████| 252 kB 402 kB/s 
[?25hRequirement already satisfied: imbalanced-learn>=0.6.2 in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.7.0)
Requirement already satisfied: joblib in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.14.1)
Requirement already satisfied: spacy in /opt/conda/lib/python3.7/site-packages (from pycaret) (2.3.2)
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.7/site-packages (from pycaret) (3.2.1)
Requirement already satisfied: mlxtend in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.17.3)
Requirement already satisfied: xgboost>=0.90 in /opt/conda/lib/python3.7/site-packages (from pycaret) (1.2.0)
Collecting datefinder>=0.7.0
  Downloading datefinder-0.7.1-py2.py3-none-any.whl (10 kB)
Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.7/site-packages (from pycaret) (1.18.5)
Requirement already satisfied: yellowbrick>=1.0.1 in /opt/conda/lib/python3.7/site-packages (from pycaret) (1.1)
Requirement already satisfied: pyLDAvis in /opt/conda/lib/python3.7/site-packages (from pycaret) (2.1.2)
Requirement already satisfied: cufflinks>=0.17.0 in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.17.3)
Collecting mlflow
  Downloading mlflow-1.11.0-py3-none-any.whl (13.9 MB)
     |████████████████████████████████| 13.9 MB 5.3 MB/s 
[?25hCollecting pyod
  Downloading pyod-0.8.3.tar.gz (96 kB)
     |████████████████████████████████| 96 kB 3.1 MB/s 
[?25hRequirement already satisfied: textblob in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.15.3)
Requirement already satisfied: pandas in /opt/conda/lib/python3.7/site-packages (from pycaret) (1.1.3)
Requirement already satisfied: umap-learn in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.4.6)
Requirement already satisfied: kmodes>=0.10.1 in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.10.2)
Requirement already satisfied: lightgbm>=2.3.1 in /opt/conda/lib/python3.7/site-packages (from pycaret) (2.3.1)
Requirement already satisfied: gensim in /opt/conda/lib/python3.7/site-packages (from pycaret) (3.8.3)
Requirement already satisfied: plotly>=4.4.1 in /opt/conda/lib/python3.7/site-packages (from pycaret) (4.11.0)
Requirement already satisfied: wordcloud in /opt/conda/lib/python3.7/site-packages (from pycaret) (1.8.0)
Requirement already satisfied: catboost>=0.23.2 in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.24.1)
Requirement already satisfied: seaborn in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.10.0)
Requirement already satisfied: scikit-learn>=0.23 in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.23.2)
Collecting pandas-profiling>=2.8.0
  Downloading pandas_profiling-2.9.0-py2.py3-none-any.whl (258 kB)
     |████████████████████████████████| 258 kB 13.6 MB/s 
[?25hRequirement already satisfied: nltk in /opt/conda/lib/python3.7/site-packages (from pycaret) (3.2.4)
Requirement already satisfied: IPython in /opt/conda/lib/python3.7/site-packages (from pycaret) (7.13.0)
Requirement already satisfied: ipywidgets in /opt/conda/lib/python3.7/site-packages (from pycaret) (7.5.1)
Requirement already satisfied: scipy>=0.19.1 in /opt/conda/lib/python3.7/site-packages (from imbalanced-learn>=0.6.2->pycaret) (1.4.1)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (0.8.0)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (1.0.2)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (2.0.3)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (3.0.2)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (0.4.1)
Requirement already satisfied: thinc==7.4.1 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (7.4.1)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (2.23.0)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (1.0.0)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (1.1.3)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (4.45.0)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (46.1.3.post20200325)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (1.0.2)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pycaret) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pycaret) (2.4.7)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pycaret) (1.2.0)
Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pycaret) (2.8.1)
Requirement already satisfied: pytz in /opt/conda/lib/python3.7/site-packages (from datefinder>=0.7.0->pycaret) (2019.3)
Requirement already satisfied: regex>=2017.02.08 in /opt/conda/lib/python3.7/site-packages (from datefinder>=0.7.0->pycaret) (2020.4.4)
Requirement already satisfied: wheel>=0.23.0 in /opt/conda/lib/python3.7/site-packages (from pyLDAvis->pycaret) (0.34.2)
Requirement already satisfied: pytest in /opt/conda/lib/python3.7/site-packages (from pyLDAvis->pycaret) (5.4.1)
Requirement already satisfied: funcy in /opt/conda/lib/python3.7/site-packages (from pyLDAvis->pycaret) (1.15)
Requirement already satisfied: jinja2>=2.7.2 in /opt/conda/lib/python3.7/site-packages (from pyLDAvis->pycaret) (2.11.2)
Requirement already satisfied: numexpr in /opt/conda/lib/python3.7/site-packages (from pyLDAvis->pycaret) (2.7.1)
Requirement already satisfied: future in /opt/conda/lib/python3.7/site-packages (from pyLDAvis->pycaret) (0.18.2)
Requirement already satisfied: colorlover>=0.2.1 in /opt/conda/lib/python3.7/site-packages (from cufflinks>=0.17.0->pycaret) (0.3.0)
Requirement already satisfied: six>=1.9.0 in /opt/conda/lib/python3.7/site-packages (from cufflinks>=0.17.0->pycaret) (1.14.0)
Collecting databricks-cli>=0.8.7
  Downloading databricks-cli-0.12.2.tar.gz (55 kB)
     |████████████████████████████████| 55 kB 1.8 MB/s 
[?25hCollecting alembic<=1.4.1
  Downloading alembic-1.4.1.tar.gz (1.1 MB)
     |████████████████████████████████| 1.1 MB 14.2 MB/s 
[?25hRequirement already satisfied: cloudpickle in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (1.3.0)
Collecting sqlalchemy<=1.3.13
  Downloading SQLAlchemy-1.3.13.tar.gz (6.0 MB)
     |████████████████████████████████| 6.0 MB 15.5 MB/s 
[?25hRequirement already satisfied: pyyaml in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (5.3.1)
Requirement already satisfied: protobuf>=3.6.0 in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (3.13.0)
Requirement already satisfied: click>=7.0 in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (7.1.1)
Requirement already satisfied: docker>=4.0.0 in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (4.2.0)
Collecting azure-storage-blob>=12.0
  Downloading azure_storage_blob-12.5.0-py2.py3-none-any.whl (326 kB)
     |████████████████████████████████| 326 kB 19.1 MB/s 
[?25hRequirement already satisfied: entrypoints in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (0.3)
Requirement already satisfied: gitpython>=2.1.0 in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (3.1.1)
Collecting gunicorn; platform_system != "Windows"
  Downloading gunicorn-20.0.4-py2.py3-none-any.whl (77 kB)
     |████████████████████████████████| 77 kB 3.9 MB/s 
[?25hCollecting querystring-parser
  Downloading querystring_parser-1.2.4.tar.gz (5.5 kB)
Collecting gorilla
  Downloading gorilla-0.3.0-py2.py3-none-any.whl (11 kB)
Requirement already satisfied: sqlparse in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (0.3.1)
Collecting prometheus-flask-exporter
  Downloading prometheus_flask_exporter-0.18.1.tar.gz (21 kB)
Requirement already satisfied: Flask in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (1.1.2)
Collecting combo
  Downloading combo-0.1.1.tar.gz (37 kB)
Requirement already satisfied: numba>=0.35 in /opt/conda/lib/python3.7/site-packages (from pyod->pycaret) (0.48.0)
Requirement already satisfied: statsmodels in /opt/conda/lib/python3.7/site-packages (from pyod->pycaret) (0.11.1)
Collecting suod
  Downloading suod-0.0.4.tar.gz (2.1 MB)
     |████████████████████████████████| 2.1 MB 19.0 MB/s 
[?25hRequirement already satisfied: smart-open>=1.8.1 in /opt/conda/lib/python3.7/site-packages (from gensim->pycaret) (2.2.1)
Requirement already satisfied: retrying>=1.3.3 in /opt/conda/lib/python3.7/site-packages (from plotly>=4.4.1->pycaret) (1.3.3)
Requirement already satisfied: pillow in /opt/conda/lib/python3.7/site-packages (from wordcloud->pycaret) (7.2.0)
Requirement already satisfied: graphviz in /opt/conda/lib/python3.7/site-packages (from catboost>=0.23.2->pycaret) (0.8.4)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=0.23->pycaret) (2.1.0)
Requirement already satisfied: missingno>=0.4.2 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret) (0.4.2)
Requirement already satisfied: confuse>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret) (1.1.0)
Requirement already satisfied: attrs>=19.3.0 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret) (19.3.0)
Requirement already satisfied: htmlmin>=0.1.12 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret) (0.1.12)
Collecting visions[type_image_path]==0.5.0
  Downloading visions-0.5.0-py3-none-any.whl (64 kB)
     |████████████████████████████████| 64 kB 2.2 MB/s 
[?25hCollecting tangled-up-in-unicode>=0.0.6
  Downloading tangled_up_in_unicode-0.0.6-py3-none-any.whl (3.1 MB)
     |████████████████████████████████| 3.1 MB 24.2 MB/s 
[?25hRequirement already satisfied: phik>=0.9.10 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret) (0.9.11)
Requirement already satisfied: pygments in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret) (2.6.1)
Requirement already satisfied: backcall in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret) (0.1.0)
Requirement already satisfied: pexpect; sys_platform != "win32" in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret) (4.8.0)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret) (3.0.5)
Requirement already satisfied: jedi>=0.10 in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret) (0.15.2)
Requirement already satisfied: traitlets>=4.2 in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret) (4.3.3)
Requirement already satisfied: pickleshare in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret) (0.7.5)
Requirement already satisfied: decorator in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret) (4.4.2)
Requirement already satisfied: widgetsnbextension~=3.5.0 in /opt/conda/lib/python3.7/site-packages (from ipywidgets->pycaret) (3.5.1)
Requirement already satisfied: ipykernel>=4.5.1 in /opt/conda/lib/python3.7/site-packages (from ipywidgets->pycaret) (5.1.1)
Requirement already satisfied: nbformat>=4.2.0 in /opt/conda/lib/python3.7/site-packages (from ipywidgets->pycaret) (5.0.6)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy->pycaret) (2020.6.20)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy->pycaret) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy->pycaret) (2.9)
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy->pycaret) (3.0.4)
Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /opt/conda/lib/python3.7/site-packages (from catalogue<1.1.0,>=0.0.7->spacy->pycaret) (2.0.0)
Requirement already satisfied: py>=1.5.0 in /opt/conda/lib/python3.7/site-packages (from pytest->pyLDAvis->pycaret) (1.8.1)
Requirement already satisfied: packaging in /opt/conda/lib/python3.7/site-packages (from pytest->pyLDAvis->pycaret) (20.1)
Requirement already satisfied: more-itertools>=4.0.0 in /opt/conda/lib/python3.7/site-packages (from pytest->pyLDAvis->pycaret) (8.2.0)
Requirement already satisfied: pluggy<1.0,>=0.12 in /opt/conda/lib/python3.7/site-packages (from pytest->pyLDAvis->pycaret) (0.13.0)
Requirement already satisfied: wcwidth in /opt/conda/lib/python3.7/site-packages (from pytest->pyLDAvis->pycaret) (0.1.9)
Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.7/site-packages (from jinja2>=2.7.2->pyLDAvis->pycaret) (1.1.1)
Requirement already satisfied: tabulate>=0.7.7 in /opt/conda/lib/python3.7/site-packages (from databricks-cli>=0.8.7->mlflow->pycaret) (0.8.7)
Collecting tenacity>=6.2.0
  Downloading tenacity-6.2.0-py2.py3-none-any.whl (24 kB)
Requirement already satisfied: Mako in /opt/conda/lib/python3.7/site-packages (from alembic<=1.4.1->mlflow->pycaret) (1.1.3)
Requirement already satisfied: python-editor>=0.3 in /opt/conda/lib/python3.7/site-packages (from alembic<=1.4.1->mlflow->pycaret) (1.0.4)
Requirement already satisfied: websocket-client>=0.32.0 in /opt/conda/lib/python3.7/site-packages (from docker>=4.0.0->mlflow->pycaret) (0.57.0)
Collecting azure-core<2.0.0,>=1.6.0
  Downloading azure_core-1.8.2-py2.py3-none-any.whl (122 kB)
     |████████████████████████████████| 122 kB 28.6 MB/s 
[?25hCollecting msrest>=0.6.10
  Downloading msrest-0.6.19-py2.py3-none-any.whl (84 kB)
     |████████████████████████████████| 84 kB 1.9 MB/s 
[?25hRequirement already satisfied: cryptography>=2.1.4 in /opt/conda/lib/python3.7/site-packages (from azure-storage-blob>=12.0->mlflow->pycaret) (2.8)
Requirement already satisfied: gitdb<5,>=4.0.1 in /opt/conda/lib/python3.7/site-packages (from gitpython>=2.1.0->mlflow->pycaret) (4.0.4)
Requirement already satisfied: prometheus_client in /opt/conda/lib/python3.7/site-packages (from prometheus-flask-exporter->mlflow->pycaret) (0.7.1)
Requirement already satisfied: Werkzeug>=0.15 in /opt/conda/lib/python3.7/site-packages (from Flask->mlflow->pycaret) (1.0.1)
Requirement already satisfied: itsdangerous>=0.24 in /opt/conda/lib/python3.7/site-packages (from Flask->mlflow->pycaret) (1.1.0)
Requirement already satisfied: llvmlite<0.32.0,>=0.31.0dev0 in /opt/conda/lib/python3.7/site-packages (from numba>=0.35->pyod->pycaret) (0.31.0)
Requirement already satisfied: patsy>=0.5 in /opt/conda/lib/python3.7/site-packages (from statsmodels->pyod->pycaret) (0.5.1)
Requirement already satisfied: boto3 in /opt/conda/lib/python3.7/site-packages (from smart-open>=1.8.1->gensim->pycaret) (1.15.13)
Requirement already satisfied: networkx>=2.4 in /opt/conda/lib/python3.7/site-packages (from visions[type_image_path]==0.5.0->pandas-profiling>=2.8.0->pycaret) (2.4)
Requirement already satisfied: imagehash; extra == "type_image_path" in /opt/conda/lib/python3.7/site-packages (from visions[type_image_path]==0.5.0->pandas-profiling>=2.8.0->pycaret) (4.1.0)
Requirement already satisfied: ptyprocess>=0.5 in /opt/conda/lib/python3.7/site-packages (from pexpect; sys_platform != "win32"->IPython->pycaret) (0.6.0)
Requirement already satisfied: parso>=0.5.2 in /opt/conda/lib/python3.7/site-packages (from jedi>=0.10->IPython->pycaret) (0.5.2)
Requirement already satisfied: ipython-genutils in /opt/conda/lib/python3.7/site-packages (from traitlets>=4.2->IPython->pycaret) (0.2.0)
Requirement already satisfied: notebook>=4.4.1 in /opt/conda/lib/python3.7/site-packages (from widgetsnbextension~=3.5.0->ipywidgets->pycaret) (5.5.0)
Requirement already satisfied: tornado>=4.2 in /opt/conda/lib/python3.7/site-packages (from ipykernel>=4.5.1->ipywidgets->pycaret) (5.0.2)
Requirement already satisfied: jupyter-client in /opt/conda/lib/python3.7/site-packages (from ipykernel>=4.5.1->ipywidgets->pycaret) (6.1.3)
Requirement already satisfied: jupyter-core in /opt/conda/lib/python3.7/site-packages (from nbformat>=4.2.0->ipywidgets->pycaret) (4.6.3)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /opt/conda/lib/python3.7/site-packages (from nbformat>=4.2.0->ipywidgets->pycaret) (3.2.0)
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy->pycaret) (3.1.0)
Collecting isodate>=0.6.0
  Downloading isodate-0.6.0-py2.py3-none-any.whl (45 kB)
     |████████████████████████████████| 45 kB 1.4 MB/s 
[?25hRequirement already satisfied: requests-oauthlib>=0.5.0 in /opt/conda/lib/python3.7/site-packages (from msrest>=0.6.10->azure-storage-blob>=12.0->mlflow->pycaret) (1.2.0)
Requirement already satisfied: cffi!=1.11.3,>=1.8 in /opt/conda/lib/python3.7/site-packages (from cryptography>=2.1.4->azure-storage-blob>=12.0->mlflow->pycaret) (1.14.0)
Requirement already satisfied: smmap<4,>=3.0.1 in /opt/conda/lib/python3.7/site-packages (from gitdb<5,>=4.0.1->gitpython>=2.1.0->mlflow->pycaret) (3.0.2)
Requirement already satisfied: botocore<1.19.0,>=1.18.13 in /opt/conda/lib/python3.7/site-packages (from boto3->smart-open>=1.8.1->gensim->pycaret) (1.18.13)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /opt/conda/lib/python3.7/site-packages (from boto3->smart-open>=1.8.1->gensim->pycaret) (0.10.0)
Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /opt/conda/lib/python3.7/site-packages (from boto3->smart-open>=1.8.1->gensim->pycaret) (0.3.3)
Requirement already satisfied: PyWavelets in /opt/conda/lib/python3.7/site-packages (from imagehash; extra == "type_image_path"->visions[type_image_path]==0.5.0->pandas-profiling>=2.8.0->pycaret) (1.1.1)
Requirement already satisfied: nbconvert in /opt/conda/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (5.6.1)
Requirement already satisfied: pyzmq>=17 in /opt/conda/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (19.0.0)
Requirement already satisfied: Send2Trash in /opt/conda/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (1.5.0)
Requirement already satisfied: terminado>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.8.3)
Requirement already satisfied: pyrsistent>=0.14.0 in /opt/conda/lib/python3.7/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets->pycaret) (0.16.0)
Requirement already satisfied: oauthlib>=3.0.0 in /opt/conda/lib/python3.7/site-packages (from requests-oauthlib>=0.5.0->msrest>=0.6.10->azure-storage-blob>=12.0->mlflow->pycaret) (3.0.1)
Requirement already satisfied: pycparser in /opt/conda/lib/python3.7/site-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.1.4->azure-storage-blob>=12.0->mlflow->pycaret) (2.20)
Requirement already satisfied: pandocfilters>=1.4.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (1.4.2)
Requirement already satisfied: bleach in /opt/conda/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (3.1.4)
Requirement already satisfied: mistune<2,>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.8.4)
Requirement already satisfied: testpath in /opt/conda/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.4.4)
Requirement already satisfied: defusedxml in /opt/conda/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.6.0)
Requirement already satisfied: webencodings in /opt/conda/lib/python3.7/site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.5.1)
Building wheels for collected packages: pyod, databricks-cli, alembic, sqlalchemy, querystring-parser, prometheus-flask-exporter, combo, suod
  Building wheel for pyod (setup.py) ... [?25l- \ | done
[?25h  Created wheel for pyod: filename=pyod-0.8.3-py3-none-any.whl size=110347 sha256=6858fa6eda242cf3101a17d6095f16ac26c9dc1b497ae42b4ee3cfeb5156be5d
  Stored in directory: /root/.cache/pip/wheels/fc/fc/77/6e530134c9ee2b45ef0840f0c8046b3be595624881cf533d7a
  Building wheel for databricks-cli (setup.py) ... [?25l- \ | done
[?25h  Created wheel for databricks-cli: filename=databricks_cli-0.12.2-py3-none-any.whl size=101163 sha256=be3329799d7581f8e81992ec5ca7ab24167fb3b92d0187c4a62c8e049195a955
  Stored in directory: /root/.cache/pip/wheels/9e/bb/9d/78e02afa234019a22759d08d285bae87a88fa881f5db58db25
  Building wheel for alembic (setup.py) ... [?25l- \ | done
[?25h  Created wheel for alembic: filename=alembic-1.4.1-py2.py3-none-any.whl size=158154 sha256=3a4b7a763a6ce226a933b9aa155d11719a424ddff92458f00091c0e7c3bd50cf
  Stored in directory: /root/.cache/pip/wheels/be/5d/0a/9e13f53f4f5dfb67cd8d245bb7cdffe12f135846f491a283e3
  Building wheel for sqlalchemy (setup.py) ... [?25l- \ | / - \ | / - \ done
[?25h  Created wheel for sqlalchemy: filename=SQLAlchemy-1.3.13-cp37-cp37m-linux_x86_64.whl size=1221862 sha256=8a33081e209764349239860912cc6cde907f8613081f067c95fecca823653bd3
  Stored in directory: /root/.cache/pip/wheels/b9/ba/77/163f10f14bd489351530603e750c195b0ceceed2f3be2b32f1
  Building wheel for querystring-parser (setup.py) ... [?25l- \ done
[?25h  Created wheel for querystring-parser: filename=querystring_parser-1.2.4-py3-none-any.whl size=7076 sha256=eed4ac8c5058079d17a797b70ae3ed32cbea0c95ed24a361365cd86217a49dca
  Stored in directory: /root/.cache/pip/wheels/69/38/7a/072b5863ca334d012821a287fd1d066cea33abdcda3ef2f878
  Building wheel for prometheus-flask-exporter (setup.py) ... [?25l- \ done
[?25h  Created wheel for prometheus-flask-exporter: filename=prometheus_flask_exporter-0.18.1-py3-none-any.whl size=17157 sha256=66518d2e9e9f0b4e8e78fa57037102604ebeae2bff773b5d1fcc7350404f267d
  Stored in directory: /root/.cache/pip/wheels/c4/b6/b5/e76659f3b2a3a226565e27f0a7eb7a3ac93c3f4d68acfbe617
  Building wheel for combo (setup.py) ... [?25l- \ done
[?25h  Created wheel for combo: filename=combo-0.1.1-py3-none-any.whl size=42113 sha256=ab8b32daeae645fb4bc1c7d5de5441eeaad8eefa09bdaf459e8168e23e25d8b5
  Stored in directory: /root/.cache/pip/wheels/3e/e1/f8/08f19ba48f75d3dbbb549cec4b86cc0392c14b2b6bb81f4e1f
  Building wheel for suod (setup.py) ... [?25l- \ | / done
[?25h  Created wheel for suod: filename=suod-0.0.4-py3-none-any.whl size=2167157 sha256=cc1b8461f955dadb8d45e095b62560588668395536e17bb7953d4c329d640dff
  Stored in directory: /root/.cache/pip/wheels/dc/ae/aa/3b8cc857617f3ba6cb9e6b804c79c69d0ed60a08e022e9a4f3
Successfully built pyod databricks-cli alembic sqlalchemy querystring-parser prometheus-flask-exporter combo suod
Installing collected packages: datefinder, tenacity, databricks-cli, sqlalchemy, alembic, azure-core, isodate, msrest, azure-storage-blob, gunicorn, querystring-parser, gorilla, prometheus-flask-exporter, mlflow, combo, suod, pyod, tangled-up-in-unicode, visions, pandas-profiling, pycaret
  Attempting uninstall: tenacity
    Found existing installation: tenacity 6.1.0
    Uninstalling tenacity-6.1.0:
      Successfully uninstalled tenacity-6.1.0
  Attempting uninstall: sqlalchemy
    Found existing installation: SQLAlchemy 1.3.16
    Uninstalling SQLAlchemy-1.3.16:
      Successfully uninstalled SQLAlchemy-1.3.16
  Attempting uninstall: alembic
    Found existing installation: alembic 1.4.3
    Uninstalling alembic-1.4.3:
      Successfully uninstalled alembic-1.4.3
  Attempting uninstall: tangled-up-in-unicode
    Found existing installation: tangled-up-in-unicode 0.0.4
    Uninstalling tangled-up-in-unicode-0.0.4:
      Successfully uninstalled tangled-up-in-unicode-0.0.4
  Attempting uninstall: visions
    Found existing installation: visions 0.4.1
    Uninstalling visions-0.4.1:
      Successfully uninstalled visions-0.4.1
  Attempting uninstall: pandas-profiling
    Found existing installation: pandas-profiling 2.6.0
    Uninstalling pandas-profiling-2.6.0:
      Successfully uninstalled pandas-profiling-2.6.0
ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

pandas-profiling 2.9.0 requires seaborn>=0.10.1, but you'll have seaborn 0.10.0 which is incompatible.
Successfully installed alembic-1.4.1 azure-core-1.8.2 azure-storage-blob-12.5.0 combo-0.1.1 databricks-cli-0.12.2 datefinder-0.7.1 gorilla-0.3.0 gunicorn-20.0.4 isodate-0.6.0 mlflow-1.11.0 msrest-0.6.19 pandas-profiling-2.9.0 prometheus-flask-exporter-0.18.1 pycaret-2.1.2 pyod-0.8.3 querystring-parser-1.2.4 sqlalchemy-1.3.13 suod-0.0.4 tangled-up-in-unicode-0.0.6 tenacity-6.2.0 visions-0.5.0
WARNING: You are using pip version 20.2.3; however, version 20.2.4 is available.
You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.
from pycaret.classification import *

위에서 정의한 features & label 확인

features, label
(['age',
  'workclass',
  'fnlwgt_log',
  'education',
  'marital_status',
  'occupation',
  'relationship',
  'race',
  'sex',
  'capital_gain',
  'capital_loss',
  'hours_per_week',
  'native_country',
  'capital_net_pos_key',
  'capital_net_neg_key',
  'country_bin'],
 ['income'])
all_data_caret = all_data[features + label]
all_data_caret.head()
age workclass fnlwgt_log education marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country capital_net_pos_key capital_net_neg_key country_bin income
0 40 Private 12.034917 level_4 Married-civ-spouse Sales Husband White Male 0 0 60 United-States False False income_04 1.0
1 17 Private 11.529055 level_2 Never-married Machine-op-inspct Own-child White Male 0 0 20 United-States False False income_04 0.0
2 18 Private 12.775237 level_5 Never-married Other-service Own-child White Male 0 0 16 United-States False False income_04 0.0
3 21 Private 11.926081 level_5 Never-married Prof-specialty Own-child White Female 0 0 25 United-States False False income_04 0.0
4 24 Private 11.713693 level_5 Never-married Adm-clerical Not-in-family Black Female 0 0 20 ? False False income_other 0.0

type casting 을 안해주면 잘 설정이 안되더라..ㅠ

all_data_caret['age'] = all_data_caret['age'].astype('float')
# all_data_caret['capital_net'] = all_data_caret['capital_net'].astype('float')
all_data_caret['hours_per_week'] = all_data_caret['hours_per_week'].astype('float')
all_data_caret['capital_gain'] = all_data_caret['capital_gain'].astype('float')
all_data_caret['capital_loss'] = all_data_caret['capital_loss'].astype('float')
train_clean = all_data_caret[:len(train)]
test_clean = all_data_caret[len(train):]
train_clean['income'] = train_clean['income'].astype('int')
train_clean.head()
age workclass fnlwgt_log education marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country capital_net_pos_key capital_net_neg_key country_bin income
0 40.0 Private 12.034917 level_4 Married-civ-spouse Sales Husband White Male 0.0 0.0 60.0 United-States False False income_04 1
1 17.0 Private 11.529055 level_2 Never-married Machine-op-inspct Own-child White Male 0.0 0.0 20.0 United-States False False income_04 0
2 18.0 Private 12.775237 level_5 Never-married Other-service Own-child White Male 0.0 0.0 16.0 United-States False False income_04 0
3 21.0 Private 11.926081 level_5 Never-married Prof-specialty Own-child White Female 0.0 0.0 25.0 United-States False False income_04 0
4 24.0 Private 11.713693 level_5 Never-married Adm-clerical Not-in-family Black Female 0.0 0.0 20.0 ? False False income_other 0
setup(data = train_clean, target = 'income', session_id=SEED, silent=True)
Setup Succesfully Completed!
Description Value
0 session_id 1234
1 Target Type Binary
2 Label Encoded 0: 0, 1: 1
3 Original Data (26049, 17)
4 Missing Values False
5 Numeric Features 5
6 Categorical Features 11
7 Ordinal Features False
8 High Cardinality Features False
9 High Cardinality Method None
10 Sampled Data (26049, 17)
11 Transformed Train Set (18234, 111)
12 Transformed Test Set (7815, 111)
13 Numeric Imputer mean
14 Categorical Imputer constant
15 Normalize False
16 Normalize Method None
17 Transformation False
18 Transformation Method None
19 PCA False
20 PCA Method None
21 PCA Components None
22 Ignore Low Variance False
23 Combine Rare Levels False
24 Rare Level Threshold None
25 Numeric Binning False
26 Remove Outliers False
27 Outliers Threshold None
28 Remove Multicollinearity False
29 Multicollinearity Threshold None
30 Clustering False
31 Clustering Iteration None
32 Polynomial Features False
33 Polynomial Degree None
34 Trignometry Features False
35 Polynomial Threshold None
36 Group Features False
37 Feature Selection False
38 Features Selection Threshold None
39 Feature Interaction False
40 Feature Ratio False
41 Interaction Threshold None
42 Fix Imbalance False
43 Fix Imbalance Method SMOTE
(        age  fnlwgt_log  capital_gain  capital_loss  hours_per_week  \
 0      40.0   12.034917           0.0           0.0            60.0   
 1      17.0   11.529055           0.0           0.0            20.0   
 2      18.0   12.775237           0.0           0.0            16.0   
 3      21.0   11.926081           0.0           0.0            25.0   
 4      24.0   11.713693           0.0           0.0            20.0   
 ...     ...         ...           ...           ...             ...   
 26044  57.0   12.430020           0.0           0.0            52.0   
 26045  23.0   12.380412           0.0           0.0            40.0   
 26046  78.0   12.017898           0.0           0.0            15.0   
 26047  26.0   11.929172           0.0           0.0            40.0   
 26048  20.0   11.511835           0.0           0.0            30.0   
 
        workclass_?  workclass_Federal-gov  workclass_Local-gov  \
 0              0.0                    0.0                  0.0   
 1              0.0                    0.0                  0.0   
 2              0.0                    0.0                  0.0   
 3              0.0                    0.0                  0.0   
 4              0.0                    0.0                  0.0   
 ...            ...                    ...                  ...   
 26044          0.0                    0.0                  0.0   
 26045          0.0                    0.0                  0.0   
 26046          1.0                    0.0                  0.0   
 26047          0.0                    0.0                  0.0   
 26048          1.0                    0.0                  0.0   
 
        workclass_Other  workclass_Private  ...  country_bin_income_01  \
 0                  0.0                1.0  ...                    0.0   
 1                  0.0                1.0  ...                    0.0   
 2                  0.0                1.0  ...                    0.0   
 3                  0.0                1.0  ...                    0.0   
 4                  0.0                1.0  ...                    0.0   
 ...                ...                ...  ...                    ...   
 26044              0.0                1.0  ...                    0.0   
 26045              0.0                1.0  ...                    0.0   
 26046              0.0                0.0  ...                    0.0   
 26047              0.0                0.0  ...                    0.0   
 26048              0.0                0.0  ...                    0.0   
 
        country_bin_income_02  country_bin_income_03  country_bin_income_04  \
 0                        0.0                    0.0                    1.0   
 1                        0.0                    0.0                    1.0   
 2                        0.0                    0.0                    1.0   
 3                        0.0                    0.0                    1.0   
 4                        0.0                    0.0                    0.0   
 ...                      ...                    ...                    ...   
 26044                    0.0                    0.0                    1.0   
 26045                    0.0                    0.0                    1.0   
 26046                    0.0                    0.0                    1.0   
 26047                    0.0                    0.0                    1.0   
 26048                    0.0                    0.0                    1.0   
 
        country_bin_income_05  country_bin_income_06  country_bin_income_07  \
 0                        0.0                    0.0                    0.0   
 1                        0.0                    0.0                    0.0   
 2                        0.0                    0.0                    0.0   
 3                        0.0                    0.0                    0.0   
 4                        0.0                    0.0                    0.0   
 ...                      ...                    ...                    ...   
 26044                    0.0                    0.0                    0.0   
 26045                    0.0                    0.0                    0.0   
 26046                    0.0                    0.0                    0.0   
 26047                    0.0                    0.0                    0.0   
 26048                    0.0                    0.0                    0.0   
 
        country_bin_income_08  country_bin_income_09  country_bin_income_other  
 0                        0.0                    0.0                       0.0  
 1                        0.0                    0.0                       0.0  
 2                        0.0                    0.0                       0.0  
 3                        0.0                    0.0                       0.0  
 4                        0.0                    0.0                       1.0  
 ...                      ...                    ...                       ...  
 26044                    0.0                    0.0                       0.0  
 26045                    0.0                    0.0                       0.0  
 26046                    0.0                    0.0                       0.0  
 26047                    0.0                    0.0                       0.0  
 26048                    0.0                    0.0                       0.0  
 
 [26049 rows x 111 columns],
 0        1
 1        0
 2        0
 3        0
 4        0
         ..
 26044    0
 26045    0
 26046    0
 26047    0
 26048    0
 Name: income, Length: 26049, dtype: int64,
         age  fnlwgt_log  capital_gain  capital_loss  hours_per_week  \
 14079  56.0   10.366655           0.0           0.0            40.0   
 2026   69.0   12.086016        1848.0           0.0            12.0   
 10955  36.0   12.419400        5178.0           0.0            60.0   
 1385   52.0   11.593906           0.0        1902.0            50.0   
 7067   32.0   12.870491           0.0           0.0            16.0   
 ...     ...         ...           ...           ...             ...   
 25430  29.0   11.693980           0.0           0.0            40.0   
 14899  39.0   11.546902           0.0           0.0            45.0   
 9236   30.0   12.591117           0.0           0.0            50.0   
 23705  59.0   12.834812           0.0           0.0            41.0   
 18592  41.0   12.687850           0.0           0.0            55.0   
 
        workclass_?  workclass_Federal-gov  workclass_Local-gov  \
 14079          0.0                    0.0                  0.0   
 2026           0.0                    0.0                  0.0   
 10955          0.0                    0.0                  0.0   
 1385           0.0                    0.0                  0.0   
 7067           0.0                    0.0                  0.0   
 ...            ...                    ...                  ...   
 25430          0.0                    1.0                  0.0   
 14899          0.0                    0.0                  0.0   
 9236           0.0                    0.0                  0.0   
 23705          1.0                    0.0                  0.0   
 18592          0.0                    0.0                  0.0   
 
        workclass_Other  workclass_Private  ...  country_bin_income_01  \
 14079              0.0                1.0  ...                    0.0   
 2026               0.0                1.0  ...                    0.0   
 10955              0.0                1.0  ...                    0.0   
 1385               0.0                1.0  ...                    0.0   
 7067               0.0                1.0  ...                    0.0   
 ...                ...                ...  ...                    ...   
 25430              0.0                0.0  ...                    0.0   
 14899              0.0                1.0  ...                    0.0   
 9236               0.0                1.0  ...                    0.0   
 23705              0.0                0.0  ...                    0.0   
 18592              0.0                1.0  ...                    0.0   
 
        country_bin_income_02  country_bin_income_03  country_bin_income_04  \
 14079                    0.0                    0.0                    1.0   
 2026                     0.0                    0.0                    1.0   
 10955                    0.0                    0.0                    0.0   
 1385                     0.0                    0.0                    0.0   
 7067                     0.0                    0.0                    1.0   
 ...                      ...                    ...                    ...   
 25430                    0.0                    0.0                    1.0   
 14899                    0.0                    0.0                    1.0   
 9236                     0.0                    0.0                    0.0   
 23705                    0.0                    0.0                    1.0   
 18592                    0.0                    0.0                    1.0   
 
        country_bin_income_05  country_bin_income_06  country_bin_income_07  \
 14079                    0.0                    0.0                    0.0   
 2026                     0.0                    0.0                    0.0   
 10955                    0.0                    0.0                    0.0   
 1385                     1.0                    0.0                    0.0   
 7067                     0.0                    0.0                    0.0   
 ...                      ...                    ...                    ...   
 25430                    0.0                    0.0                    0.0   
 14899                    0.0                    0.0                    0.0   
 9236                     0.0                    0.0                    0.0   
 23705                    0.0                    0.0                    0.0   
 18592                    0.0                    0.0                    0.0   
 
        country_bin_income_08  country_bin_income_09  country_bin_income_other  
 14079                    0.0                    0.0                       0.0  
 2026                     0.0                    0.0                       0.0  
 10955                    0.0                    0.0                       1.0  
 1385                     0.0                    0.0                       0.0  
 7067                     0.0                    0.0                       0.0  
 ...                      ...                    ...                       ...  
 25430                    0.0                    0.0                       0.0  
 14899                    0.0                    0.0                       0.0  
 9236                     0.0                    0.0                       1.0  
 23705                    0.0                    0.0                       0.0  
 18592                    0.0                    0.0                       0.0  
 
 [18234 rows x 111 columns],
         age  fnlwgt_log  capital_gain  capital_loss  hours_per_week  \
 21893  49.0   11.173178           0.0           0.0            60.0   
 24714  41.0   12.399248           0.0           0.0            80.0   
 20725  49.0   12.172340           0.0           0.0            40.0   
 13981  49.0   11.314145           0.0           0.0            40.0   
 25627  31.0   11.674253           0.0           0.0            45.0   
 ...     ...         ...           ...           ...             ...   
 3937   73.0   10.175345           0.0           0.0            30.0   
 23595  20.0   12.354411           0.0           0.0            32.0   
 25500  55.0   11.870810           0.0           0.0            40.0   
 22934  24.0   11.093508           0.0           0.0            30.0   
 18262  28.0   12.160489           0.0        1741.0            52.0   
 
        workclass_?  workclass_Federal-gov  workclass_Local-gov  \
 21893          0.0                    0.0                  0.0   
 24714          0.0                    0.0                  0.0   
 20725          0.0                    0.0                  0.0   
 13981          0.0                    0.0                  0.0   
 25627          0.0                    0.0                  0.0   
 ...            ...                    ...                  ...   
 3937           0.0                    0.0                  0.0   
 23595          0.0                    0.0                  0.0   
 25500          0.0                    0.0                  0.0   
 22934          0.0                    0.0                  0.0   
 18262          0.0                    0.0                  0.0   
 
        workclass_Other  workclass_Private  ...  country_bin_income_01  \
 21893              0.0                1.0  ...                    0.0   
 24714              0.0                1.0  ...                    0.0   
 20725              0.0                1.0  ...                    0.0   
 13981              0.0                1.0  ...                    0.0   
 25627              0.0                1.0  ...                    0.0   
 ...                ...                ...  ...                    ...   
 3937               0.0                1.0  ...                    0.0   
 23595              0.0                1.0  ...                    0.0   
 25500              0.0                1.0  ...                    0.0   
 22934              0.0                1.0  ...                    0.0   
 18262              0.0                1.0  ...                    0.0   
 
        country_bin_income_02  country_bin_income_03  country_bin_income_04  \
 21893                    0.0                    0.0                    1.0   
 24714                    0.0                    0.0                    1.0   
 20725                    0.0                    0.0                    1.0   
 13981                    0.0                    0.0                    1.0   
 25627                    0.0                    0.0                    1.0   
 ...                      ...                    ...                    ...   
 3937                     0.0                    0.0                    1.0   
 23595                    0.0                    0.0                    1.0   
 25500                    0.0                    0.0                    1.0   
 22934                    0.0                    0.0                    1.0   
 18262                    0.0                    0.0                    1.0   
 
        country_bin_income_05  country_bin_income_06  country_bin_income_07  \
 21893                    0.0                    0.0                    0.0   
 24714                    0.0                    0.0                    0.0   
 20725                    0.0                    0.0                    0.0   
 13981                    0.0                    0.0                    0.0   
 25627                    0.0                    0.0                    0.0   
 ...                      ...                    ...                    ...   
 3937                     0.0                    0.0                    0.0   
 23595                    0.0                    0.0                    0.0   
 25500                    0.0                    0.0                    0.0   
 22934                    0.0                    0.0                    0.0   
 18262                    0.0                    0.0                    0.0   
 
        country_bin_income_08  country_bin_income_09  country_bin_income_other  
 21893                    0.0                    0.0                       0.0  
 24714                    0.0                    0.0                       0.0  
 20725                    0.0                    0.0                       0.0  
 13981                    0.0                    0.0                       0.0  
 25627                    0.0                    0.0                       0.0  
 ...                      ...                    ...                       ...  
 3937                     0.0                    0.0                       0.0  
 23595                    0.0                    0.0                       0.0  
 25500                    0.0                    0.0                       0.0  
 22934                    0.0                    0.0                       0.0  
 18262                    0.0                    0.0                       0.0  
 
 [7815 rows x 111 columns],
 14079    0
 2026     0
 10955    1
 1385     1
 7067     0
         ..
 25430    0
 14899    1
 9236     0
 23705    1
 18592    0
 Name: income, Length: 18234, dtype: int64,
 21893    0
 24714    0
 20725    0
 13981    1
 25627    0
         ..
 3937     1
 23595    0
 25500    0
 22934    0
 18262    0
 Name: income, Length: 7815, dtype: int64,
 1234,
 Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=False, features_todrop=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='income',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 numeric_strategy='mean',
                                 target_variable=None)),
                 ('new_levels1',
                  New_Catagorical_Le...
                 ('group', Empty()), ('nonliner', Empty()), ('scaling', Empty()),
                 ('P_transform', Empty()), ('pt_target', Empty()),
                 ('binn', Empty()), ('rem_outliers', Empty()),
                 ('cluster_all', Empty()), ('dummy', Dummify(target='income')),
                 ('fix_perfect', Empty()), ('clean_names', Clean_Colum_Names()),
                 ('feature_select', Empty()), ('fix_multi', Empty()),
                 ('dfs', Empty()), ('pca', Empty())],
          verbose=False),
 [('Classification Setup Config',
                         Description         Value
   0                      session_id          1234
   1                     Target Type        Binary
   2                   Label Encoded    0: 0, 1: 1
   3                   Original Data   (26049, 17)
   4                 Missing Values          False
   5               Numeric Features              5
   6           Categorical Features             11
   7               Ordinal Features          False
   8      High Cardinality Features          False
   9        High Cardinality Method           None
   10                   Sampled Data   (26049, 17)
   11          Transformed Train Set  (18234, 111)
   12           Transformed Test Set   (7815, 111)
   13               Numeric Imputer           mean
   14           Categorical Imputer       constant
   15                     Normalize          False
   16              Normalize Method           None
   17                Transformation          False
   18         Transformation Method           None
   19                           PCA          False
   20                    PCA Method           None
   21                PCA Components           None
   22           Ignore Low Variance          False
   23           Combine Rare Levels          False
   24          Rare Level Threshold           None
   25               Numeric Binning          False
   26               Remove Outliers          False
   27            Outliers Threshold           None
   28      Remove Multicollinearity          False
   29   Multicollinearity Threshold           None
   30                    Clustering          False
   31          Clustering Iteration           None
   32           Polynomial Features          False
   33             Polynomial Degree           None
   34          Trignometry Features          False
   35          Polynomial Threshold           None
   36                Group Features          False
   37             Feature Selection          False
   38  Features Selection Threshold           None
   39           Feature Interaction          False
   40                 Feature Ratio          False
   41         Interaction Threshold           None
   42                  Fix Imbalance         False
   43           Fix Imbalance Method         SMOTE),
  ('X_training Set',
           age  fnlwgt_log  capital_gain  capital_loss  hours_per_week  \
   14079  56.0   10.366655           0.0           0.0            40.0   
   2026   69.0   12.086016        1848.0           0.0            12.0   
   10955  36.0   12.419400        5178.0           0.0            60.0   
   1385   52.0   11.593906           0.0        1902.0            50.0   
   7067   32.0   12.870491           0.0           0.0            16.0   
   ...     ...         ...           ...           ...             ...   
   25430  29.0   11.693980           0.0           0.0            40.0   
   14899  39.0   11.546902           0.0           0.0            45.0   
   9236   30.0   12.591117           0.0           0.0            50.0   
   23705  59.0   12.834812           0.0           0.0            41.0   
   18592  41.0   12.687850           0.0           0.0            55.0   
   
          workclass_?  workclass_Federal-gov  workclass_Local-gov  \
   14079          0.0                    0.0                  0.0   
   2026           0.0                    0.0                  0.0   
   10955          0.0                    0.0                  0.0   
   1385           0.0                    0.0                  0.0   
   7067           0.0                    0.0                  0.0   
   ...            ...                    ...                  ...   
   25430          0.0                    1.0                  0.0   
   14899          0.0                    0.0                  0.0   
   9236           0.0                    0.0                  0.0   
   23705          1.0                    0.0                  0.0   
   18592          0.0                    0.0                  0.0   
   
          workclass_Other  workclass_Private  ...  country_bin_income_01  \
   14079              0.0                1.0  ...                    0.0   
   2026               0.0                1.0  ...                    0.0   
   10955              0.0                1.0  ...                    0.0   
   1385               0.0                1.0  ...                    0.0   
   7067               0.0                1.0  ...                    0.0   
   ...                ...                ...  ...                    ...   
   25430              0.0                0.0  ...                    0.0   
   14899              0.0                1.0  ...                    0.0   
   9236               0.0                1.0  ...                    0.0   
   23705              0.0                0.0  ...                    0.0   
   18592              0.0                1.0  ...                    0.0   
   
          country_bin_income_02  country_bin_income_03  country_bin_income_04  \
   14079                    0.0                    0.0                    1.0   
   2026                     0.0                    0.0                    1.0   
   10955                    0.0                    0.0                    0.0   
   1385                     0.0                    0.0                    0.0   
   7067                     0.0                    0.0                    1.0   
   ...                      ...                    ...                    ...   
   25430                    0.0                    0.0                    1.0   
   14899                    0.0                    0.0                    1.0   
   9236                     0.0                    0.0                    0.0   
   23705                    0.0                    0.0                    1.0   
   18592                    0.0                    0.0                    1.0   
   
          country_bin_income_05  country_bin_income_06  country_bin_income_07  \
   14079                    0.0                    0.0                    0.0   
   2026                     0.0                    0.0                    0.0   
   10955                    0.0                    0.0                    0.0   
   1385                     1.0                    0.0                    0.0   
   7067                     0.0                    0.0                    0.0   
   ...                      ...                    ...                    ...   
   25430                    0.0                    0.0                    0.0   
   14899                    0.0                    0.0                    0.0   
   9236                     0.0                    0.0                    0.0   
   23705                    0.0                    0.0                    0.0   
   18592                    0.0                    0.0                    0.0   
   
          country_bin_income_08  country_bin_income_09  country_bin_income_other  
   14079                    0.0                    0.0                       0.0  
   2026                     0.0                    0.0                       0.0  
   10955                    0.0                    0.0                       1.0  
   1385                     0.0                    0.0                       0.0  
   7067                     0.0                    0.0                       0.0  
   ...                      ...                    ...                       ...  
   25430                    0.0                    0.0                       0.0  
   14899                    0.0                    0.0                       0.0  
   9236                     0.0                    0.0                       1.0  
   23705                    0.0                    0.0                       0.0  
   18592                    0.0                    0.0                       0.0  
   
   [18234 rows x 111 columns]),
  ('y_training Set',
   14079    0
   2026     0
   10955    1
   1385     1
   7067     0
           ..
   25430    0
   14899    1
   9236     0
   23705    1
   18592    0
   Name: income, Length: 18234, dtype: int64),
  ('X_test Set',
           age  fnlwgt_log  capital_gain  capital_loss  hours_per_week  \
   21893  49.0   11.173178           0.0           0.0            60.0   
   24714  41.0   12.399248           0.0           0.0            80.0   
   20725  49.0   12.172340           0.0           0.0            40.0   
   13981  49.0   11.314145           0.0           0.0            40.0   
   25627  31.0   11.674253           0.0           0.0            45.0   
   ...     ...         ...           ...           ...             ...   
   3937   73.0   10.175345           0.0           0.0            30.0   
   23595  20.0   12.354411           0.0           0.0            32.0   
   25500  55.0   11.870810           0.0           0.0            40.0   
   22934  24.0   11.093508           0.0           0.0            30.0   
   18262  28.0   12.160489           0.0        1741.0            52.0   
   
          workclass_?  workclass_Federal-gov  workclass_Local-gov  \
   21893          0.0                    0.0                  0.0   
   24714          0.0                    0.0                  0.0   
   20725          0.0                    0.0                  0.0   
   13981          0.0                    0.0                  0.0   
   25627          0.0                    0.0                  0.0   
   ...            ...                    ...                  ...   
   3937           0.0                    0.0                  0.0   
   23595          0.0                    0.0                  0.0   
   25500          0.0                    0.0                  0.0   
   22934          0.0                    0.0                  0.0   
   18262          0.0                    0.0                  0.0   
   
          workclass_Other  workclass_Private  ...  country_bin_income_01  \
   21893              0.0                1.0  ...                    0.0   
   24714              0.0                1.0  ...                    0.0   
   20725              0.0                1.0  ...                    0.0   
   13981              0.0                1.0  ...                    0.0   
   25627              0.0                1.0  ...                    0.0   
   ...                ...                ...  ...                    ...   
   3937               0.0                1.0  ...                    0.0   
   23595              0.0                1.0  ...                    0.0   
   25500              0.0                1.0  ...                    0.0   
   22934              0.0                1.0  ...                    0.0   
   18262              0.0                1.0  ...                    0.0   
   
          country_bin_income_02  country_bin_income_03  country_bin_income_04  \
   21893                    0.0                    0.0                    1.0   
   24714                    0.0                    0.0                    1.0   
   20725                    0.0                    0.0                    1.0   
   13981                    0.0                    0.0                    1.0   
   25627                    0.0                    0.0                    1.0   
   ...                      ...                    ...                    ...   
   3937                     0.0                    0.0                    1.0   
   23595                    0.0                    0.0                    1.0   
   25500                    0.0                    0.0                    1.0   
   22934                    0.0                    0.0                    1.0   
   18262                    0.0                    0.0                    1.0   
   
          country_bin_income_05  country_bin_income_06  country_bin_income_07  \
   21893                    0.0                    0.0                    0.0   
   24714                    0.0                    0.0                    0.0   
   20725                    0.0                    0.0                    0.0   
   13981                    0.0                    0.0                    0.0   
   25627                    0.0                    0.0                    0.0   
   ...                      ...                    ...                    ...   
   3937                     0.0                    0.0                    0.0   
   23595                    0.0                    0.0                    0.0   
   25500                    0.0                    0.0                    0.0   
   22934                    0.0                    0.0                    0.0   
   18262                    0.0                    0.0                    0.0   
   
          country_bin_income_08  country_bin_income_09  country_bin_income_other  
   21893                    0.0                    0.0                       0.0  
   24714                    0.0                    0.0                       0.0  
   20725                    0.0                    0.0                       0.0  
   13981                    0.0                    0.0                       0.0  
   25627                    0.0                    0.0                       0.0  
   ...                      ...                    ...                       ...  
   3937                     0.0                    0.0                       0.0  
   23595                    0.0                    0.0                       0.0  
   25500                    0.0                    0.0                       0.0  
   22934                    0.0                    0.0                       0.0  
   18262                    0.0                    0.0                       0.0  
   
   [7815 rows x 111 columns]),
  ('y_test Set',
   21893    0
   24714    0
   20725    0
   13981    1
   25627    0
           ..
   3937     1
   23595    0
   25500    0
   22934    0
   18262    0
   Name: income, Length: 7815, dtype: int64),
  ('Transformation Pipeline',
   Pipeline(memory=None,
            steps=[('dtypes',
                    DataTypes_Auto_infer(categorical_features=[],
                                         display_types=False, features_todrop=[],
                                         ml_usecase='classification',
                                         numerical_features=[], target='income',
                                         time_features=[])),
                   ('imputer',
                    Simple_Imputer(categorical_strategy='not_available',
                                   numeric_strategy='mean',
                                   target_variable=None)),
                   ('new_levels1',
                    New_Catagorical_Le...
                   ('group', Empty()), ('nonliner', Empty()), ('scaling', Empty()),
                   ('P_transform', Empty()), ('pt_target', Empty()),
                   ('binn', Empty()), ('rem_outliers', Empty()),
                   ('cluster_all', Empty()), ('dummy', Dummify(target='income')),
                   ('fix_perfect', Empty()), ('clean_names', Clean_Colum_Names()),
                   ('feature_select', Empty()), ('fix_multi', Empty()),
                   ('dfs', Empty()), ('pca', Empty())],
            verbose=False))],
 False,
 -1,
 True,
 [],
 [],
 [],
 'no_logging',
 False,
 False,
 '87a3',
 False,
 None,
 <_Logger logs (DEBUG)>,
         age         workclass  fnlwgt_log education      marital_status  \
 0      40.0           Private   12.034917   level_4  Married-civ-spouse   
 1      17.0           Private   11.529055   level_2       Never-married   
 2      18.0           Private   12.775237   level_5       Never-married   
 3      21.0           Private   11.926081   level_5       Never-married   
 4      24.0           Private   11.713693   level_5       Never-married   
 ...     ...               ...         ...       ...                 ...   
 26044  57.0           Private   12.430020   level_3  Married-civ-spouse   
 26045  23.0           Private   12.380412   level_7       Never-married   
 26046  78.0                 ?   12.017898   level_8             Widowed   
 26047  26.0  Self-emp-not-inc   11.929172   level_4       Never-married   
 26048  20.0                 ?   11.511835   level_5       Never-married   
 
               occupation   relationship   race     sex  capital_gain  \
 0                  Sales        Husband  White    Male           0.0   
 1      Machine-op-inspct      Own-child  White    Male           0.0   
 2          Other-service      Own-child  White    Male           0.0   
 3         Prof-specialty      Own-child  White  Female           0.0   
 4           Adm-clerical  Not-in-family  Black  Female           0.0   
 ...                  ...            ...    ...     ...           ...   
 26044      Other-service        Husband  White    Male           0.0   
 26045     Prof-specialty      Own-child  White    Male           0.0   
 26046                  ?  Not-in-family  White  Female           0.0   
 26047     Prof-specialty      Own-child  Black  Female           0.0   
 26048                  ?      Own-child  White  Female           0.0   
 
        capital_loss  hours_per_week native_country  capital_net_pos_key  \
 0               0.0            60.0  United-States                False   
 1               0.0            20.0  United-States                False   
 2               0.0            16.0  United-States                False   
 3               0.0            25.0  United-States                False   
 4               0.0            20.0              ?                False   
 ...             ...             ...            ...                  ...   
 26044           0.0            52.0  United-States                False   
 26045           0.0            40.0  United-States                False   
 26046           0.0            15.0  United-States                False   
 26047           0.0            40.0  United-States                False   
 26048           0.0            30.0  United-States                False   
 
        capital_net_neg_key   country_bin  income  
 0                    False     income_04       1  
 1                    False     income_04       0  
 2                    False     income_04       0  
 3                    False     income_04       0  
 4                    False  income_other       0  
 ...                    ...           ...     ...  
 26044                False     income_04       0  
 26045                False     income_04       0  
 26046                False     income_04       0  
 26047                False     income_04       0  
 26048                False     income_04       0  
 
 [26049 rows x 17 columns],
 'income',
 False)
lgbm = create_model('lightgbm')
tuned_lgbm = tune_model(lgbm, optimize='F1')
Accuracy AUC Recall Prec. F1 Kappa MCC
0 0.8717 0.9263 0.6553 0.7790 0.7118 0.6301 0.6340
1 0.8668 0.9245 0.6584 0.7598 0.7055 0.6199 0.6226
2 0.8624 0.9205 0.6267 0.7631 0.6882 0.6010 0.6058
3 0.8777 0.9430 0.6516 0.8067 0.7209 0.6438 0.6498
4 0.8738 0.9296 0.6576 0.7859 0.7160 0.6358 0.6399
5 0.8700 0.9255 0.6576 0.7713 0.7099 0.6268 0.6301
6 0.8645 0.9224 0.6440 0.7594 0.6969 0.6104 0.6139
7 0.8722 0.9289 0.6689 0.7723 0.7169 0.6349 0.6376
8 0.8733 0.9334 0.7029 0.7561 0.7286 0.6461 0.6468
9 0.8848 0.9418 0.7143 0.7895 0.7500 0.6754 0.6768
Mean 0.8717 0.9296 0.6637 0.7743 0.7145 0.6324 0.6357
SD 0.0062 0.0073 0.0249 0.0153 0.0162 0.0196 0.0190
calibrated_lgbm = calibrate_model(tuned_lgbm)
Accuracy AUC Recall Prec. F1 Kappa MCC
0 0.8723 0.9264 0.6485 0.7857 0.7106 0.6296 0.6343
1 0.8657 0.9230 0.6448 0.7641 0.6994 0.6137 0.6174
2 0.8569 0.9198 0.6109 0.7521 0.6742 0.5837 0.5889
3 0.8805 0.9452 0.6448 0.8237 0.7234 0.6486 0.6565
4 0.8788 0.9306 0.6644 0.8005 0.7261 0.6492 0.6538
5 0.8727 0.9265 0.6485 0.7879 0.7114 0.6308 0.6357
6 0.8694 0.9232 0.6417 0.7796 0.7040 0.6212 0.6261
7 0.8727 0.9298 0.6599 0.7802 0.7150 0.6338 0.6375
8 0.8793 0.9357 0.6984 0.7797 0.7368 0.6589 0.6605
9 0.8914 0.9440 0.7143 0.8140 0.7609 0.6910 0.6935
Mean 0.8740 0.9304 0.6576 0.7867 0.7162 0.6360 0.6404
SD 0.0089 0.0083 0.0280 0.0204 0.0220 0.0272 0.0267
interpret_model(tuned_lgbm, plot = 'reason', observation = 15)
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
plot_model(tuned_lgbm)

plot_model(tuned_lgbm, 'threshold')

plot_model(lgbm, 'confusion_matrix')

plot_model(lgbm, 'calibration')

tuned_lgbm_pred = predict_model(tuned_lgbm, data = test_clean)
tuned_lgbm_pred
age workclass fnlwgt_log education marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country capital_net_pos_key capital_net_neg_key country_bin income Label Score
0 28.0 Private 11.122265 level_5 Never-married Adm-clerical Other-relative White Female 0.0 0.0 40.0 United-States False False income_04 NaN 0 0.0039
1 40.0 Self-emp-inc 10.541888 level_4 Married-civ-spouse Exec-managerial Husband White Male 0.0 0.0 50.0 United-States False False income_04 NaN 0 0.4239
2 20.0 Private 11.607799 level_5 Never-married Handlers-cleaners Own-child White Male 0.0 0.0 25.0 United-States False False income_04 NaN 0 0.0004
3 40.0 Private 11.648653 level_6 Married-civ-spouse Exec-managerial Husband White Male 0.0 0.0 50.0 United-States False False income_04 NaN 1 0.8180
4 37.0 Private 10.844744 level_9 Married-civ-spouse Prof-specialty Husband White Male 0.0 0.0 99.0 France False False income_08 NaN 1 0.5937
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6507 35.0 Private 11.024236 level_7 Married-civ-spouse Sales Husband White Male 0.0 0.0 40.0 United-States False False income_04 NaN 1 0.5984
6508 41.0 Self-emp-inc 10.379256 level_7 Married-civ-spouse Tech-support Husband White Male 0.0 0.0 40.0 United-States False False income_04 NaN 1 0.5455
6509 39.0 Private 12.921932 level_1 Married-civ-spouse Other-service Husband White Male 0.0 0.0 40.0 Mexico False False income_02 NaN 0 0.0185
6510 35.0 Private 12.102610 level_4 Married-civ-spouse Craft-repair Husband White Male 0.0 0.0 40.0 United-States False False income_04 NaN 0 0.2055
6511 28.0 Private 11.962848 level_4 Divorced Handlers-cleaners Unmarried White Female 0.0 0.0 36.0 United-States False False income_04 NaN 0 0.0077

6512 rows × 19 columns

PyCaret (Ensemble)

campare_model = compare_models(sort = 'F1', n_select = 3)
Model Accuracy AUC Recall Prec. F1 Kappa MCC TT (Sec)
0 CatBoost Classifier 0.8761 0.9318 0.6583 0.7948 0.7199 0.6413 0.6462 10.5274
1 Extreme Gradient Boosting 0.8739 0.9292 0.6658 0.7816 0.7187 0.6381 0.6418 9.4363
2 Light Gradient Boosting Machine 0.8734 0.9298 0.6640 0.7804 0.7172 0.6363 0.6400 0.4357
3 Ada Boost Classifier 0.8695 0.9253 0.6397 0.7818 0.7034 0.6208 0.6262 1.3897
4 Gradient Boosting Classifier 0.8718 0.9278 0.6103 0.8138 0.6972 0.6181 0.6285 4.1646
5 Naive Bayes 0.8191 0.9007 0.7820 0.5967 0.6767 0.5543 0.5642 0.0331
6 Extra Trees Classifier 0.8490 0.8985 0.6374 0.7092 0.6711 0.5736 0.5751 1.3146
7 Random Forest Classifier 0.8548 0.8910 0.5964 0.7527 0.6651 0.5741 0.5807 0.2215
8 Linear Discriminant Analysis 0.8592 0.9130 0.5756 0.7857 0.6641 0.5778 0.5891 0.2736
9 K Neighbors Classifier 0.8402 0.8684 0.6202 0.6890 0.6524 0.5491 0.5506 0.5477
10 Ridge Classifier 0.8569 0.0000 0.5291 0.8148 0.6412 0.5570 0.5774 0.0510
11 Logistic Regression 0.8429 0.8885 0.5722 0.7212 0.6375 0.5390 0.5453 0.4018
12 Decision Tree Classifier 0.8208 0.7586 0.6381 0.6279 0.6329 0.5144 0.5145 0.2522
13 SVM - Linear Kernel 0.7638 0.0000 0.5630 0.5595 0.5019 0.3649 0.3886 0.3444
14 Quadratic Discriminant Analysis 0.7544 0.6260 0.3191 0.7977 0.3599 0.2539 0.3499 0.1040
blended_model = blend_models(estimator_list = campare_model, fold = 5, method = 'soft')
Accuracy AUC Recall Prec. F1 Kappa MCC
0 0.8687 0.9258 0.6497 0.7712 0.7052 0.6215 0.6253
1 0.8676 0.9323 0.6285 0.7817 0.6968 0.6134 0.6193
2 0.8774 0.9320 0.6682 0.7930 0.7253 0.6471 0.6511
3 0.8700 0.9269 0.6433 0.7813 0.7056 0.6232 0.6280
4 0.8845 0.9387 0.7143 0.7885 0.7496 0.6748 0.6762
Mean 0.8736 0.9311 0.6608 0.7831 0.7165 0.6360 0.6400
SD 0.0064 0.0046 0.0296 0.0074 0.0190 0.0224 0.0210
final_model = finalize_model(blended_model)
ensemble_prediction = predict_model(final_model, data = test_clean)
ensemble_pred = ensemble_prediction['Score']
THRESHOLD = 0.5
ensemble_pred[ensemble_pred >= THRESHOLD] = 1
ensemble_pred[ensemble_pred < THRESHOLD] = 0
plt.figure(figsize=(10, 6))
plt.subplot(121)
sns.countplot(ensemble_pred)

plt.subplot(122)
sns.countplot(train['income'])
plt.show()

Make Submission

submission = pd.read_csv(os.path.join(DIR, 'sample_submission.csv'))
submission.head()
id prediction
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
submission['prediction'] = ensemble_pred
submission['prediction'] = submission['prediction'].astype('int')
submission['prediction'].value_counts()
0    5225
1    1287
Name: prediction, dtype: int64
import datetime
timestring = datetime.datetime.now().strftime('%m-%d-%H-%M-%S')
filename = f'kakr-submission-{timestring}.csv'
filename
'kakr-submission-10-20-14-47-37.csv'
submission.to_csv(filename, index=False)

댓글남기기