🔥알림🔥
① 테디노트 유튜브 - 구경하러 가기!
② LangChain 한국어 튜토리얼 바로가기 👀
③ 랭체인 노트 무료 전자책(wikidocs) 바로가기 🙌
④ RAG 비법노트 LangChain 강의오픈 바로가기 🙌
⑤ 서울대 PyTorch 딥러닝 강의 바로가기 🙌

[캐글] 성인 인구조사 소득 예측 대회 커널

2021년 11월 05일 43 분 소요

작년 T-Academy와 KaKr가 주최하는 성인 인구조사 소득 예측 대회에 참여하여 EDA 노트북을 공유했었습니다.

KaKr(캐글코리아) 는 국내에서 가장 큰 캐글 커뮤니티며 전 세계적으로 그 영향력을 인정 받았다고 하네요~

페이스북 그룹이 있으니 관심있으신 분들은 가입하여 캐글 관련 정보를 공유하세요.

캐글코리아 페이스북그룹

작년에 캐글 노트북으로 공유한 커널을 오랜만에 다시 끄집어 내어 블로그에 공유해 봅니다.

대회 정보는 [T-Academy X KaKr] 성인 인구조사 소득 예측 대회 에서 보실 수 있습니다. 관련 데이터셋도 Data 탭에서 확인할 수 있습니다.

제가 캐글에서 공유한 커널은 캐하~ EDA + LightGBM + PyCaret 에서 확인하실 수 있습니다. Copy and Edit으로 수정하여 바로 돌려볼 수 있습니다.

import numpy as np 
import pandas as pd
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/kakr-4th-competition/train.csv
/kaggle/input/kakr-4th-competition/test.csv
/kaggle/input/kakr-4th-competition/sample_submission.csv

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
import os

warnings.filterwarnings('ignore')

SEED = 1234
FIG_SIZE = (10, 7)

DIR = '/kaggle/input/kakr-4th-competition'

train = pd.read_csv(os.path.join(DIR, 'train.csv'))
test = pd.read_csv(os.path.join(DIR, 'test.csv'))

id
age : 나이
workclass : 고용 형태
fnlwgt : 사람 대표성을 나타내는 가중치 (final weight의 약자)
education : 교육 수준
education_num : 교육 수준 수치
marital_status: 결혼 상태
occupation : 업종
relationship : 가족 관계
race : 인종
sex : 성별
capital_gain : 양도 소득
capital_loss : 양도 손실
hours_per_week : 주당 근무 시간
native_country : 국적
income : 수익 (예측해야 하는 값)
- 50K : 1
- <=50K : 0

print(train.shape, test.shape)

(26049, 16) (6512, 15)

train.head()

	id	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	hours_per_week	native_country	income
0	0	40	Private	168538	HS-grad	9	Married-civ-spouse	Sales	Husband	White	Male	60	United-States	>50K
1	1	17	Private	101626	9th	5	Never-married	Machine-op-inspct	Own-child	White	Male	20	United-States	<=50K
2	2	18	Private	353358	Some-college	10	Never-married	Other-service	Own-child	White	Male	16	United-States	<=50K
3	3	21	Private	151158	Some-college	10	Never-married	Prof-specialty	Own-child	White	Female	25	United-States	<=50K
4	4	24	Private	122234	Some-college	10	Never-married	Adm-clerical	Not-in-family	Black	Female	20	?	<=50K

test.head()

	id	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	hours_per_week	native_country
0	0	28	Private	67661	Some-college	10	Never-married	Adm-clerical	Other-relative	White	Female	40	United-States
1	1	40	Self-emp-inc	37869	HS-grad	9	Married-civ-spouse	Exec-managerial	Husband	White	Male	50	United-States
2	2	20	Private	109952	Some-college	10	Never-married	Handlers-cleaners	Own-child	White	Male	25	United-States
3	3	40	Private	114537	Assoc-voc	11	Married-civ-spouse	Exec-managerial	Husband	White	Male	50	United-States
4	4	37	Private	51264	Doctorate	16	Married-civ-spouse	Prof-specialty	Husband	White	Male	99	France

결측치

결측치 없음 (깰끔!)

train.isnull().sum()

id                0
age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64

test.isnull().sum()

id                0
age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
dtype: int64

컬럼 별 info() 확인

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26049 entries, 0 to 26048
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              26049 non-null  int64 
 1   age             26049 non-null  int64 
 2   workclass       26049 non-null  object
 3   fnlwgt          26049 non-null  int64 
 4   education       26049 non-null  object
 5   education_num   26049 non-null  int64 
 6   marital_status  26049 non-null  object
 7   occupation      26049 non-null  object
 8   relationship    26049 non-null  object
 9   race            26049 non-null  object
 10  sex             26049 non-null  object
 11  capital_gain    26049 non-null  int64 
 12  capital_loss    26049 non-null  int64 
 13  hours_per_week  26049 non-null  int64 
 14  native_country  26049 non-null  object
 15  income          26049 non-null  object
dtypes: int64(7), object(9)
memory usage: 3.2+ MB

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6512 entries, 0 to 6511
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              6512 non-null   int64 
 1   age             6512 non-null   int64 
 2   workclass       6512 non-null   object
 3   fnlwgt          6512 non-null   int64 
 4   education       6512 non-null   object
 5   education_num   6512 non-null   int64 
 6   marital_status  6512 non-null   object
 7   occupation      6512 non-null   object
 8   relationship    6512 non-null   object
 9   race            6512 non-null   object
 10  sex             6512 non-null   object
 11  capital_gain    6512 non-null   int64 
 12  capital_loss    6512 non-null   int64 
 13  hours_per_week  6512 non-null   int64 
 14  native_country  6512 non-null   object
dtypes: int64(7), object(8)
memory usage: 763.2+ KB

Target 변환 (Income)

train['income'].value_counts()

<=50K    19744
>50K      6305
Name: income, dtype: int64

train['income'] = train['income'].apply(lambda x: 0 if x == '<=50K' else 1)

train['income'].value_counts()

0    19744
1     6305
Name: income, dtype: int64

all_data로 train + test 세트 합치기 (전처리 동시 진행)

원래 개별 처리 해주는 것이 정식입니다.

train / test 의 분포를 따로 봐야하는 이유는 캐글에서 가끔 함정으로 train 에 없는 값 분포를 test에 심어 놓기도 하죠.

이전에 이미 개별 처리로 분포 확인을 진행한 상태가 편의상 train + test 합친 후 전처리 진행합니다.

all_data = pd.concat([train, test], sort=False)

workclass

all_data['workclass'].value_counts()

Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: workclass, dtype: int64

all_data.groupby('workclass')['income'].mean().sort_values().plot(kind='bar', figsize=FIG_SIZE)

<matplotlib.axes._subplots.AxesSubplot at 0x7fb79df07d90>

Without-pay 컬럼과 Never-worked 컬럼의 income은 모두 0 임을 확인한다.
Without-pay 컬럼과 Never-worked 컬럼을 Ohter 컬럼으로 합친다.

workclass_other = ['Without-pay', 'Never-worked']
all_data['workclass'] = all_data['workclass'].apply(lambda x: 'Other' if x in workclass_other else x)

all_data['workclass'].value_counts()

Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Other                  21
Name: workclass, dtype: int64

age: 나이

나이는 numeric column 입니다.

income 별 나이의 분포를 확인해 보도록 하겠습니다.

df1 = all_data.loc[all_data['income'] == 0, 'age']
df2 = all_data.loc[all_data['income'] == 1, 'age']

plt.figure(figsize=FIG_SIZE)
sns.distplot(df1, kde=True, rug=True, hist=False, color='blue')
sns.distplot(df2, kde=True, rug=True, hist=False, color='red')

<matplotlib.axes._subplots.AxesSubplot at 0x7fb79ddbcf50>

fnlwgt: 사람의 대표성을 나타내는 가중치

사람의 대표성을 나타내는 가중치라고는 나와있는디… 뭔말인지;; data에 대한 설명은 딱히 없어서 분포도 확인 해봤습니다.

df1 = all_data.loc[all_data['income'] == 0, 'fnlwgt']
df2 = all_data.loc[all_data['income'] == 1, 'fnlwgt']

plt.figure(figsize=FIG_SIZE)
sns.distplot(df1, kde=True, rug=True, hist=False, color='blue')
sns.distplot(df2, kde=True, rug=True, hist=False, color='red')

<matplotlib.axes._subplots.AxesSubplot at 0x7fb79ac40ed0>

income 그룹으로 나누어 확인해 보면, fnlwgt 컬럼의 분포가 거의 차이가 없을 을 확인할 수 있습니다.

다른 column 과의 상호 작용 때문에 feature 제거에는 좀 조심 스럽습니다만, 나중에 별 특이성이 없다면 feature 제거도 고려해 봐야겠습니다.

g = sns.FacetGrid(all_data, col="income", height=5)
g.map(sns.distplot, 'fnlwgt')

<seaborn.axisgrid.FacetGrid at 0x7fb799e46790>

log 를 취해 줍니다. (feature의 variance가 쓸데없이 크고, 정규 분포로 만들어줘서 최적화 이득을 보려고 합니다.)

all_data['fnlwgt_log'] = np.log(all_data['fnlwgt'])

education / education_num

education 컬럼은 education_num 컬럼과 value_counts()가 동일하게 찍히는 것을 알 수 있습니다.

따라서, 두 개의 컬럼 중 한개만 사용 나머지 한 개의 컬럼은 버리도록 하겠습니다.

all_data['education'].value_counts()

HS-grad         10501
Some-college     7291
Bachelors        5355
Masters          1723
Assoc-voc        1382
11th             1175
Assoc-acdm       1067
10th              933
7th-8th           646
Prof-school       576
9th               514
12th              433
Doctorate         413
5th-6th           333
1st-4th           168
Preschool          51
Name: education, dtype: int64

all_data['education_num'].value_counts()

9     10501
10     7291
13     5355
14     1723
11     1382
7      1175
12     1067
6       933
4       646
15      576
5       514
8       433
16      413
3       333
2       168
1        51
Name: education_num, dtype: int64

Preschool 인 value는 모두 0임을 확인한다.

all_data.loc[all_data['education'] == 'Preschool', 'income'].sum()

0.0

education과 income의 관련성이 꽤 높아 보입니다.

다만, 단계가 너무 많으면 모델 학습시 과적합이 일어날 수 있으므로, 단계를 묶어 주도록 하겠습니다.

all_data.groupby(['education'])['income'].agg(['mean', 'count']).sort_values('mean')

	mean	count
education
Preschool	0.000000	40
1st-4th	0.037313	134
5th-6th	0.049057	265
9th	0.052632	418
7th-8th	0.057426	505
11th	0.059653	922
12th	0.072423	359
10th	0.072503	731
HS-grad	0.158544	8433
Some-college	0.192586	5800
Assoc-acdm	0.255344	842
Assoc-voc	0.255474	1096
Bachelors	0.415516	4344
Masters	0.561684	1378
Prof-school	0.733906	466
Doctorate	0.734177	316

education_map = {
    'Preschool': 'level_0', 
    '1st-4th': 'level_1', 
    '5th-6th': 'level_1', 
    '7th-8th': 'level_2', 
    '9th': 'level_2', 
    '10th': 'level_3', 
    '11th': 'level_3', 
    '12th': 'level_3', 
    'HS-grad': 'level_4', 
    'Some-college': 'level_5', 
    'Assoc-acdm': 'level_6', 
    'Assoc-voc': 'level_6', 
    'Bachelors': 'level_7', 
    'Masters': 'level_8', 
    'Prof-school': 'level_9', 
    'Doctorate': 'level_9',
}

all_data['education'] = all_data['education'].map(education_map)

all_data['education'].value_counts()

level_4    10501
level_5     7291
level_7     5355
level_3     2541
level_6     2449
level_8     1723
level_2     1160
level_9      989
level_1      501
level_0       51
Name: education, dtype: int64

그룹 별로 묶어 주었는데, 추후에 묶어 주는 단계에 변화를 줘 봐도 될 것 같습니다.

level_1, level_2, level_3는 거의 차이가 없어 보이네요.

all_data.pivot_table(index='education', values=['income']).sort_values('income').plot(kind='bar', figsize=FIG_SIZE)

<matplotlib.axes._subplots.AxesSubplot at 0x7fb799c65ad0>

Preschool 의 평균 income = 0

사용을 안하는 education_num은 drop 합니다.

all_data = all_data.drop('education_num', 1)

all_data.columns

Index(['id', 'age', 'workclass', 'fnlwgt', 'education', 'marital_status',
       'occupation', 'relationship', 'race', 'sex', 'capital_gain',
       'capital_loss', 'hours_per_week', 'native_country', 'income',
       'fnlwgt_log'],
      dtype='object')

marital_status

all_data['marital_status'].value_counts()

Married-civ-spouse       14976
Never-married            10683
Divorced                  4443
Separated                 1025
Widowed                    993
Married-spouse-absent      418
Married-AF-spouse           23
Name: marital_status, dtype: int64

all_data.pivot_table(index='marital_status', values='income', aggfunc=['mean', 'count'])#.sort_values('income')

	mean	count
	income	income
marital_status
Divorced	0.104921	3536
Married-AF-spouse	0.526316	19
Married-civ-spouse	0.448789	11970
Married-spouse-absent	0.080838	334
Never-married	0.046802	8568
Separated	0.065375	826
Widowed	0.087940	796

Married-AF-spouse 컬럼의 데이터 갯수가 적습니다.

유사 그룹인 Married-civ-spouse으로 변형해주도록 하겠습니다.

all_data.loc[all_data['marital_status'] == 'Married-AF-spouse', 'marital_status'] = 'Married-civ-spouse'

occupation

all_data['occupation'].value_counts()

Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
?                    1843
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       149
Armed-Forces            9
Name: occupation, dtype: int64

all_data.groupby('occupation')['income'].mean().sort_values().plot(kind='bar', figsize=FIG_SIZE)

<matplotlib.axes._subplots.AxesSubplot at 0x7fb79997abd0>

occupation == 'Armed-Forces'는 모두 0 income 임을 확인합니다.
또한, Armed-Forces 역시 데이터 갯수가 적으므로, 과적합 방지를 위하여 Priv-house-serve와 합쳐줍니다.

all_data.loc[train['occupation'].isin(['Armed-Forces', 'Priv-house-serv']), 'income'].value_counts()

0.0    129
1.0      1
Name: income, dtype: int64

all_data.loc[all_data['occupation'].isin(['Armed-Forces', 'Priv-house-serv']), 'occupation'] = 'Priv-house-serv'

all_data['occupation'].value_counts()

Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
?                    1843
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       158
Name: occupation, dtype: int64

relationship

relationship 컬럼은 별다른 특이사항을 찾지 못하여 그냥 방치 go

all_data['relationship'].value_counts()

Husband           13193
Not-in-family      8305
Own-child          5068
Unmarried          3446
Wife               1568
Other-relative      981
Name: relationship, dtype: int64

all_data.groupby('relationship')['income'].mean().plot(kind='bar', figsize=FIG_SIZE)

<matplotlib.axes._subplots.AxesSubplot at 0x7fb799c62e90>

race

race 컬럼도 특이사항 확인 못하여 그냥 방치 go

all_data['race'].value_counts()

White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64

race 별 income 확인

all_data.groupby('race')['income'].mean().plot(kind='bar', figsize=FIG_SIZE)

<matplotlib.axes._subplots.AxesSubplot at 0x7fb79989e990>

sex

sex 컬럼도 특이사항 확인 못하여 그냥 방치 go

all_data['sex'].value_counts()

Male      21790
Female    10771
Name: sex, dtype: int64

all_data.groupby('sex')['income'].mean().plot(kind='bar', figsize=FIG_SIZE)

<matplotlib.axes._subplots.AxesSubplot at 0x7fb7995e0450>

capital_gain

capital_gain과 capital_loss를 같이 봐야겠다는 생각으로 접근합니다.

(가설) - capital_gain이 크면 소득 수준이 높지 않을까?

plt.figure(figsize=(12, 9))
sns.distplot(all_data.loc[train['capital_gain'] > 0, 'capital_gain'])

<matplotlib.axes._subplots.AxesSubplot at 0x7fb79954a090>

재밌는 사실을 발견했죠…

capital_gain > 50000이면 모두 income 이 1 입니다.

g = sns.FacetGrid(all_data.loc[all_data['capital_gain']> 0], col="income", height=7, aspect=1.5)
g.map(sns.distplot, 'capital_gain')

<seaborn.axisgrid.FacetGrid at 0x7fb799915a90>

capital_gain & capital_loss은 모두 Numerical 처럼 보이지만, categorical 로 만들어도 값의 variance가 크지 않습니다.

그래서 value_counts()로 income 별 값 분포를 확인합니다.

income == 1 인 그룹이 가지고 있는 특정 key와 income == 0 인 그룹이 가지고 있는 특정 key가 극명히 갈리는 것을 확인할 수 있습니다.

fig, axes = plt.subplots(1, 2)
fig.set_size_inches(20, 8)

df1 = train.loc[(train['income'] == 0) & (train['capital_gain'] > 0), 'capital_gain'].value_counts().sort_index()
df1.plot(kind='bar', ax=axes[0])

df1 = train.loc[(train['income'] == 1) & (train['capital_gain'] > 0), 'capital_gain'].value_counts().sort_index()
df1.plot(kind='bar', ax=axes[1])

plt.tight_layout()
plt.show()

capital_loss

fig, axes = plt.subplots(1, 2)
fig.set_size_inches(20, 8)

df1 = train.loc[(train['income'] == 0) & (train['capital_loss'] > 0), 'capital_loss'].value_counts().sort_index()
df1.plot(kind='bar', ax=axes[0])

df1 = train.loc[(train['income'] == 1) & (train['capital_loss'] > 0), 'capital_loss'].value_counts().sort_index()
df1.plot(kind='bar', ax=axes[1])

plt.tight_layout()
plt.show()

capital net

capital_gain - capital_loss 진행하여 Net을 구합니다.

all_data['capital_net'] = all_data['capital_gain'] - all_data['capital_loss']
train['capital_net'] = train['capital_gain'] - train['capital_loss']
test['capital_net'] = test['capital_gain'] - test['capital_loss']

plt.figure(figsize=(16, 9))
plt.subplot(1, 2, 1)
sns.distplot(train.loc[ (train['capital_net'] > 0) & (train['income'] == 1), 'capital_net'])

plt.subplot(1, 2, 2)
sns.distplot(train.loc[ (train['capital_net'] > 0) & (train['income'] == 0), 'capital_net'])

<matplotlib.axes._subplots.AxesSubplot at 0x7fb7987effd0>

fig, axes = plt.subplots(1, 2)
fig.set_size_inches(20, 8)

df1 = all_data.loc[(all_data['income'] == 0) & (all_data['capital_net'] > 0), 'capital_net'].value_counts().sort_index()
df1.plot(kind='bar', ax=axes[0])

df2 = all_data.loc[(all_data['income'] == 1) & (all_data['capital_net'] > 0), 'capital_net'].value_counts().sort_index()
df2.plot(kind='bar', ax=axes[1])

plt.tight_layout()
plt.show()

capital_net 기준으로 income == 1 or 0 이 나오는 key 값 추출

pos_key = all_data.loc[(all_data['income'] == 1) & (all_data['capital_net'] > 0), 'capital_net'].value_counts().sort_index().keys().tolist()
all_key = all_data.loc[(all_data['income'] == 1) & (all_data['capital_net'] > 0), 'capital_net'].value_counts().sort_index().keys().tolist()
all_key.extend(all_data.loc[(all_data['income'] == 0) & (all_data['capital_net'] > 0), 'capital_net'].value_counts().sort_index().keys().tolist())
all_key[:5]

[3103, 4386, 4687, 4787, 4934]

몇 개 겹치는 것도 있긴 합니다.

df1 = all_data.loc[(all_data['income'] == 0) & (all_data['capital_net'].isin(pos_key)), 'capital_net'].value_counts().sort_index()
df1.plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x7fb79815ec50>

pos_key = all_data.loc[(all_data['income'] == 1) & (all_data['capital_net'] > 0), 'capital_net'].value_counts().sort_index().keys().tolist()
neg_key = all_data.loc[(all_data['income'] == 0) & (all_data['capital_net'] > 0), 'capital_net'].value_counts().sort_index().keys().tolist()

겹치지 않는 것들만 추려 주려고요

capital_net_pos_key = [key for key in pos_key if key not in neg_key]
capital_net_neg_key = [key for key in neg_key if key not in pos_key]

all_data['capital_net_pos_key'] = all_data['capital_net'].apply(lambda x: x in capital_net_pos_key)
all_data['capital_net_neg_key'] = all_data['capital_net'].apply(lambda x: x in capital_net_neg_key)

hours_per_week

40시간 근로자들이 많네요~

40시간 이상 근로자들은 income == 1 쪽이 많이 보입니다.

all_data['hours_per_week'].value_counts()

40    15217
50     2819
45     1824
60     1475
35     1297
      ...  
92        1
94        1
87        1
74        1
82        1
Name: hours_per_week, Length: 94, dtype: int64

fig, axes = plt.subplots(1, 2)
fig.set_size_inches(20, 8)

df1 = all_data.loc[(all_data['income'] == 0), 'hours_per_week'].value_counts().sort_index()
df1.plot(kind='bar', ax=axes[0])

df2 = all_data.loc[(all_data['income'] == 1), 'hours_per_week'].value_counts().sort_index()
df2.plot(kind='bar', ax=axes[1])

plt.tight_layout()
plt.show()

native_country

나라가 좀 골치 덩어리 였습니다.

일단 value의 variance가 크고, 데이터의 갯수가 몇 개 없는 feature 들이 있습니다.

합쳐 주도록 하겠습니다.

train['native_country'].value_counts().shape, test['native_country'].value_counts().shape

((41,), (42,))

all_data['native_country'].value_counts()

United-States                 29170
Mexico                          643
?                               583
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England                          90
Jamaica                          81
South                            80
China                            75
Italy                            73
Dominican-Republic               70
Vietnam                          67
Guatemala                        64
Japan                            62
Poland                           60
Columbia                         59
Taiwan                           51
Haiti                            44
Iran                             43
Portugal                         37
Nicaragua                        34
Peru                             31
France                           29
Greece                           29
Ecuador                          28
Ireland                          24
Hong                             20
Cambodia                         19
Trinadad&Tobago                  19
Laos                             18
Thailand                         18
Yugoslavia                       16
Outlying-US(Guam-USVI-etc)       14
Honduras                         13
Hungary                          13
Scotland                         12
Holand-Netherlands                1
Name: native_country, dtype: int64

나중에 아래 wiki에서 국가별 소득 수준 별로 그룹을 만들어서 합쳐도 보려고요

List of countries by GNI (nominal) per capita (Wikipedia)

all_data.groupby('native_country')['income'].mean().reset_index()

	native_country	income
0	?	0.234649
1	Cambodia	0.428571
2	Canada	0.315217
3	China	0.228070
4	Columbia	0.038462
5	Cuba	0.263158
6	Dominican-Republic	0.041667
7	Ecuador	0.166667
8	El-Salvador	0.088608
9	England	0.343284
10	France	0.416667
11	Germany	0.346535
12	Greece	0.250000
13	Guatemala	0.057692
14	Haiti	0.114286
15	Holand-Netherlands	NaN
16	Honduras	0.000000
17	Hong	0.285714
18	Hungary	0.272727
19	India	0.402597
20	Iran	0.485714
21	Ireland	0.222222
22	Italy	0.380000
23	Jamaica	0.109375
24	Japan	0.404255
25	Laos	0.133333
26	Mexico	0.048689
27	Nicaragua	0.071429
28	Outlying-US(Guam-USVI-etc)	0.000000
29	Peru	0.076923
30	Philippines	0.300613
31	Poland	0.212766
32	Portugal	0.066667
33	Puerto-Rico	0.115789
34	Scotland	0.250000
35	South	0.222222
36	Taiwan	0.461538
37	Thailand	0.153846
38	Trinadad&Tobago	0.071429
39	United-States	0.247315
40	Vietnam	0.080000
41	Yugoslavia	0.416667

income_01 = ['Jamaica',
 'Haiti',
 'Puerto-Rico',
 'Laos',
 'Thailand',
 'Ecuador',]

income_02 = ['Outlying-US(Guam-USVI-etc)',
 'Honduras',
 'Columbia',
 'Dominican-Republic',
 'Mexico',
 'Guatemala',
 'Portugal',
 'Trinadad&Tobago',
 'Nicaragua',
 'Peru',
 'Vietnam',
 'El-Salvador',]

income_03 = ['Poland',
 'Ireland',
 'South',
 'China',]

income_04 = [
    'United-States',
]
income_05 = [
 'Greece',
 'Scotland',
 'Cuba',
 'Hungary',
 'Hong',
 'Holand-Netherlands',
]
income_06 = [
 'Philippines',
 'Canada',
]
income_07 = [
 'England',
 'Germany',
]

income_08 = [
 'Italy',
 'India',
 'Japan',
 'France',
 'Yugoslavia',
 'Cambodia',
]

income_09 = [
 'Taiwan',
 'Iran',
]

income_other=['?', ]

def convert_country(x):
    if x in income_01:
        return 'income_01'
    elif x in income_02:
        return 'income_02'
    elif x in income_03:
        return 'income_03'
    elif x in income_04:
        return 'income_04'
    elif x in income_05:
        return 'income_05'
    elif x in income_06:
        return 'income_06'
    elif x in income_07:
        return 'income_07'
    elif x in income_08:
        return 'income_08'
    elif x in income_09:
        return 'income_09'
    else:
        return 'income_other'

all_data['country_bin'] = all_data['native_country'].apply(convert_country)

all_data['country_bin'].value_counts()

income_04       29170
income_02        1157
income_other      583
income_06         319
income_01         303
income_08         299
income_03         239
income_07         227
income_05         170
income_09          94
Name: country_bin, dtype: int64

Define Features

쓸만한 feature들을 골라보자.

all_data.columns

Index(['id', 'age', 'workclass', 'fnlwgt', 'education', 'marital_status',
       'occupation', 'relationship', 'race', 'sex', 'capital_gain',
       'capital_loss', 'hours_per_week', 'native_country', 'income',
       'fnlwgt_log', 'capital_net', 'capital_net_pos_key',
       'capital_net_neg_key', 'country_bin'],
      dtype='object')

features = [
#     'id', 
    'age', 
    'workclass', 
#     'fnlwgt', 
    'fnlwgt_log', 
    'education', 
    'marital_status',
    'occupation',
    'relationship', 
    'race',
    'sex',
    'capital_gain',
    'capital_loss', 
    'hours_per_week',
    'native_country',
#     'income',
#     'capital_net', capital_gain과 corr이 커서 제거
    'capital_net_pos_key',
    'capital_net_neg_key',
    'country_bin',
]

label = [
    'income'
]

all_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32561 entries, 0 to 6511
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   32561 non-null  int64  
 1   age                  32561 non-null  int64  
 2   workclass            32561 non-null  object 
 3   fnlwgt               32561 non-null  int64  
 4   education            32561 non-null  object 
 5   marital_status       32561 non-null  object 
 6   occupation           32561 non-null  object 
 7   relationship         32561 non-null  object 
 8   race                 32561 non-null  object 
 9   sex                  32561 non-null  object 
 10  capital_gain         32561 non-null  int64  
 11  capital_loss         32561 non-null  int64  
 12  hours_per_week       32561 non-null  int64  
 13  native_country       32561 non-null  object 
 14  income               26049 non-null  float64
 15  fnlwgt_log           32561 non-null  float64
 16  capital_net          32561 non-null  int64  
 17  capital_net_pos_key  32561 non-null  bool   
 18  capital_net_neg_key  32561 non-null  bool   
 19  country_bin          32561 non-null  object 
dtypes: bool(2), float64(2), int64(7), object(9)
memory usage: 6.0+ MB

plt.figure(figsize=(12, 12))
sns.heatmap(abs(all_data.corr()), annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7fb7985fa590>

all_data_dummies = pd.get_dummies(all_data[features + label])
all_data_dummies.head()

	age	fnlwgt_log	hours_per_week	capital_net_pos_key	capital_net_neg_key	income	...	country_bin_income_04	country_bin_income_other
0	40	12.034917	60	False	False	1.0	...	1	0
1	17	11.529055	20	False	False	0.0	...	1	0
2	18	12.775237	16	False	False	0.0	...	1	0
3	21	11.926081	25	False	False	0.0	...	1	0
4	24	11.713693	20	False	False	0.0	...	0	1

5 rows × 111 columns

train_features = all_data_dummies.drop('income', 1).iloc[:len(train)]
test_features = all_data_dummies.drop('income', 1).iloc[len(train):]

train_label = train[label]

train_features.shape, test_features.shape

((26049, 110), (6512, 110))

Model (LightGBM)

from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import precision_score, recall_score, classification_report, f1_score, confusion_matrix
from sklearn.metrics import log_loss
from tqdm import tqdm_notebook
import lightgbm as lgbm

x_train, x_valid, y_train, y_valid = train_test_split(train_features, train_label, stratify=train_label, test_size=0.2, random_state=SEED)

NUM_BOOST_ROUND = 10000
N_SPLITS = 5

lgbm_param = {
    'objective': 'binary',
    'boosting_type':'gbdt',
    'colsample_bytree':1.0,
    'importance_type':'split',
    'learning_rate':0.1,
    'min_child_samples':20,
    'min_child_weight':0.001,
    'min_split_gain':0,
    'n_estimators':10000,
    'num_leaves':40,
    'random_state':SEED,
    'early_stopping_rounds': 200,
    'reg_alpha':0.6,
    'reg_lambda':0.5,
    'subsample':1.0,
    'subsample_for_bin':200000,
    'subsample_freq':0, 
    'n_jobs':-1, 
}

dtrain = lgbm.Dataset(x_train, y_train)
dvalid = lgbm.Dataset(x_valid, y_valid)

model = lgbm.train(lgbm_param, dtrain, NUM_BOOST_ROUND, 
                   valid_sets=(dtrain, dvalid), 
                   valid_names=('train', 'valid'), 
                   verbose_eval=100,
                  )

Training until validation scores don't improve for 200 rounds
[100]	train's binary_logloss: 0.224887	valid's binary_logloss: 0.284848
[200]	train's binary_logloss: 0.19576	valid's binary_logloss: 0.291253
Early stopping, best iteration is:
[65]	train's binary_logloss: 0.239304	valid's binary_logloss: 0.282915

Threshold 별 F1 Score 확인

threshold = 0.5
valid_prediction = model.predict(x_valid)
valid_prediction[valid_prediction > threshold] = 1
valid_prediction[valid_prediction <= threshold] = 0
print(classification_report(y_valid, valid_prediction))

              precision    recall  f1-score   support

           0       0.89      0.94      0.92      3949
           1       0.78      0.64      0.70      1261

    accuracy                           0.87      5210
   macro avg       0.83      0.79      0.81      5210
weighted avg       0.86      0.87      0.86      5210

Threshold 별 F1_Score의 변화 확인

f1_threshold = np.linspace(0.4, 0.6, 30)
f1_scores = []
max_score = 0
max_threshold = 0

for t in f1_threshold:
    valid_prediction = model.predict(x_valid)
    valid_prediction[valid_prediction > t] = 1
    valid_prediction[valid_prediction <= t] = 0
    score_ = f1_score(y_valid, valid_prediction)
    f1_scores.append(score_)
    if score_ > max_score:
        max_score = score_
        max_threshold = t
        
plt.figure(figsize=(16, 6))
plt.plot(f1_threshold, f1_scores)
plt.axvline(x=max_threshold, linestyle=':', color='r')
plt.xticks(f1_threshold, rotation=90)
plt.show()

confusion_matrix

plt.figure(figsize=FIG_SIZE)
sns.heatmap(confusion_matrix(y_valid, valid_prediction), annot=True, fmt='g')

<matplotlib.axes._subplots.AxesSubplot at 0x7fb798b4b750>

Prediction

pred = model.predict(test_features)

pred 값의 분포 확인

plt.figure(figsize=FIG_SIZE)
sns.distplot(pred)

<matplotlib.axes._subplots.AxesSubplot at 0x7fb798b578d0>

# 기본 0.5으로 설정
THRESHOLD = 0.5

print(len(pred[pred >= THRESHOLD]) / len(pred[pred < THRESHOLD]))

0.25062415978490493

pred[pred >= THRESHOLD] = 1
pred[pred < THRESHOLD] = 0

income_pct = train['income'].value_counts()[1] / train['income'].value_counts()[0]
income_pct

0.3193375202593193

plt.figure(figsize=(10, 6))
plt.subplot(121)
sns.countplot(pred)

plt.subplot(122)
sns.countplot(train['income'])
plt.show()

PyCarot

!pip install pycaret

Collecting pycaret
  Downloading pycaret-2.1.2-py3-none-any.whl (252 kB)
[K     |████████████████████████████████| 252 kB 402 kB/s 
[?25hRequirement already satisfied: imbalanced-learn>=0.6.2 in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.7.0)
Requirement already satisfied: joblib in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.14.1)
Requirement already satisfied: spacy in /opt/conda/lib/python3.7/site-packages (from pycaret) (2.3.2)
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.7/site-packages (from pycaret) (3.2.1)
Requirement already satisfied: mlxtend in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.17.3)
Requirement already satisfied: xgboost>=0.90 in /opt/conda/lib/python3.7/site-packages (from pycaret) (1.2.0)
Collecting datefinder>=0.7.0
  Downloading datefinder-0.7.1-py2.py3-none-any.whl (10 kB)
Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.7/site-packages (from pycaret) (1.18.5)
Requirement already satisfied: yellowbrick>=1.0.1 in /opt/conda/lib/python3.7/site-packages (from pycaret) (1.1)
Requirement already satisfied: pyLDAvis in /opt/conda/lib/python3.7/site-packages (from pycaret) (2.1.2)
Requirement already satisfied: cufflinks>=0.17.0 in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.17.3)
Collecting mlflow
  Downloading mlflow-1.11.0-py3-none-any.whl (13.9 MB)
[K     |████████████████████████████████| 13.9 MB 5.3 MB/s 
[?25hCollecting pyod
  Downloading pyod-0.8.3.tar.gz (96 kB)
[K     |████████████████████████████████| 96 kB 3.1 MB/s 
[?25hRequirement already satisfied: textblob in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.15.3)
Requirement already satisfied: pandas in /opt/conda/lib/python3.7/site-packages (from pycaret) (1.1.3)
Requirement already satisfied: umap-learn in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.4.6)
Requirement already satisfied: kmodes>=0.10.1 in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.10.2)
Requirement already satisfied: lightgbm>=2.3.1 in /opt/conda/lib/python3.7/site-packages (from pycaret) (2.3.1)
Requirement already satisfied: gensim in /opt/conda/lib/python3.7/site-packages (from pycaret) (3.8.3)
Requirement already satisfied: plotly>=4.4.1 in /opt/conda/lib/python3.7/site-packages (from pycaret) (4.11.0)
Requirement already satisfied: wordcloud in /opt/conda/lib/python3.7/site-packages (from pycaret) (1.8.0)
Requirement already satisfied: catboost>=0.23.2 in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.24.1)
Requirement already satisfied: seaborn in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.10.0)
Requirement already satisfied: scikit-learn>=0.23 in /opt/conda/lib/python3.7/site-packages (from pycaret) (0.23.2)
Collecting pandas-profiling>=2.8.0
  Downloading pandas_profiling-2.9.0-py2.py3-none-any.whl (258 kB)
[K     |████████████████████████████████| 258 kB 13.6 MB/s 
[?25hRequirement already satisfied: nltk in /opt/conda/lib/python3.7/site-packages (from pycaret) (3.2.4)
Requirement already satisfied: IPython in /opt/conda/lib/python3.7/site-packages (from pycaret) (7.13.0)
Requirement already satisfied: ipywidgets in /opt/conda/lib/python3.7/site-packages (from pycaret) (7.5.1)
Requirement already satisfied: scipy>=0.19.1 in /opt/conda/lib/python3.7/site-packages (from imbalanced-learn>=0.6.2->pycaret) (1.4.1)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (0.8.0)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (1.0.2)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (2.0.3)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (3.0.2)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (0.4.1)
Requirement already satisfied: thinc==7.4.1 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (7.4.1)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (2.23.0)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (1.0.0)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (1.1.3)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (4.45.0)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (46.1.3.post20200325)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /opt/conda/lib/python3.7/site-packages (from spacy->pycaret) (1.0.2)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pycaret) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pycaret) (2.4.7)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pycaret) (1.2.0)
Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pycaret) (2.8.1)
Requirement already satisfied: pytz in /opt/conda/lib/python3.7/site-packages (from datefinder>=0.7.0->pycaret) (2019.3)
Requirement already satisfied: regex>=2017.02.08 in /opt/conda/lib/python3.7/site-packages (from datefinder>=0.7.0->pycaret) (2020.4.4)
Requirement already satisfied: wheel>=0.23.0 in /opt/conda/lib/python3.7/site-packages (from pyLDAvis->pycaret) (0.34.2)
Requirement already satisfied: pytest in /opt/conda/lib/python3.7/site-packages (from pyLDAvis->pycaret) (5.4.1)
Requirement already satisfied: funcy in /opt/conda/lib/python3.7/site-packages (from pyLDAvis->pycaret) (1.15)
Requirement already satisfied: jinja2>=2.7.2 in /opt/conda/lib/python3.7/site-packages (from pyLDAvis->pycaret) (2.11.2)
Requirement already satisfied: numexpr in /opt/conda/lib/python3.7/site-packages (from pyLDAvis->pycaret) (2.7.1)
Requirement already satisfied: future in /opt/conda/lib/python3.7/site-packages (from pyLDAvis->pycaret) (0.18.2)
Requirement already satisfied: colorlover>=0.2.1 in /opt/conda/lib/python3.7/site-packages (from cufflinks>=0.17.0->pycaret) (0.3.0)
Requirement already satisfied: six>=1.9.0 in /opt/conda/lib/python3.7/site-packages (from cufflinks>=0.17.0->pycaret) (1.14.0)
Collecting databricks-cli>=0.8.7
  Downloading databricks-cli-0.12.2.tar.gz (55 kB)
[K     |████████████████████████████████| 55 kB 1.8 MB/s 
[?25hCollecting alembic<=1.4.1
  Downloading alembic-1.4.1.tar.gz (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 14.2 MB/s 
[?25hRequirement already satisfied: cloudpickle in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (1.3.0)
Collecting sqlalchemy<=1.3.13
  Downloading SQLAlchemy-1.3.13.tar.gz (6.0 MB)
[K     |████████████████████████████████| 6.0 MB 15.5 MB/s 
[?25hRequirement already satisfied: pyyaml in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (5.3.1)
Requirement already satisfied: protobuf>=3.6.0 in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (3.13.0)
Requirement already satisfied: click>=7.0 in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (7.1.1)
Requirement already satisfied: docker>=4.0.0 in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (4.2.0)
Collecting azure-storage-blob>=12.0
  Downloading azure_storage_blob-12.5.0-py2.py3-none-any.whl (326 kB)
[K     |████████████████████████████████| 326 kB 19.1 MB/s 
[?25hRequirement already satisfied: entrypoints in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (0.3)
Requirement already satisfied: gitpython>=2.1.0 in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (3.1.1)
Collecting gunicorn; platform_system != "Windows"
  Downloading gunicorn-20.0.4-py2.py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 3.9 MB/s 
[?25hCollecting querystring-parser
  Downloading querystring_parser-1.2.4.tar.gz (5.5 kB)
Collecting gorilla
  Downloading gorilla-0.3.0-py2.py3-none-any.whl (11 kB)
Requirement already satisfied: sqlparse in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (0.3.1)
Collecting prometheus-flask-exporter
  Downloading prometheus_flask_exporter-0.18.1.tar.gz (21 kB)
Requirement already satisfied: Flask in /opt/conda/lib/python3.7/site-packages (from mlflow->pycaret) (1.1.2)
Collecting combo
  Downloading combo-0.1.1.tar.gz (37 kB)
Requirement already satisfied: numba>=0.35 in /opt/conda/lib/python3.7/site-packages (from pyod->pycaret) (0.48.0)
Requirement already satisfied: statsmodels in /opt/conda/lib/python3.7/site-packages (from pyod->pycaret) (0.11.1)
Collecting suod
  Downloading suod-0.0.4.tar.gz (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 19.0 MB/s 
[?25hRequirement already satisfied: smart-open>=1.8.1 in /opt/conda/lib/python3.7/site-packages (from gensim->pycaret) (2.2.1)
Requirement already satisfied: retrying>=1.3.3 in /opt/conda/lib/python3.7/site-packages (from plotly>=4.4.1->pycaret) (1.3.3)
Requirement already satisfied: pillow in /opt/conda/lib/python3.7/site-packages (from wordcloud->pycaret) (7.2.0)
Requirement already satisfied: graphviz in /opt/conda/lib/python3.7/site-packages (from catboost>=0.23.2->pycaret) (0.8.4)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=0.23->pycaret) (2.1.0)
Requirement already satisfied: missingno>=0.4.2 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret) (0.4.2)
Requirement already satisfied: confuse>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret) (1.1.0)
Requirement already satisfied: attrs>=19.3.0 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret) (19.3.0)
Requirement already satisfied: htmlmin>=0.1.12 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret) (0.1.12)
Collecting visions[type_image_path]==0.5.0
  Downloading visions-0.5.0-py3-none-any.whl (64 kB)
[K     |████████████████████████████████| 64 kB 2.2 MB/s 
[?25hCollecting tangled-up-in-unicode>=0.0.6
  Downloading tangled_up_in_unicode-0.0.6-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 24.2 MB/s 
[?25hRequirement already satisfied: phik>=0.9.10 in /opt/conda/lib/python3.7/site-packages (from pandas-profiling>=2.8.0->pycaret) (0.9.11)
Requirement already satisfied: pygments in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret) (2.6.1)
Requirement already satisfied: backcall in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret) (0.1.0)
Requirement already satisfied: pexpect; sys_platform != "win32" in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret) (4.8.0)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret) (3.0.5)
Requirement already satisfied: jedi>=0.10 in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret) (0.15.2)
Requirement already satisfied: traitlets>=4.2 in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret) (4.3.3)
Requirement already satisfied: pickleshare in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret) (0.7.5)
Requirement already satisfied: decorator in /opt/conda/lib/python3.7/site-packages (from IPython->pycaret) (4.4.2)
Requirement already satisfied: widgetsnbextension~=3.5.0 in /opt/conda/lib/python3.7/site-packages (from ipywidgets->pycaret) (3.5.1)
Requirement already satisfied: ipykernel>=4.5.1 in /opt/conda/lib/python3.7/site-packages (from ipywidgets->pycaret) (5.1.1)
Requirement already satisfied: nbformat>=4.2.0 in /opt/conda/lib/python3.7/site-packages (from ipywidgets->pycaret) (5.0.6)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy->pycaret) (2020.6.20)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy->pycaret) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy->pycaret) (2.9)
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy->pycaret) (3.0.4)
Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /opt/conda/lib/python3.7/site-packages (from catalogue<1.1.0,>=0.0.7->spacy->pycaret) (2.0.0)
Requirement already satisfied: py>=1.5.0 in /opt/conda/lib/python3.7/site-packages (from pytest->pyLDAvis->pycaret) (1.8.1)
Requirement already satisfied: packaging in /opt/conda/lib/python3.7/site-packages (from pytest->pyLDAvis->pycaret) (20.1)
Requirement already satisfied: more-itertools>=4.0.0 in /opt/conda/lib/python3.7/site-packages (from pytest->pyLDAvis->pycaret) (8.2.0)
Requirement already satisfied: pluggy<1.0,>=0.12 in /opt/conda/lib/python3.7/site-packages (from pytest->pyLDAvis->pycaret) (0.13.0)
Requirement already satisfied: wcwidth in /opt/conda/lib/python3.7/site-packages (from pytest->pyLDAvis->pycaret) (0.1.9)
Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.7/site-packages (from jinja2>=2.7.2->pyLDAvis->pycaret) (1.1.1)
Requirement already satisfied: tabulate>=0.7.7 in /opt/conda/lib/python3.7/site-packages (from databricks-cli>=0.8.7->mlflow->pycaret) (0.8.7)
Collecting tenacity>=6.2.0
  Downloading tenacity-6.2.0-py2.py3-none-any.whl (24 kB)
Requirement already satisfied: Mako in /opt/conda/lib/python3.7/site-packages (from alembic<=1.4.1->mlflow->pycaret) (1.1.3)
Requirement already satisfied: python-editor>=0.3 in /opt/conda/lib/python3.7/site-packages (from alembic<=1.4.1->mlflow->pycaret) (1.0.4)
Requirement already satisfied: websocket-client>=0.32.0 in /opt/conda/lib/python3.7/site-packages (from docker>=4.0.0->mlflow->pycaret) (0.57.0)
Collecting azure-core<2.0.0,>=1.6.0
  Downloading azure_core-1.8.2-py2.py3-none-any.whl (122 kB)
[K     |████████████████████████████████| 122 kB 28.6 MB/s 
[?25hCollecting msrest>=0.6.10
  Downloading msrest-0.6.19-py2.py3-none-any.whl (84 kB)
[K     |████████████████████████████████| 84 kB 1.9 MB/s 
[?25hRequirement already satisfied: cryptography>=2.1.4 in /opt/conda/lib/python3.7/site-packages (from azure-storage-blob>=12.0->mlflow->pycaret) (2.8)
Requirement already satisfied: gitdb<5,>=4.0.1 in /opt/conda/lib/python3.7/site-packages (from gitpython>=2.1.0->mlflow->pycaret) (4.0.4)
Requirement already satisfied: prometheus_client in /opt/conda/lib/python3.7/site-packages (from prometheus-flask-exporter->mlflow->pycaret) (0.7.1)
Requirement already satisfied: Werkzeug>=0.15 in /opt/conda/lib/python3.7/site-packages (from Flask->mlflow->pycaret) (1.0.1)
Requirement already satisfied: itsdangerous>=0.24 in /opt/conda/lib/python3.7/site-packages (from Flask->mlflow->pycaret) (1.1.0)
Requirement already satisfied: llvmlite<0.32.0,>=0.31.0dev0 in /opt/conda/lib/python3.7/site-packages (from numba>=0.35->pyod->pycaret) (0.31.0)
Requirement already satisfied: patsy>=0.5 in /opt/conda/lib/python3.7/site-packages (from statsmodels->pyod->pycaret) (0.5.1)
Requirement already satisfied: boto3 in /opt/conda/lib/python3.7/site-packages (from smart-open>=1.8.1->gensim->pycaret) (1.15.13)
Requirement already satisfied: networkx>=2.4 in /opt/conda/lib/python3.7/site-packages (from visions[type_image_path]==0.5.0->pandas-profiling>=2.8.0->pycaret) (2.4)
Requirement already satisfied: imagehash; extra == "type_image_path" in /opt/conda/lib/python3.7/site-packages (from visions[type_image_path]==0.5.0->pandas-profiling>=2.8.0->pycaret) (4.1.0)
Requirement already satisfied: ptyprocess>=0.5 in /opt/conda/lib/python3.7/site-packages (from pexpect; sys_platform != "win32"->IPython->pycaret) (0.6.0)
Requirement already satisfied: parso>=0.5.2 in /opt/conda/lib/python3.7/site-packages (from jedi>=0.10->IPython->pycaret) (0.5.2)
Requirement already satisfied: ipython-genutils in /opt/conda/lib/python3.7/site-packages (from traitlets>=4.2->IPython->pycaret) (0.2.0)
Requirement already satisfied: notebook>=4.4.1 in /opt/conda/lib/python3.7/site-packages (from widgetsnbextension~=3.5.0->ipywidgets->pycaret) (5.5.0)
Requirement already satisfied: tornado>=4.2 in /opt/conda/lib/python3.7/site-packages (from ipykernel>=4.5.1->ipywidgets->pycaret) (5.0.2)
Requirement already satisfied: jupyter-client in /opt/conda/lib/python3.7/site-packages (from ipykernel>=4.5.1->ipywidgets->pycaret) (6.1.3)
Requirement already satisfied: jupyter-core in /opt/conda/lib/python3.7/site-packages (from nbformat>=4.2.0->ipywidgets->pycaret) (4.6.3)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /opt/conda/lib/python3.7/site-packages (from nbformat>=4.2.0->ipywidgets->pycaret) (3.2.0)
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy->pycaret) (3.1.0)
Collecting isodate>=0.6.0
  Downloading isodate-0.6.0-py2.py3-none-any.whl (45 kB)
[K     |████████████████████████████████| 45 kB 1.4 MB/s 
[?25hRequirement already satisfied: requests-oauthlib>=0.5.0 in /opt/conda/lib/python3.7/site-packages (from msrest>=0.6.10->azure-storage-blob>=12.0->mlflow->pycaret) (1.2.0)
Requirement already satisfied: cffi!=1.11.3,>=1.8 in /opt/conda/lib/python3.7/site-packages (from cryptography>=2.1.4->azure-storage-blob>=12.0->mlflow->pycaret) (1.14.0)
Requirement already satisfied: smmap<4,>=3.0.1 in /opt/conda/lib/python3.7/site-packages (from gitdb<5,>=4.0.1->gitpython>=2.1.0->mlflow->pycaret) (3.0.2)
Requirement already satisfied: botocore<1.19.0,>=1.18.13 in /opt/conda/lib/python3.7/site-packages (from boto3->smart-open>=1.8.1->gensim->pycaret) (1.18.13)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /opt/conda/lib/python3.7/site-packages (from boto3->smart-open>=1.8.1->gensim->pycaret) (0.10.0)
Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /opt/conda/lib/python3.7/site-packages (from boto3->smart-open>=1.8.1->gensim->pycaret) (0.3.3)
Requirement already satisfied: PyWavelets in /opt/conda/lib/python3.7/site-packages (from imagehash; extra == "type_image_path"->visions[type_image_path]==0.5.0->pandas-profiling>=2.8.0->pycaret) (1.1.1)
Requirement already satisfied: nbconvert in /opt/conda/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (5.6.1)
Requirement already satisfied: pyzmq>=17 in /opt/conda/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (19.0.0)
Requirement already satisfied: Send2Trash in /opt/conda/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (1.5.0)
Requirement already satisfied: terminado>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.8.3)
Requirement already satisfied: pyrsistent>=0.14.0 in /opt/conda/lib/python3.7/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets->pycaret) (0.16.0)
Requirement already satisfied: oauthlib>=3.0.0 in /opt/conda/lib/python3.7/site-packages (from requests-oauthlib>=0.5.0->msrest>=0.6.10->azure-storage-blob>=12.0->mlflow->pycaret) (3.0.1)
Requirement already satisfied: pycparser in /opt/conda/lib/python3.7/site-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.1.4->azure-storage-blob>=12.0->mlflow->pycaret) (2.20)
Requirement already satisfied: pandocfilters>=1.4.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (1.4.2)
Requirement already satisfied: bleach in /opt/conda/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (3.1.4)
Requirement already satisfied: mistune<2,>=0.8.1 in /opt/conda/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.8.4)
Requirement already satisfied: testpath in /opt/conda/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.4.4)
Requirement already satisfied: defusedxml in /opt/conda/lib/python3.7/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.6.0)
Requirement already satisfied: webencodings in /opt/conda/lib/python3.7/site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.5.1)
Building wheels for collected packages: pyod, databricks-cli, alembic, sqlalchemy, querystring-parser, prometheus-flask-exporter, combo, suod
  Building wheel for pyod (setup.py) ... [?25l- \ | done
[?25h  Created wheel for pyod: filename=pyod-0.8.3-py3-none-any.whl size=110347 sha256=6858fa6eda242cf3101a17d6095f16ac26c9dc1b497ae42b4ee3cfeb5156be5d
  Stored in directory: /root/.cache/pip/wheels/fc/fc/77/6e530134c9ee2b45ef0840f0c8046b3be595624881cf533d7a
  Building wheel for databricks-cli (setup.py) ... [?25l- \ | done
[?25h  Created wheel for databricks-cli: filename=databricks_cli-0.12.2-py3-none-any.whl size=101163 sha256=be3329799d7581f8e81992ec5ca7ab24167fb3b92d0187c4a62c8e049195a955
  Stored in directory: /root/.cache/pip/wheels/9e/bb/9d/78e02afa234019a22759d08d285bae87a88fa881f5db58db25
  Building wheel for alembic (setup.py) ... [?25l- \ | done
[?25h  Created wheel for alembic: filename=alembic-1.4.1-py2.py3-none-any.whl size=158154 sha256=3a4b7a763a6ce226a933b9aa155d11719a424ddff92458f00091c0e7c3bd50cf
  Stored in directory: /root/.cache/pip/wheels/be/5d/0a/9e13f53f4f5dfb67cd8d245bb7cdffe12f135846f491a283e3
  Building wheel for sqlalchemy (setup.py) ... [?25l- \ | / - \ | / - \ done
[?25h  Created wheel for sqlalchemy: filename=SQLAlchemy-1.3.13-cp37-cp37m-linux_x86_64.whl size=1221862 sha256=8a33081e209764349239860912cc6cde907f8613081f067c95fecca823653bd3
  Stored in directory: /root/.cache/pip/wheels/b9/ba/77/163f10f14bd489351530603e750c195b0ceceed2f3be2b32f1
  Building wheel for querystring-parser (setup.py) ... [?25l- \ done
[?25h  Created wheel for querystring-parser: filename=querystring_parser-1.2.4-py3-none-any.whl size=7076 sha256=eed4ac8c5058079d17a797b70ae3ed32cbea0c95ed24a361365cd86217a49dca
  Stored in directory: /root/.cache/pip/wheels/69/38/7a/072b5863ca334d012821a287fd1d066cea33abdcda3ef2f878
  Building wheel for prometheus-flask-exporter (setup.py) ... [?25l- \ done
[?25h  Created wheel for prometheus-flask-exporter: filename=prometheus_flask_exporter-0.18.1-py3-none-any.whl size=17157 sha256=66518d2e9e9f0b4e8e78fa57037102604ebeae2bff773b5d1fcc7350404f267d
  Stored in directory: /root/.cache/pip/wheels/c4/b6/b5/e76659f3b2a3a226565e27f0a7eb7a3ac93c3f4d68acfbe617
  Building wheel for combo (setup.py) ... [?25l- \ done
[?25h  Created wheel for combo: filename=combo-0.1.1-py3-none-any.whl size=42113 sha256=ab8b32daeae645fb4bc1c7d5de5441eeaad8eefa09bdaf459e8168e23e25d8b5
  Stored in directory: /root/.cache/pip/wheels/3e/e1/f8/08f19ba48f75d3dbbb549cec4b86cc0392c14b2b6bb81f4e1f
  Building wheel for suod (setup.py) ... [?25l- \ | / done
[?25h  Created wheel for suod: filename=suod-0.0.4-py3-none-any.whl size=2167157 sha256=cc1b8461f955dadb8d45e095b62560588668395536e17bb7953d4c329d640dff
  Stored in directory: /root/.cache/pip/wheels/dc/ae/aa/3b8cc857617f3ba6cb9e6b804c79c69d0ed60a08e022e9a4f3
Successfully built pyod databricks-cli alembic sqlalchemy querystring-parser prometheus-flask-exporter combo suod
Installing collected packages: datefinder, tenacity, databricks-cli, sqlalchemy, alembic, azure-core, isodate, msrest, azure-storage-blob, gunicorn, querystring-parser, gorilla, prometheus-flask-exporter, mlflow, combo, suod, pyod, tangled-up-in-unicode, visions, pandas-profiling, pycaret
  Attempting uninstall: tenacity
    Found existing installation: tenacity 6.1.0
    Uninstalling tenacity-6.1.0:
      Successfully uninstalled tenacity-6.1.0
  Attempting uninstall: sqlalchemy
    Found existing installation: SQLAlchemy 1.3.16
    Uninstalling SQLAlchemy-1.3.16:
      Successfully uninstalled SQLAlchemy-1.3.16
  Attempting uninstall: alembic
    Found existing installation: alembic 1.4.3
    Uninstalling alembic-1.4.3:
      Successfully uninstalled alembic-1.4.3
  Attempting uninstall: tangled-up-in-unicode
    Found existing installation: tangled-up-in-unicode 0.0.4
    Uninstalling tangled-up-in-unicode-0.0.4:
      Successfully uninstalled tangled-up-in-unicode-0.0.4
  Attempting uninstall: visions
    Found existing installation: visions 0.4.1
    Uninstalling visions-0.4.1:
      Successfully uninstalled visions-0.4.1
  Attempting uninstall: pandas-profiling
    Found existing installation: pandas-profiling 2.6.0
    Uninstalling pandas-profiling-2.6.0:
      Successfully uninstalled pandas-profiling-2.6.0
[31mERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

pandas-profiling 2.9.0 requires seaborn>=0.10.1, but you'll have seaborn 0.10.0 which is incompatible.[0m
Successfully installed alembic-1.4.1 azure-core-1.8.2 azure-storage-blob-12.5.0 combo-0.1.1 databricks-cli-0.12.2 datefinder-0.7.1 gorilla-0.3.0 gunicorn-20.0.4 isodate-0.6.0 mlflow-1.11.0 msrest-0.6.19 pandas-profiling-2.9.0 prometheus-flask-exporter-0.18.1 pycaret-2.1.2 pyod-0.8.3 querystring-parser-1.2.4 sqlalchemy-1.3.13 suod-0.0.4 tangled-up-in-unicode-0.0.6 tenacity-6.2.0 visions-0.5.0
[33mWARNING: You are using pip version 20.2.3; however, version 20.2.4 is available.
You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.[0m

from pycaret.classification import *

위에서 정의한 features & label 확인

features, label

(['age',
  'workclass',
  'fnlwgt_log',
  'education',
  'marital_status',
  'occupation',
  'relationship',
  'race',
  'sex',
  'capital_gain',
  'capital_loss',
  'hours_per_week',
  'native_country',
  'capital_net_pos_key',
  'capital_net_neg_key',
  'country_bin'],
 ['income'])

all_data_caret = all_data[features + label]

all_data_caret.head()

	age	workclass	fnlwgt_log	education	marital_status	occupation	relationship	race	sex	hours_per_week	native_country	capital_net_pos_key	capital_net_neg_key	country_bin	income
0	40	Private	12.034917	level_4	Married-civ-spouse	Sales	Husband	White	Male	60	United-States	False	False	income_04	1.0
1	17	Private	11.529055	level_2	Never-married	Machine-op-inspct	Own-child	White	Male	20	United-States	False	False	income_04	0.0
2	18	Private	12.775237	level_5	Never-married	Other-service	Own-child	White	Male	16	United-States	False	False	income_04	0.0
3	21	Private	11.926081	level_5	Never-married	Prof-specialty	Own-child	White	Female	25	United-States	False	False	income_04	0.0
4	24	Private	11.713693	level_5	Never-married	Adm-clerical	Not-in-family	Black	Female	20	?	False	False	income_other	0.0

type casting 을 안해주면 잘 설정이 안되더라..ㅠ

all_data_caret['age'] = all_data_caret['age'].astype('float')
# all_data_caret['capital_net'] = all_data_caret['capital_net'].astype('float')
all_data_caret['hours_per_week'] = all_data_caret['hours_per_week'].astype('float')
all_data_caret['capital_gain'] = all_data_caret['capital_gain'].astype('float')
all_data_caret['capital_loss'] = all_data_caret['capital_loss'].astype('float')

train_clean = all_data_caret[:len(train)]
test_clean = all_data_caret[len(train):]

train_clean['income'] = train_clean['income'].astype('int')

train_clean.head()

	age	workclass	fnlwgt_log	education	marital_status	occupation	relationship	race	sex	hours_per_week	native_country	capital_net_pos_key	capital_net_neg_key	country_bin	income
0	40.0	Private	12.034917	level_4	Married-civ-spouse	Sales	Husband	White	Male	60.0	United-States	False	False	income_04	1
1	17.0	Private	11.529055	level_2	Never-married	Machine-op-inspct	Own-child	White	Male	20.0	United-States	False	False	income_04	0
2	18.0	Private	12.775237	level_5	Never-married	Other-service	Own-child	White	Male	16.0	United-States	False	False	income_04	0
3	21.0	Private	11.926081	level_5	Never-married	Prof-specialty	Own-child	White	Female	25.0	United-States	False	False	income_04	0
4	24.0	Private	11.713693	level_5	Never-married	Adm-clerical	Not-in-family	Black	Female	20.0	?	False	False	income_other	0

setup(data = train_clean, target = 'income', session_id=SEED, silent=True)

Setup Succesfully Completed!

	Description	Value
0	session_id	1234
1	Target Type	Binary
2	Label Encoded	0: 0, 1: 1
3	Original Data	(26049, 17)
4	Missing Values	False
5	Numeric Features	5
6	Categorical Features	11
7	Ordinal Features	False
8	High Cardinality Features	False
9	High Cardinality Method	None
10	Sampled Data	(26049, 17)
11	Transformed Train Set	(18234, 111)
12	Transformed Test Set	(7815, 111)
13	Numeric Imputer	mean
14	Categorical Imputer	constant
15	Normalize	False
16	Normalize Method	None
17	Transformation	False
18	Transformation Method	None
19	PCA	False
20	PCA Method	None
21	PCA Components	None
22	Ignore Low Variance	False
23	Combine Rare Levels	False
24	Rare Level Threshold	None
25	Numeric Binning	False
26	Remove Outliers	False
27	Outliers Threshold	None
28	Remove Multicollinearity	False
29	Multicollinearity Threshold	None
30	Clustering	False
31	Clustering Iteration	None
32	Polynomial Features	False
33	Polynomial Degree	None
34	Trignometry Features	False
35	Polynomial Threshold	None
36	Group Features	False
37	Feature Selection	False
38	Features Selection Threshold	None
39	Feature Interaction	False
40	Feature Ratio	False
41	Interaction Threshold	None
42	Fix Imbalance	False
43	Fix Imbalance Method	SMOTE

(        age  fnlwgt_log  capital_gain  capital_loss  hours_per_week  \
 0      40.0   12.034917           0.0           0.0            60.0   
 1      17.0   11.529055           0.0           0.0            20.0   
 2      18.0   12.775237           0.0           0.0            16.0   
 3      21.0   11.926081           0.0           0.0            25.0   
 4      24.0   11.713693           0.0           0.0            20.0   
 ...     ...         ...           ...           ...             ...   
 26044  57.0   12.430020           0.0           0.0            52.0   
 26045  23.0   12.380412           0.0           0.0            40.0   
 26046  78.0   12.017898           0.0           0.0            15.0   
 26047  26.0   11.929172           0.0           0.0            40.0   
 26048  20.0   11.511835           0.0           0.0            30.0   
 
        workclass_?  workclass_Federal-gov  workclass_Local-gov  \
 0              0.0                    0.0                  0.0   
 1              0.0                    0.0                  0.0   
 2              0.0                    0.0                  0.0   
 3              0.0                    0.0                  0.0   
 4              0.0                    0.0                  0.0   
 ...            ...                    ...                  ...   
 26044          0.0                    0.0                  0.0   
 26045          0.0                    0.0                  0.0   
 26046          1.0                    0.0                  0.0   
 26047          0.0                    0.0                  0.0   
 26048          1.0                    0.0                  0.0   
 
        workclass_Other  workclass_Private  ...  country_bin_income_01  \
 0                  0.0                1.0  ...                    0.0   
 1                  0.0                1.0  ...                    0.0   
 2                  0.0                1.0  ...                    0.0   
 3                  0.0                1.0  ...                    0.0   
 4                  0.0                1.0  ...                    0.0   
 ...                ...                ...  ...                    ...   
 26044              0.0                1.0  ...                    0.0   
 26045              0.0                1.0  ...                    0.0   
 26046              0.0                0.0  ...                    0.0   
 26047              0.0                0.0  ...                    0.0   
 26048              0.0                0.0  ...                    0.0   
 
        country_bin_income_02  country_bin_income_03  country_bin_income_04  \
 0                        0.0                    0.0                    1.0   
 1                        0.0                    0.0                    1.0   
 2                        0.0                    0.0                    1.0   
 3                        0.0                    0.0                    1.0   
 4                        0.0                    0.0                    0.0   
 ...                      ...                    ...                    ...   
 26044                    0.0                    0.0                    1.0   
 26045                    0.0                    0.0                    1.0   
 26046                    0.0                    0.0                    1.0   
 26047                    0.0                    0.0                    1.0   
 26048                    0.0                    0.0                    1.0   
 
        country_bin_income_05  country_bin_income_06  country_bin_income_07  \
 0                        0.0                    0.0                    0.0   
 1                        0.0                    0.0                    0.0   
 2                        0.0                    0.0                    0.0   
 3                        0.0                    0.0                    0.0   
 4                        0.0                    0.0                    0.0   
 ...                      ...                    ...                    ...   
 26044                    0.0                    0.0                    0.0   
 26045                    0.0                    0.0                    0.0   
 26046                    0.0                    0.0                    0.0   
 26047                    0.0                    0.0                    0.0   
 26048                    0.0                    0.0                    0.0   
 
        country_bin_income_08  country_bin_income_09  country_bin_income_other  
 0                        0.0                    0.0                       0.0  
 1                        0.0                    0.0                       0.0  
 2                        0.0                    0.0                       0.0  
 3                        0.0                    0.0                       0.0  
 4                        0.0                    0.0                       1.0  
 ...                      ...                    ...                       ...  
 26044                    0.0                    0.0                       0.0  
 26045                    0.0                    0.0                       0.0  
 26046                    0.0                    0.0                       0.0  
 26047                    0.0                    0.0                       0.0  
 26048                    0.0                    0.0                       0.0  
 
 [26049 rows x 111 columns],
 0        1
 1        0
 2        0
 3        0
 4        0
         ..
 26044    0
 26045    0
 26046    0
 26047    0
 26048    0
 Name: income, Length: 26049, dtype: int64,
         age  fnlwgt_log  capital_gain  capital_loss  hours_per_week  \
 14079  56.0   10.366655           0.0           0.0            40.0   
 2026   69.0   12.086016        1848.0           0.0            12.0   
 10955  36.0   12.419400        5178.0           0.0            60.0   
 1385   52.0   11.593906           0.0        1902.0            50.0   
 7067   32.0   12.870491           0.0           0.0            16.0   
 ...     ...         ...           ...           ...             ...   
 25430  29.0   11.693980           0.0           0.0            40.0   
 14899  39.0   11.546902           0.0           0.0            45.0   
 9236   30.0   12.591117           0.0           0.0            50.0   
 23705  59.0   12.834812           0.0           0.0            41.0   
 18592  41.0   12.687850           0.0           0.0            55.0   
 
        workclass_?  workclass_Federal-gov  workclass_Local-gov  \
 14079          0.0                    0.0                  0.0   
 2026           0.0                    0.0                  0.0   
 10955          0.0                    0.0                  0.0   
 1385           0.0                    0.0                  0.0   
 7067           0.0                    0.0                  0.0   
 ...            ...                    ...                  ...   
 25430          0.0                    1.0                  0.0   
 14899          0.0                    0.0                  0.0   
 9236           0.0                    0.0                  0.0   
 23705          1.0                    0.0                  0.0   
 18592          0.0                    0.0                  0.0   
 
        workclass_Other  workclass_Private  ...  country_bin_income_01  \
 14079              0.0                1.0  ...                    0.0   
 2026               0.0                1.0  ...                    0.0   
 10955              0.0                1.0  ...                    0.0   
 1385               0.0                1.0  ...                    0.0   
 7067               0.0                1.0  ...                    0.0   
 ...                ...                ...  ...                    ...   
 25430              0.0                0.0  ...                    0.0   
 14899              0.0                1.0  ...                    0.0   
 9236               0.0                1.0  ...                    0.0   
 23705              0.0                0.0  ...                    0.0   
 18592              0.0                1.0  ...                    0.0   
 
        country_bin_income_02  country_bin_income_03  country_bin_income_04  \
 14079                    0.0                    0.0                    1.0   
 2026                     0.0                    0.0                    1.0   
 10955                    0.0                    0.0                    0.0   
 1385                     0.0                    0.0                    0.0   
 7067                     0.0                    0.0                    1.0   
 ...                      ...                    ...                    ...   
 25430                    0.0                    0.0                    1.0   
 14899                    0.0                    0.0                    1.0   
 9236                     0.0                    0.0                    0.0   
 23705                    0.0                    0.0                    1.0   
 18592                    0.0                    0.0                    1.0   
 
        country_bin_income_05  country_bin_income_06  country_bin_income_07  \
 14079                    0.0                    0.0                    0.0   
 2026                     0.0                    0.0                    0.0   
 10955                    0.0                    0.0                    0.0   
 1385                     1.0                    0.0                    0.0   
 7067                     0.0                    0.0                    0.0   
 ...                      ...                    ...                    ...   
 25430                    0.0                    0.0                    0.0   
 14899                    0.0                    0.0                    0.0   
 9236                     0.0                    0.0                    0.0   
 23705                    0.0                    0.0                    0.0   
 18592                    0.0                    0.0                    0.0   
 
        country_bin_income_08  country_bin_income_09  country_bin_income_other  
 14079                    0.0                    0.0                       0.0  
 2026                     0.0                    0.0                       0.0  
 10955                    0.0                    0.0                       1.0  
 1385                     0.0                    0.0                       0.0  
 7067                     0.0                    0.0                       0.0  
 ...                      ...                    ...                       ...  
 25430                    0.0                    0.0                       0.0  
 14899                    0.0                    0.0                       0.0  
 9236                     0.0                    0.0                       1.0  
 23705                    0.0                    0.0                       0.0  
 18592                    0.0                    0.0                       0.0  
 
 [18234 rows x 111 columns],
         age  fnlwgt_log  capital_gain  capital_loss  hours_per_week  \
 21893  49.0   11.173178           0.0           0.0            60.0   
 24714  41.0   12.399248           0.0           0.0            80.0   
 20725  49.0   12.172340           0.0           0.0            40.0   
 13981  49.0   11.314145           0.0           0.0            40.0   
 25627  31.0   11.674253           0.0           0.0            45.0   
 ...     ...         ...           ...           ...             ...   
 3937   73.0   10.175345           0.0           0.0            30.0   
 23595  20.0   12.354411           0.0           0.0            32.0   
 25500  55.0   11.870810           0.0           0.0            40.0   
 22934  24.0   11.093508           0.0           0.0            30.0   
 18262  28.0   12.160489           0.0        1741.0            52.0   
 
        workclass_?  workclass_Federal-gov  workclass_Local-gov  \
 21893          0.0                    0.0                  0.0   
 24714          0.0                    0.0                  0.0   
 20725          0.0                    0.0                  0.0   
 13981          0.0                    0.0                  0.0   
 25627          0.0                    0.0                  0.0   
 ...            ...                    ...                  ...   
 3937           0.0                    0.0                  0.0   
 23595          0.0                    0.0                  0.0   
 25500          0.0                    0.0                  0.0   
 22934          0.0                    0.0                  0.0   
 18262          0.0                    0.0                  0.0   
 
        workclass_Other  workclass_Private  ...  country_bin_income_01  \
 21893              0.0                1.0  ...                    0.0   
 24714              0.0                1.0  ...                    0.0   
 20725              0.0                1.0  ...                    0.0   
 13981              0.0                1.0  ...                    0.0   
 25627              0.0                1.0  ...                    0.0   
 ...                ...                ...  ...                    ...   
 3937               0.0                1.0  ...                    0.0   
 23595              0.0                1.0  ...                    0.0   
 25500              0.0                1.0  ...                    0.0   
 22934              0.0                1.0  ...                    0.0   
 18262              0.0                1.0  ...                    0.0   
 
        country_bin_income_02  country_bin_income_03  country_bin_income_04  \
 21893                    0.0                    0.0                    1.0   
 24714                    0.0                    0.0                    1.0   
 20725                    0.0                    0.0                    1.0   
 13981                    0.0                    0.0                    1.0   
 25627                    0.0                    0.0                    1.0   
 ...                      ...                    ...                    ...   
 3937                     0.0                    0.0                    1.0   
 23595                    0.0                    0.0                    1.0   
 25500                    0.0                    0.0                    1.0   
 22934                    0.0                    0.0                    1.0   
 18262                    0.0                    0.0                    1.0   
 
        country_bin_income_05  country_bin_income_06  country_bin_income_07  \
 21893                    0.0                    0.0                    0.0   
 24714                    0.0                    0.0                    0.0   
 20725                    0.0                    0.0                    0.0   
 13981                    0.0                    0.0                    0.0   
 25627                    0.0                    0.0                    0.0   
 ...                      ...                    ...                    ...   
 3937                     0.0                    0.0                    0.0   
 23595                    0.0                    0.0                    0.0   
 25500                    0.0                    0.0                    0.0   
 22934                    0.0                    0.0                    0.0   
 18262                    0.0                    0.0                    0.0   
 
        country_bin_income_08  country_bin_income_09  country_bin_income_other  
 21893                    0.0                    0.0                       0.0  
 24714                    0.0                    0.0                       0.0  
 20725                    0.0                    0.0                       0.0  
 13981                    0.0                    0.0                       0.0  
 25627                    0.0                    0.0                       0.0  
 ...                      ...                    ...                       ...  
 3937                     0.0                    0.0                       0.0  
 23595                    0.0                    0.0                       0.0  
 25500                    0.0                    0.0                       0.0  
 22934                    0.0                    0.0                       0.0  
 18262                    0.0                    0.0                       0.0  
 
 [7815 rows x 111 columns],
 14079    0
 2026     0
 10955    1
 1385     1
 7067     0
         ..
 25430    0
 14899    1
 9236     0
 23705    1
 18592    0
 Name: income, Length: 18234, dtype: int64,
 21893    0
 24714    0
 20725    0
 13981    1
 25627    0
         ..
 3937     1
 23595    0
 25500    0
 22934    0
 18262    0
 Name: income, Length: 7815, dtype: int64,
 1234,
 Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=False, features_todrop=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='income',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 numeric_strategy='mean',
                                 target_variable=None)),
                 ('new_levels1',
                  New_Catagorical_Le...
                 ('group', Empty()), ('nonliner', Empty()), ('scaling', Empty()),
                 ('P_transform', Empty()), ('pt_target', Empty()),
                 ('binn', Empty()), ('rem_outliers', Empty()),
                 ('cluster_all', Empty()), ('dummy', Dummify(target='income')),
                 ('fix_perfect', Empty()), ('clean_names', Clean_Colum_Names()),
                 ('feature_select', Empty()), ('fix_multi', Empty()),
                 ('dfs', Empty()), ('pca', Empty())],
          verbose=False),
 [('Classification Setup Config',
                         Description         Value
   0                      session_id          1234
   1                     Target Type        Binary
   2                   Label Encoded    0: 0, 1: 1
   3                   Original Data   (26049, 17)
   4                 Missing Values          False
   5               Numeric Features              5
   6           Categorical Features             11
   7               Ordinal Features          False
   8      High Cardinality Features          False
   9        High Cardinality Method           None
   10                   Sampled Data   (26049, 17)
   11          Transformed Train Set  (18234, 111)
   12           Transformed Test Set   (7815, 111)
   13               Numeric Imputer           mean
   14           Categorical Imputer       constant
   15                     Normalize          False
   16              Normalize Method           None
   17                Transformation          False
   18         Transformation Method           None
   19                           PCA          False
   20                    PCA Method           None
   21                PCA Components           None
   22           Ignore Low Variance          False
   23           Combine Rare Levels          False
   24          Rare Level Threshold           None
   25               Numeric Binning          False
   26               Remove Outliers          False
   27            Outliers Threshold           None
   28      Remove Multicollinearity          False
   29   Multicollinearity Threshold           None
   30                    Clustering          False
   31          Clustering Iteration           None
   32           Polynomial Features          False
   33             Polynomial Degree           None
   34          Trignometry Features          False
   35          Polynomial Threshold           None
   36                Group Features          False
   37             Feature Selection          False
   38  Features Selection Threshold           None
   39           Feature Interaction          False
   40                 Feature Ratio          False
   41         Interaction Threshold           None
   42                  Fix Imbalance         False
   43           Fix Imbalance Method         SMOTE),
  ('X_training Set',
           age  fnlwgt_log  capital_gain  capital_loss  hours_per_week  \
   14079  56.0   10.366655           0.0           0.0            40.0   
   2026   69.0   12.086016        1848.0           0.0            12.0   
   10955  36.0   12.419400        5178.0           0.0            60.0   
   1385   52.0   11.593906           0.0        1902.0            50.0   
   7067   32.0   12.870491           0.0           0.0            16.0   
   ...     ...         ...           ...           ...             ...   
   25430  29.0   11.693980           0.0           0.0            40.0   
   14899  39.0   11.546902           0.0           0.0            45.0   
   9236   30.0   12.591117           0.0           0.0            50.0   
   23705  59.0   12.834812           0.0           0.0            41.0   
   18592  41.0   12.687850           0.0           0.0            55.0   
   
          workclass_?  workclass_Federal-gov  workclass_Local-gov  \
   14079          0.0                    0.0                  0.0   
   2026           0.0                    0.0                  0.0   
   10955          0.0                    0.0                  0.0   
   1385           0.0                    0.0                  0.0   
   7067           0.0                    0.0                  0.0   
   ...            ...                    ...                  ...   
   25430          0.0                    1.0                  0.0   
   14899          0.0                    0.0                  0.0   
   9236           0.0                    0.0                  0.0   
   23705          1.0                    0.0                  0.0   
   18592          0.0                    0.0                  0.0   
   
          workclass_Other  workclass_Private  ...  country_bin_income_01  \
   14079              0.0                1.0  ...                    0.0   
   2026               0.0                1.0  ...                    0.0   
   10955              0.0                1.0  ...                    0.0   
   1385               0.0                1.0  ...                    0.0   
   7067               0.0                1.0  ...                    0.0   
   ...                ...                ...  ...                    ...   
   25430              0.0                0.0  ...                    0.0   
   14899              0.0                1.0  ...                    0.0   
   9236               0.0                1.0  ...                    0.0   
   23705              0.0                0.0  ...                    0.0   
   18592              0.0                1.0  ...                    0.0   
   
          country_bin_income_02  country_bin_income_03  country_bin_income_04  \
   14079                    0.0                    0.0                    1.0   
   2026                     0.0                    0.0                    1.0   
   10955                    0.0                    0.0                    0.0   
   1385                     0.0                    0.0                    0.0   
   7067                     0.0                    0.0                    1.0   
   ...                      ...                    ...                    ...   
   25430                    0.0                    0.0                    1.0   
   14899                    0.0                    0.0                    1.0   
   9236                     0.0                    0.0                    0.0   
   23705                    0.0                    0.0                    1.0   
   18592                    0.0                    0.0                    1.0   
   
          country_bin_income_05  country_bin_income_06  country_bin_income_07  \
   14079                    0.0                    0.0                    0.0   
   2026                     0.0                    0.0                    0.0   
   10955                    0.0                    0.0                    0.0   
   1385                     1.0                    0.0                    0.0   
   7067                     0.0                    0.0                    0.0   
   ...                      ...                    ...                    ...   
   25430                    0.0                    0.0                    0.0   
   14899                    0.0                    0.0                    0.0   
   9236                     0.0                    0.0                    0.0   
   23705                    0.0                    0.0                    0.0   
   18592                    0.0                    0.0                    0.0   
   
          country_bin_income_08  country_bin_income_09  country_bin_income_other  
   14079                    0.0                    0.0                       0.0  
   2026                     0.0                    0.0                       0.0  
   10955                    0.0                    0.0                       1.0  
   1385                     0.0                    0.0                       0.0  
   7067                     0.0                    0.0                       0.0  
   ...                      ...                    ...                       ...  
   25430                    0.0                    0.0                       0.0  
   14899                    0.0                    0.0                       0.0  
   9236                     0.0                    0.0                       1.0  
   23705                    0.0                    0.0                       0.0  
   18592                    0.0                    0.0                       0.0  
   
   [18234 rows x 111 columns]),
  ('y_training Set',
   14079    0
   2026     0
   10955    1
   1385     1
   7067     0
           ..
   25430    0
   14899    1
   9236     0
   23705    1
   18592    0
   Name: income, Length: 18234, dtype: int64),
  ('X_test Set',
           age  fnlwgt_log  capital_gain  capital_loss  hours_per_week  \
   21893  49.0   11.173178           0.0           0.0            60.0   
   24714  41.0   12.399248           0.0           0.0            80.0   
   20725  49.0   12.172340           0.0           0.0            40.0   
   13981  49.0   11.314145           0.0           0.0            40.0   
   25627  31.0   11.674253           0.0           0.0            45.0   
   ...     ...         ...           ...           ...             ...   
   3937   73.0   10.175345           0.0           0.0            30.0   
   23595  20.0   12.354411           0.0           0.0            32.0   
   25500  55.0   11.870810           0.0           0.0            40.0   
   22934  24.0   11.093508           0.0           0.0            30.0   
   18262  28.0   12.160489           0.0        1741.0            52.0   
   
          workclass_?  workclass_Federal-gov  workclass_Local-gov  \
   21893          0.0                    0.0                  0.0   
   24714          0.0                    0.0                  0.0   
   20725          0.0                    0.0                  0.0   
   13981          0.0                    0.0                  0.0   
   25627          0.0                    0.0                  0.0   
   ...            ...                    ...                  ...   
   3937           0.0                    0.0                  0.0   
   23595          0.0                    0.0                  0.0   
   25500          0.0                    0.0                  0.0   
   22934          0.0                    0.0                  0.0   
   18262          0.0                    0.0                  0.0   
   
          workclass_Other  workclass_Private  ...  country_bin_income_01  \
   21893              0.0                1.0  ...                    0.0   
   24714              0.0                1.0  ...                    0.0   
   20725              0.0                1.0  ...                    0.0   
   13981              0.0                1.0  ...                    0.0   
   25627              0.0                1.0  ...                    0.0   
   ...                ...                ...  ...                    ...   
   3937               0.0                1.0  ...                    0.0   
   23595              0.0                1.0  ...                    0.0   
   25500              0.0                1.0  ...                    0.0   
   22934              0.0                1.0  ...                    0.0   
   18262              0.0                1.0  ...                    0.0   
   
          country_bin_income_02  country_bin_income_03  country_bin_income_04  \
   21893                    0.0                    0.0                    1.0   
   24714                    0.0                    0.0                    1.0   
   20725                    0.0                    0.0                    1.0   
   13981                    0.0                    0.0                    1.0   
   25627                    0.0                    0.0                    1.0   
   ...                      ...                    ...                    ...   
   3937                     0.0                    0.0                    1.0   
   23595                    0.0                    0.0                    1.0   
   25500                    0.0                    0.0                    1.0   
   22934                    0.0                    0.0                    1.0   
   18262                    0.0                    0.0                    1.0   
   
          country_bin_income_05  country_bin_income_06  country_bin_income_07  \
   21893                    0.0                    0.0                    0.0   
   24714                    0.0                    0.0                    0.0   
   20725                    0.0                    0.0                    0.0   
   13981                    0.0                    0.0                    0.0   
   25627                    0.0                    0.0                    0.0   
   ...                      ...                    ...                    ...   
   3937                     0.0                    0.0                    0.0   
   23595                    0.0                    0.0                    0.0   
   25500                    0.0                    0.0                    0.0   
   22934                    0.0                    0.0                    0.0   
   18262                    0.0                    0.0                    0.0   
   
          country_bin_income_08  country_bin_income_09  country_bin_income_other  
   21893                    0.0                    0.0                       0.0  
   24714                    0.0                    0.0                       0.0  
   20725                    0.0                    0.0                       0.0  
   13981                    0.0                    0.0                       0.0  
   25627                    0.0                    0.0                       0.0  
   ...                      ...                    ...                       ...  
   3937                     0.0                    0.0                       0.0  
   23595                    0.0                    0.0                       0.0  
   25500                    0.0                    0.0                       0.0  
   22934                    0.0                    0.0                       0.0  
   18262                    0.0                    0.0                       0.0  
   
   [7815 rows x 111 columns]),
  ('y_test Set',
   21893    0
   24714    0
   20725    0
   13981    1
   25627    0
           ..
   3937     1
   23595    0
   25500    0
   22934    0
   18262    0
   Name: income, Length: 7815, dtype: int64),
  ('Transformation Pipeline',
   Pipeline(memory=None,
            steps=[('dtypes',
                    DataTypes_Auto_infer(categorical_features=[],
                                         display_types=False, features_todrop=[],
                                         ml_usecase='classification',
                                         numerical_features=[], target='income',
                                         time_features=[])),
                   ('imputer',
                    Simple_Imputer(categorical_strategy='not_available',
                                   numeric_strategy='mean',
                                   target_variable=None)),
                   ('new_levels1',
                    New_Catagorical_Le...
                   ('group', Empty()), ('nonliner', Empty()), ('scaling', Empty()),
                   ('P_transform', Empty()), ('pt_target', Empty()),
                   ('binn', Empty()), ('rem_outliers', Empty()),
                   ('cluster_all', Empty()), ('dummy', Dummify(target='income')),
                   ('fix_perfect', Empty()), ('clean_names', Clean_Colum_Names()),
                   ('feature_select', Empty()), ('fix_multi', Empty()),
                   ('dfs', Empty()), ('pca', Empty())],
            verbose=False))],
 False,
 -1,
 True,
 [],
 [],
 [],
 'no_logging',
 False,
 False,
 '87a3',
 False,
 None,
 <_Logger logs (DEBUG)>,
         age         workclass  fnlwgt_log education      marital_status  \
 0      40.0           Private   12.034917   level_4  Married-civ-spouse   
 1      17.0           Private   11.529055   level_2       Never-married   
 2      18.0           Private   12.775237   level_5       Never-married   
 3      21.0           Private   11.926081   level_5       Never-married   
 4      24.0           Private   11.713693   level_5       Never-married   
 ...     ...               ...         ...       ...                 ...   
 26044  57.0           Private   12.430020   level_3  Married-civ-spouse   
 26045  23.0           Private   12.380412   level_7       Never-married   
 26046  78.0                 ?   12.017898   level_8             Widowed   
 26047  26.0  Self-emp-not-inc   11.929172   level_4       Never-married   
 26048  20.0                 ?   11.511835   level_5       Never-married   
 
               occupation   relationship   race     sex  capital_gain  \
 0                  Sales        Husband  White    Male           0.0   
 1      Machine-op-inspct      Own-child  White    Male           0.0   
 2          Other-service      Own-child  White    Male           0.0   
 3         Prof-specialty      Own-child  White  Female           0.0   
 4           Adm-clerical  Not-in-family  Black  Female           0.0   
 ...                  ...            ...    ...     ...           ...   
 26044      Other-service        Husband  White    Male           0.0   
 26045     Prof-specialty      Own-child  White    Male           0.0   
 26046                  ?  Not-in-family  White  Female           0.0   
 26047     Prof-specialty      Own-child  Black  Female           0.0   
 26048                  ?      Own-child  White  Female           0.0   
 
        capital_loss  hours_per_week native_country  capital_net_pos_key  \
 0               0.0            60.0  United-States                False   
 1               0.0            20.0  United-States                False   
 2               0.0            16.0  United-States                False   
 3               0.0            25.0  United-States                False   
 4               0.0            20.0              ?                False   
 ...             ...             ...            ...                  ...   
 26044           0.0            52.0  United-States                False   
 26045           0.0            40.0  United-States                False   
 26046           0.0            15.0  United-States                False   
 26047           0.0            40.0  United-States                False   
 26048           0.0            30.0  United-States                False   
 
        capital_net_neg_key   country_bin  income  
 0                    False     income_04       1  
 1                    False     income_04       0  
 2                    False     income_04       0  
 3                    False     income_04       0  
 4                    False  income_other       0  
 ...                    ...           ...     ...  
 26044                False     income_04       0  
 26045                False     income_04       0  
 26046                False     income_04       0  
 26047                False     income_04       0  
 26048                False     income_04       0  
 
 [26049 rows x 17 columns],
 'income',
 False)

lgbm = create_model('lightgbm')
tuned_lgbm = tune_model(lgbm, optimize='F1')

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	0.8717	0.9263	0.6553	0.7790	0.7118	0.6301	0.6340
1	0.8668	0.9245	0.6584	0.7598	0.7055	0.6199	0.6226
2	0.8624	0.9205	0.6267	0.7631	0.6882	0.6010	0.6058
3	0.8777	0.9430	0.6516	0.8067	0.7209	0.6438	0.6498
4	0.8738	0.9296	0.6576	0.7859	0.7160	0.6358	0.6399
5	0.8700	0.9255	0.6576	0.7713	0.7099	0.6268	0.6301
6	0.8645	0.9224	0.6440	0.7594	0.6969	0.6104	0.6139
7	0.8722	0.9289	0.6689	0.7723	0.7169	0.6349	0.6376
8	0.8733	0.9334	0.7029	0.7561	0.7286	0.6461	0.6468
9	0.8848	0.9418	0.7143	0.7895	0.7500	0.6754	0.6768
Mean	0.8717	0.9296	0.6637	0.7743	0.7145	0.6324	0.6357
SD	0.0062	0.0073	0.0249	0.0153	0.0162	0.0196	0.0190

calibrated_lgbm = calibrate_model(tuned_lgbm)

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	0.8723	0.9264	0.6485	0.7857	0.7106	0.6296	0.6343
1	0.8657	0.9230	0.6448	0.7641	0.6994	0.6137	0.6174
2	0.8569	0.9198	0.6109	0.7521	0.6742	0.5837	0.5889
3	0.8805	0.9452	0.6448	0.8237	0.7234	0.6486	0.6565
4	0.8788	0.9306	0.6644	0.8005	0.7261	0.6492	0.6538
5	0.8727	0.9265	0.6485	0.7879	0.7114	0.6308	0.6357
6	0.8694	0.9232	0.6417	0.7796	0.7040	0.6212	0.6261
7	0.8727	0.9298	0.6599	0.7802	0.7150	0.6338	0.6375
8	0.8793	0.9357	0.6984	0.7797	0.7368	0.6589	0.6605
9	0.8914	0.9440	0.7143	0.8140	0.7609	0.6910	0.6935
Mean	0.8740	0.9304	0.6576	0.7867	0.7162	0.6360	0.6404
SD	0.0089	0.0083	0.0280	0.0204	0.0220	0.0272	0.0267

interpret_model(tuned_lgbm, plot = 'reason', observation = 15)

Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

plot_model(tuned_lgbm)

plot_model(tuned_lgbm, 'threshold')

plot_model(lgbm, 'confusion_matrix')

plot_model(lgbm, 'calibration')

tuned_lgbm_pred = predict_model(tuned_lgbm, data = test_clean)

tuned_lgbm_pred

	age	workclass	fnlwgt_log	education	marital_status	occupation	relationship	race	sex	capital_gain	capital_loss	hours_per_week	native_country	capital_net_pos_key	capital_net_neg_key	country_bin	income	Label	Score
0	28.0	Private	11.122265	level_5	Never-married	Adm-clerical	Other-relative	White	Female	0.0	0.0	40.0	United-States	False	False	income_04	NaN	0	0.0039
1	40.0	Self-emp-inc	10.541888	level_4	Married-civ-spouse	Exec-managerial	Husband	White	Male	0.0	0.0	50.0	United-States	False	False	income_04	NaN	0	0.4239
2	20.0	Private	11.607799	level_5	Never-married	Handlers-cleaners	Own-child	White	Male	0.0	0.0	25.0	United-States	False	False	income_04	NaN	0	0.0004
3	40.0	Private	11.648653	level_6	Married-civ-spouse	Exec-managerial	Husband	White	Male	0.0	0.0	50.0	United-States	False	False	income_04	NaN	1	0.8180
4	37.0	Private	10.844744	level_9	Married-civ-spouse	Prof-specialty	Husband	White	Male	0.0	0.0	99.0	France	False	False	income_08	NaN	1	0.5937
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6507	35.0	Private	11.024236	level_7	Married-civ-spouse	Sales	Husband	White	Male	0.0	0.0	40.0	United-States	False	False	income_04	NaN	1	0.5984
6508	41.0	Self-emp-inc	10.379256	level_7	Married-civ-spouse	Tech-support	Husband	White	Male	0.0	0.0	40.0	United-States	False	False	income_04	NaN	1	0.5455
6509	39.0	Private	12.921932	level_1	Married-civ-spouse	Other-service	Husband	White	Male	0.0	0.0	40.0	Mexico	False	False	income_02	NaN	0	0.0185
6510	35.0	Private	12.102610	level_4	Married-civ-spouse	Craft-repair	Husband	White	Male	0.0	0.0	40.0	United-States	False	False	income_04	NaN	0	0.2055
6511	28.0	Private	11.962848	level_4	Divorced	Handlers-cleaners	Unmarried	White	Female	0.0	0.0	36.0	United-States	False	False	income_04	NaN	0	0.0077

6512 rows × 19 columns

PyCaret (Ensemble)

campare_model = compare_models(sort = 'F1', n_select = 3)

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC	TT (Sec)
0	CatBoost Classifier	0.8761	0.9318	0.6583	0.7948	0.7199	0.6413	0.6462	10.5274
1	Extreme Gradient Boosting	0.8739	0.9292	0.6658	0.7816	0.7187	0.6381	0.6418	9.4363
2	Light Gradient Boosting Machine	0.8734	0.9298	0.6640	0.7804	0.7172	0.6363	0.6400	0.4357
3	Ada Boost Classifier	0.8695	0.9253	0.6397	0.7818	0.7034	0.6208	0.6262	1.3897
4	Gradient Boosting Classifier	0.8718	0.9278	0.6103	0.8138	0.6972	0.6181	0.6285	4.1646
5	Naive Bayes	0.8191	0.9007	0.7820	0.5967	0.6767	0.5543	0.5642	0.0331
6	Extra Trees Classifier	0.8490	0.8985	0.6374	0.7092	0.6711	0.5736	0.5751	1.3146
7	Random Forest Classifier	0.8548	0.8910	0.5964	0.7527	0.6651	0.5741	0.5807	0.2215
8	Linear Discriminant Analysis	0.8592	0.9130	0.5756	0.7857	0.6641	0.5778	0.5891	0.2736
9	K Neighbors Classifier	0.8402	0.8684	0.6202	0.6890	0.6524	0.5491	0.5506	0.5477
10	Ridge Classifier	0.8569	0.0000	0.5291	0.8148	0.6412	0.5570	0.5774	0.0510
11	Logistic Regression	0.8429	0.8885	0.5722	0.7212	0.6375	0.5390	0.5453	0.4018
12	Decision Tree Classifier	0.8208	0.7586	0.6381	0.6279	0.6329	0.5144	0.5145	0.2522
13	SVM - Linear Kernel	0.7638	0.0000	0.5630	0.5595	0.5019	0.3649	0.3886	0.3444
14	Quadratic Discriminant Analysis	0.7544	0.6260	0.3191	0.7977	0.3599	0.2539	0.3499	0.1040

blended_model = blend_models(estimator_list = campare_model, fold = 5, method = 'soft')

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	0.8687	0.9258	0.6497	0.7712	0.7052	0.6215	0.6253
1	0.8676	0.9323	0.6285	0.7817	0.6968	0.6134	0.6193
2	0.8774	0.9320	0.6682	0.7930	0.7253	0.6471	0.6511
3	0.8700	0.9269	0.6433	0.7813	0.7056	0.6232	0.6280
4	0.8845	0.9387	0.7143	0.7885	0.7496	0.6748	0.6762
Mean	0.8736	0.9311	0.6608	0.7831	0.7165	0.6360	0.6400
SD	0.0064	0.0046	0.0296	0.0074	0.0190	0.0224	0.0210

final_model = finalize_model(blended_model)

ensemble_prediction = predict_model(final_model, data = test_clean)

ensemble_pred = ensemble_prediction['Score']

THRESHOLD = 0.5

ensemble_pred[ensemble_pred >= THRESHOLD] = 1
ensemble_pred[ensemble_pred < THRESHOLD] = 0

plt.figure(figsize=(10, 6))
plt.subplot(121)
sns.countplot(ensemble_pred)

plt.subplot(122)
sns.countplot(train['income'])
plt.show()

Make Submission

submission = pd.read_csv(os.path.join(DIR, 'sample_submission.csv'))

submission.head()

	id	prediction
0	0	0
1	1	0
2	2	0
3	3	0
4	4	0

submission['prediction'] = ensemble_pred
submission['prediction'] = submission['prediction'].astype('int')
submission['prediction'].value_counts()

0    5225
1    1287
Name: prediction, dtype: int64

import datetime

timestring = datetime.datetime.now().strftime('%m-%d-%H-%M-%S')
filename = f'kakr-submission-{timestring}.csv'
filename

'kakr-submission-10-20-14-47-37.csv'

submission.to_csv(filename, index=False)

Twitter Facebook LinkedIn