모듈 import

from IPython.display import Image
import numpy as np
import pandas as pd
import seaborn as sns

데이터셋 로드

df = sns.load_dataset('titanic')
df.head()

컬럼(columns) 설명

survivied: 생존여부 (1: 생존, 0: 사망)
pclass: 좌석 등급 (1등급, 2등급, 3등급)
sex: 성별
age: 나이
sibsp: 형제 + 배우자 수
parch: 부모 + 자녀 수
fare: 좌석 요금
embarked: 탑승 항구 (S, C, Q)
class: pclass와 동일
who: 성별과 동일
adult_male: 성인 남자 여부
deck: 데크 번호 (알파벳 + 숫자 혼용)
embark_town: 탑승 항구 이름
alive: 생존여부 (yes, no)
alone: 혼자 탑승 여부

새로운 컬럼 추가

df1 = df.copy()

df1.head()

임의의 값을 대입하여 새로운 컬럼을 추가할 수 있습니다.

df1['VIP'] = True

df1.head()

삭제

삭제는 행(row) 삭제와 열(column) 삭제로 구분할 수 있습니다.

행 (row) 삭제

행 삭제시 index를 지정하여 삭제합니다.

df1.drop(1)

행 삭제시 범위를 지정하여 삭제할 수 있습니다.

df1.drop(df1.index[0:10])

fancy indexing을 활용하여 삭제할 수 있습니다.

df1.drop(df1.index[[1, 3, 5, 7, 9]])

열 (column) 삭제

df1.head()

열 삭제시 반드시 axis=1 옵션을 지정해야 합니다. 2번째 위치에 지정시 axis=을 생략할 수 있습니다.

df1.drop('class', axis=1).head()

df1.drop('class', 1).head()

다수의 컬럼(column) 삭제도 가능합니다.

df1.drop(['who', 'deck', 'alive'], axis=1)

삭제된 내용을 바로 적용하려면 inplace=True를 지정합니다.

df1.drop(['who', 'deck', 'alive'], axis=1, inplace=True)

df1.head()

컬럼간 연산

컬럼(column) 과 컬럼 사이의 연산을 매우 쉽게 적용할 수 있습니다.

df1 = df.copy()

family(가족)의 총합은 sibsp컬럼과 parch의 합산으로 구할 수 있습니다.

df1['family'] = df1['sibsp'] + df1['parch']

df1.head()

문자열의 합 (이어붙히기)도 가능합니다.

df1['gender'] = df1['who'] + '-' + df1['sex']

df1.head()

컬럼간 연산시 round()를 사용하여 소수점 자릿수를 지정할 수 있습니다.

round(숫자, 소수 몇 째자리)

df1['round'] = round(df1['fare'] / df1['age'], 2)

df1.head()

연산시 1개의 컬럼이라도 NaN 값을 포함하고 있다면 결과는 NaN 이 됩니다.

df1.loc[df1['age'].isnull(), 'deck':].head()

타입 변환 (astype)

df1 = df.copy()

df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
survived       891 non-null int64
pclass         891 non-null int64
sex            891 non-null object
age            714 non-null float64
sibsp          891 non-null int64
parch          891 non-null int64
fare           891 non-null float64
embarked       889 non-null object
class          891 non-null category
who            891 non-null object
adult_male     891 non-null bool
deck           203 non-null category
embark_town    889 non-null object
alive          891 non-null object
alone          891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB

int32로 변경

df1['pclass'].astype('int32').head()

0    3
1    1
2    3
3    1
4    3
Name: pclass, dtype: int32

float32로 변경

df1['pclass'].astype('float32').head()

0    3.0
1    1.0
2    3.0
3    1.0
4    3.0
Name: pclass, dtype: float32

object로 변경

df1['pclass'].astype('str').head()

0    3
1    1
2    3
3    1
4    3
Name: pclass, dtype: object

category로 변경.

category로 변경시에는 Categories가 같이 출력 됩니다.

df1['who'].value_counts()

man      537
woman    271
child     83
Name: who, dtype: int64

df1['who'].dtype

dtype('O')

df1['who'].astype('category').head()

0      man
1    woman
2    woman
3    woman
4      man
Name: who, dtype: category
Categories (3, object): [child, man, woman]

타입을 category로 변환했다면 .cat으로 접근하여 category 타입이 제공하는 attribute를 사용할 수 있습니다.

df1['who'] = df1['who'].astype('category')

df1['who'].dtype

CategoricalDtype(categories=['child', 'man', 'woman'], ordered=False)

df1['who'].cat.codes

0      1
1      2
2      2
3      2
4      1
      ..
886    1
887    2
888    2
889    1
890    1
Length: 891, dtype: int8

카테고리 이름 변경

["Group (%s)" % g for g in df1['who'].cat.categories]

['Group (child)', 'Group (man)', 'Group (woman)']

df1['who'].cat.categories = ["Group (%s)" % g for g in df1['who'].cat.categories]
df1['who'].value_counts()

Group (man)      537
Group (woman)    271
Group (child)     83
Name: who, dtype: int64

datetime - 날짜, 시간

data_range

주요 옵션 값

start: 시작 날짜
end: 끝 날짜
periods: 생성할 데이터 개수
freq: 주기

dates = pd.date_range('20210101', periods=df.shape[0], freq='15H')
dates.shape

(891,)

df1 = df.copy()
df1.head()

date의 컬럼을 만들어 생성한 date 를 대입합니다.

df1['date'] = dates
df1.head()

df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 16 columns):
survived       891 non-null int64
pclass         891 non-null int64
sex            891 non-null object
age            714 non-null float64
sibsp          891 non-null int64
parch          891 non-null int64
fare           891 non-null float64
embarked       889 non-null object
class          891 non-null category
who            891 non-null object
adult_male     891 non-null bool
deck           203 non-null category
embark_town    889 non-null object
alive          891 non-null object
alone          891 non-null bool
date           891 non-null datetime64[ns]
dtypes: bool(2), category(2), datetime64[ns](1), float64(2), int64(4), object(5)
memory usage: 87.6+ KB

date의 컬럼에 datetime64라는 데이터 타입이 표기됩니다.

datetime 타입

datetime 타입에서는 dt 접근자로 다음과 같은 날짜 속성에 쉽게 접근할 수 있습니다.

Pandas의 dt (datetime) 날짜 관련 변수는 다음과 같습니다.

도큐먼트

pandas.Series.dt.year: 연도
pandas.Series.dt.month: 월
pandas.Series.dt.day: 일
pandas.Series.dt.hour: 시
pandas.Series.dt.minute: 분
pandas.Series.dt.second: 초
pandas.Series.dt.microsecond: micro 초
pandas.Series.dt.nanosecond: nano 초
pandas.Series.dt.week: 주
pandas.Series.dt.weekofyear: 연중 몇 째주
pandas.Series.dt.dayofweek: 요일
pandas.Series.dt.weekday: 요일 (dayofweek과 동일)
pandas.Series.dt.dayofyear: 연중 몇 번째 날
pandas.Series.dt.quarter: 분기

# 연도
df1['date'].dt.year.head()

0    2021
1    2021
2    2021
3    2021
4    2021
Name: date, dtype: int64

# 월
df1['date'].dt.month.head()

0    1
1    1
2    1
3    1
4    1
Name: date, dtype: int64

# 일
df1['date'].dt.day.head()

0    1
1    1
2    2
3    2
4    3
Name: date, dtype: int64

dayofweek는 숫자로 요일이 표기 됩니다.

월요일: 0, 일요일: 6

df1['date'].dt.dayofweek.head(10)

0    4
1    4
2    5
3    5
4    6
5    0
6    0
7    1
8    2
9    2
Name: date, dtype: int64

to_datetime

# e notation 표현 방식 변경
pd.options.display.float_format = '{:.2f}'.format

샘플용 서울시 공공자전거 데이터를 로드합니다.

df2 = pd.read_csv('http://bit.ly/seoul_bicycle')
df2.head()

df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 327231 entries, 0 to 327230
Data columns (total 11 columns):
대여일자      327231 non-null object
대여소번호     327231 non-null int64
대여소명      327231 non-null object
대여구분코드    327231 non-null object
성별        272841 non-null object
연령대코드     327231 non-null object
이용건수      327231 non-null int64
운동량       327231 non-null object
탄소량       327231 non-null object
이동거리      327231 non-null float64
이용시간      327231 non-null int64
dtypes: float64(1), int64(3), object(7)
memory usage: 27.5+ MB

대여일자 컬럼은 날짜 관련 컬럼처럼 보이나 info()는 object로 인식하였습니다.

datetime타입으로 변경해야 .dt 접근자를 사용할 수 있습니다.

pd.to_datetime(): datetime type으로 변환합니다.

pd.to_datetime(df2['대여일자'])

0        2020-01-20
1        2020-01-20
2        2020-01-20
3        2020-01-20
4        2020-01-20
            ...    
327226   2020-05-20
327227   2020-05-20
327228   2020-05-20
327229   2020-05-20
327230   2020-05-20
Name: 대여일자, Length: 327231, dtype: datetime64[ns]

재대입하여 컬럼에 적용합니다.

df2['대여일자'] = pd.to_datetime(df2['대여일자'])

df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 327231 entries, 0 to 327230
Data columns (total 11 columns):
대여일자      327231 non-null datetime64[ns]
대여소번호     327231 non-null int64
대여소명      327231 non-null object
대여구분코드    327231 non-null object
성별        272841 non-null object
연령대코드     327231 non-null object
이용건수      327231 non-null int64
운동량       327231 non-null object
탄소량       327231 non-null object
이동거리      327231 non-null float64
이용시간      327231 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(3), object(6)
memory usage: 27.5+ MB

적용된 후 .dt접근자를 활용하여 datetime 속성에 접근할 수 있습니다.

df2['대여일자'].dt.dayofweek

0         0
1         0
2         0
3         0
4         0
         ..
327226    2
327227    2
327228    2
327229    2
327230    2
Name: 대여일자, Length: 327231, dtype: int64

df2['대여일자'].dt.weekday

0         0
1         0
2         0
3         0
4         0
         ..
327226    2
327227    2
327228    2
327229    2
327230    2
Name: 대여일자, Length: 327231, dtype: int64

df2['대여일자'].dt.dayofweek

0         0
1         0
2         0
3         0
4         0
         ..
327226    2
327227    2
327228    2
327229    2
327230    2
Name: 대여일자, Length: 327231, dtype: int64

pd.to_numeric() - 수치형 변환

object나 numerical type이 아닌 컬럼을 수치형(numerical) 컬럼으로 변환할 때 사용합니다.

df2.head()

df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 327231 entries, 0 to 327230
Data columns (total 11 columns):
대여일자      327231 non-null datetime64[ns]
대여소번호     327231 non-null int64
대여소명      327231 non-null object
대여구분코드    327231 non-null object
성별        272841 non-null object
연령대코드     327231 non-null object
이용건수      327231 non-null int64
운동량       327231 non-null object
탄소량       327231 non-null object
이동거리      327231 non-null float64
이용시간      327231 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(3), object(6)
memory usage: 27.5+ MB

운동량 컬럼은 숫자형 컬럼 처럼 보이지만, object 타입으로 지정되어 있습니다.

종종 이런 현상이 발생하는데, 이런 현상을 만들어낸 이유는 분명 존재합니다!

원인 파악을 위해서 일단 pd.to_numeric()으로 변환을 시도합니다.

pd.to_numeric(df2['운동량'])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "\N"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-172-39aff29f75a7> in <module>
----> 1 pd.to_numeric(df2['운동량'])

/opt/conda/lib/python3.6/site-packages/pandas/core/tools/numeric.py in to_numeric(arg, errors, downcast)
    149             coerce_numeric = errors not in ("ignore", "raise")
    150             values = lib.maybe_convert_numeric(
--> 151                 values, set(), coerce_numeric=coerce_numeric
    152             )
    153 

pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "\N" at position 2344

2344 position에 무언가 에러가 발생하였습니다.

df2.loc[2344]

대여일자      2020-01-20 00:00:00
대여소번호                     165
대여소명              165. 중앙근린공원
대여구분코드                 일일(회원)
성별                         \N
연령대코드                 AGE_003
이용건수                        1
운동량                        \N
탄소량                        \N
이동거리                     0.00
이용시간                       40
Name: 2344, dtype: object

운동량에 숫자형이 아닌 개행 (\N)이 들어가 있기 때문에 이러한 에러가 발생하였습니다.

숫자형으로 바꿀 때 NaN값이나 숫자로 변환이 불가능한 문자열이 존재할 때 변환에 실패하게 됩니다.

errors= 옵션 값을 바꾸어 해결할 수 있습니다.

errors : {'ignore', 'raise', 'coerce'}, default 'raise'

- If 'raise', then invalid parsing will raise an exception
- If 'coerce', then invalid parsing will be set as NaN
- If 'ignore', then invalid parsing will return the input

errors='coerce'로 지정하면 잘못된 문자열은 NaN 값으로 치환하여 변환합니다.

그리고, 결과 확인시 잘 변환이 된 것을 볼 수 있습니다.

pd.to_numeric(df2['운동량'], errors='coerce')

0          61.82
1          39.62
2         430.85
3           1.79
4        4501.96
           ...  
327226    689.57
327227      0.00
327228     19.96
327229     43.77
327230   4735.63
Name: 운동량, Length: 327231, dtype: float64

pd.to_numeric(df2['운동량'], errors='coerce').loc[2344]

nan

errors='ignore'로 지정하게 되면 잘못된 문자열이 숫자로 변환이 안되고 무시하기 때문에 전체 컬럼의 dtype이 object로 그대로 남아있습니다.

pd.to_numeric(df2['운동량'], errors='ignore')

0           61.82
1           39.62
2          430.85
3            1.79
4         4501.96
           ...   
327226     689.57
327227          0
327228      19.96
327229      43.77
327230    4735.63
Name: 운동량, Length: 327231, dtype: object

pd.to_numeric(df2['운동량'], errors='ignore').loc[2344]

'\\N'

재대입까지 마무리 해야 DataFrame에 적용됩니다.

df2['운동량'] = pd.to_numeric(df2['운동량'], errors='coerce')

df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 327231 entries, 0 to 327230
Data columns (total 11 columns):
대여일자      327231 non-null datetime64[ns]
대여소번호     327231 non-null int64
대여소명      327231 non-null object
대여구분코드    327231 non-null object
성별        272841 non-null object
연령대코드     327231 non-null object
이용건수      327231 non-null int64
운동량       326830 non-null float64
탄소량       327231 non-null object
이동거리      327231 non-null float64
이용시간      327231 non-null int64
dtypes: datetime64[ns](1), float64(2), int64(3), object(5)
memory usage: 27.5+ MB

pd.cut() - 구간 나누기(binning)

연속된 수치(continuous values)를 구간으로 나누어 카테고리화 할 때 사용합니다.

df2.head()

df2.describe()

운동량은 범위가 굉장히 넓습니다. 최소값은 0인데, 최대값은 엄청 큰 값이 존재합니다.

어쨌든, 운동향을 기준으로 데이터를 10개 그룹으로 분류하고 싶습니다.

pd.cut()을 활용하여 쉽게 그룹을 나눌 수 있습니다.

df2.head()

bins 옵션에 나누고자 하는 구간의 개수를 설정합니다.

df2['운동량_cut'] = pd.cut(df2['운동량'], bins=10)

df2['운동량_cut'].value_counts()

(-163936.052, 16393605.23]      326816
(98361631.38, 114755236.61]          9
(32787210.46, 49180815.69]           2
(147542447.07, 163936052.3]          1
(114755236.61, 131148841.84]         1
(16393605.23, 32787210.46]           1
(131148841.84, 147542447.07]         0
(81968026.15, 98361631.38]           0
(65574420.92, 81968026.15]           0
(49180815.69, 65574420.92]           0
Name: 운동량_cut, dtype: int64

분포를 보니 첫 구간에 대부분의 데이터가 쏠려 있습니다. 딱봐도 올바르지 않은 방법 같아 보입니다.

pd.cut()은 최소에서 최대 구간을 지정한 bin만큼 동일하게 분할 하기 때문에 이런 현상이 발생할 수 있습니다.

고르게 분포한 데이터라면 괜찮지만, 튀는 이상치(outlier)가 있는 경우에는 안 좋은 결과를 초래 합니다.

pd.qcut() - 동일한 갯수를 갖도록 구간 분할

pd.cut()과 유사하지만, quantity 즉 데이터의 분포를 최대한 비슷하게 유지하는 구간을 분할 합니다.

df2['운동량_qcut'] = pd.qcut(df2['운동량'], q=10)

df2['운동량_qcut'].value_counts()

(93.414, 192.02]           32690
(6805.188, 163936052.3]    32683
(3328.186, 6805.188]       32683
(1889.606, 3328.186]       32683
(1079.744, 1889.606]       32683
(601.705, 1079.744]        32683
(24.737, 93.414]           32683
(-0.001, 24.737]           32683
(344.45, 601.705]          32680
(192.02, 344.45]           32679
Name: 운동량_qcut, dtype: int64

구간도 예쁘게 분할(균등하게 분할)이 된 것 처럼 보입니다. 하지만, 간격은 일정하지 않습니다.

	대여소번호	이용건수	운동량	이동거리	이용시간
count	327231.00	327231.00	326830.00	327231.00	327231.00
mean	1288.41	23.62	6921.37	106881.09	752.81
std	1012.65	59.92	656482.34	463495.54	2647.38
min	3.00	1.00	0.00	0.00	0.00
25%	562.00	2.00	138.05	5290.00	66.00
50%	1204.00	6.00	601.71	22900.00	207.00
75%	1933.00	22.00	2481.17	93460.00	670.00
max	99999.00	7451.00	163936052.30	56709052.94	458960.00

#06-Pandas(판다스) 데이터 전처리, 추가, 삭제, 데이터 type 변환