모듈 importPermalink

from IPython.display import Image
import numpy as np
import pandas as pd
import seaborn as sns

실습에 활용할 데이터셋Permalink

타이타닉: 탑승객의 사망자와 생존자 데이터 (seaborn 데이터셋 활용)Permalink

Image('https://static1.squarespace.com/static/5006453fe4b09ef2252ba068/t/5090b249e4b047ba54dfd258/1351660113175/TItanic-Survival-Infographic.jpg')

건조 당시 세계 최대의 여객선이었지만,1912년의 최초이자 최후의 항해 때 빙산과 충돌해 침몰한 비운의 여객선. 아마도 세상에서 가장 유명한 여객선이자 침몰선일 것입니다.

침몰한 지 100년이 넘었지만 아직까지 세계에서 가장 유명한 침몰선입니다.

사망자 수는 1위는 아니지만, 세계적으로 유명한 영화의 영향도 있고, 당시 최첨단 기술에 대해 기대감이 컸던 사회에 큰 영향을 끼치기도 한데다가, 근대 사회에서 들어서자마자 얼마 안된, 그리고 유명인사들이 여럿 희생된 대참사이기 때문에 가장 유명한 침몰선이 되었습니다. 또한 이 사건을 기점으로 여러가지 안전 조약들이 생겨났으니 더더욱 그렇습니다.

df = sns.load_dataset("titanic")
df.head()

컬럼 (column) 설명Permalink

survivied: 생존여부 (1: 생존, 0: 사망)
pclass: 좌석 등급 (1등급, 2등급, 3등급)
sex: 성별
age: 나이
sibsp: 형제 + 배우자 수
parch: 부모 + 자녀 수
fare: 좌석 요금
embarked: 탑승 항구 (S, C, Q)
class: pclass와 동일
who: 성별과 동일
adult_male: 성인 남자 여부
deck: 데크 번호 (알파벳 + 숫자 혼용)
embark_town: 탑승 항구 이름
alive: 생존여부 (yes, no)
alone: 혼자 탑승 여부

데이터 분석!Permalink

주요 목표

Pandas를 활용하여 타이타닉호 생존자, 사망자 데이터를 분석합니다.
데이터를 토대로 생존율이 높은 승객, 생존율이 낮은 승객은 누구인지 판단합니다.

head() 앞 부분 / tail() 뒷 부분 조회Permalink

default 옵션 값으로 5개의 행이 조회됩니다.
괄호 안에 숫자를 넣어 명시적으로 조회하고 싶은 행의 갯수를 지정할 수 있습니다.

df.head()

df.tail()

df.head(10)

df.tail(10)

info()Permalink

컬럼별 정보(information)를 보여줍니다.
데이터의 갯수, 그리고 데이터 타입(dtype)을 확인할 때 사용합니다.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
survived       891 non-null int64
pclass         891 non-null int64
sex            891 non-null object
age            714 non-null float64
sibsp          891 non-null int64
parch          891 non-null int64
fare           891 non-null float64
embarked       889 non-null object
class          891 non-null category
who            891 non-null object
adult_male     891 non-null bool
deck           203 non-null category
embark_town    889 non-null object
alive          891 non-null object
alone          891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB

object 타입은 쉽게 문자열이라고 생각하면 됩니다.

그런데, category 타입도 있습니다. category 타입은 문자열이지만, '남자' / '여자'처럼 카테고리화 할 수 있는 컬럼을 의미 합니다. 나중에 별도로 다루겠습니다.

describe()Permalink

각 컬럼에 대한 요약 통계 제공
수치형 컬럼 (numerical column)의 통계를 기본으로 보여 줍니다.

df.describe()

categorical column (문자열 컬럼)에 적용해 볼 수 없지 않습니다.

아래와 같이 include='object'를 통해 categorical column에 대한 요약 통계를 확인할 수 있습니다.

df.describe(include='object')

value_counts()Permalink

column 별 값의 분포를 확인할 때 사용합니다.

남자, 여자, 아이의 데이터 분포를 확인하고 싶다면 다음과 같이 실행합니다.

df['who'].value_counts()

man      537
woman    271
child     83
Name: who, dtype: int64

연습문제Permalink

embark_town은 승객의 탑승 항구를 나타내는 column 입니다. 탑승 항구별 승객 데이터 분포를 확인해 주세요.

# 코드를 입력해 주세요
df['embark_town'].value_counts()

Southampton    644
Cherbourg      168
Queenstown      77
Name: embark_town, dtype: int64

속성: AttributesPermalink

속성 값은 함수형으로 조회하지 않습니다.

자주 활용하는 DataFrame은 속성 값들은 다음과 같습니다.

ndim
shape
index
columns
values
T

차원을 나타냅니다. DataFrame은 2가 출력됩니다.

df.ndim

2

(행, 열) 순서로 출력됩니다.

df.shape

(891, 15)

index는 기본 설정된 RangeIndex가 출력됩니다.

df.index

RangeIndex(start=0, stop=891, step=1)

columns는 열을 출력 합니다.

df.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

values는 모든 값을 출력하며, numpy array 형식으로 출력됩니다.

df.values

array([[0, 3, 'male', ..., 'Southampton', 'no', False],
       [1, 1, 'female', ..., 'Cherbourg', 'yes', False],
       [1, 3, 'female', ..., 'Southampton', 'yes', True],
       ...,
       [0, 3, 'female', ..., 'Southampton', 'no', False],
       [1, 1, 'male', ..., 'Cherbourg', 'yes', True],
       [0, 3, 'male', ..., 'Queenstown', 'no', True]], dtype=object)

T: 전치 (Transpose) 는 Index와 Column의 축을 교환합니다.

df.T

타입 변환 (astype)Permalink

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
survived       891 non-null int64
pclass         891 non-null int64
sex            891 non-null object
age            714 non-null float64
sibsp          891 non-null int64
parch          891 non-null int64
fare           891 non-null float64
embarked       889 non-null object
class          891 non-null category
who            891 non-null object
adult_male     891 non-null bool
deck           203 non-null category
embark_town    889 non-null object
alive          891 non-null object
alone          891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB

int32로 변경

df['pclass'].astype('int32').head()

0    3
1    1
2    3
3    1
4    3
Name: pclass, dtype: int32

float32로 변경

df['pclass'].astype('float32').head()

0    3.0
1    1.0
2    3.0
3    1.0
4    3.0
Name: pclass, dtype: float32

object로 변경

df['pclass'].astype('str').head()

0    3
1    1
2    3
3    1
4    3
Name: pclass, dtype: object

category로 변경.

category로 변경시에는 Categories가 같이 출력 됩니다.

df['pclass'].astype('category').head()

0    3
1    1
2    3
3    1
4    3
Name: pclass, dtype: category
Categories (3, int64): [1, 2, 3]

정렬 (sort)Permalink

sort_index: index 정렬Permalink

index 기준으로 정렬합니다. (기본 오름차순이 적용되어 있습니다.
내림차순 정렬을 적용하려면, ascending=False를 옵션 값으로 설정합니다.

df.sort_index().head(5)

df.sort_index(ascending=False).head(5)

sort_values: 값에 대한 정렬Permalink

값을 기준으로 행을 정렬합니다.
by에 기준이 되는 행을 설정합니다.
by에 2개 이상의 컬럼을 지정하여 정렬할 수 있습니다.
오름차순/내림차순을 컬럼 별로 지정할 수 있습니다.

df.sort_values(by='age').head()

내림차순 정렬: ascending=False

df.sort_values(by='age', ascending=False).head()

문자열 컬럼도 오름차순/내림차순 정렬이 가능하며 알파벳 순서로 정렬됩니다.

df.sort_values(by='class', ascending=False).head()

2개 이상의 컬럼을 기준으로 값 정렬 할 수 있습니다.

df.sort_values(by=['fare', 'age']).head()

오름차순/내림차순 정렬도 컬럼 각각에 지정해 줄 수 있습니다.

df.sort_values(by=['fare', 'age'], ascending=[False, True]).head()

Indexing, Slicing, 조건 필터링Permalink

df.head()

loc - indexing / slicingPermalink

indexing과 slicing을 할 수 있습니다.
slicing은 [시작(포함): 끝(포함)] 규칙에 유의합니다. 둘 다 포함 합니다.

indexing 예시

df.loc[5, 'class']

'Third'

fancy indexing 예시

df.loc[2:5, ['age', 'fare', 'who']]

slicing 예시

df.loc[2:5, 'class':'deck'].head()

df.loc[:6, 'class':'deck']

loc - 조건 필터Permalink

boolean index을 만들어 조건에 맞는 데이터만 추출해 낼 수 있습니다.

condition = df['who'] == 'man'
condition

0       True
1      False
2      False
3      False
4       True
       ...  
886     True
887    False
888    False
889     True
890     True
Name: who, Length: 891, dtype: bool

다음 2가지의 케이스로 조건에 맞는 데이터만 추출 할 수 있습니다.

결과는 같습니다.

케이스 1: df[condition]Permalink

df[condition].head()

케이스 2: df.loc[condition]Permalink

df.loc[condition].head()

다만, loc를 사용하는 것을 추천합니다. (값 대입시 issue 발생)

df[condition]['age']

0      22.0
4      35.0
5       NaN
6      54.0
12     20.0
       ... 
883    28.0
884    25.0
886    27.0
889    26.0
890    32.0
Name: age, Length: 537, dtype: float64

df[condition]['age'] = 2

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

다음과 같은 경고 창이 뜹니다.

/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
"""Entry point for launching an IPython kernel.

값을 대입하여 변경했음에도 불구하고 값이 변경 되지 않습니다.

df[condition]['age']

0      22.0
4      35.0
5       NaN
6      54.0
12     20.0
       ... 
883    28.0
884    25.0
886    27.0
889    26.0
890    32.0
Name: age, Length: 537, dtype: float64

loc를 사용하면 이러한 문제가 발생하지 않아 실수를 줄일 수 있습니다.

df.loc[condition, 'age'] = 10

df[condition].head()

loc - 다중 조건Permalink

다중 조건은 먼저 condition을 정의하고 & 와 | 연산자로 복합 조건을 생성합니다.

# 조건1 정의
condition1 = (df['fare'] > 30)

# 조건2 정의
condition2 = (df['who'] == 'woman')

df.loc[condition1 & condition2]

df.loc[condition1 | condition2]

연습문제Permalink

데이터를 다시 로드 합니다.

df = sns.load_dataset("titanic")
df.head()

1) 다음 조건을 만족하는 코드를 입력하세요.

나이가 30살 이상 남자 승객 조건 필터링
fare를 많이 낸 순서로 내림차순 정렬
상위 10개를 출력

# 코드를 입력해 주세요
condition1 = (df['age'] >= 30)
condition2 = (df['who'] == 'man')
df.loc[condition1 & condition2].sort_values(by='fare', ascending=False).head(10)

2) 다음 조건을 만족하는 코드를 입력하세요.

나이가 20살 이상 40살 미만인 승객
pclass가 1등급 혹은 2등급인 승객
열(column)은 survived, pclass, age, fare 만 나오게 출력
10개만 출력

# 코드를 입력해 주세요
condition1 = (df['age'] >= 20) & (df['age'] < 40)
condition2 = (df['pclass'] < 3)
df.loc[condition1 & condition2, ['survived', 'pclass', 'age', 'fare']].head(10)

ilocPermalink

loc와 유사하지만, index만 허용합니다.
loc와 마찬가지고, indexing / slicing 모두 가능합니다.

df.head()

indexing

df.iloc[1, 3]

38.0

fancy indexing

df.iloc[[0, 3, 4], [0, 1, 5, 6]]

slicing

df.iloc[:3, :5]

atPermalink

하나의 인덱스만 가져옵니다. loc보다 속도가 빠르다는 장점은 있지만, 실질적인 효용성은 떨어집니다. 그냥 loc를 사용해도 똑같은 결과를 얻을 수 있습니다.

%timeit df.loc[0, 'fare']

7.4 µs ± 27.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit df.at[0, 'fare']

4.56 µs ± 42.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

iatPermalink

하나의 인덱스만 가져옵니다. 속도가 빠르다는 장점은 있지만, 1개의 데이터만 조회 가능합니다. iloc로 대체 사용가능합니다.

%timeit df.iloc[0, 5]

8.66 µs ± 9.52 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit df.iat[0, 5]

5.53 µs ± 29.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

wherePermalink

도큐먼트

DataFrame.where(cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=False)

Pandas의 where는 Numpy의 where와 동작이 다릅니다.

cond: True/False로 판단될 수 있는 식
other: condition을 만족하지 못하는 요소에 할당 할 값

df.tail(5)

컬럼에 적용할 때

df['fare'].where(df['fare'] < 20, 0).tail(10)

881     7.8958
882    10.5167
883    10.5000
884     7.0500
885     0.0000
886    13.0000
887     0.0000
888     0.0000
889     0.0000
890     7.7500
Name: fare, dtype: float64

행 전체에 적용할 때 (추천하는 정상적인 방법은 아닙니다)

df.where(df['fare'] < 20, 0).tail(10)

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: Implicitly converting categorical to object-dtype ndarray. One or more of the values in 'other' are not present in this categorical's categories. A future version of pandas will raise a ValueError when 'other' contains different categories.

To preserve the current behavior, add the new categories to the categorical before calling 'where', or convert the categorical to a different dtype.
  """Entry point for launching an IPython kernel.

isin()Permalink

특정 값의 포함 여부는 isin 함수를 통해 비교가 가능합니다. (파이썬의 in 키워드는 사용 불가 합니다.)

sample = pd.DataFrame({'name': ['kim', 'lee', 'park', 'choi'], 
                        'age': [24, 27, 34, 19]
                      })
sample

sample['name'].isin(['kim', 'lee'])

0     True
1     True
2    False
3    False
Name: name, dtype: bool

sample.isin(['kim', 'lee'])

loc를 활용한 조건 필터링으로도 찰떡궁합입니다.

condition = sample['name'].isin(['kim', 'lee'])

sample.loc[condition]

	survived	pclass	age	sibsp	parch	fare
count	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

#03-Pandas(판다스) 데이터프레임(DataFrame) 조회, 정렬(sort), 조건필터(loc, iloc)

모듈 importPermalink

실습에 활용할 데이터셋Permalink

타이타닉: 탑승객의 사망자와 생존자 데이터 (seaborn 데이터셋 활용)Permalink

컬럼 (column) 설명Permalink

데이터 분석!Permalink

head() 앞 부분 / tail() 뒷 부분 조회Permalink

info()Permalink

describe()Permalink

value_counts()Permalink

연습문제Permalink

속성: AttributesPermalink

타입 변환 (astype)Permalink

정렬 (sort)Permalink

sort_index: index 정렬Permalink

sort_values: 값에 대한 정렬Permalink

Indexing, Slicing, 조건 필터링Permalink

loc - indexing / slicingPermalink

loc - 조건 필터Permalink

케이스 1: df[condition]Permalink

케이스 2: df.loc[condition]Permalink

loc - 다중 조건Permalink

연습문제Permalink

ilocPermalink

atPermalink

iatPermalink

wherePermalink

isin()Permalink

공유하기

댓글남기기

참고

poetry 의 거의 모든것 (튜토리얼)

LangGraph Retrieval Agent를 활용한 동적 문서 검색 및 처리

[Assistants API] Code Interpreter, Retrieval, Functions 활용법

[LangChain] 에이전트(Agent)와 도구(tools)를 활용한 지능형 검색 시스템 구축 가이드

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
886	0	2	male	27.0	0	0	13.00	S	Second	man	True	NaN	Southampton	no	True
887	1	1	female	19.0	0	0	30.00	S	First	woman	False	B	Southampton	yes	True
888	0	3	female	NaN	1	2	23.45	S	Third	woman	False	NaN	Southampton	no	False
889	1	1	male	26.0	0	0	30.00	C	First	man	True	C	Cherbourg	yes	True
890	0	3	male	32.0	0	0	7.75	Q	Third	man	True	NaN	Queenstown	no	True

	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
881	0	3	male	33.0	0	0	7.8958	S	Third	man	True	NaN	Southampton	no	True
882	0	3	female	22.0	0	0	10.5167	S	Third	woman	False	NaN	Southampton	no	True
883	0	2	male	28.0	0	0	10.5000	S	Second	man	True	NaN	Southampton	no	True
884	0	3	male	25.0	0	0	7.0500	S	Third	man	True	NaN	Southampton	no	True
885	0	3	female	39.0	0	5	29.1250	Q	Third	woman	False	NaN	Queenstown	no	False
886	0	2	male	27.0	0	0	13.0000	S	Second	man	True	NaN	Southampton	no	True
887	1	1	female	19.0	0	0	30.0000	S	First	woman	False	B	Southampton	yes	True
888	0	3	female	NaN	1	2	23.4500	S	Third	woman	False	NaN	Southampton	no	False
889	1	1	male	26.0	0	0	30.0000	C	First	man	True	C	Cherbourg	yes	True
890	0	3	male	32.0	0	0	7.7500	Q	Third	man	True	NaN	Queenstown	no	True

	sex	embarked	who	embark_town	alive
count	891	889	891	889	891
unique	2	3	3	3	2
top	male	S	man	Southampton	no
freq	577	644	537	644	549

	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
803	1	3	male	0.42	0	1	8.5167	C	Third	child	False	NaN	Cherbourg	yes	False
755	1	2	male	0.67	1	1	14.5000	S	Second	child	False	NaN	Southampton	yes	False
644	1	3	female	0.75	2	1	19.2583	C	Third	child	False	NaN	Cherbourg	yes	False
469	1	3	female	0.75	2	1	19.2583	C	Third	child	False	NaN	Cherbourg	yes	False
78	1	2	male	0.83	0	2	29.0000	S	Second	child	False	NaN	Southampton	yes	False

	survived	pclass	sex	age	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
630	1	1	male	80.0	30.0000	S	First	man	True	A	Southampton	yes	True
851	0	3	male	74.0	7.7750	S	Third	man	True	NaN	Southampton	no	True
493	0	1	male	71.0	49.5042	C	First	man	True	NaN	Cherbourg	no	True
96	0	1	male	71.0	34.6542	C	First	man	True	A	Cherbourg	no	True
116	0	3	male	70.5	7.7500	Q	Third	man	True	NaN	Queenstown	no	True

	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
258	1	1	female	35.0	0	0	512.3292	C	First	woman	False	NaN	Cherbourg	yes	True
737	1	1	male	35.0	0	0	512.3292	C	First	man	True	B	Cherbourg	yes	True
679	1	1	male	36.0	0	1	512.3292	C	First	man	True	B	Cherbourg	yes	False
27	0	1	male	19.0	3	2	263.0000	S	First	man	True	C	Southampton	no	False
88	1	1	female	23.0	3	2	263.0000	S	First	woman	False	C	Southampton	yes	False

	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	3	male	10.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
4	3	male	10.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True
5	3	male	10.0	0	8.4583	Q	Third	man	True	NaN	Queenstown	no	True
6	1	male	10.0	0	51.8625	S	First	man	True	E	Southampton	no	True
12	3	male	10.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

	name	age
0	kim	24
1	lee	27
2	park	34
3	choi	19

	name	age
0	kim	24
1	lee	27