데이터 셋 (Dataset) 다루기

sklearn.dataset 안에는 빌트인 (built-in) 데이터 셋들이 존재합니다. 물론 튜토리얼 진행을 위한 수준이므로, 규모가 크지는 않습니다 (Toy Dataset 이라고도 불리웁니다.)

그렇지만, mldata.org 데이터 셋은 조금 더 규모가 큰 데이터 셋을 제공하며, 온라인에서 동적으로 다운로드를 받을 수 있습니다.

이번 튜토리얼에서는 Built-in 데이터 셋을 활용하는 방법에 대해서 알아보도록 하겠습니다.

빌트인 (Built-in) 데이터셋 활용

데이터 셋의 종류

load_boston: 보스톤 집값 데이터
load_iris: 아이리스 붓꽃 데이터
load_diabetes: 당뇨병 환자 데이터
load_digits: 손글씨 데이터
load_linnerud: multi-output regression 용 데이터
load_wine: 와인 데이터
load_breast_cancer: 위스콘신 유방암 환자 데이터

데이터 셋 조회

빌트인 데이터셋은 sklearn.utils.Bunch 라는 자료구조를 활용합니다.

key-value 형식으로 구성되어 있으며, 사전(dict)형 타입과 유사한 구조를 가지고 있습니다.

공통 key는 다음과 같습니다.

data: 샘플 데이터, Numpy 배열로 이루어져 있습니다.
target: Label 데이터, Numpy 배열로 이루어져 있습니다.
feature_names: Feature 데이터의 이름
target_names: Label 데이터의 이름
DESCR: 데이터 셋의 설명
filename: 데이터 셋의 파일 저장 위치 (csv)

간단한 실습으로 빌트인 데이터셋의 활용법에 대하여 알아보겠습니다.

iris 붓꽃 데이터 로드하기

from IPython.display import Image

Image(url='https://user-images.githubusercontent.com/15958325/56006707-f69f3680-5d10-11e9-8609-25ba5034607e.png')

# iris 붓꽃 데이터 로드
from sklearn.datasets import load_iris

load_iris로 데이터 셋을 불러와서 임시 변수에 저장합니다.

iris = load_iris()

변수를 출력해 보면 다음과 같이 key-value 로 이루어진 데이터셋이 로드됩니다.

iris

Feature 데이터 (X)

feature 데이터 값 조회하기

feature data는 data 키로 접근하여 가져올 수 있습니다.

features = iris['data']

5개만 출력해 본다면 다음과 같은 모양새를 띄고 있습니다.

features[:5]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

feature data 에 대한 이름은 feature_names 로 가져올 수 있습니다.

iris 데이터의 경우 총 4개의 feature를 가지고 있음을 확인할 수 있습니다.

[참고]

sepal: 꽃받침
petal: 꽃잎

feature_names = iris['feature_names']
feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

Label 데이터 (Y)

label data는 target 키로 접근하여 가져올 수 있습니다.

labels = iris['target']
labels

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

feature data와 마찬가지로, label 데이터도 target_names라는 키로 target에 대한 클래쓰 이름을 확인해 볼 수 있습니다.

데이터 셋을 DataFrame으로 변환

import pandas as pd

첫번째로 data와 feature_names 키로 가져온 데이터를 활용하여 데이터 프레임을 만들어 줍니다.

df = pd.DataFrame(features, columns=feature_names)
df.head()

혹은 다음과 같이 가져와도 동일하겠죠?

df = pd.DataFrame(iris['data'], columns=iris['feature_names'])
df.head()

그런 다음 target 데이터를 새로운 컬럼을 만들어 추가해 줍니다. 여기서 target 데이터의 column 명 임의로 지정해 주면 됩니다.

df['target'] = iris['target']

df.head()

로드한 DataFrame 시각화

import matplotlib.pyplot as plt
import seaborn as sns

Sepal (꽃받침)의 길이 넓이에 따른 꽃의 종류가 어떻게 다르게 나오는지 살펴보겠습니다.

plt.figure(figsize=(10, 7))
sns.scatterplot(df.iloc[:, 0], df.iloc[:, 1], hue=df['target'], palette='muted')
plt.title('Sepal', fontsize=17)
plt.show()

이번에는, Petal (꽃잎)의 길이 넓이에 따른 꽃의 종류가 어떻게 다르게 나오는지 살펴보겠습니다.

plt.figure(figsize=(10, 7))
sns.scatterplot(df.iloc[:, 2], df.iloc[:, 3], hue=df['target'], palette='muted')
plt.title('Petal', fontsize=17)
plt.show()

데이터 분할 (train_test_split)

기계학습에서 데이터 분할을 중요한 전처리 과정입니다.

sklearn.model_selection의 train_test_split은 클래스 이름 그대로 학습과 검증 (혹은 테스트) 셋을 나누어 주는 역할을 합니다.

학습 (Train) / 검증 (Validation or Test) 세트로 나누며, 검증 세트로 과대 적합여부를 모니터링 할 수 있습니다.

또한, 검증 세트를 활용하여 모델의 성능 평가를 진행할 수 있습니다.

샘플 데이터 확인

df.head()

feature(x), label(y) 데이터를 분할 합니다.

from sklearn.model_selection import train_test_split

x = df.iloc[:, :4]
x.head()

y = df['target']
y.head()

0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int64

주요 hyperparameter

test_size: validation set에 할당할 비율 (20% -> 0.2), 기본값 0.25
stratify: 분할된 샘플의 class 갯수가 동일한 비율로 유지
random_state: 랜덤 시드값
shuffle: 셔플 옵션, 기본값 True

x_train, x_test, y_train, y_test = train_test_split(x, y, stratify=y, test_size=0.2, random_state=30)

원본 x의 shape

x.shape

(150, 4)

분할된 x의 shape

x_train.shape, x_test.shape

((120, 4), (30, 4))

원본 y의 shape

y.shape

(150,)

분할된 y의 shape

y_train.shape, y_test.shape

((120,), (30,))

scikit-learn 데이터셋(dataset) 다루기

코드

데이터 셋 (Dataset) 다루기

빌트인 (Built-in) 데이터셋 활용

데이터 셋의 종류

데이터 셋 조회

iris 붓꽃 데이터 로드하기

Feature 데이터 (X)

feature 데이터 값 조회하기

Label 데이터 (Y)

데이터 셋을 DataFrame으로 변환

로드한 DataFrame 시각화

데이터 분할 (train_test_split)

공유하기

댓글남기기

참고

poetry 의 거의 모든것 (튜토리얼)

LangGraph Retrieval Agent를 활용한 동적 문서 검색 및 처리

[Assistants API] Code Interpreter, Retrieval, Functions 활용법

[LangChain] 에이전트(Agent)와 도구(tools)를 활용한 지능형 검색 시스템 구축 가이드

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2