Machine Learning Concept (Scikit-learn)

You are not logged in.
If you are a current student, please Log In for full access to this page.

1) เบื้องต้น
2) Regression
3) Classification
4) สรุปภาพรวม

1) เบื้องต้น

machine learning เป็นอีกเทคนิคหนึ่งที่ใช้ในการประมวลผลข้อมูลจำนวนมาก
library สำหรับทำ machine learning เบื้องต้นใน python ชื่อว่า sklearn
machine learning เป็นเนื้อหาที่กว้างมาก แลบนี้เพียงแค่นำเนื้อหาเสี้ยวเล็กๆ มาแนะนำเท่านั้น
แผนภาพด้านล่างเป็นสิ่งที่ sklearn สามารถทำได้ที่เกี่ยวกับการทำ machine learning ซึ่งประกอบด้วย
1. classification
2. regression
3. clustering
4. dimensionality reduction

ในแลบนี้จะแสดงตัวอย่างของการทำ machine learning ประเภท
- regression
- classification แบบเบื้องต้นมากๆ

2) Regression

!pip install scikit-learn eli5

import pandas as pd

เริ่มต้นด้วยการนำเข้าข้อมูลด้วย pandas ตามปกติ

df = pd.read_json('https://raw.githubusercontent.com/vega/vega-datasets/master/data/gapminder.json')

df.head()

	year	country	pop	life_expect	fertility
0	1955	Afghanistan	8891209	30.332	7.7
1	1960	Afghanistan	9829450	31.997	7.7
2	1965	Afghanistan	10997885	34.020	7.7
3	1970	Afghanistan	12430623	36.088	7.7
4	1975	Afghanistan	14132019	38.438	7.7

ข้อมูลเป็นชุดข้อมูลเดียวกันแลบ Data Visualization คือ

fertility: อัตราการเกิด
life_expect: อายุขัย
pop: จำนวนประชากร

สำหรับแต่ละประเทศและปีที่เก็บข้อมูล และมีคอลัมน์

cluster: กลุ่มอะไรบางอย่าง

ในการทำ regression ให้เราระบุว่าจะใช้คอลัมน์ไหนเป็น input/output ได้ดังนี้

regression_features = ["pop","fertility"]
regression_X = df[regression_features]
regression_y = df["life_expect"]

โดย X (input) ประกอบด้วย

pop
fertility

และ y (output) เป็น life_expect

สังเกตได้ว่า life_expect เป็นปริมาณ จึงใช้กระบวนการ regression

จากนั้น เราสามารถเริ่มสร้างโมเดล machine learning จาก sklearn

from sklearn.linear_model import LinearRegression
reg = LinearRegression()

โดยโมเดลแรกที่เราจะใช้คือ LinearRegression ที่เราเรียกมาจาก sklearn.linear_model

และ reg = LinearRegression() เป็นการสร้างโมเดลนั้นขึ้นมา

แต่จะยังไม่มีการเรียนรู้เกิดขึ้นในกระบวนการนี้ ต้องใช้คำสั่งถัดไปคือ

reg.fit(regression_X, regression_y)

reg.fit(input, output) เป็นการสั่งให้โมเดล reg เรียนรู้จากข้อมูลที่เรากำหนด

หลังจากที่โมเดลเรียนรู้จากข้อมูลแล้ว เราสามารถดูสิ่งที่โมเดลได้เรียนรู้ได้ด้วยแพคเกจ eli5 (เป็นเนื้อหาเสริม สามารถไปค้นหาเพิ่มเติมเองได้)

ผลลัพธ์ของโมเดลจะเรียกว่า weight สามารถลองดูได้ตามด้านล่าง

import eli5

eli5.show_weights(reg, feature_names = regression_features)

y top features

Weight^?	Feature
+84.422	<BIAS>
-0.000	pop
-4.968	fertility

โดยโมเดล LinearRegression จะเป็นสมการเชิงเส้น มีลักษณะเป็น

y = \beta_0 + \beta_1 x_1 + \beta_2 x_2

ซึ่งจากตาราง โมเดลนี้จะได้ค่าออกมาเป็น

life\_expect = 84.422 - 4.968 \times fertility

ถ้า fertility เพิ่มขึ้น 0.1 แล้ว ค่า life_expect จะมากกว่าเดิมเท่าไร (ปัดทศนิยมสองตำแหน่ง)

ตารางด้านบนแสดงน้ำหนักของแต่ละคอลัมน์ใน input ที่ส่งผลต่อ output

ต่อไปเป็นโมเดลประเภท regression เช่นกันแต่เป็นโมเดล RandomForestRegressor

from sklearn.ensemble import RandomForestRegressor
rf_reg = RandomForestRegressor(n_estimators=10)
rf_reg.fit(regression_X, regression_y)

n_estimators เป็นรายละเอียดย่อยเฉพาะของโมเดล RandomForestRegressor ซึ่งเป็นจำนวน Decision Tree ย่อยที่โมเดล Random Forest จะสร้างขึ้นมา

บทความรายละเอียดเพิ่มเติม

eli5.show_weights(rf_reg, feature_names = regression_features)

Weight	Feature
0.8759 ± 0.0303	fertility
0.1241 ± 0.0303	pop

weight ของ RandomForestRegressor จะแสดงถึง feature importance (ความสำคัญของตัวแปร) แต่ไม่ได้แสดงถึงค่าโดยตรงเหมือน LinearRegression

top

Checkoff 1:

(คำถามอธิบาย ไม่ต้องโค้ด) ให้อธิบายขั้นตอนการใช้งาน machine learning model ด้วยแพคเกจ sklearn

Hint: ในตัวอย่างมีสามบรรทัด แต่ละบรรทัดระบุถึงขั้นตอนอะไร

3) Classification

top

ต่อไปนี้จะเป็นตัวอย่างของ machine learning ประเภท classification

โดยเราจะกำหนด output เป็น cluster แทน life_expect ซึ่งเป็นข้อมูลเชิงคุณภาพ จากเดิมเป็นเชิงปริมาณ

df["cluster"].describe()

count    693.000000
mean       2.476190
std        1.446971
min        0.000000
25%        1.000000
50%        3.000000
75%        3.000000
max        5.000000
Name: cluster, dtype: float64

classification_features = ["pop", "life_expect", "fertility"]
classification_X = df[classification_features]
classification_y = df["cluster"]

โมเดล svm เป็นโมเดล classification รูปแบบหนึ่ง

from sklearn import svm
sv = svm.LinearSVC()
sv.fit(classification_X, classification_y)

eli5.show_weights(sv, feature_names = classification_features)

y=0 top features

y=1 top features

y=2 top features

y=3 top features

y=4 top features

y=5 top features

Weight^?	Feature
-0.000	<BIAS>
-0.000	fertility
-0.000	pop
-0.000	life_expect

Weight^?	Feature
+0.000	pop
-0.000	<BIAS>
-0.000	fertility
-0.000	life_expect

Weight^?	Feature
+0.000	pop
-0.000	<BIAS>
-0.000	fertility
-0.000	life_expect

Weight^?	Feature
+0.000	life_expect
+0.000	fertility
+0.000	<BIAS>
+0.000	pop

Weight^?	Feature
-0.000	<BIAS>
-0.000	fertility
-0.000	pop
-0.000	life_expect

Weight^?	Feature
-0.000	<BIAS>
-0.000	fertility
-0.000	pop
-0.000	life_expect

อีกหนึ่งโมเดลตัวอย่างคือ RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10)
clf.fit(classification_X, classification_y)

eli5.show_weights(clf, feature_names = classification_features)

Weight	Feature
0.3475 ± 0.0711	pop
0.3346 ± 0.0998	life_expect
0.3179 ± 0.0782	fertility

ฟีเจอร์ใดส่งผลต่อ cluster มากที่สุด

ต่อไปหากจะทำการลองทดสอบข้อมูลที่สมมุติขึ้นมา โดยดูรูปแบบข้อมูลที่ใช้กับโมเดลจาก training data ก่อน

classification_X.iloc[0]

pop            8891209.000
life_expect         30.332
fertility            7.700
Name: 0, dtype: float64

จะเห็นว่าจะต้องใส่สามค่า ซึ่งสามารถใช้ np.array สร้างขึ้นมาได้

import numpy as np

เช่น สมมุติว่า

pop เป็น 4000000
life_expect เป็น 30
fertility เป็น 7

to_predict = np.array([4000000, 30, 7])

แล้วสามารถใช้ค่านี้ไป predict ได้

eli5.show_prediction(clf, to_predict, feature_names = classification_features, 
                show_feature_values=True)

y=0 (probability 0.100) top features

y=1 (probability 0.000) top features

y=2 (probability 0.700) top features

y=3 (probability 0.200) top features

y=4 (probability 0.000) top features

y=5 (probability 0.000) top features

Contribution^?	Feature	Value
+0.220	life_expect	30.000
+0.057	<BIAS>	1.000
+0.005	fertility	7.000
-0.182	pop	4000000.000

Contribution^?	Feature	Value
+0.304	<BIAS>	1.000
+0.011	pop	4000000.000
-0.147	fertility	7.000
-0.169	life_expect	30.000

Contribution^?	Feature	Value
+0.466	life_expect	30.000
+0.142	fertility	7.000
+0.066	<BIAS>	1.000
+0.026	pop	4000000.000

Contribution^?	Feature	Value
+0.350	<BIAS>	1.000
+0.202	pop	4000000.000
-0.058	fertility	7.000
-0.294	life_expect	30.000

Contribution^?	Feature	Value
+0.131	<BIAS>	1.000
-0.038	life_expect	30.000
-0.044	fertility	7.000
-0.049	pop	4000000.000

Contribution^?	Feature	Value
+0.101	fertility	7.000
+0.092	<BIAS>	1.000
-0.008	pop	4000000.000
-0.185	life_expect	30.000

ตัวโมเดลคิดว่าข้อมูลด้านบนควรอยู่ใน cluster ใด

Checkoff 2:

จงหาค่า pop, life_expect, fertility ที่ทำให้โมเดลทำนายออกมาเป็น cluster ที่ อ. กำหนด

4) สรุปภาพรวม

สรุปภาพรวม lab15

Machine Learning Concept (Scikit-learn)

Table of Contents

1) เบื้องต้น

2) Regression

3) Classification

4) สรุปภาพรวม