使用numpy实现逻辑回归对IRIS数据集二分类

使用numpy实现逻辑回归对IRIS数据集二分类，使用对数似然损失(Log-likelihood Loss)，并显示训练后loss变化曲线。

知识储备如下：

逻辑回归Logistic Regression
对数似然损失
IRIS数据集介绍
np.concatenate使用

知识储备

逻辑回归Logistic Regression

y = \frac{1}{1+e^{-(wx+b)}}

名字虽然叫回归，但是一般处理的是分类问题，尤其是二分类，比如垃圾邮件的识别，推荐系统，医疗判断等，因为其逻辑与实现简单，在工业界有着广泛的应用。

优点：

实现简单，计算代价不高，易于理解和实现, 广泛的应用于工业问题上；
分类时计算量非常小，速度很快，存储资源低；

缺点：

容易欠拟合，当特征空间很大时，逻辑回归的性能不是很好；
不能很好地处理大量多类特征或变量；

对数似然损失

对数损失, 即对数似然损失(Log-likelihood Loss), 也称逻辑斯特回归损失(Logistic Loss)或交叉熵损失(cross-entropy Loss), 是在概率估计上定义的。它常用于(multi-nominal, 多项)逻辑斯特回归和神经网络,以及一些期望极大算法的变体,可用于评估分类器的概率输出。可参考对数损失函数(Logarithmic Loss Function)的原理和 Python 实现了解详情

损失函数:

L=\frac{1}{m}*\sum_i^m -y_ilog(f(x_i))-(1-y_i)log(1-f(x_i))

梯度计算：

\frac{\partial L}{\partial w} = \frac{1}{m}X^T*(f(x)-y)

权重更新：

w = w -\alpha\frac{\partial L}{\partial w}

IRIS数据集介绍

该数据集包含4个特征变量，1个类别变量。iris每个样本都包含了4个特征：花萼长度，花萼宽度，花瓣长度，花瓣宽度，以及1个类别变量（label）。详情见加载数据

np.concatenate使用

1
2
3

a = np.array([[1, 2],[3, 4]])
b = np.array([[5, 6]])
np.concatenate((a, b), axis = 0)

array([[1, 2],
       [3, 4],
       [5, 6]])

1	np.concatenate((a, b.T), axis = 1)

array([[1, 2, 5],
       [3, 4, 6]])

加载数据

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_iris
%matplotlib inline

dataset = load_iris()
inputs = dataset['data']
target = dataset['target']
print('inputs.shape:', inputs.shape)
print('target.shape:', target.shape)
# 三个类别
print('labels:', set(target))

inputs.shape: (150, 4)
target.shape: (150,)
labels: {0, 1, 2}

target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

1
2
3

values = [np.sum(target == 0), np.sum(target == 1), np.sum(target == 2)]
plt.pie(values,labels=[0, 1, 2], autopct = '%.1f%%')
plt.show()

关于参数train_test_split的random_state的解释：

Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.
random_state即随机数种子，目的是为了保证程序每次运行都分割一样的训练集和测试集。否则，同样的算法模型在不同的训练集和测试集上的效果不一样。

from sklearn.model_selection import train_test_split
# 只取前两类， 做二分类
two_class_input = inputs[:100]
two_class_target = target[:100]
x_train, x_test, y_train, y_test = train_test_split(
                    two_class_input,two_class_target,
                    test_size = 0.3,
                    random_state = 0)
y_train = y_train.reshape(-1,1)
y_test = y_test.reshape(-1,1)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(70, 4) (30, 4) (70, 1) (30, 1)

# add one feature to x
x_train = np.concatenate([x_train, np.ones((x_train.shape[0], 1))], axis = 1)
x_test = np.concatenate([x_test, np.ones((x_test.shape[0], 1))], axis = 1)
print(x_train.shape, x_test.shape)

(70, 5) (30, 5)

定义模型

def sigmoid(x):
    return 1 / (1 + np.exp(-x))
x = np.arange(-10, 10, step = 0.1)
fig, ax = plt.subplots(figsize = (8, 4))
ax.plot(x, sigmoid(x), c = 'green')

[<matplotlib.lines.Line2D at 0x28f77053348>]

compute_loss = lambda pred_y, y: np.mean(-y * np.log(pred_y)-(1-y) * np.log(1-pred_y))
# weight and bias init
w = np.random.randn(5, 1)
# 上一个loss
losses = []
last_loss = 10000
pred_y =sigmoid(np.dot(x_train, w))
# 当前loss
now_loss = compute_loss(pred_y, y_train)
i = 0
while abs(now_loss - last_loss)>1e-4:
    last_loss = now_loss
    i = i + 1
    # 计算梯度
    grad = x_train.T.dot((pred_y - y_train)) / len(y_train)
    # 更新梯度
    w = w - 0.001 * grad
    
    # 前导计算
    pred_y = sigmoid(np.dot(x_train, w))
    now_loss = compute_loss(pred_y, y_train)
    losses.append(now_loss)
fig, ax = plt.subplots(figsize = (10, 4))
ax.plot(np.arange(len(losses)), losses, c = 'r')

[<matplotlib.lines.Line2D at 0x28f77053508>]

测试样例

# 测试
test_pred = sigmoid(np.dot(x_test, w))
pre_test_y = np.array(test_pred > 0.5, dtype = np.float32)
acc = np.sum(pre_test_y == y_test) / len(y_test)
print("the accary of model is {}".format(acc*100))

the accary of model is 100.0

1 2	print(pre_test_y.reshape(1,-1)) print(y_test.reshape(1,-1))

[[0. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 1.]]
[[0 1 0 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1]]