非常实用,不扯任何理论概念 不包含python基础教程,numpy pandas等常见已经中文化很好的部分知识。


from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import plotly.plotly as py
import scipy.stats as stats
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib
import warnings
import sklearn
import scipy
import json
import sys
import csv
import os


print('matplotlib: {}'.format(matplotlib.__version__))
print('sklearn: {}'.format(sklearn.__version__))
print('scipy: {}'.format(scipy.__version__))
print('seaborn: {}'.format(sns.__version__))
print('pandas: {}'.format(pd.__version__))
print('numpy: {}'.format(np.__version__))
print('Python: {}'.format(sys.version))
matplotlib: 3.0.2
sklearn: 0.20.0
scipy: 1.1.0
seaborn: 0.9.0
pandas: 0.23.4
numpy: 1.15.4
Python: 3.7.0 (default, Aug 25 2018, 09:46:22) 
[Clang 10.0.0 (clang-1000.10.43.1)]


Seaborn 是Matplotlib的补充,它专门针对统计数据可视化。但它更进一步:Seaborn扩展了Matplotlib,这就是为什么它可以解决与Matplotlib合作的两个最大挫折。或者,正如迈克尔·瓦斯科姆在“Seaborn的介绍”中所说的那样:“如果matplotlib试图让事情变得容易而且事情变得容易”,那么seaborn也试图制定一套定义明确的硬件。“

其中一个难点或挫折与默认的Matplotlib参数有关。 Seaborn使用不同的参数,这无疑会对那些不使用Matplotlib图的默认外观的用户说话 Seaborn是一个用Python制作统计图形的库。它建立在matplotlib之上,并与pandas数据结构紧密集成。


面向数据集的API,用于检查多个变量之间的关系 专门支持使用分类变量来显示观察结果或汇总统计数据 可视化单变量或双变量分布以及在数据子集之间进行比较的选项 不同种类因变量的线性回归模型的自动估计和绘图 方便地查看复杂数据集的整体结构 用于构建多绘图网格的高级抽象,可让您轻松构建复杂的可视化 简洁的控制matplotlib图形样式与几个内置主题 用于选择调色板的工具,可以忠实地显示数据中的模式 Seaborn旨在使可视化成为探索和理解数据的核心部分。其面向数据集的绘图功能对包含整个数据集的数据框和数组进行操作,并在内部执行必要的语义映射和统计聚合,以生成信息图。






Seaborn的重要特征 Seaborn构建于Python的核心可视化库Matplotlib之上。它旨在作为补充,而不是替代。然而,Seaborn带有一些非常重要的功能。我们在这里看一些。这些功能有助于 -

内置主题样式matplotlib图形可视化单变量和双变量数据 拟合和可视化线性回归模型绘制统计时间序列数据 Seaborn与NumPy和Pandas数据结构配合良好它内置了Matplotlib图形样式的主题


def sinplot(flip = 1):
   x = np.linspace(0, 14, 100)
   for i in range(1, 5): 
      plt.plot(x, np.sin(x + i * .5) * (7 - i) * flip)
def sinplot(flip = 1):
   x = np.linspace(0, 14, 100)
   for i in range(1, 5):
      plt.plot(x, np.sin(x + i * .5) * (7 - i) * flip)
v1 = pd.Series(np.random.normal(0,10,1000), name='v1')
v2 = pd.Series(2*v1 + np.random.normal(60,15,1000), name='v2')
plt.hist(v1, alpha=0.7, bins=np.arange(-50,150,5), label='v1');
plt.hist(v2, alpha=0.7, bins=np.arange(-50,150,5), label='v2');
# we can pass keyword arguments for each individual component of the plot
sns.distplot(v2, hist_kws={'color': 'Teal'}, kde_kws={'color': 'Navy'});
/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

sns.jointplot(v1, v2, alpha=0.4);
/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

grid = sns.jointplot(v1, v2, alpha=0.4);
/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

sns.jointplot(v1, v2, kind='hex');
/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

# set the seaborn style for all the following plots

sns.jointplot(v1, v2, kind='kde', space=0);
train = pd.read_csv('../input/train.csv')
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S


  • 开源 Python生态系统提供了一个独立,多功能和强大的科学工作环境,包括:[NumPy],[SciPy],[IPython],[Matplotlib],[Pandas]


  • Scikit-Learn构建于NumPy和SciPy之上,通过机器学习算法补充这一科学环境;
  • 根据设计,Scikit-Learn 非侵入式,易于使用且易于与其他库结合使用;
  • 核心算法以低级语言实现。



线性模型(Ridge,Lasso,Elastic Net,......)支持向量机 基于树的方法(随机森林,套袋,GBRT,......)最近的邻居 神经网络(基础知识)高斯过程 *功能选择


聚类(KMeans,Ward,...)矩阵分解(PCA,ICA,......) 密度估计异常值检测


交叉验证网格搜索 *很多指标



数据作为有限学习集${\cal L} = (X, y)$在哪里 输入样本以形状n_samples$\times$n_features的数组$X$给出,取其在${\cal X}$中的值;输出值以数组$y$的形式给出,在${\cal Y}$中取_符号_值。

监督分类的目标是建立一个最小化的估计量$\varphi: {\cal X} \mapsto {\cal Y}$

$$ Err(\varphi) = \mathbb{E}_{X,Y}\{ \ell(Y, \varphi(X)) \} $$

其中$\ell$是一个损失函数,例如,分类$\ell_{01}(Y,\hat{Y}) = 1(Y \neq \hat{Y})$的零丢失。


  • 对背景事件的信号进行分类;
  • 从症状诊断疾病;
  • 识别照片中的猫;
  • 使用Kinect摄像头识别身体部位;
  • ......


  • 输入数据= Numpy数组或Scipy稀疏矩阵;
  • 使用矩阵或向量上定义的高级操作表示算法(类似于MATLAB);      - 利用高效的低杠杆实施;      - 保持代码简短易读。
from sklearn import datasets
iris = datasets.load_iris()
X_iris = iris.data
y_iris = iris.target


print (X_iris.shape, y_iris.shape)
print ('Feature names:{0}'.format(iris.feature_names))
print ('Target classes:{0}'.format(iris.target_names))
print ('First instance features:{0}'.format(X_iris[0]))
(150, 4) (150,)
Feature names:['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target classes:['setosa' 'versicolor' 'virginica']
First instance features:[5.1 3.5 1.4 0.2]


colormarkers = [ ['red','s'], ['greenyellow','o'], ['blue','x']]
for i in range(len(colormarkers)):
    px = X_iris[:, 0][y_iris == i]
    py = X_iris[:, 1][y_iris == i]
    plt.scatter(px, py, c=colormarkers[i][0], marker=colormarkers[i][1])

plt.title('Iris Dataset: Sepal width vs sepal length')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

for i in range(len(colormarkers)):
    px = X_iris[:, 2][y_iris == i]
    py = X_iris[:, 3][y_iris == i]
    plt.scatter(px, py, c=colormarkers[i][0], marker=colormarkers[i][1])

plt.title('Iris Dataset: petal width vs petal length')
plt.xlabel('Petal length')
plt.ylabel('Petal width')





  • 选择你的属性,
  • 根据可用数据构建模型,和
  • 评估您的模型在以前看不见的数据上的表现。



我们的第一步是将数据集分成单独的集合,使用75%的实例来训练我们的分类器,剩下的25%用于评估它(在这种情况下,仅采用两个特征,萼片宽度和长度) 。我们还将执行特征缩放:对于每个特征,计算平均值,从特征值中减去平均值,并将结果除以它们的标准偏差。缩放后,每个要素的平均值为零,标准差为1。这种值的标准化(不会改变它们的分布,因为您可以通过在缩放之前和之后绘制X值来验证)是机器学习方法的常见要求,以避免具有大值的特征对最终结果可能过重。 。

from sklearn.model_selection import train_test_split
from sklearn import preprocessing

# Create dataset with only the first two attributes
X, y = X_iris[:, [0,1]], y_iris
# Test set will be the 25% taken randomly
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)
# Standarize the features
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)


print ('Training set mean:{:.2f} and standard deviation:{:.2f}'.format(np.average(X_train),np.std(X_train)))
print ('Testing set mean:{:.2f} and standard deviation:{:.2f}'.format(np.average(X_test),np.std(X_test)))
Training set mean:0.00 and standard deviation:1.00
Testing set mean:0.13 and standard deviation:0.71


colormarkers = [ ['red','s'], ['greenyellow','o'], ['blue','x']]
plt.figure('Training Data')
for i in range(len(colormarkers)):
    xs = X_train[:, 0][y_train == i]
    ys = X_train[:, 1][y_train == i]
    plt.scatter(xs, ys, c=colormarkers[i][0], marker=colormarkers[i][1])

plt.title('Training instances, after scaling')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')



import copy 
y_train_setosa = copy.copy(y_train) 
# Every 1 and 2 classes in the training set will became just 1
y_train_setosa[y_train_setosa > 0]=1
y_test_setosa = copy.copy(y_test)
y_test_setosa[y_test_setosa > 0]=1

print ('New training target classes:\n{0}'.format(y_train_setosa))
New training target classes:
[1 0 1 1 1 0 0 1 0 1 0 0 1 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 1 1 1 0 0 1 1 0
 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 0 1 0 1 0 1 1 1 1 1 0 1 0 1 1
 0 0 0 0 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1


多年来,线性分类模型已经得到了很好的研究,并且有很多不同的方法,实际上用于构建分离超平面的方法非常不同。我们将使用来自scikit-learn的SGDClassifier来实现线性模型,包括正则化。分类器(实际上,我们将看到的一类分类器)从使用随机梯度下降(Stochastic Gradient Descent)获得其名称,这是一个非常有效的数值程序,用于查找函数的局部最小值。

1847年,路易斯·奥古斯丁·柯西(Louis Augustin Cauchy)引入了“梯度下降”(Gradient Descent)来求解线性方程组。这个想法是基于观察到多变量函数在其负梯度方向上减小最快(您可以将梯度视为几个维度的导数的推广)。如果我们想要找到它的最小值(至少是局部值),我们可以向其负梯度方向移动。这正是梯度下降所做的。

scikit-learn中的每个分类器都以相同的方式创建:使用分类器的可配置超参数调用方法来创建分类器的实例。在这种情况下,我们将使用linear _model.SGDClassifier,告诉scikit-learn使用 log loss函数。

from sklearn import linear_model 
clf = linear_model.SGDClassifier(loss='log', random_state=42)
print (clf)
SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=None,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=42, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)


现在,我们只需调用fit方法来训练分类器(即,根据可用的训练数据构建我们稍后将使用的模型)。在我们的例子中,trainig setosa设置。

clf.fit(X_train, y_train_setosa)

/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/linear_model/stochastic_gradient.py:144: FutureWarning:

max_iter and tol parameters have been added in SGDClassifier in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=None,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=42, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)


print (clf.coef_,clf.intercept_)
[[ 31.0790909  -17.78632765]] [17.31337552]


现在,真正有用的部分:当我们有一个新花时,我们只需得到它的花瓣宽度和长度,并在新实例上调用分类器的predict方法。 无论我们使用的分类器还是我们用来构建它的方法,它都以相同的方式工作

print ('If the flower has 4.7 petal width and 3.1 petal length is a {}'.format(
        iris.target_names[clf.predict(scaler.transform([[4.7, 3.1]]))]))
If the flower has 4.7 petal width and 3.1 petal length is a ['setosa']




clf2 = linear_model.SGDClassifier(loss='log', random_state=33)
clf2.fit(X_train, y_train) 
print (len(clf2.coef_))
/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/linear_model/stochastic_gradient.py:144: FutureWarning:

max_iter and tol parameters have been added in SGDClassifier in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.

Let us evaluate on the previous instance to find the three-class prediction. Scikit-learn tries the three classifiers.

scaler.transform([[4.7, 3.1]])
print(clf2.decision_function(scaler.transform([[4.7, 3.1]])))
clf2.predict(scaler.transform([[4.7, 3.1]]))
[[ 13.56767487   1.74380721 -37.36375592]]



估算器的性能是衡量其有效性的标准。最明显的性能度量称为准确度:给定分类器和一组实例,它只是测量分类器正确分类的实例的比例。例如,我们可以使用训练集中的实例,并在预测目标类时计算分类器的准确性。 Scikit-learn包含一个“metrics”模块,用于实现此(以及许多其他)性能指标。

from sklearn import metrics
y_train_pred = clf2.predict(X_train)
print ('Accuracy on the training set:{:.2f}'.format(metrics.accuracy_score(y_train, y_train_pred)))
Accuracy on the training set:0.82



y_pred = clf2.predict(X_test)
print ('Accuracy on the training set:{:.2f}'.format(metrics.accuracy_score(y_test, y_pred)))
Accuracy on the training set:0.68




print (metrics.confusion_matrix(y_test, y_pred))
[[ 8  0  0]
 [ 0  3  8]
 [ 0  4 15]]




print (metrics.classification_report(y_test, y_pred, target_names=iris.target_names))
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00         8
  versicolor       0.43      0.27      0.33        11
   virginica       0.65      0.79      0.71        19

   micro avg       0.68      0.68      0.68        38
   macro avg       0.69      0.69      0.68        38
weighted avg       0.66      0.68      0.66        38

  • 精确计算预测为正确评估的正数的实例的比例(它衡量我们的分类器在表示实例为正时的权利)。
  • 召回计算正确评估的正面实例的比例(测量分类器在面对正面实例时的正确性)。
  • F1分数是精度和召回的调和平均值,并尝试将两者合并为一个数字。



# Test set will be the 25% taken randomly
X_train4, X_test4, y_train4, y_test4 = train_test_split(X_iris, y_iris, test_size=0.25, random_state=33)
# Standarize the features
scaler = preprocessing.StandardScaler().fit(X_train4)
X_train4 = scaler.transform(X_train4)
X_test4 = scaler.transform(X_test4)

# Build the classifier
clf3 = linear_model.SGDClassifier(loss='log', random_state=33)
clf3.fit(X_train4, y_train4) 

# Evaluate the classifier on the evaluation set
y_pred4 = clf3.predict(X_test4)
print (metrics.classification_report(y_test4, y_pred4, target_names=iris.target_names))
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00         8
  versicolor       0.78      0.64      0.70        11
   virginica       0.81      0.89      0.85        19

   micro avg       0.84      0.84      0.84        38
   macro avg       0.86      0.84      0.85        38
weighted avg       0.84      0.84      0.84        38

/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/linear_model/stochastic_gradient.py:144: FutureWarning:

max_iter and tol parameters have been added in SGDClassifier in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.



from sklearn import cluster
clf_sepal = cluster.KMeans(init='k-means++', n_clusters=3, random_state=33)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=33, tol=0.0001, verbose=0)


print (clf_sepal.labels_)
[1 0 1 1 1 0 0 1 0 2 0 0 1 2 0 2 1 2 1 0 0 1 1 0 0 2 0 1 2 2 1 1 0 0 2 1 0
 1 1 2 1 0 2 0 1 0 2 2 0 2 1 0 0 1 0 0 0 2 1 0 1 0 1 0 1 2 1 1 1 0 1 0 2 1
 0 0 0 0 2 2 0 1 1 2 1 0 0 1 1 1 0 1 1 0 2 1 2 1 2 0 2 0 0 0 1 1 2 1 1 1 2


print (y_train4[clf_sepal.labels_==0])
[0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0]
print (y_train4[clf_sepal.labels_==1])
[1 1 1 1 1 1 2 1 0 2 1 2 2 1 1 2 2 1 2 2 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
 2 1 2 1 1 2 1]
print (y_train4[clf_sepal.labels_==2])
[2 2 1 2 2 2 2 1 1 2 2 1 2 2 1 1 2 2 2 2 2 2 1 2 2]


colormarkers = [ ['red','s'], ['greenyellow','o'], ['blue','x']]
step = .01 
margin = .1   
sl_min, sl_max = X_train4[:, 0].min()-margin, X_train4[:, 0].max() + margin
sw_min, sw_max = X_train4[:, 1].min()-margin, X_train4[:, 1].max() + margin
sl, sw  = np.meshgrid(
    np.arange(sl_min, sl_max, step),
np.arange(sw_min, sw_max, step)
Zs = clf_sepal.predict(np.c_[sl.ravel(), sw.ravel()]).reshape(sl.shape)
centroids_s = clf_sepal.cluster_centers_


plt.imshow(Zs, interpolation='nearest', extent=(sl.min(), sl.max(), sw.min(), sw.max()), cmap= plt.cm.Pastel1, aspect='auto', origin='lower')
for j in [0,1,2]:
    px = X_train4[:, 0][y_train == j]
    py = X_train4[:, 1][y_train == j]
    plt.scatter(px, py, c=colormarkers[j][0], marker= colormarkers[j][1])
plt.scatter(centroids_s[:, 0], centroids_s[:, 1],marker='*',linewidths=3, color='black', zorder=10)
plt.title('K-means clustering on the Iris dataset using Sepal dimensions\nCentroids are marked with stars')
plt.xlim(sl_min, sl_max)
plt.ylim(sw_min, sw_max)
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")


clf_petal = cluster.KMeans(init='k-means++', n_clusters=3, random_state=33)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=33, tol=0.0001, verbose=0)
print (y_train4[clf_petal.labels_==0])
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0]
print (y_train4[clf_petal.labels_==1])
[1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1
print (y_train4[clf_petal.labels_==2])
[2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2]


colormarkers = [ ['red','s'], ['greenyellow','o'], ['blue','x']]
step = .01 
margin = .1
sl_min, sl_max = X_train4[:, 2].min()-margin, X_train4[:, 2].max() + margin
sw_min, sw_max = X_train4[:, 3].min()-margin, X_train4[:, 3].max() + margin
sl, sw  = np.meshgrid(
    np.arange(sl_min, sl_max, step),
    np.arange(sw_min, sw_max, step), 
Zs = clf_petal.predict(np.c_[sl.ravel(), sw.ravel()]).reshape(sl.shape)
centroids_s = clf_petal.cluster_centers_
plt.imshow(Zs, interpolation='nearest', extent=(sl.min(), sl.max(), sw.min(), sw.max()), cmap= plt.cm.Pastel1, aspect='auto', origin='lower')
for j in [0,1,2]:
    px = X_train4[:, 2][y_train4 == j]
    py = X_train4[:, 3][y_train4 == j]
    plt.scatter(px, py, c=colormarkers[j][0], marker= colormarkers[j][1])
plt.scatter(centroids_s[:, 0], centroids_s[:, 1],marker='*',linewidths=3, color='black', zorder=10)
plt.title('K-means clustering on the Iris dataset using Petal dimensions\nCentroids are marked with stars')
plt.xlim(sl_min, sl_max)
plt.ylim(sw_min, sw_max)
plt.xlabel("Petal length")
plt.ylabel("Petal width")


clf = cluster.KMeans(init='k-means++', n_clusters=3, random_state=33)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=33, tol=0.0001, verbose=0)
print (y_train[clf.labels_==0])
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0]
print (y_train[clf.labels_==1])
[1 1 1 1 1 1 1 2 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 2 1]
print (y_train[clf.labels_==2])
[2 2 1 2 2 1 2 1 2 2 2 1 2 1 2 2 2 1 2 2 2 2 1 1 2 2 2 2 2 2 2 1 2 2]


print (metrics.classification_report(y_test, y_pred, target_names=['setosa','versicolor','virginica']))
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00         8
  versicolor       0.58      0.64      0.61        11
   virginica       0.78      0.74      0.76        19

   micro avg       0.76      0.76      0.76        38
   macro avg       0.79      0.79      0.79        38
weighted avg       0.77      0.76      0.77        38

print (metrics.classification_report(y_test, y_pred_petal, target_names=['setosa','versicolor','virginica']))
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00         8
  versicolor       0.85      1.00      0.92        11
   virginica       1.00      0.89      0.94        19

   micro avg       0.95      0.95      0.95        38
   macro avg       0.95      0.96      0.95        38
weighted avg       0.96      0.95      0.95        38





from sklearn.datasets import load_boston
boston = load_boston()
print ('Boston dataset shape:{}'.format(boston.data.shape))
Boston dataset shape:(506, 13)
print (boston.feature_names)
 'B' 'LSTAT']



交叉验证通常包括以下步骤: 1.将数据集划分为k个不同的子集。 2.通过训练k-1子集并测试剩余的子模型来创建k个不同的模型。 3.测量每个k模型的性能,并使用平均值作为性能值。