直方图主要用来查看数据分布情况

In [55]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline

In [56]:

df = pd.read_csv('/Users/spark/Downloads/nyc_fare.csv')

In [4]:

df.describe()

Out[4]:

	fare_amount	surcharge	mta_tax	tip_amount	tolls_amount	total_amount
count	846945.000000	846945.000000	846945.000000	846945.00000	846945.000000	846945.000000
mean	12.190578	0.320303	0.499305	1.34466	0.232142	14.587073
std	9.514150	0.772642	0.057844	2.09149	1.109164	11.380950
min	-648.420000	-1.000000	-0.500000	0.00000	0.000000	-52.500000
25%	6.500000	0.000000	0.500000	0.00000	0.000000	8.000000
50%	9.500000	0.000000	0.500000	1.00000	0.000000	11.000000
75%	14.000000	0.500000	0.500000	2.00000	0.000000	16.500000
max	620.010000	628.840000	41.490000	200.00000	100.660000	620.010000

这里可以看到fare_amount的最大值虽然是620，但是75%分位数是14，所以大部分数字都应该不是很大，我们后面采用50来观察他的分布情况

In [57]:

bin_array = np.linspace(start=0., stop=50., num=100)

In [58]:

df.fare_amount.hist(bins=bin_array)

Out[58]:

<matplotlib.axes._subplots.AxesSubplot at 0x116bdff60>

this is english

这是英语