直方图主要用来查看数据分布情况

读取数据

In [55]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline
In [56]:
df = pd.read_csv('/Users/spark/Downloads/nyc_fare.csv')

查看数据

In [4]:
df.describe()
Out[4]:
fare_amount surcharge mta_tax tip_amount tolls_amount total_amount
count 846945.000000 846945.000000 846945.000000 846945.00000 846945.000000 846945.000000
mean 12.190578 0.320303 0.499305 1.34466 0.232142 14.587073
std 9.514150 0.772642 0.057844 2.09149 1.109164 11.380950
min -648.420000 -1.000000 -0.500000 0.00000 0.000000 -52.500000
25% 6.500000 0.000000 0.500000 0.00000 0.000000 8.000000
50% 9.500000 0.000000 0.500000 1.00000 0.000000 11.000000
75% 14.000000 0.500000 0.500000 2.00000 0.000000 16.500000
max 620.010000 628.840000 41.490000 200.00000 100.660000 620.010000

这里可以看到fare_amount的最大值虽然是620,但是75%分位数是14,所以大部分数字都应该不是很大,我们后面采用50来观察他的分布情况

可视化数据

In [57]:
bin_array = np.linspace(start=0., stop=50., num=100)
In [58]:
df.fare_amount.hist(bins=bin_array)
Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x116bdff60>

this is english

这是英语