Udacity数据分析-探索美国共享单车数据

您的位置：
门户
>> 文章精选
>> 软件开发专栏
>> 数据库
>> 查看资讯

Udacity数据分析-探索美国共享单车数据

发表于：2018-11-19 10:12

作者：mage.han 来源：CSDN

软件开发

数据库

　　概述

　　利用 Python 探索与以下三大美国城市的自行车共享系统相关的数据：芝加哥、纽约和华盛顿特区。编写代码导入数据，并通过计算描述性统计数据回答有趣的问题。写一个脚本，该脚本会接受原始输入并在终端中创建交互式体验，以展现这些统计信息。

　　自行车共享数据

　　在过去十年内，自行车共享系统的数量不断增多，并且在全球多个城市内越来越受欢迎。自行车共享系统使用户能够按照一定的金额在短时间内租赁自行车。用户可以在 A 处借自行车，并在 B 处还车，或者他们只是想骑一下，也可以在同一地点还车。每辆自行车每天可以供多位用户使用。

　　由于信息技术的迅猛发展，共享系统的用户可以轻松地访问系统中的基座并解锁或还回自行车。这些技术还提供了大量数据，使我们能够探索这些自行车共享系统的使用情况。

　　在此项目中，你将使用 Motivate 提供的数据探索自行车共享使用模式，Motivate 是一家入驻美国很多大型城市的自行车共享系统。你将比较以下三座城市的系统使用情况：芝加哥、纽约市和华盛顿特区。

　　数据集

　　提供了三座城市 2017 年上半年的数据。三个数据文件都包含相同的核心六列：

　　起始时间 Start Time（例如 2017-01-01 00:07:57）

　　结束时间 End Time（例如 2017-01-01 00:20:53）

　　骑行时长 Trip Duration（例如 776 秒）

　　起始车站 Start Station（例如百老汇街和巴里大道）

　　结束车站 End Station（例如塞奇威克街和北大道）

　　用户类型 User Type（订阅者 Subscriber/Registered 或客户Customer/Casual）

　　芝加哥和纽约市文件还包含以下两列（数据格式可以查看下面的图片）：

　　性别 Gender

　　出生年份 Birth Year

　　问题

　　1.起始时间（Start Time 列）中哪个月份最常见？

　　2.起始时间中，一周的哪一天（比如 Monday, Tuesday）最常见？

　　3.起始时间中，一天当中哪个小时最常见？

　　4.总骑行时长（Trip Duration）是多久，平均骑行时长是多久？

　　5.哪个起始车站（Start Station）最热门，哪个结束车站（End Station）最热门？

　　6.哪一趟行程最热门（即，哪一个起始站点与结束站点的组合最热门）？

　　7.每种用户类型有多少人？

　　8.每种性别有多少人？

　　9.出生年份最早的是哪一年、最晚的是哪一年，最常见的是哪一年？

　　项目代码

　　导入库及数据集

import time

　　import pandas as pd

　　import numpy as np

　　CITY_DATA = { 'chicago': 'chicago.csv',

　　'new york city': 'new_york_city.csv',

　　'washington': 'washington.csv' }

　　输入函数

def input_mod(input_print,enterable_list):

　　"""

　　Simplify code when user choose cities or months data

　　Arg:

　　(str) input_print - asking questions

　　(str) enterable_list - find list(cities or months)

　　Return:

　　(str) ret- return user's choice about city, month or day

　　"""

　　while True:

　　ret = input(input_print).title()

　　if ret in enterable_list:

　　return ret.lower()

　　break

　　print('Sorry, please enter {}.'.format(enterable_list))

　　选取数据集

def see_datas(data):

　　"""

　　User choose a data to input.

　　Arg:

　　(str) data - choose a data to input(cities,months,days)

　　Return:

　　(str) city, month or day - return user's choice about city, month or day

　　"""

　　#bulid lists and dictionary( cities, months and days) for user to search data

　　cities=['Chicago','New York City','Washington']

　　months =['January', 'February', 'March', 'April', 'May', 'June']

　　days={'1':'Sunday', '2':'Monday', '3':'Tuesday', '4':'Wednesday', '5':'Thursday', '6':'Friday', '7':'Saturday'}

　　while True:

　　#get user input about cities

　　if data=='cities':

　　return input_mod('Would you like to see data for Chicago, New York City or Washington: \n',cities)

　　#get user input about months

　　elif data=='months':

　　return input_mod('Which month? January, February, March, April, May or June?\n',months)

　　#get user input about weekdays

　　elif data=='days':

　　while True:

　　day = input('Which day? Please type an interger(e.g., 1=Sunday): \n')

　　if day in days:

　　return days[day]

　　break

　　print('Sorry, please enter a correct interger(e.g., 1=Sunday)')

　　通过用户的输入来得到要分析的 “城市，月，日”

def get_filters():

　　"""

　　Asks user to specify a city, month, and day to analyze.

　　Returns:

　　(str) city - name of the city to analyze

　　(str) month - name of the month to filter by, or "all" to apply no month filter

　　(str) day - name of the day of week to filter by, or "all" to apply no day filter

　　"""

　　print('Hello! Let\'s explore some US bikeshare data!')

　　# TO DO: get user input for city (chicago, new york city, washington). HINT: Use a while loop to handle invalid inputs

　　city=see_datas('cities')

　　# TO DO: get user input for month (all, january, february, ... , june)

　　while True:

　　enter=input('Would you like to filter the data by month, day, both, or not at all? Type "none" for no time filter.\n').lower()

　　if enter == 'none':

　　month='all'

　　day='all'

　　break

　　elif enter == 'both':

　　month=see_datas('months')

　　day=see_datas('days')

　　break

　　elif enter == 'month':

　　month=see_datas('months')

　　day='all'

　　break

　　elif enter == 'day':

　　month='all'

　　day=see_datas('days')

　　break

　　else:

　　print ('Sorry, please input a correct content')

　　# TO DO: get user input for day of week (all, monday, tuesday, ... sunday)

　　print('-'*40)

　　return city,month,day

　　　　加载相应的 “城市，月，日” 的数据

　 def load_data(city, month, day):

　　"""

　　Loads data for the specified city and filters by month and day if applicable.

　　Args:

　　(str) city - name of the city to analyze

　　(str) month - name of the month to filter by, or "all" to apply no month filter

　　(str) day - name of the day of week to filter by, or "all" to apply no day filter

　　Returns:

　　df - Pandas DataFrame containing city data filtered by month and day

　　"""

　　# load data file into a dataframe

　　df = pd.read_csv(CITY_DATA[city])

　　# convert the Start Time column to datetime

　　df['Start Time'] = pd.to_datetime(df['Start Time'])

　　# extract month and day of week from Start Time to create new columns

　　df['month'] = df['Start Time'].dt.month

　　df['day_of_week'] = df['Start Time'].dt.weekday_name

　　# filter by month if applicable

　　if month != 'all':

　　# use the index of the months list to get the corresponding int

　　months = ['january', 'february', 'march', 'april', 'may', 'june']

　　month = months.index(month) + 1

　　# filter by month to create the new dataframe

　　df = df[df['month'] == month]

　　# filter by day of week if applicable

　　if day != 'all':

　　# filter by day of week to create the new dataframe

　　df = df[df['day_of_week'] == day.title()]

　　return df

　　计算并显示共享单车出行的最频繁时间

　　def station_stats(df):

　　"""Displays statistics on the most popular stations and trip."""

　　print('\nCalculating The Most Popular Stations and Trip...\n')

　　start_time = time.time()

　　# TO DO: display most commonly used start station

　　common_start=df['Start Station'].value_counts().index[0]

　　print('Most commonly used start station: {}.'.format(common_start))

　　# TO DO: display most commonly used end station

　　common_end=df['End Station'].value_counts().index[0]

　　print('Most commonly used end station: {}.'.format(common_end))

　　# TO DO: display most frequent combination of start station and end station trip

　　df['combination']=df['Start Station']+'/ '+df['End Station']

　　common_combine=df['combination'].value_counts().index[0]

　　print('Most frequent combination of start and end station trip: {}.'.format(common_combine))

　　print("\nThis took %s seconds." % (time.time() - start_time))

　　print('-'*40)

　　计算并显示共享单车出行的总/平均时间

def trip_duration_stats(df):

　　"""Displays statistics on the total and average trip duration."""

　　print('\nCalculating Trip Duration...\n')

　　start_time = time.time()

　　# TO DO: display total travel time

　　total_time=df['Trip Duration'].sum()

　　print('Total travel time: {} seconds.'.format(total_time))

　　# TO DO: display mean travel time

　　mean_time=df['Trip Duration'].mean()

　　print('Mean travel time: {} seconds.'.format(mean_time))

　　print("\nThis took %s seconds." % (time.time() - start_time))

　　print('-'*40)

　　计算并显示共享单车用户的统计信息

def user_stats(df):

　　"""Displays statistics on bikeshare users."""

　　print('\nCalculating User Stats...\n')

　　start_time = time.time()

　　# TO DO: Display counts of user types

　　user_type=df['User Type'].value_counts()

　　print('User type\n{0}: {1}\n{2}: {3}'.format(user_type.index[0],user_type.iloc[0],user_type.index[1],user_type.iloc[1]))

　　# TO DO: Display counts of gender

　　cities_columns=df.columns

　　if 'Gender' in cities_columns:

　　user_gender=df['Gender'].value_counts()

　　print('Male:{0}\nFemale:{1}. '.format(user_gender.loc['Male'],user_gender.loc['Female']))

　　else:

　　print("Sorry, this city don't have gender data" )

　　# TO DO: Display earliest, most recent, and most common year of birth

　　if 'Birth Year' in cities_columns:

　　earliest_birth=df['Birth Year'].min()

　　recent_birth=df['Birth Year'].max()

　　common_birth=df['Birth Year'].value_counts().index[0]

　　print('Earliest user year of birth: %i.'%(earliest_birth))

　　print('Most recent user year of birth: %i.'%(recent_birth))

　　print('Most common user year of birth: %i.'%(common_birth))

　　else:

　　print("Sorry, this city don't have birth year data" )

　　print("\nThis took %s seconds." % (time.time() - start_time))

　　print('-'*40)

　　主函数

def main():

　　while True:

　　city, month, day = get_filters()

　　df = load_data(city, month, day)

　　time_stats(df)

　　station_stats(df)

　　trip_duration_stats(df)

　　user_stats(df)

　　restart = input('\nWould you like to restart? Enter yes or no.\n')

　　if restart.lower() != 'yes':

　　break

　　if __name__ == "__main__":

　　main()

　　链接：https://pan.baidu.com/s/1sSgbXBaSy1IxIfJqoMil2w 密码：m55o

上文内容不用于商业目的，如涉及知识产权问题，请权利人联系博为峰小编(021-64471599-8017)，我们将立即处理。

《2023软件测试行业现状调查报告》独家发布~

搜索风云榜

测试技术了解

2023测试行业调查报告

挣点稿费

车载测试入门

文章资料精选