Udacity数据分析-探索美国共享单车数据

发表于:2018-11-19 10:12

字体: | 上一篇 | 下一篇 | 我要投稿

 作者:mage.han    来源:CSDN

分享:
  概述
  利用 Python 探索与以下三大美国城市的自行车共享系统相关的数据:芝加哥、纽约和华盛顿特区。编写代码导入数据,并通过计算描述性统计数据回答有趣的问题。写一个脚本,该脚本会接受原始输入并在终端中创建交互式体验,以展现这些统计信息。
  自行车共享数据
  在过去十年内,自行车共享系统的数量不断增多,并且在全球多个城市内越来越受欢迎。自行车共享系统使用户能够按照一定的金额在短时间内租赁自行车。用户可以在 A 处借自行车,并在 B 处还车,或者他们只是想骑一下,也可以在同一地点还车。每辆自行车每天可以供多位用户使用。
  由于信息技术的迅猛发展,共享系统的用户可以轻松地访问系统中的基座并解锁或还回自行车。这些技术还提供了大量数据,使我们能够探索这些自行车共享系统的使用情况。
  在此项目中,你将使用 Motivate 提供的数据探索自行车共享使用模式,Motivate 是一家入驻美国很多大型城市的自行车共享系统。你将比较以下三座城市的系统使用情况:芝加哥、纽约市和华盛顿特区。
  数据集
  提供了三座城市 2017 年上半年的数据。三个数据文件都包含相同的核心六列:
  起始时间 Start Time(例如 2017-01-01 00:07:57)
  结束时间 End Time(例如 2017-01-01 00:20:53)
  骑行时长 Trip Duration(例如 776 秒)
  起始车站 Start Station(例如百老汇街和巴里大道)
  结束车站 End Station(例如塞奇威克街和北大道)
  用户类型 User Type(订阅者 Subscriber/Registered 或客户Customer/Casual)
  芝加哥和纽约市文件还包含以下两列(数据格式可以查看下面的图片):
  性别 Gender
  出生年份 Birth Year
  问题
  1.起始时间(Start Time 列)中哪个月份最常见?
  2.起始时间中,一周的哪一天(比如 Monday, Tuesday)最常见?
  3.起始时间中,一天当中哪个小时最常见?
  4.总骑行时长(Trip Duration)是多久,平均骑行时长是多久?
  5.哪个起始车站(Start Station)最热门,哪个结束车站(End Station)最热门?
  6.哪一趟行程最热门(即,哪一个起始站点与结束站点的组合最热门)?
  7.每种用户类型有多少人?
  8.每种性别有多少人?
  9.出生年份最早的是哪一年、最晚的是哪一年,最常见的是哪一年?
  项目代码
  导入库及数据集
   import time
  import pandas as pd
  import numpy as np
  CITY_DATA = { 'chicago': 'chicago.csv',
  'new york city': 'new_york_city.csv',
  'washington': 'washington.csv' }
  输入函数
   def input_mod(input_print,enterable_list):
  """
  Simplify code when user choose cities or months data
  Arg:
  (str) input_print - asking questions
  (str) enterable_list - find list(cities or months)
  Return:
  (str) ret- return user's choice about city, month or day
  """
  while True:
  ret = input(input_print).title()
  if ret in enterable_list:
  return ret.lower()
  break
  print('Sorry, please enter {}.'.format(enterable_list))
  选取数据集
   def see_datas(data):
  """
  User choose a data to input.
  Arg:
  (str) data - choose a data to input(cities,months,days)
  Return:
  (str) city, month or day - return user's choice about city, month or day
  """
  #bulid lists and dictionary( cities, months and days) for user to search data
  cities=['Chicago','New York City','Washington']
  months =['January', 'February', 'March', 'April', 'May', 'June']
  days={'1':'Sunday', '2':'Monday', '3':'Tuesday', '4':'Wednesday', '5':'Thursday', '6':'Friday', '7':'Saturday'}
  while True:
  #get user input about cities
  if data=='cities':
  return input_mod('Would you like to see data for Chicago, New York City or Washington: \n',cities)
  #get user input about months
  elif data=='months':
  return input_mod('Which month? January, February, March, April, May or June?\n',months)
  #get user input about weekdays
  elif data=='days':
  while True:
  day = input('Which day? Please type an interger(e.g., 1=Sunday): \n')
  if day in days:
  return days[day]
  break
  print('Sorry, please enter a correct interger(e.g., 1=Sunday)')
  通过用户的输入来得到要分析的 “城市,月,日”
   def get_filters():
  """
  Asks user to specify a city, month, and day to analyze.
  Returns:
  (str) city - name of the city to analyze
  (str) month - name of the month to filter by, or "all" to apply no month filter
  (str) day - name of the day of week to filter by, or "all" to apply no day filter
  """
  print('Hello! Let\'s explore some US bikeshare data!')
  # TO DO: get user input for city (chicago, new york city, washington). HINT: Use a while loop to handle invalid inputs
  city=see_datas('cities')
  # TO DO: get user input for month (all, january, february, ... , june)
  while True:
  enter=input('Would you like to filter the data by month, day, both, or not at all? Type "none" for no time filter.\n').lower()
  if enter == 'none':
  month='all'
  day='all'
  break
  elif enter == 'both':
  month=see_datas('months')
  day=see_datas('days')
  break
  elif enter == 'month':
  month=see_datas('months')
  day='all'
  break
  elif enter == 'day':
  month='all'
  day=see_datas('days')
  break
  else:
  print ('Sorry, please input a correct content')
  # TO DO: get user input for day of week (all, monday, tuesday, ... sunday)
  print('-'*40)
  return city,month,day

    加载相应的 “城市,月,日” 的数据
  def load_data(city, month, day):
  """
  Loads data for the specified city and filters by month and day if applicable.
  Args:
  (str) city - name of the city to analyze
  (str) month - name of the month to filter by, or "all" to apply no month filter
  (str) day - name of the day of week to filter by, or "all" to apply no day filter
  Returns:
  df - Pandas DataFrame containing city data filtered by month and day
  """
  # load data file into a dataframe
  df = pd.read_csv(CITY_DATA[city])
  # convert the Start Time column to datetime
  df['Start Time'] = pd.to_datetime(df['Start Time'])
  # extract month and day of week from Start Time to create new columns
  df['month'] = df['Start Time'].dt.month
  df['day_of_week'] = df['Start Time'].dt.weekday_name
  # filter by month if applicable
  if month != 'all':
  # use the index of the months list to get the corresponding int
  months = ['january', 'february', 'march', 'april', 'may', 'june']
  month = months.index(month) + 1
  # filter by month to create the new dataframe
  df = df[df['month'] == month]
  # filter by day of week if applicable
  if day != 'all':
  # filter by day of week to create the new dataframe
  df = df[df['day_of_week'] == day.title()]
  return df
  计算并显示共享单车出行的最频繁时间
  def station_stats(df):
  """Displays statistics on the most popular stations and trip."""
  print('\nCalculating The Most Popular Stations and Trip...\n')
  start_time = time.time()
  # TO DO: display most commonly used start station
  common_start=df['Start Station'].value_counts().index[0]
  print('Most commonly used start station: {}.'.format(common_start))
  # TO DO: display most commonly used end station
  common_end=df['End Station'].value_counts().index[0]
  print('Most commonly used end station: {}.'.format(common_end))
  # TO DO: display most frequent combination of start station and end station trip
  df['combination']=df['Start Station']+'/ '+df['End Station']
  common_combine=df['combination'].value_counts().index[0]
  print('Most frequent combination of start and end station trip: {}.'.format(common_combine))
  print("\nThis took %s seconds." % (time.time() - start_time))
  print('-'*40)
  计算并显示共享单车出行的总/平均时间
   def trip_duration_stats(df):
  """Displays statistics on the total and average trip duration."""
  print('\nCalculating Trip Duration...\n')
  start_time = time.time()
  # TO DO: display total travel time
  total_time=df['Trip Duration'].sum()
  print('Total travel time: {} seconds.'.format(total_time))
  # TO DO: display mean travel time
  mean_time=df['Trip Duration'].mean()
  print('Mean travel time: {} seconds.'.format(mean_time))
  print("\nThis took %s seconds." % (time.time() - start_time))
  print('-'*40)
  计算并显示共享单车用户的统计信息
   def user_stats(df):
  """Displays statistics on bikeshare users."""
  print('\nCalculating User Stats...\n')
  start_time = time.time()
  # TO DO: Display counts of user types
  user_type=df['User Type'].value_counts()
  print('User type\n{0}: {1}\n{2}: {3}'.format(user_type.index[0],user_type.iloc[0],user_type.index[1],user_type.iloc[1]))
  # TO DO: Display counts of gender
  cities_columns=df.columns
  if 'Gender' in cities_columns:
  user_gender=df['Gender'].value_counts()
  print('Male:{0}\nFemale:{1}. '.format(user_gender.loc['Male'],user_gender.loc['Female']))
  else:
  print("Sorry, this city don't have gender data" )
  # TO DO: Display earliest, most recent, and most common year of birth
  if 'Birth Year' in cities_columns:
  earliest_birth=df['Birth Year'].min()
  recent_birth=df['Birth Year'].max()
  common_birth=df['Birth Year'].value_counts().index[0]
  print('Earliest user year of birth: %i.'%(earliest_birth))
  print('Most recent user year of birth: %i.'%(recent_birth))
  print('Most common user year of birth: %i.'%(common_birth))
  else:
  print("Sorry, this city don't have birth year data" )
  print("\nThis took %s seconds." % (time.time() - start_time))
  print('-'*40)
  主函数
   def main():
  while True:
  city, month, day = get_filters()
  df = load_data(city, month, day)
  time_stats(df)
  station_stats(df)
  trip_duration_stats(df)
  user_stats(df)
  restart = input('\nWould you like to restart? Enter yes or no.\n')
  if restart.lower() != 'yes':
  break
  if __name__ == "__main__":
  main()
  链接:https://pan.baidu.com/s/1sSgbXBaSy1IxIfJqoMil2w 密码:m55o
   
    上文内容不用于商业目的,如涉及知识产权问题,请权利人联系博为峰小编(021-64471599-8017),我们将立即处理。
《2023软件测试行业现状调查报告》独家发布~

关注51Testing

联系我们

快捷面板 站点地图 联系我们 广告服务 关于我们 站长统计 发展历程

法律顾问:上海兰迪律师事务所 项棋律师
版权所有 上海博为峰软件技术股份有限公司 Copyright©51testing.com 2003-2024
投诉及意见反馈:webmaster@51testing.com; 业务联系:service@51testing.com 021-64471599-8017

沪ICP备05003035号

沪公网安备 31010102002173号