学习笔记《用Python玩转数据》

大壮老师的开场：欢迎学习《用Python玩转数据》，本课程用非计算机专业的小伙伴们能听得懂的方式来讲述如何利用Python这种简单易学的程序设计语言方便快捷地获取数据、表示数据、分析数据和展示数据，通过多个案例让大家轻松愉快地学会用Python玩转各领域数据。相信《用Python玩转数据》是一门可以让你不再害怕数据处理的计算机程序设计课，2016年9月8日10:00，大壮老师期待你的加入^_^。

MODULE 01: Python基础之第二周 Python面面观

题目内容：

找第n个默尼森数。P是素数且M也是素数，并且满足等式M=2P-1，则称M为默尼森数。例如，P=5，M=2P-1=31，5和31都是素数，因此31是默尼森数。

输入格式:用input()函数输入，注意如果Python 3中此函数的返回类型
输出格式：int类型

输入样例：4
输出样例：127

Python实现

from math import sqrt
def prime(num):
   if num <=1:
       return False
   N = int(sqrt(num))
   for i in range(2,N+1):
     if num%i ==0:
        return False
        break
   return True
def monisen(no):
    if no >=1:
       k = 0
       num = 2
       while True:
          M = 2**num - 1
          if prime(num) and prime(M):
             k += 1
          num += 1
          if k == no:
            break
       return M
    else:
        return 0
print monisen(input())
`

MODULE 02: 数据获取与表示之第三周数据获取与表示

题目内容

定义函数countchar()统计字符串中所有出现的字母的个数（允许输入大写字符，并且计数时不区分大小写）。形如：

def countchar(str):
      ... ...
     return a list
if __name__ == "__main__":
     str = raw_input()
     ... ...
     print countchar(str)    # print(countchar(str)) in Python 3

输入格式:
字符串

输出格式：
列表

输入样例：
Hello, World!

输出样例：

[0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 3, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0]

Python实现

#coding=utf-8
def countchar(str):
    charmap = {}
    charnum = []
    #初始化字典
    for i in range(26):
      charmap[chr(i+97)]=0
      charnum.append(0)
    #统一大小写
    str=str.lower()
    #统计个字母出现次数
    for c in str:
      if  ord("a")<=ord(c)<=ord("z"):
        charmap[c]+=1
      else:
        continue
    for i in range(26):
        charnum[i] = charmap[chr(i+97)]
    return charnum
if __name__ == "__main__":
     str = raw_input()
     print countchar(str)    # print(countchar(str)) in Python 3

题目内容：

抓取百度贴吧（注意你所评价的程序是用Python 2还是Python 3书写的，分别使用print语句和print()函数）“http://tieba.baidu.com/p/1000000000”至“http://tieba.baidu.com/p/1000000009”这10个页面并以1000000000.html~1000000009.html这样的文件名保存到本地硬盘上（提示：文件写入使用wb模式）。

Python实现

# -*- coding:utf-8 -*-
import urllib
import urllib2
import re
       
def getPage(baseURL,pageNum=1):
    url = baseURL+ str(pageNum)
    request = urllib2.Request(url)
    response = urllib2.urlopen(request)
    return response
    
def getContent(page):
    items = page.read()
    return items
        
def writeData(file_name,contents):
    file = open(file_name,"wb")
    file.write(contents)
    file.close
baseURL = 'http://tieba.baidu.com/p/100000000'
for p in range(10):
    print "正在处理贴吧: 1000000000" + str(p) + ".html ..."
    # 保存数据文件名
    file_name = '1000000000' + str(p) + ".html"
    # 获取网页
    page = getPage(baseURL,pageNum=p)
    # 获取帖子内容
    contents = getContent(page)
    # 写帖子内容到指定文件
    writeData(file_name,contents)

题目拓展：获取python吧的前10页的发帖内容

#-*- coding:utf-8 -*-
import urllib
import urllib2
import re
def replace(x):
    #去除img标签,7位长空格
    removeImg = re.compile('<img.*?>| {7}|')
    #删除超链接标签
    removeAddr = re.compile('<a.*?>|</a>')
    #把换行的标签换为\n
    replaceLine = re.compile('<tr>|<div>|</div>|</p>')
    #将表格制表<td>替换为\t
    replaceTD= re.compile('<td>')
    #把段落开头换为\n加空两格
    replacePara = re.compile('<p.*?>')
    #将换行符或双换行符替换为\n
    replaceBR = re.compile('<br><br>|<br>')
    #将其余标签剔除
    removeExtraTag = re.compile('<.*?>')
    x = re.sub(removeImg,"",x)
    x = re.sub(removeAddr,"",x)
    x = re.sub(replaceLine,"\n",x)
    x = re.sub(replaceTD,"\t",x)
    x = re.sub(replacePara,"\n    ",x)
    x = re.sub(replaceBR,"\n",x)
    x = re.sub(removeExtraTag,"",x)
    #strip()将前后多余内容删除
    return x.strip()
       
def getPage(baseURL,seeLZ=0,pageNum=1):
    seeLZ = '?see_lz='+str(seeLZ)
    url = baseURL+ seeLZ + '&pn=' + str(pageNum)
    request = urllib2.Request(url)
    response = urllib2.urlopen(request)
    return response
    
def getContent(page):
    pattern = re.compile('<div id="post_content_.*?>(.*?)</div>',re.S)
    items = re.findall(pattern,page)
    return items
        
def writeData(file_name,contents):
    file = open(file_name,"wb")
    floor = 1
    for item in contents:
            floorLine = "\n"+ str(floor) + "楼:  "
            file.write(floorLine)
            file.write(replace(item))
            floor += 1   
    file.close
    
baseURL = 'http://tieba.baidu.com/p/4622665804'
for p in range(1,11):
    print "正在处理贴吧第" + str(p) + "页..."
    # 保存数据文件名
    file_name = str(p) + ".txt"
    # 获取网页
    page = getPage(baseURL,pageNum=p)
    # 获取帖子内容
    contents = getContent(page.read())
    # 写帖子内容到指定文件
    writeData(file_name,contents)

MODULE 03: 第四周强大的数据结构和Python扩展库

题目内容：

有5名某界大佬xiaoyun、xiaohong、xiaoteng、xiaoyi和xiaoyang，其QQ号分别是88888、5555555、11111、1234321和1212121，用字典将这些数据组织起来。编程实现以下两个功能：
（1）用户输入某一个大佬的姓名后可以输出其QQ号，如果输入的姓名不在字典中则返回提示信息并允许再次输入；
（2）寻找所有有QQ靓号（5位数或小于5位数）的大佬，输出所有姓名。

其中Python 2中提示输入和输出结果的两句提示语请使用如下形式：
name = raw_input(“Please input the name:”)
print “Who has the nice QQ number?”

其中Python 3中提示输入和输出结果的两句提示语请使用如下形式：
name = input(“Please input the name:”)
print(“Who has the nice QQ number?”)

Python实现

dict1 = {'xiaoyun':88888,'xiaohong':5555555,'xiaoteng':11111,'xiaoyi':12341234,'xiaoyang':1212121}
name = raw_input("Please input the name:")
print dict1[name]
listV = dict1.values()
listK = dict1.keys()
print  "Who has the nice QQ number?"
for i in range(0,len(listV)):
     if listV[i]< 100000:
        print listK[i]

MODULE 04: Python数据统计和可视化

准备知识：

matplotlib.finance子模块提供了一个获取雅虎股票数据的api接口:quotes_historical_yahoo_ochl

datetime.date：表示日期的类。常用的属性有year, month, day；
datetime.time：表示时间的类。常用的属性有hour, minute, second, microsecond；
datetime.datetime：表示日期时间。
datetime.timedelta：表示时间间隔，即两个时间点之间的长度。
datetime.tzinfo：与时区有关的相关信息。

date类定义了一些常用的类方法与类属性，方便我们操作：
date.max、date.min：date对象所能表示的最大、最小日期；
date.resolution：date对象表示日期的最小单位。这里是天。
date.today()：返回一个表示当前本地日期的date对象；
　　　　date.strftime(fmt)：自定义格式化字符串；
date.fromtimestamp(timestamp)：根据给定的时间戮，返回一个date对象；
datetime.fromordinal(ordinal)：将Gregorian日历时间转换为date对象；

数据显示

索引：quotesdf.index
列名：quotesdf.columns
数据的值：quotesdf.values
数据的描述：quotesdf.describe
从头开始显示前５个：quotesdf.head(5)
从尾开始显示后５个：quotesdf.tail(5)

数据选择

切片方式
索引方式　
loc　行列索引　iloc 行列位置
行：quotesdf[u’2016-09-01’:u’2016-09-10’] quotesdf.loc[1:5,]　quotesdf.iloc[1:6,]　
列：quotesdf[‘low’] quotesdf.loc[:,[‘open’,’high’]] quotesdf.iloc[:,[0,2]]　
区域：quotesdf.loc[1:5,[‘open’,’high’]] quotesdf.iloc[1:6,[0,2]]　
单个值：quotesdf.loc[5,’open’]　quotesdf.at[5,’open’]　 quotesdf.iat[5,0]　
条件帅选
quotesdf[columns=’open’]
quotesdf[(quotesdf.open>=60)]

简单统计和处理

均值　quotesdf.mean[columns=’open’]
排序　quotesdf.sort[columns=’close’]
分组　quotesdf.groupby[‘mouth’]　
Merge　追加　append　连接　concat join

题目内容：

利用财经数据接口爬取雅虎AXP公司股票数据

python 实现：

from matplotlib.finance import quotes_historical_yahoo_ochl
#!!!!!!!the latest library name is quotes_historical_yahoo_ochl!!!!!!
from datetime import date
from datetime import datetime
import pandas as pd
today = date.today()
start = (today.year-1, today.month, today.day)
quotes = quotes_historical_yahoo_ochl('AXP', start, today)
#!!!!!!!quotes_historical_yahoo_ochl is also the latest function name!!!!!!
fields = ['date','open','close','high','low','volume']
list1 = []
for i in range(0,len(quotes)):
    x = date.fromordinal(int(quotes[i][0]))
    y = datetime.strftime(x,'%Y-%m-%d')
    list1.append(y)
quotesdf = pd.DataFrame(quotes, index = list1, columns = fields)
quotesdf = quotesdf.drop(['date'], axis = 1)
print quotesdf

题目内容：

求微软公司（MSFT）2015年第一季度股票收盘价的平均值

python 实现：

from matplotlib.finance import quotes_historical_yahoo_ochl
from datetime import date
import pandas as pd
today = date.today()
start = (today.year-2, today.month, today.day)
quotesMS = quotes_historical_yahoo_ochl('MSFT', start, today)
attributes=['date','open','close','high','low','volume']
quotesdfMS = pd.DataFrame(quotesMS, columns= attributes)
list = []
for i in range(0, len(quotesMS)):
    x = date.fromordinal(int(quotesMS[i][0]))
    y = date.strftime(x, '%y/%m/%d')
    list.append(y)
quotesdfMS.index = list
quotesdfMS = quotesdfMS.drop(['date'], axis = 1)
list = []
quotesdfMS15 = quotesdfMS['15/01/01':'15/12/31']
for i in range(0, len(quotesdfMS15)):
    list.append(int(quotesdfMS15.index[i][3:5])) #get month just like '02'
quotesdfMS15['month'] = list
print(quotesdfMS15.groupby('month').mean().close)

from matplotlib.finance import quotes_historical_yahoo_ochl
from datetime import date
import pandas as pd
#获取到的数据是从两年前的今天到现在的微软公司的股票历史数据
today = date.today()
start = (today.year-2, today.month, today.day)
quotes = quotes_historical_yahoo_ochl('MSFT', start, today)
#为微软的quotes数据添加属性名
attributes = ['date','open','close','high','low','volume']
quotesdf = pd.DataFrame(quotes, columns = attributes)
#为这部分数据将索引列更换为日期，并删除掉原先的date列，日期格式是2015年1月30日，显示为 ‘15/01/30’ 注意空格和符号
list1 = []
for i in range(0, len(quotes)):
    x = date.fromordinal(int(quotes[i][0]))
    y = date.strftime(x, '%y/%m/%d')
    list1.append(y)
quotesdf.index = list1
quotesdf = quotesdf.drop(['date'], axis = 1 )
#获取2014年10月30日到12月10日这期间微软更换CEO阶段股票的开盘价和收盘价
print quotesdf.ix['14/10/30':'14/12/10',['open', 'close']]
#查询在2014年下半年（即6月1日至12月31日）微软股票收盘价大于45美元的记录
quotesdf['14/06/01':'14/12/31'][quotesdf.close > 45]
#查询在2014年整年内（即1月1日至12月31日）微软股票收盘价最高的5天数据
quotesdf['14/01/01':'14/12/31'].sort('close', ascending= False)[:5]
#根据成交量升序排列的2014年上半年的微软股票数据
print quotesdf['14/10/1':'14/12/31'].sort('volume') 
#使得统计在2014年整年内（即1月1日至12月31日）微软股票涨价的每个月天数据
list1 = []
tmpdf = quotesdf['14/01/01':'14/12/31']
for i in range(0, len(tmpdf)):
    list1.append(int(tmpdf.index[i][3:5]))
tmpdf['month'] = list1
print tmpdf[tmpdf.close > tmpdf.open]['month'].value_counts()
#统计在2014年整年内（即1月1日至12月31日）微软股票每个月的总成交量
print tmpdf.groupby('month')['volume'].sum()
#合并在2014年整年内（即1月1日至12月31日）微软股票收盘价最高的5天和最低的5天
sorted = quotesdf.sort('close',ascending= False)
pd.concat([sorted[:5], sorted[-5:]])

题目内容：

计算 MovieLens 100k 数据集中男性女性用户评分的标准差并输出(结果保留 6 位小数,并且两个值中间用一个空格分隔)。
数据集下载 http:// les.grouplens.org/datasets/movielens/ml-100k.zip
其中
u.data 表示 100k 条评分记录,每一列的数值含义是:
user id | item id | rating | timestamp

u.user 表示用户的信息,每一列的数值含义是:
user id | age | gender | occupation | zip code

可能会用到的相关函数:
pandas.read_table( lepath_or_bu er, sep=’\t’, names=None)
pandas.pivot_table(data, values=None, columns=None, aggfunc=’mean’)
pandas.merge(left, right, how=’inner’)
更详尽的 API 文档请参考 http://pandas.pydata.org/pandas-docs/stable/ 。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
unames = ['user_id', 'age', 'gender', 'occupation', 'zip']
users = pd.read_table('ml-100k/u.user', sep='\|', header=None, names=unames)
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ml-100k/u.data', sep='\t', header=None, names=rnames)
mnames = ['movie_id', 'movie title', 'release date_video', 'release date',\
          'IMDb URL', 'unknown', 'Action', 'Adventure', 'Animation',
          'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama',
          'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery',
          'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
movies = pd.read_table('ml-100k/u.item', sep='\|', header=None, names=mnames)
# 合并2个DataFrame
data = pd.merge(ratings, users)
#按性别生成数据透视表, 默认aggfunc = 'mean', 即values是每个用户为N个电影打分的平均值
data_series = data.pivot_table(index = ['gender','movie_id'], values = 'rating', aggfunc='mean')
#将series转换为dataFrame
data_frame = pd.DataFrame(data_series)
#对DataFrame进行筛选，分别生成女性和男性的DataFrame
Female_df = data_frame.query("gender == ['F']")
Male_df = data_frame.query("gender == ['M']")
#按性别计算评分标准差
Female_std = np.std(Female_df)
Male_std = np.std(Male_df)
#结果输出
print 'Gender\n', 'F\t%.6f' % Female_std, '\nM\t%.6f' % Male_std

1
2
3

Gender
F	0.845311 
M	0.797378

最近访客

本作品采用知识共享署名 2.5 中国大陆许可协议进行许可，欢迎转载，但转载请注明来自 Sunshine 并保持转载后文章内容的完整。本人保留所有版权相关权利。

本文链接：http://gaobb.github.io/2016/09/30/学习笔记_Learning_Python/

MODULE 01: Python基础之第二周 Python面面观

题目内容：

Python实现

MODULE 02: 数据获取与表示之第三周 数据获取与表示

题目内容

Python实现

题目内容：

Python实现

题目拓展：获取python吧的前10页的发帖内容

MODULE 03: 第四周 强大的数据结构和Python扩展库

题目内容：

Python实现

MODULE 04: Python数据统计和可视化

准备知识：

数据显示

数据选择

简单统计和处理

题目内容：

python 实现：

题目内容：

python 实现：

题目内容：

MODULE 02: 数据获取与表示之第三周数据获取与表示

MODULE 03: 第四周强大的数据结构和Python扩展库