Python爬虫 [toc]
request库 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import requestsr = requests.get("https://www.icourse163.org" ) print (r.status_code) print (type (r)) print (r.encoding) print (r.apparent_encoding) r.encoding = 'utf-8' print (r.text) print (r.headers) print (r.content)
request对象属性
r.status_code
HTTP请求的返回状态,200表示连接成功,404表示失败
r.text
HTTP响应内容的字符串形式,即url对应的页面内容
r.encoding
从HTTP header中猜测的响应内容编码方式
r.apparent_encoding
从内容中分析出的响应内容编码方式(备选编码方式)
r.content
HTTP响应内容的二进制形式
request对象方法 方法说明
requests.request()
构造一个请求,支撑以下各方法的基础方法
requests.get()
获取HTML网页的主要方法,对应于HTTP的GET
requests.head()
获取HTML网页头信息的方法,对应于HTTP的HEAD
requests.post()
向HTML网页提交POST请求的方法,对应于HTTP的POST
requests.put()
向HTML网页提交PUT请求的方法,对应于HTTP的PUT
requests.patch()
向HTML网页提交局部修改请求,对应于HTTP的PATCH
requests.delete()
向HTML页面提交删除请求,对应于HTTP的DELETE
通用代码框架 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 import requests import timedef getHTMLText (url ): """ 获取指定 URL 的 HTML 内容 Parameters: url (str): 要获取 HTML 内容的 URL Returns: str: 成功获取到的 HTML 内容,如果请求失败则返回 "产生异常" """ try : r = requests.get(url,timeout=1 ) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except : return "产生异常" if __name__ == "__main__" : url1 = "https://www.baidu.com/" print (getHTMLText(url1)) url2 = "https://chat.openai.com/" print (getHTMLText(url2)) first_time = time.time() N = 100 for i in range (0 ,N): getHTMLText(url1) last_time = time.time() print (f'访问{N} 次耗时:{last_time-first_time} ' )
Robots协议 我们可以在网站的根目录建立一个robots.txt
文件来说明限制爬虫的行为,但是这只是一种公示,不能真正阻止爬虫。
实例
当网站对访问用户设备有限制时,我们可以修改user-agent字段
实现访问。
1 2 3 4 kv = {'user-agent' :'Mozilla/5.0' } r = requests.get(url,timeout=1 ,headers=kv)
2.部分网站提供API URL 搜索
百度:http://www.baidu.com/s?wd
360:http://www.so.com/s?q
百度的我没有爬到,这里发下360的爬虫
1 2 3 4 5 6 7 8 9 10 11 12 import requestsurl ='http://www.so.com/s' keyword ='Python' kv = {'q' :keyword} try : kv = {'q' :keyword} r= requests.get(url,params = kv) r.raise_for_status() r.encoding= r.apparent_encoding print (r.text) except : print ('爬取失败' )
存储
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 import requests import os url ='https://img-blog.csdnimg.cn/f35b24a8c05744dca042a64ef2154c12.png' root = "H:/pachong/" path = root + url.split('/' )[-1 ] try : if not os.path.exists(root): os.makedirs(root) if not os.path.exists(path): r = requests.get(url) with open (path,'wb' ) as f: f.write(r.content) f.close() print ("文件已保存" ) else : print ("文件已存在" ) except : print ('爬取失败' )
ip归属地查询
我这里用的是ip38查询
1 2 3 4 5 6 7 8 9 10 11 12 13 import requestsurl ='https://ip38.com/ip.php' keyword ='ip-address' kv = {'ip' :keyword} try : kv = {'ip' :keyword} r= requests.get(url,params = kv) r.raise_for_status() r.encoding= r.apparent_encoding print (r.request.url) print (r.text) except : print ('爬取失败' )
Beautiful Soup
基本使用 1 2 3 4 5 6 7 8 9 10 11 12 from bs4 import BeautifulSoupimport requestsurl = 'https://baike.baidu.com/item/%E5%88%9D%E9%9F%B3%E6%9C%AA%E6%9D%A5/8231955' try : r = requests.get(url) demo = r.text soup = BeautifulSoup(demo,"html.parser" ) print (soup.prettify()) except : print ("error" )
基本元素 基本元素说明
Tag
标签,最基本的信息组织单元,分别用<>和>标明开头和结尾
Name
标签的名字,…
的名字是’p’,格式:.name
Attributes
标签的属性,字典形式组织,格式:.attrs
NavigableString
标签内非属性字符串,<>…>中字符串,格式:.string
Comment
标签内字符串的注释部分,一种特殊的Comment类型
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 from bs4 import BeautifulSoupimport requestsurl = 'https://python123.io/ws/demo.html' try : r = requests.get(url) demo = r.text soup = BeautifulSoup(demo,"html.parser" ) print (soup.title) tag = soup.a print (tag) print (tag.name) print (tag.parent.name) print (tag.parent.parent.name) print (tag.attrs) print (tag.attrs['href' ]) print (type (tag)) tag1 = soup.p print (soup.p.string) print (type (soup.p.string)) print (soup.prettify()) newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>" ,"html.parser" ) print (newsoup.b.string) print (type (newsoup.b.string)) print (newsoup.p.string) print (type (newsoup.p.string)) except : print ("error" )
HTML内容遍历 分为下行遍历
、上行遍历
和平行遍历
,直接看代码更容易理解
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 from bs4 import BeautifulSoup import requests url = 'https://python123.io/ws/demo.html' try : r = requests.get(url) demo = r.text soup = BeautifulSoup(demo, "html.parser" ) print (soup.head.contents) print (soup.body.contents) print ("--------------------------------------------------------------------下行遍历---------------------------------------------------------------" ) for child in soup.body.children: print (child) print (len (soup.body.contents)) for i in range (len (soup.body.contents)): print (soup.body.contents[i]) print ("--------------------------------------------------------------------------------------------------------------------------------------------" ) print ("--------------------------------------------------------------------上行遍历---------------------------------------------------------------" ) for parent in soup.a.parents: if parent is None : print (parent) else : print (parent.name) print ("--------------------------------------------------------------------------------------------------------------------------------------------" ) print ("--------------------------------------------------------------------平行遍历---------------------------------------------------------------" ) print (soup.p) for sibling in soup.p.next_siblings: print (sibling) print ("#########################################################################################################################################" ) for sibling in soup.p.previous_siblings: print (sibling) print ("--------------------------------------------------------------------------------------------------------------------------------------------" ) except Exception as e: print ("error:" , e)
信息组织与提取 信息标记的三种方式 XML (eXtensible Markup Language)
最早的通用信息标记语言,可扩展性好,但繁琐
Internet上的信息交互与传递
1 2 3 4 5 6 7 8 9 10 <person > <firstName > Tian</firstName > <lastName > Song</lastName > <address > <streetAddr > 中关村南大街5号</streetAddr > <city > 北京市</city > <zipcode > 100081</zipcode > </address > <prof > Computer System</prof > <prof > Security</prof > </person >
JSON (JavaScript Object Notation)
信息有类型,适合程序处理(js),较XML简洁
移动应用云端和节点的信息通信,无注释
1 2 3 4 5 6 7 8 9 10 { “firstName” : “Tian” , “lastName” : “Song” , “address” : { “streetAddr” : “中关村南大街5 号” , “city” : “北京市” , “zipcode” : “100081 ” } , “prof” : [ “Computer System” , “Security” ]
YAML (YAML Ain't Markup Language)
信息无类型,文本信息比例最高,可读性好
各类系统的配置文件,有注释易读
1 2 3 4 5 6 7 8 9 firstName : Tian lastName : Song address : streetAddr : 中关村南大街5号 city : 北京市 zipcode : 100081 prof : ‐Computer System ‐Security
信息提取 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 from bs4 import BeautifulSoup import requests import re url = 'https://python123.io/ws/demo.html' try : r = requests.get(url) demo = r.text soup = BeautifulSoup(demo, "html.parser" ) print ("-----------------" ) for link in soup.find_all('a' ): print (link.get('href' )) print ("-----------------" ) print () print ("-----------------" ) for tag in soup.find_all(True ): print (tag.name) print ("-----------------" ) print () print ("-----------------" ) for tag in soup.find_all(re.compile ('b' )): print (tag.name) print ("-----------------" ) print () print ("-----------------" ) for tag in soup.find_all(id ='link1' ): print (tag) print ("-----------------" ) for tag in soup.find_all(id =re.compile ('link' )): print (tag) print ("-----------------" ) print () print ("-----------------" ) print (soup.find_all('a' )) print ("-----------------" ) print (soup.find_all('a' , recursive=False )) print ("-----------------" ) print () print ("-----------------" ) print (soup.find_all(string="Basic Python" )) print ("-----------------" ) print (soup.find_all(string=re.compile ('Python' ))) print ("-----------------" ) except Exception as e: print ("error:" , e)
拓展方法 方法说明
<>.find() 搜索且只返回一个结果,同.find_all()参数
<>.find_parents() 在先辈节点中搜索,返回列表类型,同.find_all()参数
<>.find_parent() 在先辈节点中返回一个结果,同.find()参数
<>.find_next_siblings() 在后续平行节点中搜索,返回列表类型,同.find_all()参数
<>.find_next_sibling() 在后续平行节点中返回一个结果,同.find()参数
<>.find_previous_siblings() 在前序平行节点中搜索,返回列表类型,同.find_all()参数
<>.find_previous_sibling() 在前序平行节点中返回一个结果,同.find()参数
中国大学排名定向爬虫实例 这里使用的是2023版的排名,8年过去网页结构已发生变化,稍加修改还能继续爬取【软科排名】2023年最新软科中国大学排名|中国最好大学排名 (shanghairanking.cn)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 from bs4 import BeautifulSoupimport bs4import requestsdef getHTMLText (url ): """ 获取指定URL的HTML文本内容 参数: url (str): 目标网页的URL 返回: str: 网页的HTML内容,如果请求失败则返回空字符串 """ try : r = requests.get(url, timeout=30 ) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except : return "" def fillUnivList (ulist, html ): """ 将HTML文本中的大学信息填充到列表中 参数: ulist (list): 存储大学信息的列表 html (str): 网页的HTML内容 """ soup = BeautifulSoup(html, "html.parser" ) for tr in soup.find('tbody' ).children: if isinstance (tr, bs4.element.Tag): tds = tr('td' ) a = tds[0 ].string.strip() b = tds[1 ].find('a' , class_='name-cn' ).string.strip() c = tds[1 ].find('p' ).string.strip() d = tds[2 ].text.strip() e = tds[3 ].text.strip() f = tds[4 ].string.strip() ulist.append([a, b, c, d, e, f]) def printUnivList (ulist, num ): """ 打印大学信息列表 参数: ulist (list): 存储大学信息的列表 num (int): 打印的大学数量 """ tplt = "{:^20}\t{:^20}\t{:^20}\t{:^20}\t{:^10}\t{:^10}" print (tplt.format ("排名" , "学校名称" , "级别" , "地域" , "类型" , "总分" )) for i in range (num): u = ulist[i] print (tplt.format (u[0 ], u[1 ], u[2 ], u[3 ], u[4 ], u[5 ])) def main (): """ 主函数,用于执行爬虫程序 """ uinfo = [] url = 'https://www.shanghairanking.cn/rankings/bcur/202311' html = getHTMLText(url) fillUnivList(uinfo, html) printUnivList(uinfo, 30 ) main()