发布时间:2025-12-09 20:31:25 浏览次数:4
通过编写程序,模拟浏览器去上网,然后让其去互联网上抓取数据的过程
爬虫分类
爬虫的矛与盾:
反爬机制: 门户网站,可以通过制定相应的策略或者技术手段,防止爬虫程序进行网站数据的爬取
反反爬策略: 爬虫程序可以通过指定相关策略或者技术手段,破解门户网站中具备的反爬机制,从而可以获取门户网站中相关的数据
robots.txt协议:
君子协议,规定了网站中哪些数据可以被爬虫爬取,哪些数据不可以被爬虫爬取
#查看淘宝网站的robots.txt协议www.taobao.com/robots.txtHTTP&HTTPS协议
当前URL地址在数据传输的时候遵循的HTTP协议
协议:就是两个计算机之间为了能够流畅的进行沟通而设置的一个君子协议,常见的协议有TCP/IP SOAP协议,HTTP协议, SMTP协议等…
HTTP协议: Hyper Text Transfer Protocol(超文本传输协议) 的缩写,是用于万维网服务器传输超文本到浏览器的传送协议
HTTPS协议:安全的超文本传输协议
直白点,就是浏览器和服务器之间的数据交互遵守的HTTP协议
HTTP协议把一条消息分为三大快内容,无论是请求还是响应的三块内容
反扒手段
作用:模拟浏览器 发起请求
豆瓣电影持久化存储第一页数据
import urllib.requestimport urllib.parseurl = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&start=0&limit=20'# 获取豆瓣电影第一页的数据,并且保存headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}# 请求对象定制request = urllib.request.Request(url=url, headers=headers)# 获取响应数据response = urllib.request.urlopen(request)#获取数据进行持久化存储content=response.read().decode('utf-8')with open('../data/douban.json',mode='w',encoding='utf-8') as f:f.write(content)ajax的get请求之豆瓣电影持久化存储前十页数据
import urllib.requestimport urllib.parseimport os# 获取豆瓣电影前十页的数据,并且保存def create_request(page):base_url = f'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action='data = {'start': (page - 1) * 20,'limit': 20}data = urllib.parse.urlencode(data)url = base_url + dataheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}# 请求对象定制request = urllib.request.Request(url=url, headers=headers)return requestdef get_content(request):resposne = urllib.request.urlopen(request,timeout=2)content = resposne.read().decode('utf-8')return contentdef download(page, content):with open('../data/douban/douban' + str(page) + '.json', 'w', encoding='utf-8') as f:print(f'正在下载第{str(page)}页,请耐心等待.....')f.write(content)if __name__ == '__main__':# 创建数据的保存目录if not os.path.exists('../data/douban'):os.mkdir('../data/douban')start_page = int(input('请输入起始页码:'))end_page = int(input('请输入结束的页码:'))for page in range(start_page, end_page + 1):# 每一页都有自己的请求对象的定制request = create_request(page)# 获取响应数据content = get_content(request)# 下载download(page, content)print(f'\n第{start_page}页到{end_page}下载完毕,请注意查看....')快代理
1.代理的常用功能?1.突破自身1P访问限,访问国外站点.2.访问一些单位或团体内部资通扩展:某大学年TP(前提是该代理地址在该资源的允许访问范园之内),使用教育网内地址设免费代理服务器,就可以用于对教育网开放的各类FTP下载上传,以及各类资料直间共享等服务。3.提高访问速度扩展:通常代理服务器都设置一个较大的硬白缓冲区,当有外界的信息通过时,同时也将其保存到暖冲区中,当其他用户再访问相同的信息时,则直接由领冲区中取出信息,传给用户,以提高访问速度。4.隐真实1P扩展:上网者也可以通过这种方法隐藏自己的P,免受攻击。2.代码配置代理创建Reugest对象创建ProxyHandleri对象用handler对象创建opener对家使用opener,open函数发送请求 import urllib.requesturl='https://www.baidu.com/s?ie=UTF-8&wd=ip'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}request=urllib.request.Request(url=url,headers=headers)proxies={'http': '118.24.219.151:16817'}#模拟浏览器访问服务器handler=urllib.request.ProxyHandler(proxies=proxies)opener=urllib.request.build_opener(handler)resposne=opener.open(request)#获取相应信息content=resposne.read().decode('utf-8')with open('../data/代理.html','w',encoding='utf-8') as f:f.write(content)代理池
import urllib.requestproxies_pool=[{'http':'118.24.219.151:16817'},{'http':'112.14.47.6:52024'},{'http':'222.74.73.202:42055'},{'http':'114.233.70.231:9000'},{'http':'116.9.163.205:58080'},{'http':'27.42.168.46:55481'},{'http':'121.13.252.61:41564'},{'http':'61.216.156.222:60808'},]import randomproxies=random.choice(proxies_pool)url='https://www.baidu.com/s?ie=UTF-8&wd=ip'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}request=urllib.request.Request(url=url,headers=headers)handler=urllib.request.ProxyHandler(proxies=proxies)opener=urllib.request.build_opener(handler)response=opener.open(request)content=response.read().decode('utf-8')with open('../data/代理.html','w',encoding='utf-8') as f:f.write(content)需求:爬取搜狗首页页面数据
# 导包import requsets# 指定urlurl='https://www.sogou.com/'# 发起请求#get方法会返回一个响应对象response=requests.get(url=url)# 获取响应数据:调用响应对象的text属性, 返回响应对象中存储的是字符串形式的响应数据(页面源码数据)page_text=response.textprint(page_text)# 持久化存储with open("sougou.html",'w',encoding='utf-8') as f: f.write(page_text)print('爬取结束')案例一:爬取搜狗指定词条对应的结果(简易网页采集器)
import requestsget_url='https://www.sogou.com/web?'#处理url携带的参数,存到字典中kw=input("enter a message:")param={"query":kw}#UA伪装 :让爬虫对应的请求载体身份标识 伪装成某一款浏览器#UA:User-Agent 门户网站的服务器会检测对应请求的载体身份标识,如果检测到请求载体的身份标识是一款浏览器,,说明是一个正常的请求,但是检测到请求载体不是某一款浏览器,则表示该请求不正常(crawl),服务器端拒绝请求headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}response=requests.get(url=get_url,params=param,headers=headers)page_text=response.text# print(page_text)FielName=kw+'.html'with open(FielName,'w',encoding='utf-8') as f:f.write(page_text)print(kw +' save over!!')案例二:百度翻译
#拿到当前单词所对应的翻译结果#页面局部刷新 ajaximport jsonimport requestspost_url='https://fanyi.baidu.com/sug'headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}word=input('请输入要翻译的单词:enter a word!!\n')#post请求参数处理data={"kw":word}response=requests.post(url=post_url,data=data,headers=headers)# print(response.text)dic_obj=response.json()#json返回的是字典对象,确认响应数据是json类型,才可使用json()with open(word+'.json','w',encoding='utf-8') as f:json.dump(dic_obj,fp=f,ensure_ascii=False)print('over')案例三:豆瓣电影
#页面局部刷新,发起ajax请求import jsonimport requestsget_url='https://movie.douban.com/j/chart/top_list'headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}param={'type':'24','interval_id' : '100:90','action': '','start': '0',#从库中第几部电影去爬取'limit': '20', #一次取出多少个}respons=requests.get(url=get_url,params=param,headers=headers)list_data=respons.json()fp=open('./douban.json','w',encoding='utf-8')json.dump(list_data,fp,ensure_ascii=False)fp.close()print('over')案例四:肯德基官网餐厅查询,爬取多页餐厅信息
动态加载,局部刷新,import jsonimport requestspost_url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}for i in range(1,11):data = {'cname': '','pid': '','keyword': '北京','pageIndex': i,'pageSize': '10',}response requests.post(url=post_url,headers=headers, data=data)dic_data = response.textwith open('KFC_order', 'a', encoding='utf-8') as f:f.write(dic_data)print('\n') #爬取肯德基餐厅第一页数据import requests,ospost_url='http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'headers={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"}mes= input('请输入一个城市:')data={'cname':'' ,'pid':'','keyword': mes,'pageIndex': '1','pageSize': '10',}#发送请求response=requests.post(url=post_url,headers=headers,data=data)#获取响应result=response.textif not os.path.exists('./网络爬虫/sucai/KFC/'):os.makedirs('./网络爬虫/sucai/KFC/')#持久化存储with open('./网络爬虫/sucai/KFC/'+mes+'KFC','w',encoding='utf-8') as f:f.write(result)回顾requests模块实现爬虫的步骤:
起始在持久化存储之前还有一步数据解析,需要使用聚焦爬虫,爬取网页中部分数据,而不是整个页面的数据。下面会学习到三种数据解析的方式。至此,爬虫流程可以修改为:
数据解析分类:
数据解析原理概述:
案例一:豆瓣排行
import csvimport requestsimport reurl ="https://movie.douban.com/top250"header={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}# 获取响应res=requests.get(url=url,headers=header)content_page=res.text# print(content_page)#解析数据obj=re.compile(r'<li>.*?<span class="title">(?P<name>.*?)</span>.*?'r'<p class="">.*?<br>(?P<time>.*?) .*?<p class="star">.*?'r'<span class="rating_num".*?>(?P<grade>.*?)</span>.*?'r'<span>(?P<number>.*?)</span>',re.S)#开始匹配result=obj.finditer(content_page)f=open("data.csv",mode="w",encoding='utf-8')cswriter=csv.writer(f)for item in result:# print(item.group("name"))# print(item.group("time").strip())# print(item.group("grade"))# print(item.group("number"))dic=item.groupdict()dic['time']=dic['time'].strip()cswriter.writerow(dic.values())f.close()print("over")数据解析原理
1. 标签定位,2. 提取标签,标签属性中存储的数据值实例化一个beautifulSoup对象,并且将页面源码数据加载到该对象 中3. 通过调用BeautifulSoup对象中相关属性或者方法进行标签定位和数据提取环境安装:
pip install bs4pip install lxml如何实例化BeautifulSoup对象
对象实例化:
案例一:爬取三国演义小说所有章节标题和章节内容
三国演义
import requestsfrom bs4 import BeautifulSoupfrom fake_useragent import UserAgentif __name__ =='__main__':headers={"User-Agent":UserAgent().chrome}get_url='https://www.shicimingju.com/book/sanguoyanyi.html'#发起请求,获取响应page_text=requests.get(url=get_url,headers=headers).text.encode('ISO-8859-1')#在首页中解析出章节标题和章节内容#1. 实例化BeautifulSoup对象,将html数据加载到该对象中soup=BeautifulSoup(page_text,'lxml')# print(soup)#2.解析章节标题和详情页的urllist_data=soup.select('.book-mulu > ul > li')fp=open('./sanguo.text','w',encoding='utf-8')for i in list_data:title=i.a.textdetail_url='https://www.shicimingju.com/'+ i.a['href']#对详情页的url发送请求,detail_text=requests.get(url=detail_url,headers=headers).text.encode('ISO-8859-1')detail_soup=BeautifulSoup(detail_text,'lxml')#获取章节内容content=detail_soup.find('p',class_='chapter_content').text#持久化存储fp.write(title+":"+content+"\n")print(title,'下载完成')**案例二:**星巴克菜单名
from bs4 import BeautifulSoupimport urllib.requesturl = 'https://www.starbucks.com.cn/menu/'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}request = urllib.request.Request(url=url, headers=headers)response = urllib.request.urlopen(request)content = response.read().decode('utf-8')soup = BeautifulSoup(content, 'lxml')# //ul[@class="grid padded-3 product"]//strong/text()name_list=soup.select('ul[class="grid padded-3 product"] strong')for name in name_list:print(name.get_text()) import requestsfrom bs4 import BeautifulSoupimport timeheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}def get(url):response = requests.get(url, headers=headers)if response.status_code == 200:response.encoding = 'utf-8'content = response.textparse(content)def parse(html):soup = BeautifulSoup(html, 'lxml')# //ul[@class="grid padded-3 product"]/li/a/strong/text()name_list = soup.select('ul[class="grid padded-3 product"] strong')lists = []for name in name_list:lists.append(name.get_text())itemipiline(lists)import csvdef itemipiline(name):for i in name:with open('星巴克.csv', 'a', encoding='utf-8') as fp:writer = csv.writer(fp)writer.writerows([i])if __name__ == '__main__':url = 'https://www.starbucks.com.cn/menu/'get(url)**案例零:**获取百度网站的百度一下
from lxml import etreeimport urllib.request# 1.获取网页源码# 2.解析服务器响应的文件 etree.HTML# 3.打印url = 'https://www.baidu.com/'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}# 请求对象定制request = urllib.request.Request(url=url, headers=headers)# 模拟浏览器向服务器获取响应response = urllib.request.urlopen(request)# 获取网页源码content = response.read().decode('utf-8')# 解析网页源码,获取数据tree = etree.HTML(content)#xpath返回值是一个列表res = tree.xpath('//input[@id="su"]/@value')[0]print(res)案例一:彼岸图网
import requestsfrom lxml import etreeimport osfrom fake_useragent import UserAgentheaders={"User-Agent":UserAgent().chrome}url='https://pic.netbian.com/4kdongman/'#发请求,获取响应response=requests.get(url=url,headers=headers)page_text=response.text#数据解析 src属性 alt属性tree=etree.HTML(page_text)list_data=tree.xpath('//p[@class="slist"]/ul/li')#print(list_data)#创建文件夹if not os.path.exists('./PicLibs'):os.mkdir('./PicLibs')for li in list_data:img_src='https://pic.netbian.com' + li.xpath('./a/img/@src')[0]img_name=li.xpath('./a/img/@alt')[0]+'.jpg'#print(img_name,img_src)#通用解决中文乱码的解决方案img_name=img_name.encode("iso-8859-1").decode('gbk')#请求图片,进行持久化存储img_data=requests.get(url=img_src,headers=headers).contentimg_path='PicLibs/'+ img_namewith open(img_path,'wb') as fp:fp.write(img_data)print(img_name,'下载成功!!!!')案例二:发表情
import requests,osfrom lxml import etreefrom fake_useragent import UserAgentdef crawl(url):headers={"User-Agent":UserAgent().chrome}#获取页面响应信息page_text=requests.get(url,headers).text#解析表情包的详情页urltree=etree.HTML(page_text)list_data=tree.xpath('//p[@class="ui segment imghover"]/p/a')if not os.path.exists('表情包'):os.mkdir('表情包')for i in list_data:detail_url='https://www.fabiaoqing.com'+i.xpath('./@href')[0]# print(detail_url)#对详情页面url发起请求,获取响应detail_page_text=requests.get(detail_url,headers).texttree=etree.HTML(detail_page_text)#得到搞笑图片的地址,发起请求进行持久化存储detail_list_data=tree.xpath('//p[@class="swiper-wrapper"]/p/img/@src')[0]fp=detail_list_data.split('/')[-1]with open('表情包/'+fp, 'wb') as fp:fp.write(requests.get(detail_list_data).content)print(fp,'下载完了!!!')#调用crawl('https://www.fabiaoqing.com/biaoqing/lists/page/1.html')案例三:绝对领域
import requestsfrom fake_useragent import UserAgentfrom lxml import etreeimport osdef crawl():url='https://www.jdlingyu.com/tuji'headers={"User-Agent":UserAgent().chrome}#获取页面代码page_text=requests.get(url=url,headers=headers).text#print(page_text)tree=etree.HTML(page_text)list_url=tree.xpath('//li[@class="post-list-item item-post-style-1"]/p/p/a')#print(list_url)for i in list_url:detail_link=i.xpath('./@href')[0]#print(detail_link)#对详情页url发请求,解析详情页图片detail_data=requests.get(detail_link,headers).texttree=etree.HTML(detail_data)#名称list_name=tree.xpath('//article[@class="single-article b2-radius box"]//h1/text()')[0]# 创建一个相册文件夹文件夹if not os.path.exists('绝对领域\\'+list_name):os.makedirs('绝对领域\\'+list_name)#图片链接list_link=tree.xpath('//p[@class="entry-content"]/p')for i in list_link:img_src=i.xpath('./img/@src')[0]# 图片名称pic_name=img_src.split('/')[-1]with open(f'绝对领域\\{list_name}\\{pic_name}','wb') as fp:fp.write(requests.get(img_src).content)print(list_name,'下载完成')crawl()案例四:爬取站长素材中免费简历模板
# 爬取站长素材中免费简历模板 官网:https://sc.chinaz.com/from lxml import etreeimport requests,osimport time if __name__ == '__main__':url = 'https://sc.chinaz.com/jianli/free.html'headers = {"user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"}# 获取免费简历模板网页数据page_text = requests.get(url=url, headers=headers).text#print(page_text)#解析数据,获取页面模板对应详情页面的urltree=etree.HTML(page_text)list_data= tree.xpath('//p[@id="main"]/p/p')#print(list_data)if not os.path.exists('./网络爬虫/sucai/站长素材'):os.makedirs('./网络爬虫/sucai/站长素材')for i in list_data:#解析出先要的详情页面链接detail_url ='https:'+i.xpath('./a/@href')[0]#获取简历模板名字detail_name =i.xpath('./p/a/text()')[0]detail_name=detail_name.encode('iso-8859-1').decode('utf-8')#解决中文编码问题#print(detail_name)#print(detail_url)#对详情页url发送请求,进行下载detail_data_text=requests.get(url=detail_url,headers=headers).text#进行数据解析,得到下载地址tree=etree.HTML(detail_data_text)down_link= tree.xpath('//p[@class="clearfix mt20 downlist"]/ul/li[1]/a/@href')[0]#print(down_link)#最后对下载地址链接发送请求,获取二进制数据final_data= requests.get(url=down_link,headers=headers).content#print(final_data)file_path='./网络爬虫/sucai/站长素材/'+detail_namewith open(file_path,'wb') as fp:fp.write(final_data)time.sleep(1) #下载延时1秒,防止爬取太快print(detail_name+'下载成功!!!')**案例五:**站长素材高清风景图片
import urllib.requestfrom lxml import etreeimport osdef create_request(page):if page == 1:url = 'https://sc.chinaz.com/tupian/fengjingtupian.html'else:url = f'https://sc.chinaz.com/tupian/fengjingtupian_{str(page)}.html'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}request = urllib.request.Request(url=url, headers=headers)return requestdef get_content(request):response = urllib.request.urlopen(request)content = response.read().decode('utf-8')return contentdef download(content):# 下载图片#'D:\图片\风景'tree=etree.HTML(content)src_list=tree.xpath('//body/p[3]/p[2]/p/img/@data-original')name_list=tree.xpath('//body/p[3]/p[2]/p/img/@alt')# urllib.request.urlretrieve('','D:\图片\风景')for i in range(len(name_list)):name=name_list[i]url='https:'+src_list[i]print('正在下载 %s [%s]' % (name, url))urllib.request.urlretrieve(url=url,filename=fr'D:\图片\风景\{name}.jpg')if __name__ == '__main__':if not os.path.exists('D:\图片\风景'):os.mkdir('D:\图片\风景')start_page = int(input('请输入起始页码:'))end_page = int(input('请输入结束页码:'))for page in range(start_page, end_page + 1):# 1.请求对象的定制request = create_request(page)# 2.获取网页源码content = get_content(request)# 3.下载download(content)博客教程
json格式数据
{ "store": {"book": [{ "category": "reference","author": "Nigel Rees","title": "Sayings of the Century","price": 8.95},{ "category": "fiction","author": "Evelyn Waugh","title": "Sword of Honour","price": 12.99},{ "category": "fiction","author": "Herman Melville","title": "Moby Dick","isbn": "0-553-21311-3","price": 8.99},{ "category": "fiction","author": "J. R. R. Tolkien","title": "The Lord of the Rings","isbn": "0-395-19395-8","price": 22.99}],"bicycle": {"author":"老王","color": "red","price": 19.95}}}提取数据
import jsonpathimport jsonobj = json.load(open('../../../data/jsonpath.json', 'r', encoding='utf-8'))# 书店所有的书的作者# author_list=jsonpath.jsonpath(obj,'$.store.book[*].author')# 所有的作者# author_list =jsonpath.jsonpath(obj,'$..author')# store的所有元素。所有的bookst和bicycle# tag_list=jsonpath.jsonpath(obj,'$.store.*')# store里面所有东西的price# tag_list = jsonpath.jsonpath(obj, '$.store..price')# 第三个书# book=jsonpath.jsonpath(obj,'$..book[2]')# 最后一本书# book=jsonpath.jsonpath(obj,'$..book[(@.length-1)]')# 前面的两本书。# book=jsonpath.jsonpath(obj,'$..book[0,1]')# book=jsonpath.jsonpath(obj,'$..book[:2]')# 过滤出所有的包含isbn的书。# book = jsonpath.jsonpath(obj, '$..book[?(@.isbn)]')#过滤出价格高于10的书。book=jsonpath.jsonpath(obj,'$..book[?(@.price>10)]')print(book)验证码是门户网站中采取的一种反扒机制
反扒机制:验证码,识别验证码图片中的数据,用于模拟登陆操作
http/https协议特性:无状态
没有请求到对应页面数据的原因:发起第二次基于个人主页请求的时候,服务器端并不知道该此请求是基于登陆状态下的请求
cookie:
用来让服务器端记录客户端的相关信息
另一种携带cookie访问
直接把cookie放在headers中(不建议)
#另一种携带cookie访问res=requests.get('https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919',headers={"Cookie":"GUID=868e19f9-5bb3-4a1e-b416-94e1d2713f04; BAIDU_SSP_lcr=https://www.baidu.com/link?url=r1vJtpZZQR2eRMiyq3NsP6WYUA45n6RSDk9IQMZ-lDT2fAmv28pizBTds9tE2dGm&wd=&eqid=f689d9020000de600000000462ce7cd7; Hm_lvt_9793f42b498361373512340937deb2a0=1657699549; sajssdk_2015_cross_new_user=1; c_channel=0; c_csc=web; accessToken=avatarUrl%3Dhttps%253A%252F%252Fcdn.static.17k.com%252Fuser%252Favatar%252F08%252F08%252F86%252F97328608.jpg-88x88%253Fv%253D1657699654000%26id%3D97328608%26nickname%3D%25E9%2585%25B8%25E8%25BE%25A3%25E9%25B8%25A1%25E5%259D%2597%26e%3D1673251686%26s%3Dba90dcad84b8da40; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2297328608%22%2C%22%24device_id%22%3A%22181f697be441a0-087c370fff0f44-521e311e-1764000-181f697be458c0%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%2C%22first_id%22%3A%22868e19f9-5bb3-4a1e-b416-94e1d2713f04%22%7D; Hm_lpvt_9793f42b498361373512340937deb2a0=1657700275"})print(res.json())cookie登录古诗文网
import requestsfrom bs4 import BeautifulSoupimport urllib.request# https://so.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fso.gushiwen.cn%2fuser%2fcollect.aspxheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}login_url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'response = requests.get(url=login_url, headers=headers)content = response.text# 解析页面源码,分别获取 __VIEWSTATE __VIEWSTATEGENERATORsoup = BeautifulSoup(content, 'lxml')viewstate = soup.select('#__VIEWSTATE')[0].attrs.get('value')viewstategenerator = soup.select('#__VIEWSTATEGENERATOR')[0].attrs.get('value')# 获取验证码图片并保存img = soup.select('#imgCode')[0].attrs.get('src')code_url = 'https://so.gushiwen.cn/' + img# requests里有一个方法 session(),通过session的返回值,就能使请求变成一个对象# urllib.request.urlretrieve(code_url, '../../../data/code.jpg')session = requests.session()response_code = session.get(code_url)# 图片下载需要使用二进制下载content_code = response_code.contentwith open('../../../data/code.jpg', 'wb') as f:f.write(content_code)code_name = input('请输入验证码:')# 点击登录接口url_post = 'https://so.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fso.gushiwen.cn%2fuser%2fcollect.aspx'data = {'__VIEWSTATE': viewstate,'__VIEWSTATEGENERATOR': viewstategenerator,'from': 'http://so.gushiwen.cn/user/collect.aspx','email': 'xxxxxxxxx','pwd': '123123','code': code_name,'denglu': '登录',}response_post = session.post(url=url_post, headers=headers, data=data)content_post = response_post.textwith open('../../../data/古诗文.html', 'w', encoding='utf-8') as f:f.write(content_post)破解封IP的各种反扒机制
案例:
requests
import requestsproxies={#"http":"""https":"222.110.147.50:3218" #代理IP地址}res=requests.get("https://www.baidu.com/s?tn=87135040_1_oem_dg&ie=utf-8&wd=ip",proxies=proxies)res.encoding='utf-8'with open('ip.html','w') as fp:fp.write(res.text)urllib.request
import urllib.requesturl='https://www.baidu.com/s?ie=UTF-8&wd=ip'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}request=urllib.request.Request(url=url,headers=headers)proxies={'http': '118.24.219.151:16817'}#模拟浏览器访问服务器handler=urllib.request.ProxyHandler(proxies=proxies)opener=urllib.request.build_opener(handler)resposne=opener.open(request)#获取相应信息content=resposne.read().decode('utf-8')with open('../data/代理.html','w',encoding='utf-8') as f:f.write(content)在访问一些链接的时,网站会进行溯源,
他所呈现的反扒核心:当网站中的一些地址被访问的时候,会溯源到你的上一个链接
需要具备的一些“环境因素”,例如访问的过程中需要请求的时候携带headers
案例:梨视频
#分析#请求返回的地址 和 可直接观看的视频的地址有差异##请求返回的数据中拼接了返回的systemTime值#真实可看的视频地址中有浏览器地址栏的ID值#解决办法,将url_no中的systemTime值替换为cont-拼接的ID值#url='https://www.pearvideo.com/video_1767372import requests,osif not os.path.exists('./网络爬虫/sucai/梨video'):os.makedirs('./网络爬虫/sucai/梨video')#梨视频爬取视频url='https://www.pearvideo.com/video_1767372'contId= url.split('_')[1]headers={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36","Referer":url,}videoStatusUrl=f'https://www.pearvideo.com/videoStatus.jsp?contId={contId}&mrd=0.6004271686556242'res =requests.get(videoStatusUrl,headers=headers)#print(res.json())dic=res.json()srcUrl=dic['videoInfo']['videos']['srcUrl']systemTime=dic['systemTime']srcUrl=srcUrl.replace(systemTime,f'cont-{contId}')#下载视频,进行存储video_data=requests.get(url=srcUrl,headers=headers).contentwith open('./网络爬虫/sucai/梨video/video.mp4','wb')as fp:fp.write(video_data)目的:在爬虫中使用异步实现高性能的数据爬取操作
异步爬虫方式:
多线程,多进程:(不建议)
进程池,线程池(适当使用)
一次性开辟一些线程,我们用户直接给线程池提交任务,线程任务的调度交给线程池来完成
需求分析
import uuidfrom multiprocessing import Queue, Processfrom threading import Threadimport pymysqlfrom lxml import etreeimport requestsheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}class DownloadThread(Thread):def __init__(self, url):super().__init__()self.url = urldef run(self):print('开始下载', self.url)resp = requests.get(url=self.url, headers=headers)if resp.status_code == 200:resp.encoding = 'utf-8'self.content = resp.textprint(self.url, '下载完成')def get_content(self):return self.contentclass DownloadProcess(Process):"""下载进程"""def __init__(self, url_q, html_q):self.url_q: Queue = url_qself.html_q = html_qsuper().__init__()def run(self):while True:try:url = self.url_q.get(timeout=30)# 将下载任务交给子线程t = DownloadThread(url)t.start()t.join()# 获取下载的数据html = t.get_content()# 将数据压入到解析队列中self.html_q.put((url, html))except:breakprint('---下载进程over---')class ParseThread(Thread):def __init__(self, html, url_q):self.html = htmlself.url_q = url_qsuper().__init__()def run(self):tree = etree.HTML(self.html)imgs = tree.xpath('//p[contains(@class,"com-img-txt-list")]//img')for img in imgs:item = {}item['id'] = uuid.uuid4().hexitem['name'] = img.xpath('./@data-original')[0]item['src'] = img.xpath('./@alt')[0]# 将item数据写入数据库conn = pymysql.connect(user='root',password='123.com',host='127.0.0.1',port=3306,database='dog',charset='utf8')cursor = conn.cursor()sql = 'insert into labuladuo(name,src) values("{}","{}")'.format(item['name'], item['src'])cursor.execute(sql)conn.commit()cursor.close()conn.close()print(item)# 获取下一页的链接next_page = tree.xpath('//a[@class="nextpage"]/@href')if next_page:next_url = 'https://sc.chinaz.com/tupian/' + next_page[0]self.url_q.put(next_url) # 将新的下载任务放到下载队列中class ParseProcess(Process):"""解析进程"""def __init__(self, url_q, html_q):super().__init__()self.url_q = url_qself.html_q = html_qdef run(self):while True:try:# 读取解析的任务url, html = self.html_q.get(timeout=60)# 启动解析线程print('开始解析', url)t = ParseThread(html, self.url_q).start()except:breakprint('---解析进程over---')if __name__ == '__main__':task1 = Queue() # 下载任务队列task2 = Queue() # 解析任务队列# 起始爬虫任务task1.put('https://sc.chinaz.com/tupian/labuladuo.html')p1 = DownloadProcess(task1, task2)p2 = ParseProcess(task1, task2)p1.start()p2.start()p1.join()p2.join()print('Over !!!!')案例:水果交易网
#https://www.guo68.com/sell?page=1中国水果交易市场#先考虑如何爬取单个页面的数据#再用线程池,多页面爬取import requests,csv,osfrom fake_useragent import UserAgentfrom concurrent.futures import ThreadPoolExecutorfrom lxml import etreeif not os.path.exists('./网络爬虫/sucai/中国水果交易'):os.makedirs('./网络爬虫/sucai/中国水果交易')f='./网络爬虫/sucai/中国水果交易/fruit.csv'fp=open(f,'w',encoding='utf-8')csvwrite=csv.writer(fp)def download_one_page(url):#拿去页面代码response=requests.get(url=url,headers={"user-agent":UserAgent().chrome}).text#解析数据 tree=etree.HTML(response)list_data=tree.xpath('//li[@class="fruit"]/a')all_list_data=[]for i in list_data:list_price=i.xpath('./p[2]/span[1]/text()')[0]list_sort=i.xpath('./p[1]/text()')[0]list_address=i.xpath('./p[2]/text()')[0]all_list_data.append([list_price,list_sort,list_address])#print(list_price,list_address,list_sort)#持久化存储csvwrite.writerow(all_list_data)print(url,'下载完毕')if __name__ =='__main__':with ThreadPoolExecutor(50) as Thread: #使用线程池,池子里放50个线程for i in range(1,60):Thread.submit(download_one_page,f'https://www.guo68.com/sell?page={i} ')print("全部下载完毕")协程不是计算机真实存在的(计算机只有进程,线程),而是由程序员人为创造出来的。
协程也可以被成为微线程,是一种用户态内的上下文切换技术,简而言之,其实就是通过一个线程实现代码块相互切换执行
实现协程方法:
通过greenlet,早期模块
yield关键字
asyncio模块
async,await关键字 【推荐】
协程意义
在一个线程中如果遇到IO等待时间,线程不会傻傻等,利用空闲的时间再去干点别的事
python3.4之后版本能用
import asyncioasync def fun1():print(1)await asyncio.sleep(2)#遇到IO操作时,自动切换到task中的其他任务print(2)async def fun2():print(3)await asyncio.sleep(3)#遇到IO操作时,自动切换到task中的其他任务print(4)task=[asyncio.ensure_future(fun1())asyncio.ensure_future(fun2())]loop= asyncio.get_event_loop()loop.run_untile_complete(asyncio.wait(task))#遇到IO阻塞自动切换理解为一个死循环,检测并执行某些代码
import asyncio#生成或获取一个事件循环loop=asyncio.get_event_loop()#将任务放到‘任务列表’loop.run_untile_complete(任务)协程函数:定义函数时,async def 函数名
协程对象:执行 协程函数(),得到的就是协程对象
执行协程函数创建协程对象,函数内部代码不会执行
如果想要执行协程函数内部代码,必须要将协程对象交给事件循环来处理
await +可等待对象(协程对象,task对象,Future对象=====》IO等待)
示例1:
import asyncioasync def fun():print("12374")await asyncio.sleep(2)print("结束")asyncio.run(fun())示例2:
import asyncioasync def others():print('start')await asyncio.sleep(2)print('end")return '返回值'async def fun():print("执行协程函数内部代码")#遇到IO操作 挂起当前协程(任务),等IO操作完成之后再继续往下执行,当前协程挂起时,事件循环可以去执行其他些协程(任务)response =await others()print('IO请求结束,结果为:',response)asyncio.run(fun())示例3:
import asyncioasync def others():print('start')await asyncio.sleep(2)print('end")return '返回值'async def fun():print("执行协程函数内部代码")#遇到IO操作 挂起当前协程(任务),等IO操作完成之后再继续往下执行,当前协程挂起时,事件循环可以去执行其他些协程(任务)response1 =await others()print('IO请求结束,结果为:',response1)response2 =await others()print('IO请求结束,结果为:',response2)asyncio.run(fun())await 等待对象的值得到结果之后才继续往下走
白话:在事件循环中添加多个任务
Task用于并发调度协程,通asyncio.create_task(协程对象)的方式创建Task对象,这样可以让协程加入事件循环中等待被调度执行,除了使用asyncio.create_task()函数以外,还可以用低级的loop.create_task()或者ensure_future()函数。不建议手动实例化Task对象
示例1:
import asyncioasync def func():print(1)await asyncio.sleep(2)print(2)return '返回值'async def main():print("main开始")#创建Task对象,将当前执行func()函数任务添加到事件循环中task1=asyncio.create_task(func())#创建Task对象,将当前执行func()函数任务添加到事件循环中task2=asyncio.creat_task(func())print("main结束")#当执行某协程遇到IO操作时,会自动化切换执行其他任务#此处的await是等待相对应的协程全部都执行完毕并获取结果res1=await task1res2=await task2print(res1,res2)asyncio.run(main())示例2:
import asyncioasync def func():print(1)await asyncio.sleep(2)print(2)return '返回值'async def main():print("main开始")task_list=[asyncio.create_task(func(),name="n1")asyncio.create_task(func(),name="n2")]print("main结束")#返回值会放到done里面,done,pending=await asyncio.wait(task_list,timeout=None)print(done)asyncio.run(main())示例3:
import asyncioasync def func():print(1)await asyncio.sleep(2)print(2)return '返回值'task_list=[func(),func(),]done,pending=asyncio.run( asyncio.wait(task_list))print(done)Task继承Tuture对象,Task对象内部await结果的处理基于Future对象来的
示例1:
示例2:
async def set_after():await asyncio.sleep(2)fut.set_result("666")async def main():#获取当前事件循环loop=asyancio.get_running_loop()#创建一个任务(Future对象)没绑定任何行为,则这个任务永远不知道社么时候结束fut=loop.create_future()#创建一个任务(Task对象)绑定了set_after函数,函数内部在2s后,会给fut赋值#手动设置future任务的最终结果,那么fut就可以结束了await loop.create_task(set_after(fut))#等待 Future对象获取 最终结果,否则一直等下去data=await futprint(data)asyncio.run(main())使用线程池,进程池来实现异步操作时用到的对象
import time from concurrent.futures.thread import ThreadPoolExecutorfrom concurrent.futures.process import ProcessPoolExecutorfrom concurrent.futures import Futuredef fun(value):time.sleep(1)print(value)pool=ThreadPoolExecutor(max_workers=5)#或者pool=ProcessPoolExecutor(max_workers=5)for i in range(10):fut =pool.submit(fun,i)print(fut)案例:asyncio + 不支持异步的模块
import requestsimport asyncioasync def down_load(url):#发送网络请求,下载图片,(遇到网络下载图片的IO请求,自动化切换到其他任务)print('开始下载',url)loop=asyncio.get_event_loop()#requests模块默认不支持异步操作,所以使用线程池来配合实现了future=loop.run_in_executor(None,requets.get,url)response=await futureprint("下载完成")#图片保存本地file_name=url.rsplit('/')[-1]with open(file_name,'wwb') as fp:fp.write(response.content)if __name__=="__main__":urls=[""""""]tasks=[ down_load(url) for url in urls]+loop=asyncio.get_event_loop()loop.run_until_complete(asyncio.wait(tasks))selenium是基于浏览器自动化的一个模块
使用流程
chromedriver驱动,,驱动下载好放在当前目录下就可以了
查看驱动和浏览器版本的映射关系
chromedriver官方文档
浏览迷:各种浏览器版本
安装selenium
pip install selenium实例化一个浏览器对象:
from selenium import webdriver#实例化一个浏览器对象,(传入浏览器的驱动生成)bro = webdriver.Chrome(executable_path = './chromedriver')方法:
获取文本和标签属性的方法
# driver: 是之前定义的打开浏览器的 “变量名称”# .text: 是获取该标签位置的文本# .get_attribute(value).:获取标签属性# value:属性字段名使用:获取红楼梦章节标题
from selenium import webdriverfrom lxml import etreefrom time import sleep#实例化浏览器驱动对象,drive = webdriver.Chrome()#让浏览器发送一个指定url请求drive.get("https://www.shicimingju.com/book/hongloumeng.html")#page_source获取浏览器当前页面的页面代码,(经过数据加载以及JS执行之后的结果的html结果),#我们平时打开页面源代码看得数据 和 点击检查在elements中看得代码是不一样的,有些是经过动态加载,在页面源代码代码里面不显示page_text=drive.page_source#解析红楼梦章节标题tree=etree.HTML(page_text)list_data=tree.xpath('//p[@class="book-mulu"]/ul/li')for i in list_data:article_title=i.xpath('./a/text()')[0]print(article_title)sleep(5)#关闭浏览器drive.quit()selenium新版本已经不能导入Phantomjs了
知道有这个东西就OK
用来处理整个系统的数据流处理,触发事物(框架核心)
用来接受引擎发过来的请求,压入队列中,并在引擎再次请求的时候返回,可以想象成一个URL(抓取网页的地址或者说是链接)的优先队列,由它来决定下一个要抓取的网址是什么,同时去除重复的网址
用于下载网页内容,并将网页内容返回给的链接(Scrapy下载器是建立在twisted这个高效的异步模型上的)
爬虫是主要干活的,用于从特定的网页中提取自己需要的信息,即所有的实体(item)用户也可以从中提取链接,让Scrapy继续抓取下一个页面
负责处理爬虫从网页中抽取的实体,主要功能是持久化存储,验证实体的有效性,清楚不许需要的信息,当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据
selector()
css() 样式选择器,返回selector选择器的 可迭代对象
xpath() xpath路径
选择器常用方法
使用管道要在``settings.py`中开启管道
ITEM_PIPELINES = {#管道可以有很多个,优先级的范围是1到1000,值越小,优先级越高'scrapt_dangdang_03.pipelines.ScraptDangdang03Pipeline': 300,}1.1爬虫文件的编写
import scrapyfrom scrapt_dangdang_03.items import ScraptDangdang03Itemclass DangSpider(scrapy.Spider):name = 'dang'allowed_domains = ['category.dangdang.com']start_urls = ['http://category.dangdang.com/cp01.01.02.00.00.00.html']def parse(self, response):# src = '//ul[@id="component_59"]/li//img/@src'# name='//ul[@id="component_59"]/li//img/@src'# price='//ul[@id="component_59"]/li//p[3]/span[1]/text()'# 所有的selector对象都可以再次调用xpath方法li_list = response.xpath('//ul[@id="component_59"]/li')for li in li_list:# 第一张图片个其他图片的标签的属性是不一样的# 第一张图片是src,其他的dadta-originalsrc = li.xpath('.//img/@data-original').extract_first()if not src:src = li.xpath('.//img/@src').extract_first()name = li.xpath('.//img/@alt').extract_first()price = li.xpath('.//p[3]/span[1]/text()').extract_first()print(src, name, price)item = ScraptDangdang03Item(src=src, name=name, price=price)# 获取一个就交给管道yield item1.2 item文件的编写
import scrapyclass ScraptDangdang03Item(scrapy.Item):# 通俗将就是要下载的数据都有什么# 图片src = scrapy.Field()# 名字name = scrapy.Field()# 价格price = scrapy.Field()1.3管道文件的编写
# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interfacefrom itemadapter import ItemAdapter# 要想使用管道必须在settings中开启class ScraptDangdang03Pipeline:fp = Nonedef open_spider(self, spider):# 在爬虫开始之前就执行的一个方法print('开始爬虫'.center(20,'-'))self.fp = open('book.json', 'w', encoding='utf-8')# item就是yield后的item对象def process_item(self, item, spider):# # 每传递一个对象,就打开一个文件,对文件的操作太过频繁# with open('book.json', 'a', encoding='utf-8') as f:# # write必须是一个字符串# # 'w'会覆盖之前的文件内容# f.write(str(item))self.fp.write(str(item))return itemdef close_spider(self, spier):print('结束爬虫'.center(20, '-'))self.fp.close()案例1
1.1爬虫文件的编写
import scrapyfrom scrapt_dangdang_03.items import ScraptDangdang03Itemclass DangSpider(scrapy.Spider):name = 'dang'allowed_domains = ['category.dangdang.com']start_urls = ['http://category.dangdang.com/cp01.01.02.00.00.00.html']def parse(self, response):# src = '//ul[@id="component_59"]/li//img/@src'# name='//ul[@id="component_59"]/li//img/@src'# price='//ul[@id="component_59"]/li//p[3]/span[1]/text()'# 所有的selector对象都可以再次调用xpath方法li_list = response.xpath('//ul[@id="component_59"]/li')for li in li_list:# 第一张图片个其他图片的标签的属性是不一样的# 第一张图片是src,其他的dadta-originalsrc = li.xpath('.//img/@data-original').extract_first()if not src:src = li.xpath('.//img/@src').extract_first()name = li.xpath('.//img/@alt').extract_first()price = li.xpath('.//p[3]/span[1]/text()').extract_first()print(src, name, price)item = ScraptDangdang03Item(src=src, name=name, price=price)# 获取一个就交给管道yield item1.2item文件的编写
import scrapyclass ScraptDangdang03Item(scrapy.Item):# 通俗将就是要下载的数据都有什么# 图片src = scrapy.Field()# 名字name = scrapy.Field()# 价格price = scrapy.Field()1.3管道类的编写
# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interfacefrom itemadapter import ItemAdapter# 要想使用管道必须在settings中开启class ScraptDangdang03Pipeline:fp = Nonedef open_spider(self, spider):# 在爬虫开始之前就执行的一个方法print('开始爬虫'.center(20, '-'))self.fp = open('book.json', 'w', encoding='utf-8')# item就是yield后的item对象def process_item(self, item, spider):# # 每传递一个对象,就打开一个文件,对文件的操作太过频繁# with open('book.json', 'a', encoding='utf-8') as f:# # write必须是一个字符串# # 'w'会覆盖之前的文件内容# f.write(str(item))self.fp.write(str(item))return itemdef close_spider(self, spier):print('结束爬虫'.center(20, '-'))self.fp.close()import urllib.request# 多条管道开启# 定义管道类class DangDangDownloadPipeline:def process_item(self, item, spier):url = 'http:' + item.get('src')filename = './books' + item.get('name') + '.jpg'print('正在下载{}'.format(item.get("name")).center(20,'-'))urllib.request.urlretrieve(url=url, filename=filename)return item案例2
2.1 爬虫文件的编写
import scrapyfrom douban.items import DoubanItemclass DoubanSpider(scrapy.Spider):name = 'douban.'#allowed_domains = ['www.xxx.con']start_urls = ['https://www.qidian.com/all/']def parse(self, response):# 解析小说的作者和标题list_data=response.xpath('//p[@class="book-img-text"]/ul/li')for i in list_data:title =i.xpath('./p[2]/h2/a/text()').extract_first()author=i.xpath('./p[2]/p/a/text()')[0].extract()item = DoubanItem()item['title'] = titleitem['author']=authoryield item #将ltem提交给管道2.2 items文件的编写
import scrapyclass DoubanItem(scrapy.Item):# define the fields for your item here like:title = scrapy.Field()author=scrapy.Field()2.3管道类的重写
from itemadapter import ItemAdapterimport pymysqlclass DoubanPipeline:fp = None# 重写父类方法,该方法只在开始爬虫的时候只调用一次def open_spider(self, spider):print('开始爬虫')self.fp = open('./qidian.txt', 'w', encoding='utf-8')# 专门处理Item类型对象# 该方法可以接收爬虫文件提交过来的item对象# 该方法每接收一次Item就会被调用一次def process_item(self, item, spider):author = item['author']title = item['title']self.fp.write(title + ':' + author + '\n')return item#就会传递给下一个即将执行的管道类def close_spider(self, spider):print('结束爬虫')self.fp.close()# 管道文件中一个管道对应将一组数据存储到一个平台或者载体中class mysqlPipeline:conn = Nonecursor = Nonedef open_spider(self, spider):self.conn = pymysql.Connect(host='服务器IP地址', port=端口号, user='root', password=自己的密码', db='qidian',charset='utf8')def process_item(self, item, spider):self.cursor = self.conn.cursor()try:self.cursor.execute('insert into qidian values(0,"%s","%s")'%(item["title"],item["author"]))self.conn.commit()except Exception as e:print(e)self.conn.rollback()return itemdef close_spider(self, spider):self.cursor.close()self.conn.close()# 爬虫文件提交的item类型的对象最终会提交给哪一个管道类哪案例1.:
import scrapyclass QidianSpider(scrapy.Spider):name = 'qidian'#allowed_domains = ['www.xxx.com']start_urls = ['https://www.qidian.com/all/']#设置一个通用的urlurl='https://www.qidian.com/all/page%d/'page_num=2def parse(self, response):list_data=response.xpath('//p[@id="book-img-text"]/ul/li')for i in list_data:title =i.xpath('./p[2]/h2/a/text()').extract_first()author=i.xpath('./p[2]/p/a[1]/text()')[0].extract()styple=i.xpath('./p[2]/p/a[2]/text()').extract_first()styple1=i.xpath('./p[2]/p/a[3]/text()').extract_first()type=styple+'.'+styple1#print(title,author,type)if self.page_num<=3:new_url=format(self.url%self.page_num)self.page_num+=1#手动的发送请求callback回调函数是专门用于数据解析的yield scrapy.Request(url=new_url,callback=self.parse)案例2:当当网青春文学书籍图片下载
import scrapyfrom scrapt_dangdang_03.items import ScraptDangdang03Itemclass DangSpider(scrapy.Spider):name = 'dang'allowed_domains = ['category.dangdang.com']start_urls = ['http://category.dangdang.com/cp01.01.02.00.00.00.html']base_url = 'http://category.dangdang.com/pg'page = 1def parse(self, response):# src = '//ul[@id="component_59"]/li//img/@src'# name='//ul[@id="component_59"]/li//img/@src'# price='//ul[@id="component_59"]/li//p[3]/span[1]/text()'# 所有的selector对象都可以再次调用xpath方法li_list = response.xpath('//ul[@id="component_59"]/li')for li in li_list:# 第一张图片个其他图片的标签的属性是不一样的# 第一张图片是src,其他的dadta-originalsrc = li.xpath('.//img/@data-original').extract_first()if not src:src = li.xpath('.//img/@src').extract_first()name = li.xpath('.//img/@alt').extract_first()price = li.xpath('.//p[3]/span[1]/text()').extract_first()print(src, name, price)item = ScraptDangdang03Item(src=src, name=name, price=price)# 获取一个就交给管道yield itemif self.page < 100:self.page += 1url = self.base_url + str(self.page) + '-cp01.01.02.00.00.00.html'# url就是请求地址# callback是要执行的 函数,不需要加括号yield scrapy.Request(url=url, callback=self.parse)案例:起点小说网
三个核心的方法:
案例一
* 基于scrapy爬取字符串类型数据和爬取图片类型数据区别:- 字符串: 只需要基于xpath 进行解析,且提交管道进行持久化存储- 图片: 只可以通过xpath解析出图片的src属性,单独的对图片地址发请求获取图片二进制类型的数据。* 基于ImagePipeline:- 只需要将img的src属性值进行解析,提交到管道,管道就会对图片src进行发送请求获取图片二进制类型的数据,且还会帮我么进行持久化存储。* 需求:爬取站长素材中高清图片- 使用流程:1. 数据解析(图片地址)2. 将存储图片地址的item提交到指定的管道类中3. 在管道文件中自定义一个基于ImagesPipeline的一个管道类- get_media_requests()- file_path()- item_completed()4. 在配置文件中:- 指定存储图片的目录:IMAGES_STORE='./imgs_站长素材'- 开启自定义的管道:1. 爬虫文件编写解析数据
#站长素材 高清图片的解析import scrapyfrom imgPro.items import ImgproItemclass ImgSpider(scrapy.Spider):name = 'img'#allowed_domains = ['www.xxx.com']start_urls = ['https://sc.chinaz.com/tupian/']def parse(self, response):p_list= response.xpath('//p[@id="container"]/p')for i in p_list:img_address='https:'+i.xpath('./p/a/img/@src').extract_first()#print(img_address)#实例化item对象item=ImgproItem()#传入img_address这个属性,item['src']=img_address#提交item 到管道yield item2.把图片地址提交到管道
#来到item 中封装一个属性scr=scrapy.Field()3.将解析到的数据封装到item中
from imgPro.items import ImgproItem#实例化item对象item=ImgproItem()#传入img_address这个属性,item['img_address']=img_address#提交item 到管道yield item4.管道类的重写 pipelines
import scrapyfrom scrapy.pipelines.images import ImagesPipelineclass ImagsPipeline(ImagesPipeline):#可以根据图片地址进行图片数据的请求#对item中的图片进行请求操作def get_media_requests(self,item ,info ):yield scrapy.Request(item['src'])#指定图片持久化存储的路径def file_path(self,request,response=None,info=None):img_name=request.url.split('/')[-1]return img_namedef item_completed(self,results,item ,info):return item #该返回值会传递一给下一个即将被执行的管道类5.settings中图片管道相关参数
#指定图片存储目录IMAGES_STORE='./imgs_站长素材'#缩略图IMAGES_THUMBS={'small':(60,32),'big':(120,80)}6.settings配置
#开启管道, 管道类名换成自定义的名称ITEM_PIPELINES = {'imgPro.pipelines.ImagsPipeline': 300,}USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'# Obey robots.txt rulesROBOTSTXT_OBEY = False#显示报错日志 只显示错误类型的日志信息LOG_LEVEL="ERROR"```xxxxxxxxxx11 1#开启管道, 管道类名换成自定义的名称2ITEM_PIPELINES = {3 'imgPro.pipelines.ImagsPipeline': 300,4}5USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'67# Obey robots.txt rules8ROBOTSTXT_OBEY = False910#显示报错日志 只显示错误类型的日志信息11LOG_LEVEL="ERROR"python案例二
案例一:读书网数据入库
l.创建项目:scrapy startproject dushuproject2.跳转到spiders路径 cd\dushuproject\dushuproject\spiders3.创建爬虫类:scrapy genspider -t crawl read www.dushu.com4.items5.spiders6.settings7.pipelines数据保存到本地数据保存到mysq1数据库数据入库:
(1)settings配置参数:DBH0T='127.0.0.1'DB_PORT =3306DB USER 'root'DB PASSWORD '1234'DB_NAME ='readbood'DB_CHARSET = 'utf8'(2)管道配置#加载配置文件from scrapy.utils.project import get_project_settingsimport pymysqlclass MysqlPipeline(object):#__init__方法和open_spider的作用是一样的#init获取settings中的连接参数代码实现:
案例二: 使用默认的ImagesPipeline下载图片
安装服务器并执行
docker pull redisdocker run -dit --name redis-server -p 6378:6379 redis#然后使用 docker ps 查看运行的容器创建爬虫项目
scrapy startproject dushu_rediscd dushu_redis\dushu_redsi\spidersscrapy genspider guoxu dushu.com编写爬虫程序
#继承RedisSpiderfrom scrapy_redis.spiders import RedisSpiderfrom scrapy import Requestclass GuoxueSpider(RedisSpider):name = 'guoxue'allowed_domains = ['dushu.com']redis_key = 'gx_start_urls'def parse(self, response):for url in response.css('.sub-catalog a::attr("href")').extract():yield Request(url='https://www.dushu.com' + url, callback=self.parse_item)def parse_item(self, response):p_list = response.css('.book-info')for i in p_list:item = {}item['name'] = i.xpath('./p//img/@alt').get()item['cover'] = i.xpath('./p//img/@src').get()item['detail_url'] = i.xpath('./p/a/@href').get()yield item# 下一页next_url=response.css('.pages ').xpath('./a[last()]/@href').get()yield Request(url='https://www.dushu.com' + next_url, callback=self.parse_item)settings配置
#配置调度器SCHEDULER = 'scrapy_redis.scheduler.Scheduler'SCHEDULER_PERSIST = True# 配置去重DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'#配置redis消息队列服务器REDIS_URL='redis://192.168.163.128:6378/0'USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'ROBOTSTXT_OBEY = False配置好后,然后执行爬虫文件
scrapy crawl guoxue进入到redis数据库
docker exec -it redis-server bash root@1472cc4b0e69:/data# redis-cli127.0.0.1:6379> select 0OK127.0.0.1:6379> keys *(empty array)127.0.0.1:6379> lpush gx_start_urls https://www.dushu.com/guoxue/(integer) 1127.0.0.1:6379>