近期对爬虫比较感兴趣，就利用春节几天简单学习了一下。
本篇文章就是简单总结一下近期所学。
对于爬虫，大致可以分成三步：

分析要爬取的网站构成，如果在“元素”中找不到待爬取的内容，则在“network”中寻找
利用requests或者urllib库可以提取网页元素，或者是json数据或者是HTML或者是图片、视频等等
利用beautifulSoap或者lxml可以解析第二步提取的数据，从而获得想要的信息

提取url

requests

爬取B站某一视频评论

import pandas as pd
import time
import re

def convertTime(ctime):   
    timeArray = time.localtime(ctime)#将GPS时间转为时间类型的数据结构
    #例如
    #将00000  转为  time.struct_time(tm_year=1970, tm_mon=1, tm_mday=1, tm_hour=8, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=1, tm_isdst=0)
    myTime = time.strftime("%Y.%m.%d", timeArray) #时间规范为某一类型字符串
    return myTime


#获取aid
def get_oid(bvid):
    video_url = 'https://www.bilibili.com/video/' + bvid
    page = requests.get(video_url).text
    aid = re.search(r'"aid":[0-9]+', page).group()[6:]
    return aid

ID = "BV1244y1p7kt"
page = 0
authorMap = []
while True:
    time.sleep(0.05)  #用于防止被服务器认为是dos攻击
    
    r = requests.get("https://api.bilibili.com/x/v2/reply/main?jsonp=jsonp&next={}&type=1&oid={}&mode=3&plat=1&_=1644128621860".format(page,get_oid(ID)))
    
    data = json.loads(r.text)
    
    #以下就是从json中读取要用的数据
    if data['data']['replies']:
        for i in data['data']['replies']:        
            authorMap.append([i['member']['uname'],i['content']['message'],convertTime(i['ctime'])])
            if i['replies'] != None:
                for j in i['replies']:
                    authorMap.append([j['member']['uname'],j['content']['message'],convertTime(j['ctime'])])
    else:
        i = data['data']['top']['upper']
        authorMap.append([i['member']['uname'],i['content']['message'],convertTime(i['ctime'])])
        if i['replies'] != None:
            for j in i['replies']:
                authorMap.append([j['member']['uname'],j['content']['message'],convertTime(j['ctime'])])
    
    #循环读取所有页数的评论
    if data['data']['cursor']['is_end'] ==True:
        break
    else:
        page+=1

爬取慕课视频

import requests
import json
from ffmpy import FFmpeg
import os

m3u8url = f"https://mooc2vod.stu.126.net/nos/hls/2020/11/03/693/15bc7544-1305-4aee-bffd-9b33a4c43316_8.m3u8?ak=7909bff134372bffca53cdc2c17adc27a4c38c6336120510aea1ae1790819de8dd311591b24bac29eff49e0d7a7721ff70e0fc008011ee997b6e824dffac03753059f726dc7bb86b92adbc3d5b34b132ef8f2d0c2972470a66a7ee77174b2162b8ae77e29788836745b7125f174b3914"
#m3u8文件存储了该视频对应的所有ts文件地址
#例如：   15bc7544-1305-4aee-bffd-9b33a4c43316_8_01.ts

#该函数用于将m3u8文件中的ts地址读出存为列表
def Fromm3u8GetTs(m3u8url):
    lasts = m3u8url.rfind('/')     
    prefix = m3u8url[:lasts+1]   #取出https://mooc2vod.stu.126.net/nos/hls/2020/11/03/693/
    
    
    m3u8 = requests.get(m3u8url)
    str_m3u8 = m3u8.text
    strLine = str_m3u8.splitlines()
    tsList=[]
    for i in strLine:
        if i.endswith('.ts'):    #如果这一行字符串以.ts结尾，则代表这个字符串是ts文件地址
            tsList.append(prefix+i)
    return tsList


tsList = Fromm3u8GetTs(m3u8url)



##以下内容用于下载ts文件
if not os.path.exists("temp"):
    os.mkdir("temp")
filename=""    
for ts in tsList:
    response = requests.get(ts)
    with open("temp/{}".format(ts[-13:]),mode='wb') as f:
        f.write(response.content)    #读取响应文件为二进制数
        filename+=("file '"+os.path.join(os.path.abspath("temp"),ts[-13:])+"'\n")
with open("temp/filename.txt",mode="w") as fl:
    fl.write(filename)
    
##利用FFmpeg合并ts    
ff = FFmpeg(
    global_options="-f concat -safe 0",
    inputs={'D:\\myTemp\\python\\jupyter notebook\\爬虫\\temp\\filename.txt':None},
    outputs={'D:\\myTemp\\python\\jupyter notebook\\爬虫\\temp\\output.mp4':'-c copy'}
    
)
#ffmpeg -f concat -safe 0 -i filename.txt -c copy output.mp4
ff.run()
if os.path.exists("temp/output.mp4"):
    for i in (os.listdir("temp")):
        j = os.path.join(os.path.join(os.getcwd(),"temp"),i)
        if j.endswith(".ts"):
            os.remove(j)

对于requests还有许多方法，

例如，在get/post的时候还可以添加 headers等参数

myheader = {
    "Host": "dxx.scyol.com",
    "Connection": "keep-alive",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36 Edg/98.0.1108.43",
    "Accept": "text/css,*/*;q=0.1",
    "Referer": "http://dxx.scyol.com/dxxBackend/",
    "Accept-Encoding": "gzip, deflate",
    "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6"
}

d = {'key1': 'value1', 'key2': 'value2'}

r = requests.post("http://dxx.scyol.com/backend/study/student/list",headers = myheader,data=d)

其他使用方法见：Requests: 让 HTTP 服务人类 — Requests 2.18.1 文档 (python-requests.org)

urllib

python urllib和requests区别_urllib2和requests的区别_SEX专家的博客-CSDN博客

利用url.request.urlretrieve下载url的文件到特定本地地址，其中reporthook可以等于一个回调函数，用于显示进度条

import urllib
global downloaded
downloaded = 0
def show_progress(count, block_size, total_size):
    global downloaded
    downloaded += block_size
    num = round(((downloaded*100.0) / total_size))
    if num<=100:
        print('downloading ... %d%%' %num )
    else:
        print('downloading ... 100%')
source_url = "https://pubs.usgs.gov/fs/2020/3062/fs20203062.pdf"
target_file = "build/newss.pdf"
print('downloading ... ')
urllib.request.urlretrieve(source_url, filename=target_file, reporthook=show_progress)
print('downloading ... done')

使用urllib库的urlretrieve()方法下载网络文件到本地的方法，这一篇文章还利用xpath解析html批量下载了该页面下的图片

解析网页数据

BeautifulSoap

import requests
from bs4 import BeautifulSoup

res=requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html')
html=res.text
#print(html)
soup=BeautifulSoup(res.text,'html.parser')
items=soup.find_all(class_='books')
for item in items:
    name=item.find('h2')
    title = item.find(class_='title')  # 在列表中的每个元素里，匹配属性class_='title'提取出数据
    brief = item.find(class_='info')  # 在列表中的每个元素里，匹配属性class_='info'提取出数据
    print(name.text, '\n', title.text, '\n', brief.text)  # 打印提取出的数据

1
2
3

nav_id = soup.find_all(id="nav")[0]
catlog0 = nav_id.find_all(class_ = "catlog")[0]
catlog0.text

这个模块本质上就是快速从html等文本中解析数据

我们可以利用find和find_all方式获取相应位置，text属性可以提取文字

其他详细用法可参照Beautiful Soup 4.4.0 文档 — Beautiful Soup 4.2.0 中文文档

lxml

我们可以通过简单的例子了解如何利用lxml解析网页：

import requests
from lxml import etree

url='https://bj.58.com/ershoufang/'
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36 Edg/93.0.961.47'
}
res_text=requests.get(url=url,headers=headers).text

## 以下是基本操作流程！！！
my_etree=etree.HTML(res_text)
ii=my_etree.xpath('/html/body/div[1]//text()')

此外我们不仅仅可以捕获文本，我们也可以收集图片，下面这个例子就是如何利用lxml提取图片

url='https://pic.netbian.com/4kmeinv/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36 Edg/93.0.961.47'
}
response=requests.get(url=url, headers=headers)
#response.encoding='gbk'   #可行
res_text = response.text

my_etree=etree.HTML(res_text)
##li_list=my_etree.xpath('/html/body/div[@class="wrap clearfix"]/div/div[@class="slist"]/ul/li')
li_list = my_etree.xpath('//div[@class="slist"]/ul/li')

if not os.path.exists('./PicDic'):
    os.mkdir('./PicDic')


for li in li_list:
    img_src='https://pic.netbian.com'+li.xpath('./a/img/@src')[0]
    img_name=li.xpath('./a/img/@alt')[0]+'.jpg'

    img_name=img_name.encode('iso-8859-1').decode('gbk')


    print(img_name+'+'+img_src+'\n')
    #将图片持久化保存
    img_data=requests.get(img_src,headers=headers).content #二进制数据
    img_path='PicDic/'+img_name
    with open(img_path,'wb') as fp:
        fp.write(img_data)
        print(img_name+'下载成功！\n')

XPath 语法 | 菜鸟教程 (runoob.com)

关于爬虫的其他内容

selenium

这是一个自动化模块，可以实现模拟登陆

from selenium import webdriver

prefs = {"download.default_directory": "D:\\myTemp\\python\\爬虫\\learn01\\zidonghua\\files"}
chromeOptions.add_experimental_option("prefs", prefs)  #定义默认下载地址
chrome_driver=webdriver.Chrome("./chromedriver.exe", options=chromeOptions)#可以添加选项


#获取地址
chrome_driver.get('https://www.taobao.com/')

#标签定位
search_keys=chrome_driver.find_element_by_id('q')
#标签交互
search_keys.send_keys('袜子')

#通过class定位
search_button=chrome_driver.find_element_by_class_name('btn-search')
search_button.click()

#通过xpath定位
btn_banji = bro.find_element_by_xpath(
    '/html/body/div[1]/div/div[2]/section/div/div[3]/div/div[3]/table/tbody/tr/td[7]/div/button[4]')

btn_banji.click()
sleep(2.5)

tarfile

这是一个解压tar文件的模块

import tarfile
tar = tarfile.open(target_file, "r:gz")
tar.extractall()
#TarFile.extractall(path='.', members=None, *, numeric_owner=False)
#将归档中的所有成员提取到当前工作目录或 path 目录。 如果给定了可选的 members，则它必须为 getmembers() 所返回的列表的一个子集。 字典信息例如所有者、修改时间和权限会在所有成员提取完毕后被设置。 这样做是为了避免两个问题：目录的修改时间会在每当在其中创建文件时被重置。 并且如果目录的权限不允许写入，提取文件到目录的操作将失败。
tar.close()
os.remove(target_file)

模式	action
`'r' or 'r:*'`	打开和读取使用透明压缩（推荐）。
`'r:'`	打开和读取不使用压缩。
`'r:gz'`	打开和读取使用gzip 压缩。
`'r:bz2'`	打开和读取使用bzip2 压缩。
`'r:xz'`	打开和读取使用lzma 压缩。
`'x'` 或 `'x:'`	创建tarfile不进行压缩。如果文件已经存在，则抛出 `FileExistsError` 异常。
`'x:gz'`	使用gzip压缩创建tarfile。如果文件已经存在，则抛出 `FileExistsError` 异常。
`'x:bz2'`	使用bzip2 压缩创建tarfile。如果文件已经存在，则抛出 `FileExistsError` 异常。
`'x:xz'`	使用lzma 压缩创建tarfile。如果文件已经存在，则抛出 `FileExistsError` 异常。
`'a' or 'a:'`	打开以便在没有压缩的情况下追加。如果文件不存在，则创建该文件。
`'w' or 'w:'`	打开用于未压缩的写入。
`'w:gz'`	打开用于 gzip 压缩的写入。
`'w:bz2'`	打开用于 bzip2 压缩的写入。
`'w:xz'`	打开用于 lzma 压缩的写入。

爬虫其他问题

如何处理devtool打开后，无法刷新网页的问题

通常我们在分析网站的时候是通过打开DevTool后，按“刷新”按钮，监测“network”加载情况。然而当分析慕课等网站的时候，在打开了devtool后，网页就无法刷新了，并弹出下图所示弹框。

在这里插入图片描述

在这一情况下，我们可以点击下图的“停用断点”，然后再点击上图的播放按钮，即可恢复正常刷新功能。

在这里插入图片描述

如何在python环境中安装ffmpeg ffmpy

#安装ffmpeg
conda install ffmpeg

#安装ffmpy
pip install ffmpy

ffmpy文档