工具介绍
首先介绍一下采用的工具requests和bs4
这里主要用request.get向浏览器发起get请求,利用bs4中的Beautiful Soup对html内容进行解析,提取真正需要的内容,文档如下
页面爬取
首先用requests来获取小说页面的html,这里选择笔趣看的《心魔》一书
打开第一节,将网址记录下来
在pycharm中写下如下代码,就可以获取到该页面的html
1 2 3 4 5 6
| import requests
if __name__ == '__main__': target = 'https://www.bqkan.com/11_11154/4135502.html' req = requests.get(url=target) print(req.text)
|
结果如图所示
这时候就要用上我们的BeaufifulSoup
我们发现需要的内容在showtxt里面,并且只有一个showtxt,那么我们在所有的div里面找showtxt即可
1 2 3 4 5 6 7 8 9 10 11
| import requests from bs4 import BeautifulSoup
if __name__ == '__main__': target = 'https://www.bqkan.com/11_11154/4135502.html' req = requests.get(url=target) html = req.text bf = BeautifulSoup(html) texts = bf.find_all('div', class_='showtxt') print(texts)
|
find_all(‘div’, class_=’showtxt’)的意思就是找到class为showtxt的div,这里class要加下划线是因为class是python本身的语法词
成功获取showtxt内容,但是我们发现有很多乱七八糟的不想要的东西,需要进行处理只留下文本
我们发现每两行之间除了<br>
之外,还有八个空格,所以我们把空格替换成换行,然后忽略所有的标签
1
| print(texts[0].text.replace('\xa0' * 8, '\n\n'))
|
就可以得到这一章的文本内容了
目录爬取
现在我们需要爬下一整本小说,则需要获取每个章节的网址,然后到每个网址去爬对应的文本
1 2 3 4 5 6
| import requests
if __name__ == '__main__': target= 'https://www.bqkan.com/11_11154/' req = requests.get(url=target) print(req.text)
|
应该是这个位置,可惜是乱码
1 2 3 4
| if __name__ == '__main__': target= 'https://www.bqkan.com/11_11154/' req = requests.get(url=target).content.decode('GBK') print(req)
|
用GBK解码 (不知道为啥这么写,我试出来的)
完美拿到了目录页面 (虽说没有中文也没啥问题,只要有章节网址就行了)
然后我们审查元素,发现目录这个div叫listmain,所以提取listmain
1 2 3 4 5 6 7 8 9 10
| import requests from bs4 import BeautifulSoup
if __name__ == '__main__': target= 'https://www.bqkan.com/11_11154/' req = requests.get(url=target).content.decode('GBK') bf = BeautifulSoup(req, features="html.parser") listmain = bf.find_all('div', class_ = 'listmain') print(listmain)
|
然后我们把每个a里面的东西拿出来,做地址的拼接处理
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| import requests from bs4 import BeautifulSoup
if __name__ == '__main__': server = 'http://www.biqukan.com' target= 'https://www.bqkan.com/11_11154/' req = requests.get(url=target).content.decode('GBK') bf = BeautifulSoup(req, features="html.parser") listmain = bf.find_all('div', class_ = 'listmain') a_bf = BeautifulSoup(str(listmain[0]), features="html.parser") a = a_bf.find_all('a') for each in a: print(each.string, server + each.get('href'))
|
得到章节+地址
前面12行是我们不想要的,处理的时候在数组里面抹去
然后我们整合一下之前的模块,把它们函数化
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
| import requests from bs4 import BeautifulSoup
class downloader(object): def __init__(self): self.server = 'http://www.biqukan.com' self.target= 'https://www.bqkan.com/11_11154/' self.names = [] self.urls = [] self.nums = 0
def get_download_url(self): req = requests.get(url=self.target).content.decode('GBK') bf = BeautifulSoup(req, features="html.parser") listmain = bf.find_all('div', class_ = 'listmain') a_bf = BeautifulSoup(str(listmain[0]), features="html.parser") a = a_bf.find_all('a') self.nums = len(a[12:-2]) for each in a[12:-2]: self.names.append(each.string) self.urls.append(self.server + each.get('href'))
def get_contents(self, target): req = requests.get(url=target) html = req.text bf = BeautifulSoup(html, features="html.parser") texts = bf.find_all('div', class_='showtxt') texts = texts[0].text.replace('\xa0' * 8, '\n\n') return texts
if __name__ == "__main__": dl = downloader() dl.get_download_url()
|
下载与保存
最后我们需要一个模块来把爬到的内容下载并保存起来
1 2 3 4 5 6
| def writer(self, name, path, text): write_flag = True with open(path, 'a', encoding='utf-8') as f: f.write(name + '\n') f.writelines(text) f.write('\n\n')
|
主函数
1 2 3 4 5 6 7 8
| if __name__ == "__main__": dl = downloader() dl.get_download_url() print('-----------开始下载----------') for i in range(dl.nums): dl.writer(dl.names[i], '心魔.txt', dl.get_contents(dl.urls[i])) print(dl.names[i]+' 下载完成') print('---------全书下载完成---------')
|
全部的代码如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
| import requests from bs4 import BeautifulSoup
class downloader(object): def __init__(self): self.server = 'http://www.biqukan.com' self.target= 'https://www.bqkan.com/11_11154/' self.names = [] self.urls = [] self.nums = 0
def get_download_url(self): req = requests.get(url=self.target).content.decode('GBK') bf = BeautifulSoup(req, features="html.parser") listmain = bf.find_all('div', class_ = 'listmain') a_bf = BeautifulSoup(str(listmain[0]), features="html.parser") a = a_bf.find_all('a') self.nums = len(a[12:-2]) for each in a[12:-2]: self.names.append(each.string) self.urls.append(self.server + each.get('href'))
def get_contents(self, target): req = requests.get(url=target) html = req.text bf = BeautifulSoup(html, features="html.parser") texts = bf.find_all('div', class_='showtxt') texts = texts[0].text.replace('\xa0' * 8, '\n\n') return texts
def writer(self, name, path, text): write_flag = True with open(path, 'a', encoding='utf-8') as f: f.write(name + '\n') f.writelines(text) f.write('\n\n\n')
if __name__ == "__main__": dl = downloader() dl.get_download_url() print('-----------开始下载----------') for i in range(dl.nums): dl.writer(dl.names[i], '心魔.txt', dl.get_contents(dl.urls[i])) print(dl.names[i]+' 下载完成') print('---------全书下载完成---------')
|
然后就可以愉快地下起来了,虽然比较慢 (可以多线程加速,但是我还不会)
总结
总结一下爬虫的思路,大体可以分三个步骤
- 发起HTTP请求,获取数据
- 解析数据,提取想要的部分
- 下载并保存数据
爬虫spider就是一只互联网上的小蜘蛛,由你来指定规则控制它通过哪条丝线从一张网爬到另一张网,在每张网上抓取什么信息
最后,《心魔》真的很好看,强力推荐