网站首页 > 知识剖析正文

码畜在工作中总结的知识点，记录一下

nixiaole 2024-11-17 14:27:51 知识剖析 25 ℃

1.python3基本需要的库(基于window实现)

1.1 requests 请求库: 这是一个阻塞式http请求库

1.2 selenium自动化测试工具

安装方法: pip install selenium

1.2.1 selenium自动化测试工具需要安装 ChromeDrive驱动或 GeckoDriver火狐扩展

安装方法: 在官网下载完成后,将chromedriver.exe 或 geckodriver.exe 放到python安装目录里的scripts目录下就行

1.3 安装无界面浏览器 PhantomJS

下载后的执行文件需要 放到python安装目录里的scripts目录下就行,或者设置环境变量就可以

1.4 aiohttp 异步请求库

1.5 解析库的安装

lxml 解析库的安装 : 默认有安装 pip install lxml
Beautiful Soup解析库安装方法: pip install beautifulsoup4
需要注意的地方: 在引用这个包的时候 from bs4 import BeautifulSoup 才可以
pyquery 网页解析器,提供了与jquery类似的语法来解析文档安装方法: pip install pyquery #支持css选择器

1.6 tesserocr 验证码的库用来解决验证码问题

安装方法 需要在上这个网站下载 https://digi.bib.uni-mannheim.de/tesseract/ 对应的exe文件,安装后再执行 pip install tessereocr

2. python爬虫基本信息

2.1 xpath的使用

语法nodename 选取此节点的所有子节点
/ 从当前节点选取直接的节点
// 从当前节点选取子孙节点
. 选取当前节点
.. 选取当前节点的父节点
@ 选取属性
用法//title[@lang='eng'] 选取属性lang为eng的所有title
属性多值匹配 contains()来实现如果某些节点的某个属性可能有多个属性时,如
from lxml import etree
text="<li class='li active'><a href='http://www.baidu.com'>百度一下</a></li>"
html=etree.HTML(text)
#res=html.xpath('//li/a[class="active"]/@href') # []
res=html.xpath('//li[contains(@class,"li")]/a/@href') #可以匹配到值
print(res) #
多个属性同时匹配使用and 进行多条件判断from lxml import etree
text="<li class='li active' name='123'><a href='http://www.baidu.com'>百度一下</a></li>"
html=etree.HTML(text)
res=html.xpath('//li[contains(@class,"li") and @name="123"]/a/@href')
print(res) #

from lxml import etree
html="html相关代码"
使用1.直接使用字符串进行读取
?
html=etree.Html(html) 返回一个xpath解析对象
使用2:可以使用文本文件使用
html2=etree.parse('./test.html',etree.HTMLParse())
result=etree.tostring(html2)
print(result.decode('utf-8'))
?

2.2 xpath中的运算符

与我们平常操作的运算符相同or,and,mod(求模),|(计算两个节点的集) + - * div(除法) = != < > <= >=
按序选择(索引从1 开始),选择最后一个是last()即可?

3 beautiful soup的使用

基本使用from bs4 import BeautifulSoup
soup=BeautifulSoup('<p>hello</p>','lxml')
print(soup.p.string) # hello
#加载一个页面,需要打开给定资源
from bs4 import BeautifulSoup
with open('1.html','rt',encoding='utf-8') as f:
?
soup=BeautifulSoup(f.read(),'lxml')
# print(soup.p.string)
print(soup.prettify())
常用方法1. prettify 格式化代码
soup.prettify()
2. name 获取节点的名称
soup.title.name # title
3. attrs 获取属性返回值为字典形式,通过key可以获取数据
soup.p.attrs
方法选择器find_all(name,attrs,recusive,text ,**kwargs) 返回多个元素name: 根据节点名来查询元素
soup.find_all(name='li')
attrs: 通过节点名称查询时,也可以添加属性名来查询
soup.find_all(attrs={'id':'list-1'}) # 以字典的形式传入
text: 用来匹配节点的文本,传入的形式可以是字符串,也可以正则表达式对象
?find() 返回单个元素
css选择器只需要调用select()方法就可以1. res=soup.select('.movie-item-info p a')
for item in res:
print(item.string)
#方法如下：
res=soup.select('.board-item-content')
for item in res:
title=item.select('.name a')[0].string
url=item.select('.name a')[0].attrs['href']
author=item.select('.star')[0].string
time=item.select('.releasetime')[0].string
score_int=item.select('.score .integer')[0].string
score_float=item.select('.score .fraction')[0].string
score=score_int+score_float
?
print(title,author,url,time,score)
2.获取属性可以通过attrs方法获取属性,也可以通过中括号的方式来获取
3.获取文本可以通过string 来获取, 也可以通过get_text()方法来获取

4. pyquery 如果对css选择器比较熟练,可以使用这个 ,返回值不是list列表

基本使用from pyquery import PyQuery
with open('1.html','rt',encoding='utf-8') as f:
doc=PyQuery(f.read())
print(doc('li'))
#css选择器
doc=PyQuery(f.read())
res=doc('.board-item-content')
print(res.find('.name a').text())
三种导入数据的方式1. 以字符串的形式传入
2.url 传参 PyQuery(url='https://www.baidu.com')
3.以对象的方式来传入
PyQuery(requests.get('https://www.baidu.com')) #需要引入request请求类
查找节点1.查找子节点需要用到find()方法参数传入css选择器就可以,find()是节点的子孙节点,如果只查找子节点,那么可以使用children()方法
res=doc('.board-item-content')
print(res.find('.name a').text())
#获取子节点
print(res.children())

2.查找父节点可以用parent()来获取某个节点的父节点
print(res.parent())
3.获取祖先节点可以使用parents()来获取某个节点的祖先节点
print(res.parents())

4.获取兄弟节点可以使用siblings()来获取某个节点的兄弟节点
?
遍历通过调用items()方法可以获取所有的匹配内容for item in res.items():
print(item.find('.star').text()) #获取文件数据
获取数据1.获取属性:使用attr()
print(res.items())
for item in res.items():
print(item.find('.star').attr('class'))
2. 获取文本 : 使用text() ,如果想获取当前节点下的所有文本需要使用html()
节点操作addClass(),removeClass(),attr(),text(),html(),remove()和css里的使用方法类似,这里就不再赘述了

5. 保存数据相关处理

保存形式为文件from bs4 import BeautifulSoup
with open('1.html','rt',encoding='utf-8') as f, open('data.txt','at',encoding='utf-8') as f2:
soup=BeautifulSoup(f.read(),'lxml')
res=soup.select('.board-item-content')
l2=[];
tmpstr='';
for item in res:
# title=item.select('.name a')[0].string
title=item.select('.name a')[0].get_text(strip=True)
url=item.select('.name a')[0].attrs['href']
author=item.select('.star')[0].get_text(strip=True)
time=item.select('.releasetime')[0].string
score_int=item.select('.score .integer')[0].string
score_float=item.select('.score .fraction')[0].string
score=score_int+score_float
l2.append(title)
l2.append(author)
l2.append(url)
l2.append(time)
l2.append(score)
?
print(title,author,url,time,score)
# print(res)
tmpstr +=' # '.join(l2)
f2.write(tmpstr+'\n')
如果乱码,请添加encoding='utf-8'
open()打开模式的说明r:以只读方式打开,文件指针会放在文件开头
rb:以二进制只读方式打开一个文件文件指针会放在文件开头
r+: 以读写方式打开一个文件,文件指针会放在文件开头
rb+:以二进制读写方式打开一个文件,文件指针会放在文件的开头
w:以写入方式打开一个文件,如果存在,则直接覆盖写,没有则新建文件
wb:以二进制写入一个文件,如果存在,则直接覆盖写,没有则新建文件
wb+:以二进制读写入一个文件,如果存在,则直接覆盖写,没有则新建文件
a:以追加方式打开一个文件,指针在未尾,文件不存在,则新建
a+:以读写的方式追加方式打开一个文件,指针在未尾,文件不存在,则新建
ab:以二进制追加方式打开一个文件,指针在未尾,文件不存在,则新建
ab+:以二进制读写追加方式打开一个文件,指针在未尾,文件不存在,则新建
?
保存形式为jsonfrom bs4 import BeautifulSoup
import json 导入json包
with open('1.html','rt',encoding='utf-8') as f, open('data.txt','at',encoding='utf-8') as f2:
soup=BeautifulSoup(f.read(),'lxml')
res=soup.select('.board-item-content')
l2={};
l3=[];
tmpstr='';
for item in res:
# title=item.select('.name a')[0].string
title=item.select('.name a')[0].get_text(strip=True)
url=item.select('.name a')[0].attrs['href']
author=item.select('.star')[0].get_text(strip=True)
time=item.select('.releasetime')[0].string
score_int=item.select('.score .integer')[0].string
score_float=item.select('.score .fraction')[0].string
score=score_int+score_float
l2['title']=title
l2['url']=url
l2['author']=author
l2['time']=time
l2['score']=score
l3.append(l2)
# print(title,author,url,time,score)
# print(res)
res2=json.dumps(l3,ensure_ascii=False) #生成json字符串
print(res2,type(res2))
# print(l3)
f2.write(res2)
#读取json文件
#读取json,需要先打开json文件
with open('data.txt','r',encoding='utf-8') as f:
json_data=json.loads(f.read())
print(json_data)
保存形式为csv的文件1.写入
import csv
?
with open('data.csv','w',encoding='utf-8') as f:
myWriter=csv.writer(f,delimiter=' ') #delimiter 可以设置值之间的分隔符
myWriter.writerow(['id','name','age'])
myWriter.writerow([1,'ln','18'])
myWriter.writerow([2,'ln2','19'])
myWriter.writerow([3,'ln3','20'])
2.可以调用writerows()写以写入多行,参数是一个列表
myWriter.writerows([[1,2,3],[2,3,4],[4,5,6]])
3.字典形式的写入
fieldname=['name','id','age']
myWriter=csv.DictWriter(f,fieldnames=fieldname)
myWriter.writerow({'name':'123','id':2,'age':30})
4.也可以写入多个字典的形式
fieldname=['name','id','age']
myWriter=csv.DictWriter(f,fieldnames=fieldname)
myWriter.writerows({'name':'123','id':2,'age':30})
保存到数据库(以mysql为例)ps:前提是已经安装好mysql,在python中需要使用游标进行执行操作
1. 安装pymysql 模块(如果没有的话)
pip install pymysql
2.代码如下:
import pymysql
import time
db=pymysql.connect(host='127.0.0.1',user='root',password='password',database='p_test')
cursor=db.cursor()
#查询
cursor.execute('select * from user')
#添加
# res=cursor.fetchall()
#方式一: 写法不灵活,
user_code='12345'
user_name='test123'
mobile_phone='13716103539'
state='1'
head_img=''
add_time=time.localtime()
entry_time=time.localtime()
sql="insert into user(user_code,user_name,mobile_phone,state,head_img,add_time,entry_time) values(%s,%s,%s,%s,%s,%s,%s)"
try:
res= cursor.execute(sql,(user_code,user_code,mobile_phone,state,head_img,add_time,entry_time))
db.commit()
print(res)
except Exception as e:
db.rollback()
print(e)
# print(cursor.fetchall())
ps: 在新增数据时,需要commit才可以进行添加到数据库操作
#方式二:利用字典的方式进行操作
table='user'
dict_data={
'user_code':'1111',
'user_name' :'test123',
'mobile_phone' :'13716103539',
'state' :'0',
'head_img' :'',
'add_time' :time.localtime(),
'entry_time' :time.localtime(),
?
}
keys=' ,'.join(dict_data.keys())
?
values=' ,'.join(['%s'] *len(dict_data))
sql = "insert into {table}({keys}) values({values})".format(table=table,keys=keys,values=values)
try:
res = cursor.execute(sql, tuple(dict_data.values()))
db.commit()
print(res)
except Exception as e:
db.rollback()
print(e)
#更新操作
#更新操作
sql="update user set user_name=%s where id=%s "
?
try:
cursor.execute(sql,('test111',541))
db.commit()
except Exception as e:
print(e)
db.close()
#优化二:
保存到非关系型数据库,基本内存的数据库redis 推荐使用strictRedis这个类基本文档的数据库 mongodb ps: 详细步骤可以查看相关文档

上一篇： 04元素类型（元素种类总结）
下一篇：我只用一个txt文档，就将公众号里面的所有文章列表提取出来

网站首页 > 知识剖析 正文

码畜在工作中总结的知识点，记录一下

1.1 requests 请求库: 这是一个阻塞式http请求库

1.2 selenium自动化测试工具

1.2.1 selenium自动化测试工具需要安装 ChromeDrive驱动或 GeckoDriver火狐扩展

1.3 安装 无界面浏览器 PhantomJS

1.4 aiohttp 异步请求库

1.5 解析库的安装

1.6 tesserocr 验证码的库 用来解决验证码问题

2. python爬虫基本信息

2.1 xpath的使用

2.2 xpath中的运算符

3 beautiful soup的使用

4. pyquery 如果对css选择器比较熟练,可以使用这个 ,返回值不是list列表

5. 保存数据相关处理

猜你喜欢

网站首页 > 知识剖析正文

1.3 安装无界面浏览器 PhantomJS

1.6 tesserocr 验证码的库用来解决验证码问题