sed提取两个关键字之间的内容_python提取文本指定内容

发布时间：2025-12-09 11:44:36 浏览次数：12

示例:

<table> <thead> <tr> <th>ID</th> <th>名称</th> <th>电话</th> <th>说明</th> <th>类型</th> <th>位置</th> </tr> </thead> <tbody> <tr> <td>1</td> <td>11</td> <td>111111</td> <td>1111111</td> <td>11111111</td> <td>111111111</td> </tr> <tr> <td>2</td> <td>22</td> <td></td> <td></td> <td>22222222</td> <td>222222222</td> </tr> </tbody> </table>

如果上述代码是列表页中要获取的部分代码,现在要获取 所有列表页 的tbody标签中每个tr标签下 除第三、四个td标签(这2个中可能有数据,也可能无数据) 外的其他4个td标签中的数据,该如何获取?

如果使用如下方式获取:

res = html.xpath('//tbody/tr/td/text()')print(res)

则结果为:

['1', '11', '111111', '1111111', '11111111', '111111111', '2', '22', '22222222', '222222222', ...]

这样不方便清洗不需要的数据。

可以分三步来获取数据。 第一步:获取所有的td节点

res = html.xpath('//tbody/tr/td')print(res)

结果为:

[<Element td at 0x93cd9c8>, <Element td at 0x93cdbc8>, <Element td at 0x93cdd48>, <Element td at 0x93cd708>, <Element td at 0x93cddc8>, <Element td at 0x93d74c8>, <Element td at 0x93d7d08>, <Element td at 0x93d7048>, <Element td at 0x93d7288>, <Element td at 0x93d7548>, <Element td at 0x93d7888>, <Element td at 0x93d7388>]

第二步:将大list分割成多个小list,每个小list包含6个td节点

res2 = [res[s : s + 6] for s in range(0, len(res), 6)]#将大list分割成多个小list,每个小list包含6个td节点print(res2)

结果为:

[[<Element td at 0x93cdb48>, <Element td at 0x93cd788>, <Element td at 0x93cd848>, <Element td at 0x93cdd08>, <Element td at 0x93cdf88>, <Element td at 0x93d7e48>], [<Element td at 0x93d7e08>, <Element td at 0x93d7388>, <Element td at 0x93d7888>, <Element td at 0x93d7548>, <Element td at 0x93d7808>, <Element td at 0x93d7288>]]

第三步:循环获取每个小list中的每个td节点的文本数据,并剔除不需要的数据

for x in res2:res3 = []for y in x:res4 = y.xpath('text()')res3.append(str(res4).strip("[']"))res3 = res3[:2] + res3[4:]#只保留除了第3、4个td标签外的其他4个td标签的数据print(res3)

结果为:

['1', '11', '11111111', '111111111']['2', '22', '22222222', '222222222']

这样就获得了想要的结果。

如有更好的方法,请留言告诉我,谢谢!

196520.html

element td

上一篇：MarsTalk | Git三路合并算法下一篇：英语词性

知识问答

sed提取两个关键字之间的内容_python提取文本指定内容

综合百科

网站导航