python脚本

5分钟学会用Python从网站抓取数据

发表于 2025年10月07日阅读 387 评论 0

大家好，我是何三，80后老猿，独立开发者

网上很多数据手动复制很麻烦，通过python写个脚本来抓取，需要的时候运行下脚本就很方便了。其实，这些工作完全可以交给程序自动完成！今天，就用5分钟，带你入门一个超级实用的技能——用Python写一个网络爬虫。无需基础，看完就能自己动手试试！

一、爬虫是什么？没那么神秘！

别被"网络爬虫"这个名字吓到。它的本质非常简单，就是模拟我们人类浏览网页的行为。

想象一下： 1. 你打开浏览器，输入网址。 2. 网站服务器把网页代码（HTML）发回给你的浏览器。 3. 浏览器把代码渲染成你看到的漂亮页面。

网络爬虫就是跳过第3步，直接获取网页代码，然后从代码中自动提取出我们需要的信息（如文章标题、链接、摘要等）。

今天我们就要用Python，做一个能自动抓取技术博客文章的爬虫！

二、准备工作：安装环境

工欲善其事，必先利其器。我们只需要安装两个强大的Python库。

打开你的命令行（Windows叫CMD或PowerShell，Mac叫终端），输入下面这行命令，然后回车：

pip install requests bs4

requests：负责帮我们访问网站，获取网页代码。
bs4：负责解析网页代码，让我们能方便地找到想要的信息。

安装成功后，我们就可以开始写代码了。

三、四步代码，抓住核心！

我们以我的技术博客 https://www.h3blog.com 为目标，抓取网站首页的所有文章信息。

第1步：导入工具包

打开你的代码编辑器（如VSCode、PyCharm），新建一个Python文件，首先引入我们刚安装的"武器"。

import requests
from bs4 import BeautifulSoup  # 注意这里是从bs4导入BeautifulSoup

第2步：获取网页

让 requests 库去帮我们要来整个网页的"源代码"。

url = 'https://www.h3blog.com'  # 目标网址
response = requests.get(url)     # 发送GET请求

现在，网页的所有HTML代码都存放在 response.text 里了。

第3步：解析网页，找到"文章盒子"

现在源代码是一大坨杂乱无章的文本，我们需要 BeautifulSoup 来把它变成结构化的格式，方便我们查找。

soup = BeautifulSoup(response.text, 'html.parser')

接下来是关键：通过浏览器"检查"功能，我们发现每篇文章通常都被特定的HTML标签包裹着。常见的文章结构可能包含 <article>, <div class="post">, <div class="article"> 等标签。

我们先尝试找到所有的"文章盒子"。

# 尝试几种常见的文章容器选择器
articles = soup.find_all('article')  # 首先尝试article标签
if not articles:
    articles = soup.find_all('div', class_='post')  # 如果没有，尝试class包含post的div
if not articles:
    articles = soup.find_all('div', class_='article')  # 再尝试class包含article的div

第4步：从"盒子"里拿出想要的东西

现在，我们遍历每一个"文章盒子"，并从里面精确地提取出文章标题和链接。

for article in articles:
    # 查找文章标题 - 通常在h1, h2, h3标签中
    title_element = (article.find('h1') or 
                    article.find('h2') or 
                    article.find('h3') or 
                    article.find('a'))

    if title_element:
        # 获取标题文本
        title = title_element.get_text().strip()

        # 获取文章链接
        link = title_element.get('href') if title_element.name == 'a' else None
        if not link and title_element.find('a'):
            link = title_element.find('a').get('href')

        # 如果是相对链接，转换为绝对链接
        if link and link.startswith('/'):
            link = 'https://www.h3blog.com' + link

        # 打印结果
        print(f"标题：{title}")
        if link:
            print(f"链接：{link}")
        print("---" * 20)

四、完整代码及运行

将上面所有代码组合在一起，保存为 blog_scraper.py 文件：

import requests
from bs4 import BeautifulSoup

def scrape_h3blog():
    # 目标网址
    url = 'https://www.h3blog.com'

    try:
        # 发送请求获取网页内容
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()  # 如果请求失败会抛出异常

        # 解析网页内容
        soup = BeautifulSoup(response.text, 'html.parser')

        # 尝试多种选择器来查找文章容器
        articles = soup.find_all('article')
        if not articles:
            articles = soup.find_all('div', class_='post')
        if not articles:
            articles = soup.find_all('div', class_='article')
        if not articles:
            articles = soup.find_all('div', class_='blog-post')

        print(f"在 {url} 中找到 {len(articles)} 篇文章")
        print("=" * 50)

        # 提取每篇文章的信息
        for i, article in enumerate(articles, 1):
            print(f"第 {i} 篇文章：")

            # 查找标题
            title_element = (article.find('h1') or 
                           article.find('h2') or 
                           article.find('h3') or 
                           article.find('a'))

            if title_element:
                title = title_element.get_text().strip()

                # 获取链接
                link = None
                if title_element.name == 'a':
                    link = title_element.get('href')
                elif title_element.find('a'):
                    link = title_element.find('a').get('href')
                elif article.find('a'):
                    link = article.find('a').get('href')

                # 处理相对链接
                if link and link.startswith('/'):
                    link = 'https://www.h3blog.com' + link
                elif link and not link.startswith('http'):
                    link = 'https://www.h3blog.com/' + link

                print(f"标题：{title}")
                if link:
                    print(f"链接：{link}")

                # 尝试获取文章摘要
                summary = (article.find('p') or 
                          article.find('div', class_='summary') or 
                          article.find('div', class_='excerpt'))
                if summary:
                    summary_text = summary.get_text().strip()[:100] + "..." if len(summary.get_text().strip()) > 100 else summary.get_text().strip()
                    print(f"摘要：{summary_text}")

            else:
                print("未找到标题")

            print("-" * 40)

    except requests.exceptions.RequestException as e:
        print(f"请求出错：{e}")
    except Exception as e:
        print(f"发生错误：{e}")

if __name__ == "__main__":
    scrape_h3blog()

然后在终端运行：

python blog_scraper.py

你会看到控制台输出网站上找到的所有文章信息！

恭喜你！ 你已经成功写出了一个能工作的网络爬虫！它已经可以自动为你收集博客数据了。

五、接下来做什么？

你可以把这个简单的程序变得更强大：

保存到文件：将 print 改为写入文件的代码，数据就能保存到本地的 txt 或 CSV 文件中。
爬取多页：如果博客有分页，你可以写一个循环，让程序自动遍历所有页面。
抓取详细内容：在获取文章列表后，可以进一步访问每篇文章的链接，抓取完整内容。
定时运行：设置定时任务，每天自动抓取最新文章。

重要提醒：做一名"绅士"爬虫

能力越大，责任越大。在畅游网络世界时，请务必遵守规则：

查看 robots.txt：访问 https://www.h3blog.com/robots.txt，了解网站允许和禁止爬取哪些内容。
友善访问：在代码中设置延时（如 time.sleep(1)），不要一秒内请求成百上千次，以免对网站服务器造成压力。
遵守法律与版权：尊重网站的数据版权，不要将爬取的数据用于非法或商业牟利用途。

动手时间到！

完整的代码已经放在上面了，现在就打开你的编辑器，复制代码，运行一下吧！感受一下代码自动获取数据的神奇魅力。

如果你在实践过程中遇到任何问题，或者想学习更高级的爬虫技巧（比如处理登录、抓取动态加载的内容），欢迎在评论区留言讨论！

版权声明：如无特殊说明，文章均为何三笔记原创，转载请注明出处

本文链接：https://www.h3blog.com/article/654/

一、爬虫是什么？没那么神秘！
二、准备工作：安装环境
三、四步代码，抓住核心！
四、完整代码及运行
五、接下来做什么？
重要提醒：做一名"绅士"爬虫