将网站转变为大模型训练数据的神器：自动化爬虫工具FireCrawl，两周斩获4K Star！

jc_ipec 2024-05-23 发布于湖北

展开全文

https://mp.weixin.qq.com/s/KS93gpz73X20AD8L-3zz2Q

🔥将整个网站转变为适用于大模型训练的 Markdown 或结构化数据。使用单个 API 进行抓取、爬行、搜索和提取。

Hello，大家好，我是Aitrainee。今天给大家介绍一下Firecrawl，这是一个实用的爬虫工具。

Firecrawl 是什么？

Firecrawl就像一个智能机器人，从你给定的网页开始，自动找到并访问这个网站上的所有其他页面。它会提取每个页面中的主要内容，去掉广告和其他不需要的东西，然后把这些信息整理好，让你方便使用。而且，它不需要网站提供的地图文件来找到这些页面。

Firecrawl可以从你指定的网页开始，自动访问这个网站上所有能打开的子页面。就像你点开一个链接后，它会继续点开这个页面里的所有链接，直到把所有页面都访问一遍。只要这些页面没有被网站的设置阻止（比如没有被robots.txt文件禁止访问），Firecrawl就能爬取它们。

此外，Firecrawl还会从每个页面中提取有用的信息，去掉不重要的内容（比如广告和导航栏），并把这些数据整理成易于使用的格式，比如Markdown。

站点地图是什么？

站点地图（sitemap）是一个网站提供的文件，列出网站上的所有页面。它帮助搜索引擎或爬虫更快地找到和访问这些页面。站点地图通常是一个XML文件，里面包含网站上所有页面的链接。

总结一下：

1. Firecrawl 会自动从你给定的网页开始，遍历网站上的所有链接，爬取所有能访问的页面。
2. 它会去除杂乱信息，提取有用的数据并整理好。
3. 无需站点地图，Firecrawl也能找到并爬取所有页面。

演示视频

油管博主：开发者文稿 / 字幕译：Aitrainee，链接在这里：

https://www./watch?v=fDSM7chMo5E

下面提供官方的文档介绍、相关资源、部署教程等，进一步支撑你的行动，以提升本文的帮助力。

🔥 Firecrawl

我们提供了易于使用的API托管版本。您可以在这里找到演示和文档。您也可以自行托管后台服务。

· API
· Python SDK
· Node SDK
· Langchain集成 🦜🔗
· Llama Index集成 🦙
· Langchain JS集成 🦜🔗
· 想要其他SDK或集成？请通过打开issue告知我们。

要在本地运行，请参考指南。

API密钥

要使用API，您需要在 Firecrawl 注册并获取API密钥。

爬取

用于爬取一个URL及其所有可访问的子页面。此操作提交一个爬取任务并返回一个作业ID以检查爬取状态。

curl -X POST https://api.firecrawl.dev/v0/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      'url': 'https://'
    }'

返回一个作业ID

{ 'jobId': '1234-5678-9101' }

检查爬取作业

用于检查爬取作业的状态并获取其结果。

curl -X GET https://api.firecrawl.dev/v0/crawl/status/1234-5678-9101 \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_API_KEY'

{
  'status': 'completed',
  'current': 22,
  'total': 22,
  'data': [
    {
      'content': 'Raw Content ',
      'markdown': '# Markdown Content',
      'provider': 'web-scraper',
      'metadata': {
        'title': 'Mendable | AI for CX and Sales',
        'description': 'AI for CX and Sales',
        'language': null,
        'sourceURL': 'https://www./'
      }
    }
  ]
}

爬取

用于爬取一个URL并获取其内容。

curl -X POST https://api.firecrawl.dev/v0/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      'url': 'https://'
    }'

响应：

{
  'success': true,
  'data': {
    'content': 'Raw Content ',
    'markdown': '# Markdown Content',
    'provider': 'web-scraper',
    'metadata': {
      'title': 'Mendable | AI for CX and Sales',
      'description': 'AI for CX and Sales',
      'language': null,
      'sourceURL': 'https://www./'
    }
  }
}

搜索（测试版）

用于搜索网络，获取最相关的结果，爬取每个页面并返回Markdown格式的数据。

curl -X POST https://api.firecrawl.dev/v0/search \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      'query': 'firecrawl',
      'pageOptions': {
        'fetchPageContent': true // 设置为false可快速获取搜索引擎结果页面
      }
    }'

{
  'success': true,
  'data': [
    {
      'url': 'https://',
      'markdown': '# Markdown Content',
      'provider': 'web-scraper',
      'metadata': {
        'title': 'Mendable | AI for CX and Sales',
        'description': 'AI for CX and Sales',
        'language': null,
        'sourceURL': 'https://www./'
      }
    }
  ]
}

智能提取（测试版）

用于从爬取的页面中提取结构化数据。

curl -X POST https://api.firecrawl.dev/v0/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      'url': 'https://www./',
      'extractorOptions': {
        'mode': 'llm-extraction',
        'extractionPrompt': 'Based on the information on the page, extract the information from the schema. ',
        'extractionSchema': {
          'type': 'object',
          'properties': {
            'company_mission': {
                      'type': 'string'
            },
            'supports_sso': {
                      'type': 'boolean'
            },
            'is_open_source': {
                      'type': 'boolean'
            },
            'is_in_yc': {
                      'type': 'boolean'
            }
          },
          'required': [
            'company_mission',
            'supports_sso',
            'is_open_source',
            'is_in_yc'
          ]
        }
      }
    }'

{
    'success': true,
    'data': {
      'content': 'Raw Content',
      'metadata': {
        'title': 'Mendable',
        'description': 'Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide',
        'robots': 'follow, index',
        'ogTitle': 'Mendable',
        'ogDescription': 'Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide',
        'ogUrl': 'https:///',
        'ogImage': 'https:///mendable_new_og1.png',
        'ogLocaleAlternate': [],
        'ogSiteName': 'Mendable',
        'sourceURL': 'https:///'
      },
      'llm_extraction': {
        'company_mission': 'Train a secure AI on your technical resources that answers customer and employee questions so your team doesn't have to',
        'supports_sso': true,
        'is_open_source': false,
        'is_in_yc': true
      }
    }
}

使用Python SDK

安装Python SDK

pip install firecrawl-py

爬取一个网站

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='YOUR_API_KEY')

crawl_result = app.crawl_url('', {'crawlerOptions': {'excludes': ['blog/*']}})

# 获取Markdown内容
for result in crawl_result:
    print(result['markdown'])

爬取一个URL

要爬取单个URL，请使用 scrape_url 方法。它接收URL作为参数并返回爬取的数据字典。

url = 'https://'
scraped_data = app.scrape_url(url)

从URL中提取结构化数据

通过LLM提取，您可以轻松地从任何URL中提取结构化数据。我们支持Pydantic模型，使其更容易使用。以下是使用方法：

class ArticleSchema(BaseModel):
    title: str
    points: int 
    by: str
    commentsURL: str

class TopArticlesSchema(BaseModel):
    top: List[ArticleSchema] = Field(..., max_items=5, description='Top 5

 stories')

data = app.scrape_url('https://news.', {
    'extractorOptions': {
        'extractionSchema': TopArticlesSchema.model_json_schema(),
        'mode': 'llm-extraction'
    },
    'pageOptions':{
        'onlyMainContent': True
    }
})
print(data['llm_extraction'])

搜索查询

执行网络搜索，获取顶级结果，提取每个页面的数据，并返回它们的Markdown格式内容。

query = 'What is Mendable?'
search_result = app.search(query)

使用Node SDK

安装

要安装Firecrawl Node SDK，可以使用npm：

npm install @mendable/firecrawl-js

使用方法

1. 从 firecrawl.dev 获取API密钥。
2. 将API密钥设置为环境变量 FIRECRAWL_API_KEY，或将其作为参数传递给 FirecrawlApp 类。

爬取URL

要爬取单个URL并进行错误处理，请使用 scrapeUrl 方法。它接收URL作为参数并返回爬取的数据字典。

try {
  const url = 'https://';
  const scrapedData = await app.scrapeUrl(url);
  console.log(scrapedData);
} catch (error) {
  console.error(
    'Error occurred while scraping:',
    error.message
  );
}

爬取网站

要爬取网站并进行错误处理，请使用 crawlUrl 方法。它接收起始URL和可选参数作为参数。params 参数允许您指定爬取任务的附加选项，例如最大爬取页面数、允许的域和输出格式。

const crawlUrl = 'https://';
const params = {
  crawlerOptions: {
    excludes: ['blog/'],
    includes: [], // 留空以包含所有页面
    limit: 1000,
  },
  pageOptions: {
    onlyMainContent: true
  }
};
const waitUntilDone = true;
const timeout = 5;
const crawlResult = await app.crawlUrl(
  crawlUrl,
  params,
  waitUntilDone,
  timeout
);

检查爬取状态

要检查爬取任务的状态并进行错误处理，请使用 checkCrawlStatus 方法。它接收作业ID作为参数并返回爬取任务的当前状态。

const status = await app.checkCrawlStatus(jobId);
console.log(status);

从URL中提取结构化数据

通过LLM提取，您可以轻松地从任何URL中提取结构化数据。我们支持zod模式，使其更容易使用。以下是使用方法：

import FirecrawlApp from '@mendable/firecrawl-js';
import { z } from 'zod';

const app = new FirecrawlApp({
  apiKey: 'fc-YOUR_API_KEY',
});

// 定义要提取内容的模式
const schema = z.object({
  top: z
    .array(
      z.object({
        title: z.string(),
        points: z.number(),
        by: z.string(),
        commentsURL: z.string(),
      })
    )
    .length(5)
    .describe('Hacker News上的前5个故事'),
});

const scrapeResult = await app.scrapeUrl('https://news.', {
  extractorOptions: { extractionSchema: schema },
});

console.log(scrapeResult.data['llm_extraction']);

搜索查询

通过 search 方法，您可以在搜索引擎中搜索查询并获取顶级结果及每个结果的页面内容。该方法接收查询作为参数并返回搜索结果。

const query = 'what is mendable?';
const searchResults = await app.search(query, {
  pageOptions: {
    fetchPageContent: true // 获取每个搜索结果的页面内容
  }
});