ScrapeGraph

本笔记本提供了快速入门 ScrapeGraph 工具的概览。如需了解所有 ScrapeGraph 功能和配置的详细文档，请参阅 API 参考。

有关 ScrapeGraph AI 的更多信息：

概览

集成详情

类	包	可序列化的	JS 支持
SmartScraperTool	langchain-scrapegraph	✅	❌
MarkdownifyTool	langchain-scrapegraph	✅	❌
LocalScraperTool	langchain-scrapegraph	✅	❌
GetCreditsTool	langchain-scrapegraph	✅	❌

工具特性

工具	目的	输入	输出
SmartScraperTool	Extract structured data from websites	URL + prompt	JSON
MarkdownifyTool	Convert webpages to markdown	URL	Markdown text
LocalScraperTool	Extract data from HTML content	HTML + prompt	JSON
GetCreditsTool	Check API credits	None	Credit info

设置

该集成需要以下软件包：

%pip install --quiet -U langchain-scrapegraph

Note: you may need to restart the kernel to use updated packages.

凭据

您需要一个 ScrapeGraph AI API 密钥才能使用这些工具。请在 scrapegraphai.com 获取。

import getpass
import os

if not os.environ.get("SGAI_API_KEY"):
    os.environ["SGAI_API_KEY"] = getpass.getpass("ScrapeGraph AI API key:\n")

设置 LangSmith 以实现一流的可观测性也很有帮助（但非必需）：

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

实例化

这里我们展示如何实例化 ScrapeGraph 工具的实例：

from langchain_scrapegraph.tools import (
    GetCreditsTool,
    LocalScraperTool,
    MarkdownifyTool,
    SmartScraperTool,
)

smartscraper = SmartScraperTool()
markdownify = MarkdownifyTool()
localscraper = LocalScraperTool()
credits = GetCreditsTool()

调用

直接用参数调用

让我们逐个尝试每个工具：

# SmartScraper
result = smartscraper.invoke(
    {
        "user_prompt": "Extract the company name and description",
        "website_url": "https://scrapegraphai.com",
    }
)
print("SmartScraper Result:", result)

# Markdownify
markdown = markdownify.invoke({"website_url": "https://scrapegraphai.com"})
print("\nMarkdownify Result (first 200 chars):", markdown[:200])

local_html = """
<html>
    <body>
        <h1>Company Name</h1>
        <p>We are a technology company focused on AI solutions.</p>
        <div class="contact">
            <p>Email: contact@example.com</p>
            <p>Phone: (555) 123-4567</p>
        </div>
    </body>
</html>
"""

# LocalScraper
result_local = localscraper.invoke(
    {
        "user_prompt": "Make a summary of the webpage and extract the email and phone number",
        "website_html": local_html,
    }
)
print("LocalScraper Result:", result_local)

# Check credits
credits_info = credits.invoke({})
print("\nCredits Info:", credits_info)

SmartScraper Result: {'company_name': 'ScrapeGraphAI', 'description': "ScrapeGraphAI is a powerful AI web scraping tool that turns entire websites into clean, structured data through a simple API. It's designed to help developers and AI companies extract valuable data from websites efficiently and transform it into formats that are ready for use in LLM applications and data analysis."}

Markdownify Result (first 200 chars): [![ScrapeGraphAI Logo](https://scrapegraphai.com/images/scrapegraphai_logo.svg)ScrapeGraphAI](https://scrapegraphai.com/)

PartnersPricingFAQ[Blog](https://scrapegraphai.com/blog)DocsLog inSign up

Op
LocalScraper Result: {'company_name': 'Company Name', 'description': 'We are a technology company focused on AI solutions.', 'contact': {'email': 'contact@example.com', 'phone': '(555) 123-4567'}}

Credits Info: {'remaining_credits': 49679, 'total_credits_used': 914}

使用 ToolCall 调用

我们也可以使用模型生成的 ToolCall 来调用该工具：

model_generated_tool_call = {
    "args": {
        "user_prompt": "Extract the main heading and description",
        "website_url": "https://scrapegraphai.com",
    },
    "id": "1",
    "name": smartscraper.name,
    "type": "tool_call",
}
smartscraper.invoke(model_generated_tool_call)

ToolMessage(content='{"main_heading": "Get the data you need from any website", "description": "Easily extract and gather information with just a few lines of code with a simple api. Turn websites into clean and usable structured data."}', name='SmartScraper', tool_call_id='1')

链式调用

让我们利用工具与大语言模型（LLM）结合来分析一个网站：

选择聊天模型:

pip install -qU "langchain[openai]"

import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableConfig, chain

prompt = ChatPromptTemplate(
    [
        (
            "system",
            "You are a helpful assistant that can use tools to extract structured information from websites.",
        ),
        ("human", "{user_input}"),
        ("placeholder", "{messages}"),
    ]
)

llm_with_tools = llm.bind_tools([smartscraper], tool_choice=smartscraper.name)
llm_chain = prompt | llm_with_tools


@chain
def tool_chain(user_input: str, config: RunnableConfig):
    input_ = {"user_input": user_input}
    ai_msg = llm_chain.invoke(input_, config=config)
    tool_msgs = smartscraper.batch(ai_msg.tool_calls, config=config)
    return llm_chain.invoke({**input_, "messages": [ai_msg, *tool_msgs]}, config=config)


tool_chain.invoke(
    "What does ScrapeGraph AI do? Extract this information from their website https://scrapegraphai.com"
)

API 参考：ChatPromptTemplate | RunnableConfig | Chains

AIMessage(content='ScrapeGraph AI is an AI-powered web scraping tool that efficiently extracts and converts website data into structured formats via a simple API. It caters to developers, data scientists, and AI researchers, offering features like easy integration, support for dynamic content, and scalability for large projects. It supports various website types, including business, e-commerce, and educational sites. Contact: contact@scrapegraphai.com.', additional_kwargs={'tool_calls': [{'id': 'call_shkRPyjyAtfjH9ffG5rSy9xj', 'function': {'arguments': '{"user_prompt":"Extract details about the products, services, and key features offered by ScrapeGraph AI, as well as any unique selling points or innovations mentioned on the website.","website_url":"https://scrapegraphai.com"}', 'name': 'SmartScraper'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 47, 'prompt_tokens': 480, 'total_tokens': 527, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_c7ca0ebaca', 'finish_reason': 'stop', 'logprobs': None}, id='run-45a12c86-d499-4273-8c59-0db926799bc7-0', tool_calls=[{'name': 'SmartScraper', 'args': {'user_prompt': 'Extract details about the products, services, and key features offered by ScrapeGraph AI, as well as any unique selling points or innovations mentioned on the website.', 'website_url': 'https://scrapegraphai.com'}, 'id': 'call_shkRPyjyAtfjH9ffG5rSy9xj', 'type': 'tool_call'}], usage_metadata={'input_tokens': 480, 'output_tokens': 47, 'total_tokens': 527, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})

API 参考

有关 ScrapeGraph 所有功能和配置的详细文档，请参阅 Langchain API 参考：https://python.langchain.com/docs/integrations/tools/scrapegraph

或前往官方 SDK 仓库：https://github.com/ScrapeGraphAI/langchain-scrapegraph

工具概念指南
工具如何使用指南

概览​

集成详情​

工具特性​

设置​

凭据​

实例化​

调用​

直接用参数调用​

使用 ToolCall 调用​

链式调用​

API 参考​

相关​

概览