Apify 执行器
Apify Actors 是云程序,旨在执行广泛的网络爬虫、抓取和数据提取任务。这些演员简化了从网页上自动收集信息的过程,使用户能够高效地提取、处理和存储信息。演员可以用于执行诸如为电子商务网站抓取产品详情、监控价格变化或获取搜索引擎结果等任务。它们可以无缝集成到 Apify Datasets 中,允许通过 JSON、CSV 或 Excel 等格式存储、管理和导出演员收集的结构化数据以进行进一步分析或使用。
概览
这个笔记本将引导您使用Apify Actors与LangChain结合,以自动化网络抓取和数据提取。langchain-apify包将Apify的基于云的工具与LangChain代理集成在一起,从而可以为AI应用程序高效地收集和处理数据。
设置
此集成位于 langchain-apify 包中。该包可以使用 pip 安装。
%pip install langchain-apify
前置条件
import os
os.environ["APIFY_API_TOKEN"] = "your-apify-api-token"
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
Instantiation
这里我们实例化ApifyActorsTool以便调用RAG Web Browser Apify Actor。此Actor为AI和LLM应用程序提供网络浏览功能,类似于ChatGPT中的网络浏览功能。任何来自Apify Store的Actor都可以以这种方式使用。
from langchain_apify import ApifyActorsTool
tool = ApifyActorsTool("apify/rag-web-browser")
Invocation
The ApifyActorsTool 接受单个参数,即 run_input - 这是一个传递给 Actor 的运行输入字典。关于运行输入方案的文档可以在 Actor 详细信息页面的输入部分找到。参见 RAG Web 浏览器输入方案。
tool.invoke({"run_input": {"query": "what is apify?", "maxResults": 2}})
链式调用
我们可以将创建的工具提供给一个代理。当被要求搜索信息时,代理会调用Apify Actor,该Actor将在网络上进行搜索,然后检索搜索结果。
%pip install langgraph langchain-openai
from langchain_core.messages import ToolMessage
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent
model = ChatOpenAI(model="gpt-4o")
tools = [tool]
graph = create_react_agent(model, tools=tools)
inputs = {"messages": [("user", "search for what is Apify")]}
for s in graph.stream(inputs, stream_mode="values"):
message = s["messages"][-1]
# skip tool messages
if isinstance(message, ToolMessage):
continue
message.pretty_print()
================================[1m Human Message [0m=================================
search for what is Apify
==================================[1m Ai Message [0m==================================
Tool Calls:
apify_actor_apify_rag-web-browser (call_27mjHLzDzwa5ZaHWCMH510lm)
Call ID: call_27mjHLzDzwa5ZaHWCMH510lm
Args:
run_input: {"run_input":{"query":"Apify","maxResults":3,"outputFormats":["markdown"]}}
==================================[1m Ai Message [0m==================================
Apify is a comprehensive platform for web scraping, browser automation, and data extraction. It offers a wide array of tools and services that cater to developers and businesses looking to extract data from websites efficiently and effectively. Here's an overview of Apify:
1. **Ecosystem and Tools**:
- Apify provides an ecosystem where developers can build, deploy, and publish data extraction and web automation tools called Actors.
- The platform supports various use cases such as extracting data from social media platforms, conducting automated browser-based tasks, and more.
2. **Offerings**:
- Apify offers over 3,000 ready-made scraping tools and code templates.
- Users can also build custom solutions or hire Apify's professional services for more tailored data extraction needs.
3. **Technology and Integration**:
- The platform supports integration with popular tools and services like Zapier, GitHub, Google Sheets, Pinecone, and more.
- Apify supports open-source tools and technologies such as JavaScript, Python, Puppeteer, Playwright, Selenium, and its own Crawlee library for web crawling and browser automation.
4. **Community and Learning**:
- Apify hosts a community on Discord where developers can get help and share expertise.
- It offers educational resources through the Web Scraping Academy to help users become proficient in data scraping and automation.
5. **Enterprise Solutions**:
- Apify provides enterprise-grade web data extraction solutions with high reliability, 99.95% uptime, and compliance with SOC2, GDPR, and CCPA standards.
For more information, you can visit [Apify's official website](https://apify.com/) or their [GitHub page](https://github.com/apify) which contains their code repositories and further details about their projects.
API 参考
要了解此集成的更多信息,请参阅git 仓库或Apify 集成文档。