Diffbot Extraction
Diffbot Content Extraction¶
Bridge402 provides access to Diffbot's content extraction APIs through x402-payment-protected endpoints. Extract structured data from articles, products, and discussions using crypto payments.
Overview¶
The Diffbot extraction service enables AI agents and Web3 applications to extract structured content from web pages. Instead of managing API keys or subscriptions, pay per extraction with crypto.
Primary Use Cases: - AI Content Analysis: Extract and analyze article content, product information, or discussion threads for AI processing - Web Scraping Alternative: Use Diffbot's computer vision-based extraction instead of fragile HTML parsing - Content Aggregation: Build decentralized content aggregators that extract data from multiple sources - Research Automation: Automate research workflows that require structured data extraction
Available Endpoints¶
Bridge402 provides six Diffbot endpoints:
| Endpoint | Description | Use Case |
|---|---|---|
/diffbot/article |
Extract article content (title, author, date, text, images) | News articles, blog posts, editorial content |
/diffbot/product |
Extract product information (name, price, images, reviews) | E-commerce pages, product listings |
/diffbot/discussion |
Extract discussion threads and comments | Forums, comment sections, Reddit-style threads |
/diffbot/image |
Extract primary images from a web page | Image galleries, media pages, visual content analysis |
/diffbot/analyze |
Analyze a web page using Diffbot Analyze API | General web page analysis |
Note: For Natural Language Processing from raw text (entities, sentiment, facts), see the Diffbot NL Processing documentation.
API Endpoints¶
Extract Article¶
POST /diffbot/article
Extract structured article content from a URL.
Query Parameters:
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
url |
string | Yes | URL of the article to extract | https://techcrunch.com/2024/01/01/article |
network |
string | No | Payment network preference | base or sol/solana |
Headers:
| Header | Type | Required | Description |
|---|---|---|---|
X-PAYMENT |
string | Yes* | Base64-encoded x402 payment data |
*Required for access. If omitted, returns payment invoice (402 response).
Extract Product¶
POST /diffbot/product
Extract product information from an e-commerce URL.
Query Parameters: Same as /diffbot/article (replace with product URL)
Headers: Same as /diffbot/article
Extract Discussion¶
POST /diffbot/discussion
Extract discussion threads and comments from a forum/community URL.
Query Parameters: Same as /diffbot/article (replace with discussion URL)
Headers: Same as /diffbot/article
Extract Images¶
POST /diffbot/image
Extract primary images from a web page with comprehensive metadata.
Query Parameters:
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
url |
string | Yes | URL of the page to extract images from | https://example.com/gallery |
network |
string | No | Payment network preference | base or sol/solana |
Headers:
| Header | Type | Required | Description |
|---|---|---|---|
X-PAYMENT |
string | Yes* | Base64-encoded x402 payment data |
*Required for access. If omitted, returns payment invoice (402 response).
Analyze Web Page¶
POST /diffbot/analyze
Analyze a web page using Diffbot Analyze API.
Query Parameters:
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
url |
string | Yes | URL of the page to analyze | https://example.com/page |
network |
string | No | Payment network preference | base or sol/solana |
Headers:
| Header | Type | Required | Description |
|---|---|---|---|
X-PAYMENT |
string | Yes* | Base64-encoded x402 payment data |
*Required for access. If omitted, returns payment invoice (402 response).
Request Examples¶
Get Payment Invoice (Without Payment)¶
curl -X POST "https://bridge402.tech/diffbot/article?url=https://techcrunch.com/2024/01/01/article&network=sol"
Response (402 Payment Required):
{
"x402Version": 1,
"error": "X-PAYMENT header is required",
"accepts": [
{
"scheme": "exact",
"network": "solana",
"maxAmountRequired": "10000",
"asset": "EPjFWdd5AufqSSqeM2qN1xzybapC8G4wEGGkZwyTDt1v",
"payTo": "BjxbJg48jQmoBLJnRunB1CMY5SZwvcUmnXCaWNeSXBei",
"resource": "https://bridge402.tech/diffbot/article",
"description": "Diffbot article extraction for URL [Solana/USDC]",
"mimeType": "application/json",
"maxTimeoutSeconds": 120,
"extra": {
"product": "Bridge402 Diffbot — Article Extraction (Solana)",
"extractionType": "article",
"url": "https://techcrunch.com/2024/01/01/article",
"feePayer": "2wKupLR9q6wXYppw8Gr2NvWxKBUqm4PPJKkQfoxHDBg4"
}
}
],
"extractionType": "article",
"url": "https://techcrunch.com/2024/01/01/article"
}
Get Extraction with Payment¶
curl -X POST "https://bridge402.tech/diffbot/article?url=https://techcrunch.com/2024/01/01/article&network=sol" \
-H "X-PAYMENT: <base64-encoded-x402-payment>"
Response (200 Success):
{
"extractionType": "article",
"url": "https://techcrunch.com/2024/01/01/article",
"data": {
"objects": [
{
"type": "article",
"title": "Article Title",
"author": "Author Name",
"date": "2024-01-01T12:00:00.000Z",
"text": "Full article content...",
"html": "<html>...</html>",
"images": [
{
"url": "https://example.com/image.jpg",
"primary": true
}
],
"tags": ["technology", "crypto"],
"language": "en"
}
],
"request": {
"pageUrl": "https://techcrunch.com/2024/01/01/article",
"api": "article",
"version": 3
}
},
"payment": {
"verified": true,
"settled": true,
"txHash": "5xK...",
"network": "solana"
},
"metadata": {
"provider": "Diffbot",
"endpoint": "article",
"timestamp": 1703123456.789
}
}
Pricing¶
- Cost: $0.01 USDC per extraction/request (10,000 atomic units)
- Payment Networks: Base or Solana (USDC)
- No Subscription Required: Pay-per-use model perfect for AI agents and intermittent access
- All Endpoints: Same price for article, product, discussion, image, analyze, and Natural Language processing
Response Format¶
Article Extraction Response¶
The data.objects[0] contains the extracted article:
{
"type": "article",
"title": "Article Title",
"author": "Author Name",
"date": "2024-01-01T12:00:00.000Z",
"text": "Clean article text without HTML...",
"html": "Full HTML content...",
"images": [
{
"url": "https://example.com/image.jpg",
"primary": true,
"caption": "Image caption"
}
],
"videos": [],
"tags": ["tag1", "tag2"],
"language": "en",
"resolvedPageUrl": "https://example.com/article"
}
Product Extraction Response¶
{
"type": "product",
"title": "Product Name",
"brand": "Brand Name",
"offerPrice": "$99.99",
"regularPrice": "$129.99",
"currencyCode": "USD",
"availability": "inStock",
"sku": "SKU123",
"mpn": "MPN456",
"gtin": "0123456789012",
"images": [...],
"description": "Product description...",
"specifications": {...}
}
Discussion Extraction Response¶
{
"type": "discussion",
"title": "Discussion Title",
"author": "Author Name",
"date": "2024-01-01T12:00:00.000Z",
"text": "Main post content...",
"posts": [
{
"author": "Commenter Name",
"date": "2024-01-01T13:00:00.000Z",
"text": "Comment text...",
"parent": 0
}
],
"tags": ["tag1", "tag2"]
}
Image Extraction Response¶
The data.objects array contains image objects with metadata:
{
"type": "image",
"url": "https://example.com/image.jpg",
"title": "Image Title or Caption",
"naturalHeight": 1024,
"naturalWidth": 768,
"displayHeight": 512,
"displayWidth": 384,
"humanLanguage": "en",
"anchorUrl": "https://example.com/linked-page",
"pageUrl": "https://example.com/page",
"xpath": "/HTML/BODY/IMG[@id='main-image']",
"diffbotUri": "image|3|123456789",
"tags": [
{
"id": 12345,
"label": "Photograph",
"uri": "http://diffbot.com/entity/..."
}
],
"meta": "EXIF, XMP, ICC Profile"
}
Integration Examples¶
Node.js Example (diff.js)¶
The diff.js example demonstrates a complete Diffbot extraction client:
# Install dependencies
npm install undici dotenv @solana/web3.js @solana/spl-token readline
# Set environment variables
export BASE_URL=http://localhost:8081
export SOLANA_RPC=https://api.mainnet-beta.solana.com
export KEYPAIR_PATH=path/to/keypair.json
# Run the script
node diff.js
The script will:
1. Prompt you for a URL
2. Request an invoice from the Diffbot endpoint
3. Pay with Solana USDC
4. Save the extraction result to output.json
Python Example¶
import asyncio
import httpx
import base64
import json
async def extract_article(url: str, payment_data: str):
"""Extract article using Diffbot API with x402 payment"""
async with httpx.AsyncClient() as client:
response = await client.post(
"https://bridge402.tech/diffbot/article",
params={
"url": url,
"network": "sol" # or "base"
},
headers={"X-PAYMENT": payment_data}
)
if response.status_code == 200:
data = response.json()
return data
elif response.status_code == 402:
# Payment required - get invoice
invoice = response.json()
print(f"Payment required: {invoice['accepts'][0]['maxAmountRequired']} atomic units")
return invoice
else:
raise Exception(f"Request failed: {response.status_code} - {response.text}")
# Usage
result = await extract_article("https://example.com/article", "<your-x402-payment>")
if result.get("data"):
article = result["data"]["objects"][0]
print(f"Title: {article['title']}")
print(f"Author: {article.get('author', 'Unknown')}")
print(f"Text: {article['text'][:200]}...")
JavaScript/Node.js Example¶
import { request } from 'undici';
async function extractArticle(url, paymentData) {
const urlEncoded = encodeURIComponent(url);
const res = await request(
`https://bridge402.tech/diffbot/article?url=${urlEncoded}&network=sol`,
{
method: 'POST',
headers: {
'X-PAYMENT': paymentData
}
}
);
const data = await res.body.json();
if (data.data && data.data.objects && data.data.objects.length > 0) {
const article = data.data.objects[0];
return {
title: article.title,
author: article.author,
date: article.date,
text: article.text,
images: article.images || []
};
}
return data;
}
// Usage
const result = await extractArticle(
'https://techcrunch.com/2024/01/01/article',
'<your-x402-payment>'
);
console.log(`Title: ${result.title}`);
console.log(`Author: ${result.author}`);
Image Extraction Example¶
import { request } from 'undici';
async function extractImages(url, paymentData) {
const urlEncoded = encodeURIComponent(url);
const res = await request(
`https://bridge402.tech/diffbot/image?url=${urlEncoded}&network=sol`,
{
method: 'POST',
headers: {
'X-PAYMENT': paymentData
}
}
);
const data = await res.body.json();
if (data.data && data.data.objects && data.data.objects.length > 0) {
const images = data.data.objects;
return images.map(img => ({
url: img.url,
title: img.title,
dimensions: `${img.naturalWidth}x${img.naturalHeight}`,
displaySize: img.displayWidth && img.displayHeight
? `${img.displayWidth}x${img.displayHeight}`
: null,
tags: img.tags || []
}));
}
return data;
}
// Usage
const images = await extractImages(
'https://example.com/gallery',
'<your-x402-payment>'
);
images.forEach(img => {
console.log(`Image: ${img.url}`);
console.log(`Dimensions: ${img.dimensions}`);
});
Error Handling¶
Common Errors¶
400 Bad Request
402 Payment Required
500 Internal Server Error - Diffbot API may be unavailable - Invalid URL or unsupported page type - Retry the request
502 Bad Gateway - Upstream Diffbot API error - Check that the URL is accessible - Verify Diffbot API key is configured on the server
Use Cases for AI Agents¶
Content Analysis Pipeline¶
async def analyze_content(url: str):
"""AI agent workflow for content analysis"""
# 1. Extract article
result = await extract_article(url, payment_data)
if not result.get("data"):
return None
article = result["data"]["objects"][0]
# 2. Analyze with AI
analysis = await analyze_with_llm(article["text"], {
"extract_key_points": True,
"sentiment_analysis": True,
"summarize": True,
"extract_entities": True
})
return {
"url": url,
"title": article["title"],
"author": article.get("author"),
"date": article["date"],
"analysis": analysis,
"source": "Bridge402 Diffbot"
}
Multi-Source Content Aggregation¶
async function aggregateContent(urls) {
const results = [];
for (const url of urls) {
// Get invoice for each URL
const invoice = await getInvoice(url);
// Pay and extract
const payment = await createPayment(invoice);
const extraction = await extractArticle(url, payment);
if (extraction.data) {
results.push({
url,
title: extraction.data.objects[0].title,
text: extraction.data.objects[0].text,
extracted_at: new Date().toISOString()
});
}
}
return results;
}
Product Price Monitoring¶
async def monitor_product_price(product_url: str):
"""Monitor product price changes"""
result = await extract_product(product_url, payment_data)
if result.get("data") and result["data"]["objects"]:
product = result["data"]["objects"][0]
return {
"url": product_url,
"title": product.get("title"),
"current_price": product.get("offerPrice"),
"regular_price": product.get("regularPrice"),
"availability": product.get("availability"),
"extracted_at": time.time()
}
Best Practices¶
- Validate URLs: Ensure URLs are accessible and valid before requesting extraction
- Cache Results: Extracted content doesn't change - cache results locally
- Handle Errors: Always handle 402 responses to get payment requirements
- Network Selection: Choose network based on your wallet capabilities (Base or Solana)
- Rate Limiting: Be mindful of Diffbot API rate limits on the server side
- URL Encoding: Always URL-encode the target URL in query parameters
Support¶
For questions about Diffbot extraction or integration help, refer to: - Payment Integration Guide - Examples - Complete Node.js extraction client - Contact the Bridge402 development team