Diffbot Extraction

Diffbot Content Extraction¶

Bridge402 provides access to Diffbot's content extraction APIs through x402-payment-protected endpoints. Extract structured data from articles, products, and discussions using crypto payments.

Overview¶

The Diffbot extraction service enables AI agents and Web3 applications to extract structured content from web pages. Instead of managing API keys or subscriptions, pay per extraction with crypto.

Primary Use Cases: - AI Content Analysis: Extract and analyze article content, product information, or discussion threads for AI processing - Web Scraping Alternative: Use Diffbot's computer vision-based extraction instead of fragile HTML parsing - Content Aggregation: Build decentralized content aggregators that extract data from multiple sources - Research Automation: Automate research workflows that require structured data extraction

Available Endpoints¶

Bridge402 provides six Diffbot endpoints:

Endpoint	Description	Use Case
`/diffbot/article`	Extract article content (title, author, date, text, images)	News articles, blog posts, editorial content
`/diffbot/product`	Extract product information (name, price, images, reviews)	E-commerce pages, product listings
`/diffbot/discussion`	Extract discussion threads and comments	Forums, comment sections, Reddit-style threads
`/diffbot/image`	Extract primary images from a web page	Image galleries, media pages, visual content analysis
`/diffbot/analyze`	Analyze a web page using Diffbot Analyze API	General web page analysis

Note: For Natural Language Processing from raw text (entities, sentiment, facts), see the Diffbot NL Processing documentation.

API Endpoints¶

Extract Article¶

POST /diffbot/article

Extract structured article content from a URL.

Query Parameters:

Parameter	Type	Required	Description	Example
`url`	string	Yes	URL of the article to extract	`https://techcrunch.com/2024/01/01/article`
`network`	string	No	Payment network preference	`base` or `sol`/`solana`

Headers:

Header	Type	Required	Description
`X-PAYMENT`	string	Yes*	Base64-encoded x402 payment data

*Required for access. If omitted, returns payment invoice (402 response).

Extract Product¶

POST /diffbot/product

Extract product information from an e-commerce URL.

Query Parameters: Same as /diffbot/article (replace with product URL)

Headers: Same as /diffbot/article

Extract Discussion¶

POST /diffbot/discussion

Extract discussion threads and comments from a forum/community URL.

Query Parameters: Same as /diffbot/article (replace with discussion URL)

Headers: Same as /diffbot/article

Extract Images¶

POST /diffbot/image

Extract primary images from a web page with comprehensive metadata.

Query Parameters:

Parameter	Type	Required	Description	Example
`url`	string	Yes	URL of the page to extract images from	`https://example.com/gallery`
`network`	string	No	Payment network preference	`base` or `sol`/`solana`

Headers:

Header	Type	Required	Description
`X-PAYMENT`	string	Yes*	Base64-encoded x402 payment data

*Required for access. If omitted, returns payment invoice (402 response).

Analyze Web Page¶

POST /diffbot/analyze

Analyze a web page using Diffbot Analyze API.

Query Parameters:

Parameter	Type	Required	Description	Example
`url`	string	Yes	URL of the page to analyze	`https://example.com/page`
`network`	string	No	Payment network preference	`base` or `sol`/`solana`

Headers:

Header	Type	Required	Description
`X-PAYMENT`	string	Yes*	Base64-encoded x402 payment data

*Required for access. If omitted, returns payment invoice (402 response).

Request Examples¶

Get Payment Invoice (Without Payment)¶

curl -X POST "https://bridge402.tech/diffbot/article?url=https://techcrunch.com/2024/01/01/article&network=sol"

Response (402 Payment Required):

{
  "x402Version": 1,
  "error": "X-PAYMENT header is required",
  "accepts": [
    {
      "scheme": "exact",
      "network": "solana",
      "maxAmountRequired": "10000",
      "asset": "EPjFWdd5AufqSSqeM2qN1xzybapC8G4wEGGkZwyTDt1v",
      "payTo": "BjxbJg48jQmoBLJnRunB1CMY5SZwvcUmnXCaWNeSXBei",
      "resource": "https://bridge402.tech/diffbot/article",
      "description": "Diffbot article extraction for URL [Solana/USDC]",
      "mimeType": "application/json",
      "maxTimeoutSeconds": 120,
      "extra": {
        "product": "Bridge402 Diffbot — Article Extraction (Solana)",
        "extractionType": "article",
        "url": "https://techcrunch.com/2024/01/01/article",
        "feePayer": "2wKupLR9q6wXYppw8Gr2NvWxKBUqm4PPJKkQfoxHDBg4"
      }
    }
  ],
  "extractionType": "article",
  "url": "https://techcrunch.com/2024/01/01/article"
}

Get Extraction with Payment¶

curl -X POST "https://bridge402.tech/diffbot/article?url=https://techcrunch.com/2024/01/01/article&network=sol" \
  -H "X-PAYMENT: <base64-encoded-x402-payment>"

Response (200 Success):

{
  "extractionType": "article",
  "url": "https://techcrunch.com/2024/01/01/article",
  "data": {
    "objects": [
      {
        "type": "article",
        "title": "Article Title",
        "author": "Author Name",
        "date": "2024-01-01T12:00:00.000Z",
        "text": "Full article content...",
        "html": "<html>...</html>",
        "images": [
          {
            "url": "https://example.com/image.jpg",
            "primary": true
          }
        ],
        "tags": ["technology", "crypto"],
        "language": "en"
      }
    ],
    "request": {
      "pageUrl": "https://techcrunch.com/2024/01/01/article",
      "api": "article",
      "version": 3
    }
  },
  "payment": {
    "verified": true,
    "settled": true,
    "txHash": "5xK...",
    "network": "solana"
  },
  "metadata": {
    "provider": "Diffbot",
    "endpoint": "article",
    "timestamp": 1703123456.789
  }
}

Pricing¶

Cost: $0.01 USDC per extraction/request (10,000 atomic units)
Payment Networks: Base or Solana (USDC)
No Subscription Required: Pay-per-use model perfect for AI agents and intermittent access
All Endpoints: Same price for article, product, discussion, image, analyze, and Natural Language processing

Response Format¶

Article Extraction Response¶

The data.objects[0] contains the extracted article:

{
  "type": "article",
  "title": "Article Title",
  "author": "Author Name",
  "date": "2024-01-01T12:00:00.000Z",
  "text": "Clean article text without HTML...",
  "html": "Full HTML content...",
  "images": [
    {
      "url": "https://example.com/image.jpg",
      "primary": true,
      "caption": "Image caption"
    }
  ],
  "videos": [],
  "tags": ["tag1", "tag2"],
  "language": "en",
  "resolvedPageUrl": "https://example.com/article"
}

Product Extraction Response¶

{
  "type": "product",
  "title": "Product Name",
  "brand": "Brand Name",
  "offerPrice": "$99.99",
  "regularPrice": "$129.99",
  "currencyCode": "USD",
  "availability": "inStock",
  "sku": "SKU123",
  "mpn": "MPN456",
  "gtin": "0123456789012",
  "images": [...],
  "description": "Product description...",
  "specifications": {...}
}

Discussion Extraction Response¶

{
  "type": "discussion",
  "title": "Discussion Title",
  "author": "Author Name",
  "date": "2024-01-01T12:00:00.000Z",
  "text": "Main post content...",
  "posts": [
    {
      "author": "Commenter Name",
      "date": "2024-01-01T13:00:00.000Z",
      "text": "Comment text...",
      "parent": 0
    }
  ],
  "tags": ["tag1", "tag2"]
}

Image Extraction Response¶

The data.objects array contains image objects with metadata:

{
  "type": "image",
  "url": "https://example.com/image.jpg",
  "title": "Image Title or Caption",
  "naturalHeight": 1024,
  "naturalWidth": 768,
  "displayHeight": 512,
  "displayWidth": 384,
  "humanLanguage": "en",
  "anchorUrl": "https://example.com/linked-page",
  "pageUrl": "https://example.com/page",
  "xpath": "/HTML/BODY/IMG[@id='main-image']",
  "diffbotUri": "image|3|123456789",
  "tags": [
    {
      "id": 12345,
      "label": "Photograph",
      "uri": "http://diffbot.com/entity/..."
    }
  ],
  "meta": "EXIF, XMP, ICC Profile"
}

Integration Examples¶

Node.js Example (diff.js)¶

The diff.js example demonstrates a complete Diffbot extraction client:

# Install dependencies
npm install undici dotenv @solana/web3.js @solana/spl-token readline

# Set environment variables
export BASE_URL=http://localhost:8081
export SOLANA_RPC=https://api.mainnet-beta.solana.com
export KEYPAIR_PATH=path/to/keypair.json

# Run the script
node diff.js

The script will: 1. Prompt you for a URL 2. Request an invoice from the Diffbot endpoint 3. Pay with Solana USDC 4. Save the extraction result to output.json

Python Example¶

import asyncio
import httpx
import base64
import json

async def extract_article(url: str, payment_data: str):
    """Extract article using Diffbot API with x402 payment"""
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://bridge402.tech/diffbot/article",
            params={
                "url": url,
                "network": "sol"  # or "base"
            },
            headers={"X-PAYMENT": payment_data}
        )

        if response.status_code == 200:
            data = response.json()
            return data
        elif response.status_code == 402:
            # Payment required - get invoice
            invoice = response.json()
            print(f"Payment required: {invoice['accepts'][0]['maxAmountRequired']} atomic units")
            return invoice
        else:
            raise Exception(f"Request failed: {response.status_code} - {response.text}")

# Usage
result = await extract_article("https://example.com/article", "<your-x402-payment>")
if result.get("data"):
    article = result["data"]["objects"][0]
    print(f"Title: {article['title']}")
    print(f"Author: {article.get('author', 'Unknown')}")
    print(f"Text: {article['text'][:200]}...")

JavaScript/Node.js Example¶

import { request } from 'undici';

async function extractArticle(url, paymentData) {
    const urlEncoded = encodeURIComponent(url);
    const res = await request(
        `https://bridge402.tech/diffbot/article?url=${urlEncoded}&network=sol`,
        {
            method: 'POST',
            headers: {
                'X-PAYMENT': paymentData
            }
        }
    );

    const data = await res.body.json();

    if (data.data && data.data.objects && data.data.objects.length > 0) {
        const article = data.data.objects[0];
        return {
            title: article.title,
            author: article.author,
            date: article.date,
            text: article.text,
            images: article.images || []
        };
    }

    return data;
}

// Usage
const result = await extractArticle(
    'https://techcrunch.com/2024/01/01/article',
    '<your-x402-payment>'
);
console.log(`Title: ${result.title}`);
console.log(`Author: ${result.author}`);

Image Extraction Example¶

import { request } from 'undici';

async function extractImages(url, paymentData) {
    const urlEncoded = encodeURIComponent(url);
    const res = await request(
        `https://bridge402.tech/diffbot/image?url=${urlEncoded}&network=sol`,
        {
            method: 'POST',
            headers: {
                'X-PAYMENT': paymentData
            }
        }
    );

    const data = await res.body.json();

    if (data.data && data.data.objects && data.data.objects.length > 0) {
        const images = data.data.objects;
        return images.map(img => ({
            url: img.url,
            title: img.title,
            dimensions: `${img.naturalWidth}x${img.naturalHeight}`,
            displaySize: img.displayWidth && img.displayHeight 
                ? `${img.displayWidth}x${img.displayHeight}` 
                : null,
            tags: img.tags || []
        }));
    }

    return data;
}

// Usage
const images = await extractImages(
    'https://example.com/gallery',
    '<your-x402-payment>'
);
images.forEach(img => {
    console.log(`Image: ${img.url}`);
    console.log(`Dimensions: ${img.dimensions}`);
});

Error Handling¶

Common Errors¶

400 Bad Request

{
  "detail": "Invalid URL. Must start with http:// or https://"
}

402 Payment Required

{
  "x402Version": 1,
  "error": "X-PAYMENT header is required",
  "accepts": [...]
}

500 Internal Server Error - Diffbot API may be unavailable - Invalid URL or unsupported page type - Retry the request

502 Bad Gateway - Upstream Diffbot API error - Check that the URL is accessible - Verify Diffbot API key is configured on the server

Use Cases for AI Agents¶

Content Analysis Pipeline¶

async def analyze_content(url: str):
    """AI agent workflow for content analysis"""
    # 1. Extract article
    result = await extract_article(url, payment_data)

    if not result.get("data"):
        return None

    article = result["data"]["objects"][0]

    # 2. Analyze with AI
    analysis = await analyze_with_llm(article["text"], {
        "extract_key_points": True,
        "sentiment_analysis": True,
        "summarize": True,
        "extract_entities": True
    })

    return {
        "url": url,
        "title": article["title"],
        "author": article.get("author"),
        "date": article["date"],
        "analysis": analysis,
        "source": "Bridge402 Diffbot"
    }

Multi-Source Content Aggregation¶

async function aggregateContent(urls) {
    const results = [];

    for (const url of urls) {
        // Get invoice for each URL
        const invoice = await getInvoice(url);

        // Pay and extract
        const payment = await createPayment(invoice);
        const extraction = await extractArticle(url, payment);

        if (extraction.data) {
            results.push({
                url,
                title: extraction.data.objects[0].title,
                text: extraction.data.objects[0].text,
                extracted_at: new Date().toISOString()
            });
        }
    }

    return results;
}

Product Price Monitoring¶

async def monitor_product_price(product_url: str):
    """Monitor product price changes"""
    result = await extract_product(product_url, payment_data)

    if result.get("data") and result["data"]["objects"]:
        product = result["data"]["objects"][0]

        return {
            "url": product_url,
            "title": product.get("title"),
            "current_price": product.get("offerPrice"),
            "regular_price": product.get("regularPrice"),
            "availability": product.get("availability"),
            "extracted_at": time.time()
        }

Best Practices¶

Validate URLs: Ensure URLs are accessible and valid before requesting extraction
Cache Results: Extracted content doesn't change - cache results locally
Handle Errors: Always handle 402 responses to get payment requirements
Network Selection: Choose network based on your wallet capabilities (Base or Solana)
Rate Limiting: Be mindful of Diffbot API rate limits on the server side
URL Encoding: Always URL-encode the target URL in query parameters

Support¶

For questions about Diffbot extraction or integration help, refer to: - Payment Integration Guide - Examples - Complete Node.js extraction client - Contact the Bridge402 development team