Skip to content

Diffbot Extraction

Diffbot Content Extraction

Bridge402 provides access to Diffbot's content extraction APIs through x402-payment-protected endpoints. Extract structured data from articles, products, and discussions using crypto payments.

Overview

The Diffbot extraction service enables AI agents and Web3 applications to extract structured content from web pages. Instead of managing API keys or subscriptions, pay per extraction with crypto.

Primary Use Cases: - AI Content Analysis: Extract and analyze article content, product information, or discussion threads for AI processing - Web Scraping Alternative: Use Diffbot's computer vision-based extraction instead of fragile HTML parsing - Content Aggregation: Build decentralized content aggregators that extract data from multiple sources - Research Automation: Automate research workflows that require structured data extraction

Available Endpoints

Bridge402 provides six Diffbot endpoints:

Endpoint Description Use Case
/diffbot/article Extract article content (title, author, date, text, images) News articles, blog posts, editorial content
/diffbot/product Extract product information (name, price, images, reviews) E-commerce pages, product listings
/diffbot/discussion Extract discussion threads and comments Forums, comment sections, Reddit-style threads
/diffbot/image Extract primary images from a web page Image galleries, media pages, visual content analysis
/diffbot/analyze Analyze a web page using Diffbot Analyze API General web page analysis

Note: For Natural Language Processing from raw text (entities, sentiment, facts), see the Diffbot NL Processing documentation.

API Endpoints

Extract Article

POST /diffbot/article

Extract structured article content from a URL.

Query Parameters:

Parameter Type Required Description Example
url string Yes URL of the article to extract https://techcrunch.com/2024/01/01/article
network string No Payment network preference base or sol/solana

Headers:

Header Type Required Description
X-PAYMENT string Yes* Base64-encoded x402 payment data

*Required for access. If omitted, returns payment invoice (402 response).

Extract Product

POST /diffbot/product

Extract product information from an e-commerce URL.

Query Parameters: Same as /diffbot/article (replace with product URL)

Headers: Same as /diffbot/article

Extract Discussion

POST /diffbot/discussion

Extract discussion threads and comments from a forum/community URL.

Query Parameters: Same as /diffbot/article (replace with discussion URL)

Headers: Same as /diffbot/article

Extract Images

POST /diffbot/image

Extract primary images from a web page with comprehensive metadata.

Query Parameters:

Parameter Type Required Description Example
url string Yes URL of the page to extract images from https://example.com/gallery
network string No Payment network preference base or sol/solana

Headers:

Header Type Required Description
X-PAYMENT string Yes* Base64-encoded x402 payment data

*Required for access. If omitted, returns payment invoice (402 response).

Analyze Web Page

POST /diffbot/analyze

Analyze a web page using Diffbot Analyze API.

Query Parameters:

Parameter Type Required Description Example
url string Yes URL of the page to analyze https://example.com/page
network string No Payment network preference base or sol/solana

Headers:

Header Type Required Description
X-PAYMENT string Yes* Base64-encoded x402 payment data

*Required for access. If omitted, returns payment invoice (402 response).

Request Examples

Get Payment Invoice (Without Payment)

curl -X POST "https://bridge402.tech/diffbot/article?url=https://techcrunch.com/2024/01/01/article&network=sol"

Response (402 Payment Required):

{
  "x402Version": 1,
  "error": "X-PAYMENT header is required",
  "accepts": [
    {
      "scheme": "exact",
      "network": "solana",
      "maxAmountRequired": "10000",
      "asset": "EPjFWdd5AufqSSqeM2qN1xzybapC8G4wEGGkZwyTDt1v",
      "payTo": "BjxbJg48jQmoBLJnRunB1CMY5SZwvcUmnXCaWNeSXBei",
      "resource": "https://bridge402.tech/diffbot/article",
      "description": "Diffbot article extraction for URL [Solana/USDC]",
      "mimeType": "application/json",
      "maxTimeoutSeconds": 120,
      "extra": {
        "product": "Bridge402 Diffbot — Article Extraction (Solana)",
        "extractionType": "article",
        "url": "https://techcrunch.com/2024/01/01/article",
        "feePayer": "2wKupLR9q6wXYppw8Gr2NvWxKBUqm4PPJKkQfoxHDBg4"
      }
    }
  ],
  "extractionType": "article",
  "url": "https://techcrunch.com/2024/01/01/article"
}

Get Extraction with Payment

curl -X POST "https://bridge402.tech/diffbot/article?url=https://techcrunch.com/2024/01/01/article&network=sol" \
  -H "X-PAYMENT: <base64-encoded-x402-payment>"

Response (200 Success):

{
  "extractionType": "article",
  "url": "https://techcrunch.com/2024/01/01/article",
  "data": {
    "objects": [
      {
        "type": "article",
        "title": "Article Title",
        "author": "Author Name",
        "date": "2024-01-01T12:00:00.000Z",
        "text": "Full article content...",
        "html": "<html>...</html>",
        "images": [
          {
            "url": "https://example.com/image.jpg",
            "primary": true
          }
        ],
        "tags": ["technology", "crypto"],
        "language": "en"
      }
    ],
    "request": {
      "pageUrl": "https://techcrunch.com/2024/01/01/article",
      "api": "article",
      "version": 3
    }
  },
  "payment": {
    "verified": true,
    "settled": true,
    "txHash": "5xK...",
    "network": "solana"
  },
  "metadata": {
    "provider": "Diffbot",
    "endpoint": "article",
    "timestamp": 1703123456.789
  }
}

Pricing

  • Cost: $0.01 USDC per extraction/request (10,000 atomic units)
  • Payment Networks: Base or Solana (USDC)
  • No Subscription Required: Pay-per-use model perfect for AI agents and intermittent access
  • All Endpoints: Same price for article, product, discussion, image, analyze, and Natural Language processing

Response Format

Article Extraction Response

The data.objects[0] contains the extracted article:

{
  "type": "article",
  "title": "Article Title",
  "author": "Author Name",
  "date": "2024-01-01T12:00:00.000Z",
  "text": "Clean article text without HTML...",
  "html": "Full HTML content...",
  "images": [
    {
      "url": "https://example.com/image.jpg",
      "primary": true,
      "caption": "Image caption"
    }
  ],
  "videos": [],
  "tags": ["tag1", "tag2"],
  "language": "en",
  "resolvedPageUrl": "https://example.com/article"
}

Product Extraction Response

{
  "type": "product",
  "title": "Product Name",
  "brand": "Brand Name",
  "offerPrice": "$99.99",
  "regularPrice": "$129.99",
  "currencyCode": "USD",
  "availability": "inStock",
  "sku": "SKU123",
  "mpn": "MPN456",
  "gtin": "0123456789012",
  "images": [...],
  "description": "Product description...",
  "specifications": {...}
}

Discussion Extraction Response

{
  "type": "discussion",
  "title": "Discussion Title",
  "author": "Author Name",
  "date": "2024-01-01T12:00:00.000Z",
  "text": "Main post content...",
  "posts": [
    {
      "author": "Commenter Name",
      "date": "2024-01-01T13:00:00.000Z",
      "text": "Comment text...",
      "parent": 0
    }
  ],
  "tags": ["tag1", "tag2"]
}

Image Extraction Response

The data.objects array contains image objects with metadata:

{
  "type": "image",
  "url": "https://example.com/image.jpg",
  "title": "Image Title or Caption",
  "naturalHeight": 1024,
  "naturalWidth": 768,
  "displayHeight": 512,
  "displayWidth": 384,
  "humanLanguage": "en",
  "anchorUrl": "https://example.com/linked-page",
  "pageUrl": "https://example.com/page",
  "xpath": "/HTML/BODY/IMG[@id='main-image']",
  "diffbotUri": "image|3|123456789",
  "tags": [
    {
      "id": 12345,
      "label": "Photograph",
      "uri": "http://diffbot.com/entity/..."
    }
  ],
  "meta": "EXIF, XMP, ICC Profile"
}

Integration Examples

Node.js Example (diff.js)

The diff.js example demonstrates a complete Diffbot extraction client:

# Install dependencies
npm install undici dotenv @solana/web3.js @solana/spl-token readline

# Set environment variables
export BASE_URL=http://localhost:8081
export SOLANA_RPC=https://api.mainnet-beta.solana.com
export KEYPAIR_PATH=path/to/keypair.json

# Run the script
node diff.js

The script will: 1. Prompt you for a URL 2. Request an invoice from the Diffbot endpoint 3. Pay with Solana USDC 4. Save the extraction result to output.json

Python Example

import asyncio
import httpx
import base64
import json

async def extract_article(url: str, payment_data: str):
    """Extract article using Diffbot API with x402 payment"""
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://bridge402.tech/diffbot/article",
            params={
                "url": url,
                "network": "sol"  # or "base"
            },
            headers={"X-PAYMENT": payment_data}
        )

        if response.status_code == 200:
            data = response.json()
            return data
        elif response.status_code == 402:
            # Payment required - get invoice
            invoice = response.json()
            print(f"Payment required: {invoice['accepts'][0]['maxAmountRequired']} atomic units")
            return invoice
        else:
            raise Exception(f"Request failed: {response.status_code} - {response.text}")

# Usage
result = await extract_article("https://example.com/article", "<your-x402-payment>")
if result.get("data"):
    article = result["data"]["objects"][0]
    print(f"Title: {article['title']}")
    print(f"Author: {article.get('author', 'Unknown')}")
    print(f"Text: {article['text'][:200]}...")

JavaScript/Node.js Example

import { request } from 'undici';

async function extractArticle(url, paymentData) {
    const urlEncoded = encodeURIComponent(url);
    const res = await request(
        `https://bridge402.tech/diffbot/article?url=${urlEncoded}&network=sol`,
        {
            method: 'POST',
            headers: {
                'X-PAYMENT': paymentData
            }
        }
    );

    const data = await res.body.json();

    if (data.data && data.data.objects && data.data.objects.length > 0) {
        const article = data.data.objects[0];
        return {
            title: article.title,
            author: article.author,
            date: article.date,
            text: article.text,
            images: article.images || []
        };
    }

    return data;
}

// Usage
const result = await extractArticle(
    'https://techcrunch.com/2024/01/01/article',
    '<your-x402-payment>'
);
console.log(`Title: ${result.title}`);
console.log(`Author: ${result.author}`);

Image Extraction Example

import { request } from 'undici';

async function extractImages(url, paymentData) {
    const urlEncoded = encodeURIComponent(url);
    const res = await request(
        `https://bridge402.tech/diffbot/image?url=${urlEncoded}&network=sol`,
        {
            method: 'POST',
            headers: {
                'X-PAYMENT': paymentData
            }
        }
    );

    const data = await res.body.json();

    if (data.data && data.data.objects && data.data.objects.length > 0) {
        const images = data.data.objects;
        return images.map(img => ({
            url: img.url,
            title: img.title,
            dimensions: `${img.naturalWidth}x${img.naturalHeight}`,
            displaySize: img.displayWidth && img.displayHeight 
                ? `${img.displayWidth}x${img.displayHeight}` 
                : null,
            tags: img.tags || []
        }));
    }

    return data;
}

// Usage
const images = await extractImages(
    'https://example.com/gallery',
    '<your-x402-payment>'
);
images.forEach(img => {
    console.log(`Image: ${img.url}`);
    console.log(`Dimensions: ${img.dimensions}`);
});

Error Handling

Common Errors

400 Bad Request

{
  "detail": "Invalid URL. Must start with http:// or https://"
}

402 Payment Required

{
  "x402Version": 1,
  "error": "X-PAYMENT header is required",
  "accepts": [...]
}

500 Internal Server Error - Diffbot API may be unavailable - Invalid URL or unsupported page type - Retry the request

502 Bad Gateway - Upstream Diffbot API error - Check that the URL is accessible - Verify Diffbot API key is configured on the server

Use Cases for AI Agents

Content Analysis Pipeline

async def analyze_content(url: str):
    """AI agent workflow for content analysis"""
    # 1. Extract article
    result = await extract_article(url, payment_data)

    if not result.get("data"):
        return None

    article = result["data"]["objects"][0]

    # 2. Analyze with AI
    analysis = await analyze_with_llm(article["text"], {
        "extract_key_points": True,
        "sentiment_analysis": True,
        "summarize": True,
        "extract_entities": True
    })

    return {
        "url": url,
        "title": article["title"],
        "author": article.get("author"),
        "date": article["date"],
        "analysis": analysis,
        "source": "Bridge402 Diffbot"
    }

Multi-Source Content Aggregation

async function aggregateContent(urls) {
    const results = [];

    for (const url of urls) {
        // Get invoice for each URL
        const invoice = await getInvoice(url);

        // Pay and extract
        const payment = await createPayment(invoice);
        const extraction = await extractArticle(url, payment);

        if (extraction.data) {
            results.push({
                url,
                title: extraction.data.objects[0].title,
                text: extraction.data.objects[0].text,
                extracted_at: new Date().toISOString()
            });
        }
    }

    return results;
}

Product Price Monitoring

async def monitor_product_price(product_url: str):
    """Monitor product price changes"""
    result = await extract_product(product_url, payment_data)

    if result.get("data") and result["data"]["objects"]:
        product = result["data"]["objects"][0]

        return {
            "url": product_url,
            "title": product.get("title"),
            "current_price": product.get("offerPrice"),
            "regular_price": product.get("regularPrice"),
            "availability": product.get("availability"),
            "extracted_at": time.time()
        }

Best Practices

  1. Validate URLs: Ensure URLs are accessible and valid before requesting extraction
  2. Cache Results: Extracted content doesn't change - cache results locally
  3. Handle Errors: Always handle 402 responses to get payment requirements
  4. Network Selection: Choose network based on your wallet capabilities (Base or Solana)
  5. Rate Limiting: Be mindful of Diffbot API rate limits on the server side
  6. URL Encoding: Always URL-encode the target URL in query parameters

Support

For questions about Diffbot extraction or integration help, refer to: - Payment Integration Guide - Examples - Complete Node.js extraction client - Contact the Bridge402 development team