ArgusFlow

Argus Product Data Extractor

Stop writing custom selectors for every shop. Pass any product HTML to this service and instantly receive structured prices, brands, and specifications.

bash — argus-extractor

➜~

Response (200 OK):

{
  "data": {
    "title": "Sony WH-1000XM5",
    "price": 349.00,
    "currency": "EUR",
    "availability": "In Stock"
  }
}

Process completed in 42ms

➜ ~

The Smart Way to Process Product Data

Standard parsing logic breaks the moment a website updates its layout. Argus Extractor is built to be resilient. Instead of rigid rules, it uses a modular "scoreboard" system that identifies key information based on content patterns - allowing you to process data from thousands of different sources without manual configuration.

Why developers choose Argus

Format Independent: Extract data from any domain without writing site-specific CSS or XPath selectors.
Handles Dynamic Content: Integrates with Playwright to capture data that only appears after JavaScript has fully rendered the page.
Reliable Scoring: Independent parsers cross-verify data, ensuring you receive the result with the highest confidence score.
Privacy Centric: Runs entirely on your own infrastructure. Your data never leaves your server, and you pay zero external API fees.

Quick Start & API Usage

To run the service locally, ensure you are in the `argus/services/extractor/` directory, then follow these steps:

Copy environment file: `cp .env.example .env`
Build the service: `make build-dev`
Start extraction: `make up-dev`

The service is available at `http://localhost:8001` with full documentation at `/docs`.

curl -X POST "http://localhost:8001/api/v1/extract" \
-H "Content-Type: application/json" \
-H "x-api-key: default_dev_key" \
-d '{
  "url": "https://www.example.com/product/123",
  "html_content": "<html><body><h1>Smart Watch Series 5</h1><p>Price: $299.00</p><span>Brand: Apple</span></body></html>",
  "use_llm": false
}'

$ch = curl_init('http://localhost:8001/api/v1/extract');
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, ['Content-Type: application/json', 'x-api-key: default_dev_key']);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode([
    'url' => '...',
    'html_content' => '<html><body><h1>Smart Watch Series 5</h1><p>Price: $299.00</p><span>Brand: Apple</span></body></html>',
    'use_llm' => false
]));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = json_decode(curl_exec($ch), true);

import requests

response = requests.post(
    "http://localhost:8001/api/v1/extract",
    headers={"x-api-key": "default_dev_key"},
    json={
        "url": "https://www.example.com/product/123",
        "html_content": "<html><body><h1>Smart Watch Series 5</h1><p>Price: $299.00</p><span>Brand: Apple</span></body></html>",
        "use_llm": False
    }
)
print(response.json())

fetch('http://localhost:8001/api/v1/extract', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'x-api-key': 'default_dev_key'
  },
  body: JSON.stringify({
    url: 'https://www.example.com/product/123',
    html_content: '<html><body><h1>Smart Watch Series 5</h1><p>Price: $299.00</p><span>Brand: Apple</span></body></html>',
    use_llm: false
  })
})
.then(res => res.json())
.then(data => console.log(data));

{
  "data": {
    "title": "Smart Watch Series 5",
    "brand": "Apple",
    "price": 299.00,
    "currency": "USD"
  },
  "message": "Extraction successful"
}

Try it Yourself

Extracting data, please wait...

Security

Every microservice is protected by an API key. For production environments, ensure the `AUTH__API_KEY` environment variable is set to a unique, secure string.

Multilingual Support & Customization

The extractor supports multilingual extraction by separating language-specific keywords and regex patterns from the core logic. These are defined in `config/patterns.yml`.

To add or override patterns for any language (e.g., adding German keywords), you can create a `config/custom_patterns.yml` file. This file is loaded after the default patterns and will safely merge with them, allowing you to customize the extractor without modifying core files.


# config/custom_patterns.yml
de:
  availability_in_stock:
    - 'auf lager'
    - 'sofort lieferbar'
  brand_label_regex: '\b(marke|hersteller)\b'