ArgusFlow

Argus Extractor

A microservice designed to extract structured product information—such as price, brand, and specifications—from the HTML of a product page.

What is the Extractor?

The Argus Extractor is a microservice that uses a modular architecture with various independent parsers and a "scoreboard" mechanism to select the most reliable data. It can parse both static HTML and content dynamically loaded with JavaScript.

Key Features

  • Modular Parsers:

    A scalable architecture where multiple, independent parsers can exist for each data field (e.g., price, brand).

  • Intelligent Data Selection:

    A "scoreboard" system weighs results from all parsers and selects the data with the highest confidence score.

  • Browser Automation:

    Integrates with Playwright to retrieve dynamically loaded (JavaScript) content.

  • Advanced Analysis:

    Uses BeautifulSoup for HTML parsing and a spaCy NLP model for smarter text analysis.

  • Secure by Default:

    All API endpoints (except /health) are protected by a mandatory x-api-key header.

  • Multilingual Support:

    Supports multiple languages by separating language-specific keywords and regex patterns from the core logic.

Quick Start & API Usage

To run the service locally, ensure you are in the `argus/services/extractor/` directory, then follow these steps:

  1. Copy the example environment file: `cp .env.example .env`
  2. Build the development image: `make build-dev`
  3. Start the service: `make up-dev`

The service will be available at `http://localhost:8001` with interactive API docs at `http://localhost:8001/docs`.

API Request Example (Direct Mode)

The `/api/v1/extract` endpoint requires an API key. The default key for development is `default_dev_key`.


curl -X POST "http://localhost:8001/api/v1/extract" \
-H "Content-Type: application/json" \
-H "x-api-key: default_dev_key" \
-d '{
  "url": "https://www.example.com/product/123",
  "html_content": "<html><body><h1>An amazing product</h1><p>Price: $123.45</p><span>Brand: ExampleBrand</span></body></html>",
  "use_llm": false
}'

$ch = curl_init();

$url = 'http://localhost:8001/api/v1/extract';
$payload = json_encode([
    'url' => 'https://www.example.com/product/123',
    'html_content' => '<html><body><h1>An amazing product</h1><p>Price: $123.45</p><span>Brand: ExampleBrand</span></body></html>',
    'use_llm' => false
]);
$headers = [
    'Content-Type: application/json',
    'x-api-key: default_dev_key' //
];

curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $payload);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

$response = curl_exec($ch);
curl_close($ch);

$data = json_decode($response, true);

import requests

api_url = "http://localhost:8001/api/v1/extract"
headers = {
    "Content-Type": "application/json",
    "x-api-key": "default_dev_key" #
}
payload = {
    "url": "https://www.example.com/product/123",
    "html_content": "<html><body><h1>An amazing product</h1><p>Price: $123.45</p><span>Brand: ExampleBrand</span></body></html>",
    "use_llm": False
}

response = requests.post(api_url, json=payload, headers=headers)
data = response.json()

print(data)

const apiUrl = 'http://localhost:8001/api/v1/extract';
const payload = {
  url: 'https://www.example.com/product/123',
  html_content: '<html><body><h1>An amazing product</h1><p>Price: $123.45</p><span>Brand: ExampleBrand</span></body></html>',
  use_llm: false
};
const headers = {
  'Content-Type': 'application/json',
  'x-api-key': 'default_dev_key' //
};

fetch(apiUrl, {
  method: 'POST',
  headers: headers,
  body: JSON.stringify(payload)
})
.then(response => response.json())
.then(data => {
  console.log(data);
});

Example Output

{
  "data": {
    "title": "An amazing product",
    "brand": "ExampleBrand",
    "price": 123.45,
    "original_price": null,
    "currency": "USD",
    "specifications": []
  },
  "message": "Extraction successful"
}

Try it Yourself

Extract product data from URL

Extracting data, please wait...

Security

All API endpoints (except for the `/health` check) require a valid API key to be passed in the `x-api-key` header.

The default development key is `default_dev_key`. For production, you must override this by setting the `AUTH__API_KEY` environment variable to a strong, randomly generated key.

Multilingual Support & Customization

The extractor supports multilingual extraction by separating language-specific keywords and regex patterns from the core logic. These are defined in `config/patterns.yml`.

To add or override patterns for any language (e.g., adding German keywords), you can create a `config/custom_patterns.yml` file. This file is loaded after the default patterns and will safely merge with them, allowing you to customize the extractor without modifying core files.


# config/custom_patterns.yml
de:
  availability_in_stock:
    - 'auf lager'
    - 'sofort lieferbar'
  brand_label_regex: '\b(marke|hersteller)\b'