Argus Extractor
A microservice designed to extract structured product information—such as price, brand, and specifications—from the HTML of a product page.
What is the Extractor?
The Argus Extractor is a microservice that uses a modular architecture with various independent parsers and a "scoreboard" mechanism to select the most reliable data. It can parse both static HTML and content dynamically loaded with JavaScript.
Key Features
Modular Parsers:
A scalable architecture where multiple, independent parsers can exist for each data field (e.g., price, brand).
Intelligent Data Selection:
A "scoreboard" system weighs results from all parsers and selects the data with the highest confidence score.
Browser Automation:
Integrates with Playwright to retrieve dynamically loaded (JavaScript) content.
Advanced Analysis:
Uses BeautifulSoup for HTML parsing and a spaCy NLP model for smarter text analysis.
Secure by Default:
All API endpoints (except /health) are protected by a mandatory x-api-key header.
Multilingual Support:
Supports multiple languages by separating language-specific keywords and regex patterns from the core logic.
Quick Start & API Usage
To run the service locally, ensure you are in the `argus/services/extractor/` directory, then follow these steps:
- Copy the example environment file: `cp .env.example .env`
- Build the development image: `make build-dev`
- Start the service: `make up-dev`
The service will be available at `http://localhost:8001` with interactive API docs at `http://localhost:8001/docs`.
API Request Example (Direct Mode)
The `/api/v1/extract` endpoint requires an API key. The default key for development is `default_dev_key`.
curl -X POST "http://localhost:8001/api/v1/extract" \
-H "Content-Type: application/json" \
-H "x-api-key: default_dev_key" \
-d '{
"url": "https://www.example.com/product/123",
"html_content": "<html><body><h1>An amazing product</h1><p>Price: $123.45</p><span>Brand: ExampleBrand</span></body></html>",
"use_llm": false
}'
Example Output
{
"data": {
"title": "An amazing product",
"brand": "ExampleBrand",
"price": 123.45,
"original_price": null,
"currency": "USD",
"specifications": []
},
"message": "Extraction successful"
}
Try it Yourself
Security
All API endpoints (except for the `/health` check) require a valid API key to be passed in the `x-api-key` header.
The default development key is `default_dev_key`. For production, you must override this by setting the `AUTH__API_KEY` environment variable to a strong, randomly generated key.
Multilingual Support & Customization
The extractor supports multilingual extraction by separating language-specific keywords and regex patterns from the core logic. These are defined in `config/patterns.yml`.
To add or override patterns for any language (e.g., adding German keywords), you can create a `config/custom_patterns.yml` file. This file is loaded after the default patterns and will safely merge with them, allowing you to customize the extractor without modifying core files.
# config/custom_patterns.yml
de:
availability_in_stock:
- 'auf lager'
- 'sofort lieferbar'
brand_label_regex: '\b(marke|hersteller)\b'