diff --git a/.env.example b/.env.example new file mode 100644 index 0000000..443d207 --- /dev/null +++ b/.env.example @@ -0,0 +1,3 @@ +# Mistral AI API key +# Get yours at https://console.mistral.ai/ +MISTRAL_API_KEY=your_api_key_here diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..4c49bd7 --- /dev/null +++ b/.gitignore @@ -0,0 +1 @@ +.env diff --git a/README.md b/README.md index b7c2d2b..2af6f13 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,187 @@ # ckOCR +PHP CLI tool that photographs part identification labels and extracts structured data using **Mistral AI OCR**. + +Reads images from `input/`, calls the Mistral API, and writes YAML files to `output/` containing the **Serial Number**, **Model Number**, and **Date**. + +--- + +## Requirements + +- PHP **8.1 – 8.5** with the `curl` extension enabled (no Composer required) +- A [Mistral AI](https://console.mistral.ai/) account with API access + +**Arch Linux / CachyOS** — enable the curl extension after installing PHP: + +```bash +sudo pacman -S php +# uncomment "extension=curl" in /etc/php/php.ini +php -m | grep curl # verify +``` + +--- + +## Installation + +```bash +git clone ckOCR +cd ckOCR +cp .env.example .env +``` + +Edit `.env` and insert your Mistral API key: + +```env +MISTRAL_API_KEY=your_api_key_here +``` + +Alternatively, export it as an environment variable: + +```bash +export MISTRAL_API_KEY=your_api_key_here +``` + +--- + +## Usage + +Place one or more label photos in the `input/` folder, then run: + +```bash +php ocr.php +``` + +Results are written to `output/` as YAML files — one per image, same filename stem. + +### Options + +| Flag | Description | +|---|---| +| `--force` | Re-process images that already have an output file | +| `--verbose` | Print the raw OCR text and API request details | +| `--help` | Show usage information | + +### Examples + +```bash +# Process all new images +php ocr.php + +# Re-run everything, show full detail +php ocr.php --force --verbose + +# Just see options +php ocr.php --help +``` + +--- + +## Input + +Supported image formats: **JPG, JPEG, PNG, WebP, GIF** + +Maximum file size: **5 MB** per image (Mistral API limit) + +``` +input/ +├── part-label-01.jpg +├── motor-sn.png +└── board-sticker.jpg +``` + +--- + +## Output + +Each processed image produces a YAML file in `output/`: + +``` +output/ +├── part-label-01.yaml +├── motor-sn.yaml +└── board-sticker.yaml +``` + +### YAML structure + +```yaml +--- +serial_number: SN-20241234 +model_number: "XYZ-4K/B" +date: 2024-01 +source_file: part-label-01.jpg +processed_at: 2026-03-04 15:30:00 +raw_ocr: | + Full text extracted from the label by the OCR model, + preserved exactly as returned. +``` + +| Field | Description | +|---|---| +| `serial_number` | Serial Number — labelled S/N, SN, Serial No., etc. | +| `model_number` | Model or Part Number — labelled Model, M/N, P/N, MPN, etc. | +| `date` | Any date on the label — MFG date, DOM, expiry, etc. | +| `source_file` | Original image filename | +| `processed_at` | Timestamp of processing | +| `raw_ocr` | Full OCR text returned by Mistral before extraction | + +Fields not found on the label are written as `null`. + +--- + +## How it works + +Processing runs in two API calls per image: + +``` +Image file + │ + ▼ +[1] POST /ocr (mistral-ocr-latest) + │ base64-encoded image → markdown text + │ + ▼ +[2] POST /chat/completions (mistral-small-latest) + │ OCR text + extraction prompt → JSON with the three fields + │ + ▼ +YAML file written to output/ +``` + +1. **OCR step** — the image is base64-encoded and sent to `mistral-ocr-latest`, which returns the full label text as markdown. +2. **Extraction step** — the OCR text is passed to `mistral-small-latest` with a structured prompt. The model returns a JSON object (`response_format: json_object`) containing `serial_number`, `model_number`, and `date`. + +Already-processed images are skipped automatically unless `--force` is used. + +--- + +## Project structure + +``` +ckOCR/ +├── ocr.php # Main script +├── .env # API key (not committed, see .env.example) +├── .env.example # Template +├── .gitignore +├── input/ # Label photos (test data included) +└── output/ # YAML results (test data included) +``` + +--- + +## Troubleshooting + +**`MISTRAL_API_KEY not set`** +Set the key in `.env` or export it as an environment variable. + +**`Mistral API 401`** +Your API key is invalid or expired. Check it at [console.mistral.ai](https://console.mistral.ai/). + +**`File too large`** +Resize the image below 5 MB before placing it in `input/`. + +**`No text found`** +The label may be blurry, low contrast, or too small. Try a clearer photo. The output YAML is still written with `null` fields so the file won't be re-processed accidentally — use `--force --verbose` to retry and inspect the raw OCR output. + +**Fields are `null` but text was extracted** +Run with `--verbose` to see the raw OCR text and check whether the label uses non-standard abbreviations. The extraction prompt covers the most common label formats. diff --git a/input/WIN_20260304_15_05_25_Pro.jpg b/input/WIN_20260304_15_05_25_Pro.jpg new file mode 100644 index 0000000..774f74f Binary files /dev/null and b/input/WIN_20260304_15_05_25_Pro.jpg differ diff --git a/input/WIN_20260304_15_05_41_Pro.jpg b/input/WIN_20260304_15_05_41_Pro.jpg new file mode 100644 index 0000000..0d4266a Binary files /dev/null and b/input/WIN_20260304_15_05_41_Pro.jpg differ diff --git a/input/WIN_20260304_15_06_20_Pro.jpg b/input/WIN_20260304_15_06_20_Pro.jpg new file mode 100644 index 0000000..0c57c2b Binary files /dev/null and b/input/WIN_20260304_15_06_20_Pro.jpg differ diff --git a/input/WIN_20260304_15_09_52_Pro.jpg b/input/WIN_20260304_15_09_52_Pro.jpg new file mode 100644 index 0000000..d946fd8 Binary files /dev/null and b/input/WIN_20260304_15_09_52_Pro.jpg differ diff --git a/input/d885193d-5e69-4823-aa08-cacb618b3dd1.jpg b/input/d885193d-5e69-4823-aa08-cacb618b3dd1.jpg new file mode 100644 index 0000000..d0dbf7f Binary files /dev/null and b/input/d885193d-5e69-4823-aa08-cacb618b3dd1.jpg differ diff --git a/ocr.php b/ocr.php new file mode 100644 index 0000000..b8bf05a --- /dev/null +++ b/ocr.php @@ -0,0 +1,398 @@ + 'image/jpeg', + 'png' => 'image/png', + 'webp' => 'image/webp', + 'gif' => 'image/gif', + default => 'image/jpeg', + }; +} + +/** + * Minimal YAML serialiser — handles the flat structure we produce. + * Supports: null, bool, int, float, single-line strings, multi-line strings (literal block). + */ +function to_yaml(array $data, int $depth = 0): string +{ + $out = ''; + $pad = str_repeat(' ', $depth); + + foreach ($data as $key => $value) { + if ($value === null) { + $out .= "{$pad}{$key}: null\n"; + continue; + } + if (is_bool($value)) { + $out .= "{$pad}{$key}: " . ($value ? 'true' : 'false') . "\n"; + continue; + } + if (is_int($value) || is_float($value)) { + $out .= "{$pad}{$key}: {$value}\n"; + continue; + } + if (is_array($value)) { + $out .= "{$pad}{$key}:\n" . to_yaml($value, $depth + 1); + continue; + } + + $str = (string) $value; + + // Multi-line → YAML literal block scalar + if (str_contains($str, "\n")) { + $childPad = str_repeat(' ', $depth + 1); + $indented = $childPad . implode("\n{$childPad}", explode("\n", rtrim($str))); + $out .= "{$pad}{$key}: |\n{$indented}\n"; + continue; + } + + // Single-line — quote if the value contains YAML special characters + if ($str === '' || preg_match('/[:#\[\]{}|>&!\'"%@`,]|^\s|\s$/', $str)) { + $escaped = str_replace(['\\', '"'], ['\\\\', '\\"'], $str); + $out .= "{$pad}{$key}: \"{$escaped}\"\n"; + continue; + } + + $out .= "{$pad}{$key}: {$str}\n"; + } + + return $out; +} + +// ── Mistral API ─────────────────────────────────────────────────────────────── + +/** + * Generic JSON POST to the Mistral REST API. + * + * @throws RuntimeException on network error or non-200 response + */ +function mistral_post(string $endpoint, array $payload, string $apiKey, bool $verbose): array +{ + $url = MISTRAL_BASE_URL . $endpoint; + $body = json_encode($payload, JSON_UNESCAPED_UNICODE | JSON_THROW_ON_ERROR); + + verbose("POST {$url}", $verbose); + + $ch = curl_init($url); + curl_setopt_array($ch, [ + CURLOPT_RETURNTRANSFER => true, + CURLOPT_POST => true, + CURLOPT_POSTFIELDS => $body, + CURLOPT_HTTPHEADER => [ + 'Authorization: Bearer ' . $apiKey, + 'Content-Type: application/json', + 'Accept: application/json', + ], + CURLOPT_TIMEOUT => 120, + CURLOPT_CONNECTTIMEOUT => 15, + ]); + + $response = curl_exec($ch); + $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE); + $curlError = curl_error($ch); + curl_close($ch); + + if ($curlError !== '') { + throw new RuntimeException("cURL error: {$curlError}"); + } + + if ($httpCode !== 200) { + $decoded = json_decode((string) $response, true); + $msg = $decoded['message'] + ?? $decoded['error']['message'] + ?? (string) $response; + throw new RuntimeException("Mistral API {$httpCode}: {$msg}"); + } + + $decoded = json_decode((string) $response, true); + if (!is_array($decoded)) { + throw new RuntimeException("Non-JSON response from Mistral API"); + } + + return $decoded; +} + +/** + * Step 1 — Send the image to mistral-ocr-latest and get markdown text back. + */ +function ocr_image(string $imagePath, string $apiKey, bool $verbose): string +{ + $mime = mime_for($imagePath); + $imageData = base64_encode((string) file_get_contents($imagePath)); + + verbose("OCR model: " . OCR_MODEL, $verbose); + + $result = mistral_post('/ocr', [ + 'model' => OCR_MODEL, + 'document' => [ + 'type' => 'image_url', + 'image_url' => "data:{$mime};base64,{$imageData}", + ], + ], $apiKey, $verbose); + + $text = ''; + foreach ($result['pages'] ?? [] as $page) { + $text .= ($page['markdown'] ?? '') . "\n"; + } + + return trim($text); +} + +/** + * Step 2 — Extract Serial Number, Model Number, Date from raw OCR text + * using a chat model with JSON response mode. + */ +function extract_fields(string $ocrText, string $apiKey, bool $verbose): array +{ + verbose("Extraction model: " . CHAT_MODEL, $verbose); + + $system = 'You are a precision industrial part-label parser. ' + . 'Extract structured fields from OCR text. ' + . 'Return ONLY valid JSON — no explanation, no markdown fences.'; + + $user = << CHAT_MODEL, + 'messages' => [ + ['role' => 'system', 'content' => $system], + ['role' => 'user', 'content' => $user], + ], + 'response_format' => ['type' => 'json_object'], + 'temperature' => 0.0, + ], $apiKey, $verbose); + + $content = $result['choices'][0]['message']['content'] ?? '{}'; + $fields = json_decode($content, true); + + if (!is_array($fields)) { + stderr("Could not parse extraction response: {$content}"); + $fields = []; + } + + return [ + 'serial_number' => isset($fields['serial_number']) ? (string) $fields['serial_number'] : null, + 'model_number' => isset($fields['model_number']) ? (string) $fields['model_number'] : null, + 'date' => isset($fields['date']) ? (string) $fields['date'] : null, + ]; +} + +// ── Image processing ────────────────────────────────────────────────────────── + +function process_image(string $imagePath, string $outputPath, string $apiKey, bool $verbose): bool +{ + $filename = basename($imagePath); + $size = filesize($imagePath); + + if ($size === false || $size > MAX_IMAGE_BYTES) { + stderr("File too large or unreadable ({$size} bytes): {$filename}"); + return false; + } + + // Step 1: OCR + $ocrText = ocr_image($imagePath, $apiKey, $verbose); + + if ($ocrText === '') { + stderr("No text found in: {$filename}"); + // Still write output so we don't retry repeatedly + } + + verbose("--- OCR text ---\n{$ocrText}\n---", $verbose); + + // Step 2: Structured extraction (skip if nothing to parse) + $fields = ['serial_number' => null, 'model_number' => null, 'date' => null]; + if ($ocrText !== '') { + $fields = extract_fields($ocrText, $apiKey, $verbose); + } + + // Build and write YAML + $output = [ + 'serial_number' => $fields['serial_number'], + 'model_number' => $fields['model_number'], + 'date' => $fields['date'], + 'source_file' => $filename, + 'processed_at' => date('Y-m-d H:i:s'), + 'raw_ocr' => $ocrText !== '' ? $ocrText : null, + ]; + + $yaml = "---\n" . to_yaml($output); + file_put_contents($outputPath, $yaml); + + return true; +} + +// ── Main ────────────────────────────────────────────────────────────────────── + +if (!is_dir(OUTPUT_DIR)) { + mkdir(OUTPUT_DIR, 0755, true); +} + +// Collect images +$pattern = INPUT_DIR . '/*.{' . implode(',', SUPPORTED_EXTENSIONS) . '}'; +$images = glob($pattern, GLOB_BRACE) ?: []; + +if ($images === []) { + stdout("No supported images found in " . INPUT_DIR); + exit(0); +} + +stdout(sprintf("Found %d image(s). Starting OCR…\n", count($images))); + +$processed = 0; +$skipped = 0; +$failed = 0; + +foreach ($images as $imagePath) { + $filename = basename($imagePath); + $stem = pathinfo($filename, PATHINFO_FILENAME); + $outputPath = OUTPUT_DIR . '/' . $stem . '.yaml'; + + if (!$force && file_exists($outputPath)) { + stdout("SKIP {$filename} (output exists, use --force to re-run)"); + $skipped++; + continue; + } + + stdout("PROCESS {$filename}"); + + try { + $ok = process_image($imagePath, $outputPath, $apiKey, $verbose); + if ($ok) { + stdout(" → output/{$stem}.yaml"); + $processed++; + } else { + $failed++; + } + } catch (RuntimeException $e) { + stderr($e->getMessage()); + $failed++; + } +} + +stdout(sprintf( + "\nDone — processed: %d skipped: %d failed: %d", + $processed, + $skipped, + $failed +)); diff --git a/output/WIN_20260304_15_05_25_Pro.yaml b/output/WIN_20260304_15_05_25_Pro.yaml new file mode 100644 index 0000000..b3d4599 --- /dev/null +++ b/output/WIN_20260304_15_05_25_Pro.yaml @@ -0,0 +1,12 @@ +--- +serial_number: Z1X6029781024 +model_number: B69199Q +date: null +source_file: WIN_20260304_15_05_25_Pro.jpg +processed_at: "2026-03-04 17:26:30" +raw_ocr: | + | POCLAIN + Hydraulics | W/N: 0126 | + | --- | --- | + | P/N: B69199Q | W/N: 0126 | + | S/N: Z1X6029781024 | FN | diff --git a/output/WIN_20260304_15_05_41_Pro.yaml b/output/WIN_20260304_15_05_41_Pro.yaml new file mode 100644 index 0000000..36e064d --- /dev/null +++ b/output/WIN_20260304_15_05_41_Pro.yaml @@ -0,0 +1,19 @@ +--- +serial_number: Z1X6029782007 +model_number: B69199Q +date: null +source_file: WIN_20260304_15_05_41_Pro.jpg +processed_at: "2026-03-04 17:26:32" +raw_ocr: | + # POCLAIN + Hydraulics + + P/N: B69199Q + S/N: Z1X6029782007 + + ![img-0.jpeg](img-0.jpeg) + + W/N: 0126 + FN + + ![img-1.jpeg](img-1.jpeg) diff --git a/output/WIN_20260304_15_06_20_Pro.yaml b/output/WIN_20260304_15_06_20_Pro.yaml new file mode 100644 index 0000000..464fd48 --- /dev/null +++ b/output/WIN_20260304_15_06_20_Pro.yaml @@ -0,0 +1,22 @@ +--- +serial_number: 2506053021331 +model_number: 38E3470018G1 +date: null +source_file: WIN_20260304_15_06_20_Pro.jpg +processed_at: "2026-03-04 17:26:34" +raw_ocr: | + # POWER CODE + + IN-FIELD SUPPORT by VANGUARD™ + + basco.com/patents • data rates apply + + Serial #: + 2506053021331 + + Model Number: + 38E3470018G1 + + ![img-0.jpeg](img-0.jpeg) + + ![img-1.jpeg](img-1.jpeg) diff --git a/output/WIN_20260304_15_09_52_Pro.yaml b/output/WIN_20260304_15_09_52_Pro.yaml new file mode 100644 index 0000000..ff9932a --- /dev/null +++ b/output/WIN_20260304_15_09_52_Pro.yaml @@ -0,0 +1,33 @@ +--- +serial_number: 25101001230300 +model_number: 10418 +date: 2025/04 +source_file: WIN_20260304_15_09_52_Pro.jpg +processed_at: "2026-03-04 17:26:37" +raw_ocr: | + # TTControl + + HYDAC INTERNATIONAL + + EU contact: TTControl GmbH, Schönbrunner Str. 7, 1040 Vienna AT + + UK contact: HYDAC Technology Ltd, De Havilland Way, Windrush Park, OX29 0YG Witney, UK + + HY-TTC 60-CD-594K-768K-0000-000 + + Version: 01.00-D SW: 623 Date: 2025/04 + Voltage: +12/24V S/N: 25101001230300 + + ![img-0.jpeg](img-0.jpeg) + + ![img-1.jpeg](img-1.jpeg) + + 010005D + + 10R-04 0021 + + CODESYS® + + Made in Hungary + + P/N: [10418] 921088 diff --git a/output/d885193d-5e69-4823-aa08-cacb618b3dd1.yaml b/output/d885193d-5e69-4823-aa08-cacb618b3dd1.yaml new file mode 100644 index 0000000..2059d00 --- /dev/null +++ b/output/d885193d-5e69-4823-aa08-cacb618b3dd1.yaml @@ -0,0 +1,9 @@ +--- +serial_number: null +model_number: null +date: null +source_file: d885193d-5e69-4823-aa08-cacb618b3dd1.jpg +processed_at: "2026-03-04 17:26:38" +raw_ocr: | + | ![img-0.jpeg](img-0.jpeg) | | | + | --- | --- | --- |