Add Mistral AI OCR script with test data and documentation

- ocr.php: two-step pipeline (mistral-ocr-latest + mistral-small-latest) extracts Serial Number, Model Number, and Date from part label photos - input/: 5 test images of industrial part labels - output/: corresponding YAML results - README.md: full usage, setup, and troubleshooting docs - .gitignore: excludes .env only - .env.example: API key template Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-04 18:29:07 +01:00 · 2026-03-04 18:29:07 +01:00 · 5bf9e065e4
commit 5bf9e065e4
parent 3219ea6916
14 changed files with 682 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -1,2 +1,187 @@
 # ckOCR

+PHP CLI tool that photographs part identification labels and extracts structured data using **Mistral AI OCR**.
+
+Reads images from `input/`, calls the Mistral API, and writes YAML files to `output/` containing the **Serial Number**, **Model Number**, and **Date**.
+
+---
+
+## Requirements
+
+- PHP **8.1 – 8.5** with the `curl` extension enabled (no Composer required)
+- A [Mistral AI](https://console.mistral.ai/) account with API access
+
+**Arch Linux / CachyOS** — enable the curl extension after installing PHP:
+
+```bash
+sudo pacman -S php
+# uncomment "extension=curl" in /etc/php/php.ini
+php -m | grep curl   # verify
+```
+
+---
+
+## Installation
+
+```bash
+git clone <repo-url> ckOCR
+cd ckOCR
+cp .env.example .env
+```
+
+Edit `.env` and insert your Mistral API key:
+
+```env
+MISTRAL_API_KEY=your_api_key_here
+```
+
+Alternatively, export it as an environment variable:
+
+```bash
+export MISTRAL_API_KEY=your_api_key_here
+```
+
+---
+
+## Usage
+
+Place one or more label photos in the `input/` folder, then run:
+
+```bash
+php ocr.php
+```
+
+Results are written to `output/` as YAML files — one per image, same filename stem.
+
+### Options
+
+| Flag | Description |
+|---|---|
+| `--force` | Re-process images that already have an output file |
+| `--verbose` | Print the raw OCR text and API request details |
+| `--help` | Show usage information |
+
+### Examples
+
+```bash
+# Process all new images
+php ocr.php
+
+# Re-run everything, show full detail
+php ocr.php --force --verbose
+
+# Just see options
+php ocr.php --help
+```
+
+---
+
+## Input
+
+Supported image formats: **JPG, JPEG, PNG, WebP, GIF**
+
+Maximum file size: **5 MB** per image (Mistral API limit)
+
+```
+input/
+├── part-label-01.jpg
+├── motor-sn.png
+└── board-sticker.jpg
+```
+
+---
+
+## Output
+
+Each processed image produces a YAML file in `output/`:
+
+```
+output/
+├── part-label-01.yaml
+├── motor-sn.yaml
+└── board-sticker.yaml
+```
+
+### YAML structure
+
+```yaml
+---
+serial_number: SN-20241234
+model_number: "XYZ-4K/B"
+date: 2024-01
+source_file: part-label-01.jpg
+processed_at: 2026-03-04 15:30:00
+raw_ocr: |
+  Full text extracted from the label by the OCR model,
+  preserved exactly as returned.
+```
+
+| Field | Description |
+|---|---|
+| `serial_number` | Serial Number — labelled S/N, SN, Serial No., etc. |
+| `model_number` | Model or Part Number — labelled Model, M/N, P/N, MPN, etc. |
+| `date` | Any date on the label — MFG date, DOM, expiry, etc. |
+| `source_file` | Original image filename |
+| `processed_at` | Timestamp of processing |
+| `raw_ocr` | Full OCR text returned by Mistral before extraction |
+
+Fields not found on the label are written as `null`.
+
+---
+
+## How it works
+
+Processing runs in two API calls per image:
+
+```
+Image file
+    │
+    ▼
+[1] POST /ocr  (mistral-ocr-latest)
+    │  base64-encoded image → markdown text
+    │
+    ▼
+[2] POST /chat/completions  (mistral-small-latest)
+    │  OCR text + extraction prompt → JSON with the three fields
+    │
+    ▼
+YAML file written to output/
+```
+
+1. **OCR step** — the image is base64-encoded and sent to `mistral-ocr-latest`, which returns the full label text as markdown.
+2. **Extraction step** — the OCR text is passed to `mistral-small-latest` with a structured prompt. The model returns a JSON object (`response_format: json_object`) containing `serial_number`, `model_number`, and `date`.
+
+Already-processed images are skipped automatically unless `--force` is used.
+
+---
+
+## Project structure
+
+```
+ckOCR/
+├── ocr.php          # Main script
+├── .env             # API key (not committed, see .env.example)
+├── .env.example     # Template
+├── .gitignore
+├── input/           # Label photos (test data included)
+└── output/          # YAML results (test data included)
+```
+
+---
+
+## Troubleshooting
+
+**`MISTRAL_API_KEY not set`**
+Set the key in `.env` or export it as an environment variable.
+
+**`Mistral API 401`**
+Your API key is invalid or expired. Check it at [console.mistral.ai](https://console.mistral.ai/).
+
+**`File too large`**
+Resize the image below 5 MB before placing it in `input/`.
+
+**`No text found`**
+The label may be blurry, low contrast, or too small. Try a clearer photo. The output YAML is still written with `null` fields so the file won't be re-processed accidentally — use `--force --verbose` to retry and inspect the raw OCR output.
+
+**Fields are `null` but text was extracted**
+Run with `--verbose` to see the raw OCR text and check whether the label uses non-standard abbreviations. The extraction prompt covers the most common label formats.