From 07a23bda90e56fa086c4bd5bd8207314ccaf30ae Mon Sep 17 00:00:00 2001
From: Fini Jastrow <ulf.fini.jastrow@desy.de>
Date: Fri, 26 May 2023 08:33:06 +0200
Subject: [PATCH] name-parser: Allow dashes between modifier and weight

[why]
Some fonts might have a non-standard (i.e. broken) weight naming scheme:
They put a blank or a dash between the modifier and the weight, for
example "Extra Bold" or "Demi-Condensed", when they mean "ExtraBold"
resp "DemiCondensed".

The former happens with CartographCF, the later with IBM3270.

[how]
Automatically allow a dash between modifier and weight, which comes up
as CamelCase boundary. Insert an optional dash (r'-?') into such
boundaries.
For the further lookup we need to remove the dash in the found keyword,
if there is any, to get back to standard naming.

This might break if the font name ends in a modifier. So we can not
really distinguish

       Font Name Extra Bold Italic
    => Font Name - ExtraBold Italic
    => Font Name Extra - Bold Italic

The known modifiers are 'Demi', 'Ultra', 'Semi', 'Extra'.

It is possible but unlikely that a font name ends in one of these.
For example "Modern Ultra - Bold".

[note]
The question arises if we should not parse the PSname instead of the
Fullname; and stick to the dash there as boundary.
The problem might be prepatched fonts with broken naming, that would be
parsed completely wrong then. So maybe the current approach is still the
best, with the caveat given above (fontnames ending in a modifier).

[note 2]
Funny enough the variable allow_regex_token was not used at all :->
Some leftover? Anyhow we use it now.

[note 3]
We can still not remove the special handling for IBM3270, because the
font initially looks like a PSname and this is parsed as such, which
breaks the name in the incorrect place:

        PSname template  = "Name-StylesWeights"
        Fullname of 3270 = "IBM 3270 Semi-Condensed"

Signed-off-by: Fini Jastrow <ulf.fini.jastrow@desy.de>
---
 bin/scripts/name_parser/FontnameTools.py | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/bin/scripts/name_parser/FontnameTools.py b/bin/scripts/name_parser/FontnameTools.py
index 0878d57a7..d5ce1bd40 100644
--- a/bin/scripts/name_parser/FontnameTools.py
+++ b/bin/scripts/name_parser/FontnameTools.py
@@ -64,7 +64,6 @@ class FontnameTools:
         known_names = {
             # Source of the table is the current sourcefonts
             # Left side needs to be lower case
-            '-':            '',
             'book':         '',
             'text':         '',
             'ce':           'CE',
@@ -150,7 +149,12 @@ class FontnameTools:
         not_matched = ""
         all_tokens = []
         j = 1
-        regex = re.compile('(.*?)(' + '|'.join(tokens) + ')(.*)', re.IGNORECASE)
+        token_regex = '|'.join(tokens)
+        if not allow_regex_token:
+            # Allow a dash between CamelCase token word parts, i.e. Camel-Case
+            # This allows for styles like Extra-Bold
+            token_regex = re.sub(r'(?<=[a-z])(?=[A-Z])', '-?', token_regex)
+        regex = re.compile('(.*?)(' + token_regex + ')(.*)', re.IGNORECASE)
         while j:
             j = regex.match(name)
             if not j:
@@ -159,6 +163,9 @@ class FontnameTools:
                 sys.exit('Malformed regex in FontnameTools.get_name_token()')
             not_matched += ' ' + j.groups()[0] # Blanc prevents unwanted concatenation of unmatched substrings
             tok = j.groups()[1].lower()
+            if not allow_regex_token:
+                # Remove dashes between CamelCase token words
+                tok = tok.replace('-', '')
             if tok in lower_tokens:
                 tok = tokens[lower_tokens.index(tok)]
             tok = FontnameTools.unify_style_names(tok)