refactor(emoji): rewrite script w/ Python and update emojis (#8069)

Closes #8069
This commit is contained in:
Léopold Mebazaa 2019-08-09 21:56:57 -04:00 committed by Marc Cornellà
parent 02d8cf6e9a
commit 0613232202
No known key found for this signature in database
GPG key ID: 0314585E776A9C1B
7 changed files with 32840 additions and 2608 deletions

View file

@ -10,7 +10,7 @@ This plugin provides support for working with Unicode emoji characters in `zsh`
Variable | Description
----------------- | --------------------------------
$emoji | Maps emoji names to characters
$emoji | Maps emoji names to characters (except flags)
$emoji_flags | Maps country names to flag characters (using region indicators)
$emoji_groups | Named groups of emoji. Keys are group names; values are whitespace-separated lists of character names
@ -55,10 +55,8 @@ The defined group names can be found with `echo ${(k)emoji_groups}`.
To list all available emoji with their names, use:
```
$> display_emoji
$> display_emoji fruits
$> display_emoji animals
$> display_emoji vehicles
$> display_emoji faces
$> display_emoji people
```
To use emoji in a prompt:
@ -73,13 +71,13 @@ PROMPT="$surfer > "
The emoji names and codes are sourced from Unicode Technical Report \#51, which provides information on emoji support in Unicode. It can be found at https://www.unicode.org/reports/tr51/index.html.
The group definitions are added by this OMZ plugin. They are not based on external definitions. (As far as I can tell. -apjanke)
The group definitions are added by this OMZ plugin. They are not based on external definitions.
The values in the `$emoji*` maps are the emoji characters themselves, not escape sequences or other forms that require interpretation. They can be used in any context and do not require escape sequence support from commands like `echo` or `print`.
The emoji in the main `$emoji` map are standalone character sequences which can all be output on their own, without worrying about combining characters. The values may actually be multi-code-point sequences, instead of a single code point, and may include combining characters in those sequences. But they're arranged so their effects do not extend beyond that sequence.
The exception to this is the skin tone variation selectors. These are included in the main `$emoji` map because they can be displayed on their own, as well as used as combining characters. (If they follow a character that is not one of the emoji characters they combine with, they are displayed as color swatches.)
The exception to this is the skin tone / hair style variation selectors. These are included in the main `$emoji` map because they can be displayed on their own, as well as used as combining characters. (If they follow a character that is not one of the emoji characters they combine with, they are displayed as color swatches.)
## Experimental Features
@ -90,7 +88,6 @@ Variables:
Variable | Description
----------------- | --------------------------------
$emoji2 | Auxiliary and combining characters
$emoji_skintone | Skin tone modifiers (from Unicode 8.0)
@ -105,31 +102,26 @@ The "variation selectors" are combining characters which change the appearance o
The `$emoji_skintone` associative array maps skin tone IDs to the variation selector characters. To use one, output it immediately following a smiley or other human emoji.
```
echo "$emoji[smiling_face_with_open_mouth]$emoji_skintone[4]"
echo $emoji[waving_hand]$emoji_skintone[5]
```
Note that `$emoji_skintone` is an associative array, and its keys are the *names* of "Fitzpatrick Skin Type" groups, not linear indexes into a normal array. The names are `1_2`, `3`, `4`, `5`, and `6`. (Types 1 and 2 are combined into a single color.) See the [Diversity section in Unicode TR 51](https://www.unicode.org/reports/tr51/index.html#Diversity) for details.
#### Gemoji support
The [gemoji project](https://github.com/github/gemoji) seems to be the de facto main source for short names and other emoji-related metadata that isn't included in the official Unicode reports. So, our list of emojis incorporates some of their aliases to make your life more convenient:
```
echo $emoji[grinning_face_with_smiling_eyes]
echo $emoji[smile]
```
These two commands yield the same emoji (😄). The first name is the official one, in the Unicode reference, and the second one is the alias that was in Gemoji's database.
## TODO
These are things that could be enhanced in future revisions of the plugin.
* Incorporate CLDR data for ordering and groupings
* Short :bracket: style names (from gemoji)
* Incorporate `gemoji` data
* Country codes for flags
* ZWJ combining function?
#### Gemoji support
The [gemoji project](https://github.com/github/gemoji) seems to be the de facto main source for short names and other emoji-related metadata that isn't included in the official Unicode reports. (I'm saying this just from looking at the google results for "emoji short names" and related searches. -apjanke)
If this plugin is updated to provide short names, CLDR sorting data, and similar stuff, it should probably be changed to use the Gemoji project, and the `update_emoji.pl` script be rewritten in Ruby so it can use the Gemoji library directly instead of parsing its data files.
This does *not* mean that it should use Gemoji at run time. None of the `zsh` plugin stuff should call Gemoji or Ruby code. Rather, the "build time" `update_emoji.pl` script should be rewritten to use Gemoji to generate a pure-native-`zsh` character definition file which would be checked in to the repo and can be called by OMZ users without having Gemoji installed.
#### ZWJ combining function
One of the newer features of Unicode emoji is the ability to use the "Zero-Width Joiner" character to compose multiple emoji characters in to a single "emoji ligature" glyph. For example, this is [how Apple supports "family" emoji with various genders and skin tones](https://www.unicode.org/reports/tr51/index.html#ZWJ_Sequences).
These are a pain to write out (and probably worse to read), and it might be convenient to have a couple functions for concisely composing them, if wider support for them appears.
* ZWJ combining function?

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

View file

@ -15,9 +15,6 @@ _omz_emoji_plugin_dir="${0:h}"
local LC_ALL=en_US.UTF-8
typeset -gAH emoji_groups
typeset -gAH emoji_con
typeset -gAH emoji2
typeset -gAH emoji_skintone
source "$_omz_emoji_plugin_dir/emoji-char-definitions.zsh"
@ -30,7 +27,6 @@ unset _omz_emoji_plugin_dir
# The digits 0-9 are already in the emoji table as keycap_digit_<N>, keycap_ten, etc.
# It's unclear whether this should be in the $emoji array, because those characters are all ones
# which can be displayed on their own.
#emoji[combining_enclosing_keycap]="\U20E3"
emoji[regional_indicator_symbol_letter_d_regional_indicator_symbol_letter_e]=$'\xF0\x9F\x87\xA9\xF0\x9F\x87\xAA'
emoji[regional_indicator_symbol_letter_g_regional_indicator_symbol_letter_b]=$'\xF0\x9F\x87\xAC\xF0\x9F\x87\xA7'
@ -43,209 +39,12 @@ emoji[regional_indicator_symbol_letter_i_regional_indicator_symbol_letter_t]=$'\
emoji[regional_indicator_symbol_letter_u_regional_indicator_symbol_letter_s]=$'\xF0\x9F\x87\xBA\xF0\x9F\x87\xB8'
emoji[regional_indicator_symbol_letter_r_regional_indicator_symbol_letter_u]=$'\xF0\x9F\x87\xB7\xF0\x9F\x87\xBA'
# Nonstandard alias names
emoji[vulcan_salute]=$'\U1F596'
# Emoji combining and auxiliary characters
# "Variation Selectors" for controlling text vs emoji style presentation
# These apply to the immediately preceding character
emoji2[text_style]=$'\UFE0E'
emoji2[emoji_style]=$'\UFE0F'
# Joiner that indicates a single combined-form glyph (ligature) should be used
emoji2[zero_width_joiner]=$'\U200D'
# Skin tone modifiers
emoji2[emoji_modifier_fitzpatrick_type_1_2]=$'\U1F3FB'
emoji2[emoji_modifier_fitzpatrick_type_3]=$'\U1F3FC'
emoji2[emoji_modifier_fitzpatrick_type_4]=$'\U1F3FD'
emoji2[emoji_modifier_fitzpatrick_type_5]=$'\U1F3FE'
emoji2[emoji_modifier_fitzpatrick_type_6]=$'\U1F3FF'
# Various other combining characters. (Incomplete list; I selected ones that sound useful)
emoji2[combining_enclosing_circle]=$'\U20DD'
emoji2[combining_enclosing_square]=$'\U20DE'
emoji2[combining_enclosing_diamond]=$'\U20DF'
emoji2[combining_enclosing_circle_backslash]=$'\U20E0'
emoji2[combining_enclosing_screen]=$'\U20E2'
emoji2[combining_enclosing_keycap]=$'\U20E3'
emoji2[combining_enclosing_upward_pointing_triangle]=$'\U20E4'
# Easier access to skin tone modifiers
emoji_skintone[1_2]=$'\U1F3FB'
emoji_skintone[3]=$'\U1F3FC'
emoji_skintone[4]=$'\U1F3FD'
emoji_skintone[5]=$'\U1F3FE'
emoji_skintone[6]=$'\U1F3FF'
# Emoji groups
# These are stored in a single associative array, $emoji_groups, to avoid cluttering up the global
# namespace, and to allow adding additional group definitions at run time.
# The keys are the group names, and the values are whitespace-separated lists of emoji character names.
emoji_groups[fruits]="
tomato
aubergine
grapes
melon
watermelon
tangerine
banana
pineapple
red_apple
green_apple
peach
cherries
strawberry
lemon
pear
"
emoji_groups[vehicles]="
airplane
rocket
railway_car
high_speed_train
high_speed_train_with_bullet_nose
bus
ambulance
fire_engine
police_car
taxi
automobile
recreational_vehicle
delivery_truck
ship
speedboat
bicycle
helicopter
steam_locomotive
train
light_rail
tram
oncoming_bus
trolleybus
minibus
oncoming_police_car
oncoming_taxi
oncoming_automobile
articulated_lorry
tractor
monorail
mountain_railway
suspension_railway
mountain_cableway
aerial_tramway
rowboat
bicyclist
mountain_bicyclist
sailboat
"
emoji_groups[animals]="
snail
snake
horse
sheep
monkey
chicken
boar
elephant
octopus
spiral_shell
bug
ant
honeybee
lady_beetle
fish
tropical_fish
blowfish
turtle
hatching_chick
baby_chick
front_facing_baby_chick
bird
penguin
koala
poodle
bactrian_camel
dolphin
mouse_face
cow_face
tiger_face
rabbit_face
cat_face
dragon_face
spouting_whale
horse_face
monkey_face
dog_face
pig_face
frog_face
hamster_face
wolf_face
bear_face
panda_face
rat
mouse
ox
water_buffalo
cow
tiger
leopard
rabbit
cat
dragon
crocodile
whale
ram
goat
rooster
dog
pig
dromedary_camel
"
emoji_groups[faces]="
grinning_face_with_smiling_eyes
face_with_tears_of_joy
smiling_face_with_open_mouth
smiling_face_with_open_mouth_and_smiling_eyes
smiling_face_with_open_mouth_and_cold_sweat
smiling_face_with_open_mouth_and_tightly_closed_eyes
winking_face
smiling_face_with_smiling_eyes
face_savouring_delicious_food
relieved_face
smiling_face_with_heart_shaped_eyes
smirking_face
unamused_face
face_with_cold_sweat
pensive_face
confounded_face
face_throwing_a_kiss
kissing_face_with_closed_eyes
face_with_stuck_out_tongue_and_winking_eye
face_with_stuck_out_tongue_and_tightly_closed_eyes
disappointed_face
angry_face
pouting_face
crying_face
persevering_face
face_with_look_of_triumph
disappointed_but_relieved_face
fearful_face
weary_face
sleepy_face
tired_face
loudly_crying_face
face_with_open_mouth_and_cold_sweat
face_screaming_in_fear
astonished_face
flushed_face
dizzy_face
face_with_medical_mask
"
}
# Prints a random emoji character
@ -264,7 +63,11 @@ function random_emoji() {
[[ $list_size -eq 0 ]] && return 1
local random_index=$(( ( RANDOM % $list_size ) + 1 ))
local name=${names[$random_index]}
echo ${emoji[$name]}
if [[ "$group" == "flags" ]]; then
echo ${emoji_flags[$name]}
else
echo ${emoji[$name]}
fi
}
# Displays a listing of emoji with their names
@ -281,12 +84,26 @@ function display_emoji() {
fi
# The extra spaces in output here are a hack for readability, since some
# terminals treat these emoji chars as single-width.
local counter=1
for i in $names; do
printf '%s ' "$emoji[$i]"
if [[ "$group" == "flags" ]]; then
printf '%s ' "$emoji_flags[$i]"
else
printf '%s ' "$emoji[$i]"
fi
# New line every 20 emoji, to avoid weirdnesses
if (($counter % 20 == 0)); then
printf "\n"
fi
let counter=$counter+1
done
print
for i in $names; do
echo "${emoji[$i]} = $i"
if [[ "$group" == "flags" ]]; then
echo "${emoji_flags[$i]} = $i"
else
echo "${emoji[$i]} = $i"
fi
done
}

21538
plugins/emoji/gemoji_db.json Normal file

File diff suppressed because it is too large Load diff

View file

@ -1,113 +0,0 @@
#!/usr/bin/perl -w
#
# update_emoji.pl
#
# This script generates the emoji.plugin.zsh emoji definitions from the Unicode
# character data for the emoji characters.
#
# The data file can be found at https://unicode.org/Public/emoji/latest/emoji-data.txt
# as referenced in Unicode TR51 (https://www.unicode.org/reports/tr51/index.html).
#
# This is known to work with the data file from version 1.0. It may not work with later
# versions if the format changes. In particular, this reads line comments to get the
# emoji character name and unicode version.
#
# Country names have punctuation and other non-letter characters removed from their name,
# to avoid possible complications with having to escape the strings when using them as
# array subscripts. The definition file seems to use some combining characters like accents
# that get stripped during this process.
use strict;
use warnings;
use 5.010;
use autodie;
use Path::Class;
use File::Copy;
# Parse definitions out of the data file and convert
sub process_emoji_data_file {
my ( $infile, $outfilename ) = @_;
my $file = file($infile);
my $outfile = file($outfilename);
my $outfilebase = $outfile->basename();
my $tempfilename = "$outfilename.tmp";
my $tempfile = file($tempfilename);
my $outfh = $tempfile->openw();
$outfh->print("
# $outfilebase - Emoji character definitions for oh-my-zsh emoji plugin
#
# This file is auto-generated by update_emoji.pl. Do not edit it manually.
#
# This contains the definition for:
# \$emoji - which maps character names to Unicode characters
# \$emoji_flags - maps country names to Unicode flag characters using region indicators
# Main emoji
typeset -gAH emoji
# National flags
typeset -gAH emoji_flags
# Combining modifiers
typeset -gAH emoji_mod
");
my $fh = $file->openr();
my $line_num = 0;
while ( my $line = $fh->getline() ) {
$line_num++;
$_ = $line;
# Skip all-comment lines (from the header) and blank lines
# (But don't strip comments on normal lines; we need to parse those for
# the emoji names.)
next if /^\s*#/ or /^\s*$/;
if (/^(\S.*?\S)\s*;\s*(\w+)\s*;\s*(\w+)\s*;\s*(\w+)\s*;\s*(\w.*?)\s*#\s*V(\S+)\s\(.*?\)\s*(\w.*\S)\s*$/) {
my ($code, $style, $level, $modifier_status, $sources, $version, $keycap_name)
= ($1, $2, $3, $4, $5, $6, $7);
#print "code=$code style=$style level=$level modifier_status=$modifier_status sources=$sources version=$version name=$keycap_name\n";
my @code_points = split /\s+/, $code;
my @sources = split /\s+/, $sources;
my $flag_country = "";
if ( $keycap_name =~ /^flag for (\S.*?)\s*$/) {
$flag_country = $1;
}
my $zsh_code = join '', map { "\\U$_" } @code_points;
# Convert keycap names to valid associative array names that do not require any
# quoting. Works fine for most stuff, but is clumsy for flags.
my $omz_name = lc($keycap_name);
$omz_name =~ s/[^A-Za-z0-9]/_/g;
my $zsh_flag_country = $flag_country;
$zsh_flag_country =~ s/[^\p{Letter}]/_/g;
if ($flag_country) {
$outfh->print("emoji_flags[$zsh_flag_country]=\$'$zsh_code'\n");
} else {
$outfh->print("emoji[$omz_name]=\$'$zsh_code'\n");
}
# Modifiers are included in both the main set and their separate map,
# because they have a standalone representation as a color swatch.
if ( $modifier_status eq "modifier" ) {
$outfh->print("emoji_mod[$omz_name]=\$'$zsh_code'\n");
}
} else {
die "Failed parsing line $line_num: '$_'";
}
}
$fh->close();
$outfh->print("\n");
$outfh->close();
move($tempfilename, $outfilename)
or die "Failed moving temp file to $outfilename: $!";
}
my $datafile = "emoji-data.txt";
my $zsh_def_file = "emoji-char-definitions.zsh";
process_emoji_data_file($datafile, $zsh_def_file);
print "Updated definition file $zsh_def_file\n";

View file

@ -0,0 +1,213 @@
"""
Update Emoji.py
Refeshes OMZ emoji database based on the latest Unicode spec
"""
import re
import json
spec = open("emoji-data.txt", "r")
# Regexes
# regex_emoji will return, respectively:
# the code points, its type (status), the actual emoji, and its official name
regex_emoji = r"^([\w ].*?\S)\s*;\s*([\w-]+)\s*#\s*(.*?)\s(\S.*).*$"
# regex_group returns the group of subgroup that a line opens
regex_group = r"^#\s*(group|subgroup):\s*(.*)$"
headers = """
# emoji-char-definitions.zsh - Emoji definitions for oh-my-zsh emoji plugin
#
# This file is auto-generated by update_emoji.py. Do not edit it manually.
#
# This contains the definition for:
# $emoji - which maps character names to Unicode characters
# $emoji_flags - maps country names to Unicode flag characters using region
# indicators
# $emoji_mod - maps modifier components to Unicode characters
# $emoji_groups - a single associative array to avoid cluttering up the
# global namespace, and to allow adding additional group
# definitions at run time. The keys are the group names, and
# the values are whitespace-separated lists of emoji
# character names.
# Main emoji
typeset -gAH emoji
# National flags
typeset -gAH emoji_flags
# Combining modifiers
typeset -gAH emoji_mod
# Emoji groups
typeset -gAH emoji_groups
"""
#######
# Adding country codes
#######
# This is the only part of this script that relies on an external library
# (country_converter), and is hence commented out by default.
# You can uncomment it to have country codes added as aliases for flag
# emojis. (By default, when you install this extension, country codes are
# included as aliases, but not if you re-run this script without uncommenting.)
# Warning: country_converter is very verbose, and will print warnings all over
# your terminal.
# import country_converter as coco # pylint: disable=wrong-import-position
# cc = coco.CountryConverter()
# def country_iso(_all_names, _omz_name):
# """ Using the external library country_converter,
# this funciton can detect the ISO2 and ISO3 codes
# of the country. It takes as argument the array
# with all the names of the emoji, and returns that array."""
# omz_no_underscore = re.sub(r'_', r' ', _omz_name)
# iso2 = cc.convert(names=[omz_no_underscore], to='ISO2')
# if iso2 != 'not found':
# _all_names.append(iso2)
# iso3 = cc.convert(names=[omz_no_underscore], to='ISO3')
# _all_names.append(iso3)
# return _all_names
#######
# Helper functions
#######
def code_to_omz(_code_points):
""" Returns a ZSH-compatible Unicode string from the code point(s) """
return r'\U' + r'\U'.join(_code_points.split(' '))
def name_to_omz(_name, _group, _subgroup, _status):
""" Returns a reasonable snake_case name for the emoji. """
def snake_case(_string):
""" Does the regex work of snake_case """
remove_dots = re.sub(r'\.\(\)', r'', _string)
replace_ands = re.sub(r'\&', r'and', remove_dots)
remove_whitespace = re.sub(r'[^\#\*\w]', r'_', replace_ands)
return re.sub(r'__', r'_', remove_whitespace)
shortname = ""
split_at_colon = lambda s: s.split(": ")
# Special treatment by group and subgroup
# If the emoji is a flag, we strip "flag" from its name
if _group == "Flags" and len(split_at_colon(_name)) > 1:
shortname = snake_case(split_at_colon(_name)[1])
else:
shortname = snake_case(_name)
# Special treatment by status
# Enables us to have every emoji combination,
# even the one that are not officially sanctionned
# and are implemeted by, say, only one vendor
if _status == "unqualified":
shortname += "_unqualified"
elif _status == "minimally-qualified":
shortname += "_minimally"
return shortname
def increment_name(_shortname):
""" Increment the short name by 1. If you get, say,
'woman_detective_unqualified', it returns
'woman_detective_unqualified_1', and then
'woman_detective_unqualified_2', etc. """
last_char = _shortname[-1]
if last_char.isdigit():
num = int(last_char)
return _shortname[:-1] + str(num + 1)
return _shortname + "_1"
########
# Going through every line
########
group, subgroup, short_name_buffer = "", "", ""
emoji_database = []
for line in spec:
# First, test if this line opens a group or subgroup
group_match = re.findall(regex_group, line)
if group_match != []:
gr_or_sub, name = group_match[0]
if gr_or_sub == "group":
group = name
elif gr_or_sub == "subgroup":
subgroup = name
continue # Moving on...
# Second, test if this line references one emoji
emoji_match = re.findall(regex_emoji, line)
if emoji_match != []:
code_points, status, emoji, name = emoji_match[0]
omz_codes = code_to_omz(code_points)
omz_name = name_to_omz(name, group, subgroup, status)
# If this emoji has the same shortname as the preceding one
if omz_name in short_name_buffer:
omz_name = increment_name(short_name_buffer)
short_name_buffer = omz_name
emoji_database.append(
[omz_codes, status, emoji, omz_name, group, subgroup])
spec.close()
########
# Write to emoji-char-definitions.zsh
########
# Aliases for emojis are retrieved through the DB of Gemoji
# Retrieved on Aug 9 2019 from the following URL:
# https://raw.githubusercontent.com/github/gemoji/master/db/emoji.json
gemoji_db = open("gemoji_db.json")
j = json.load(gemoji_db)
aliases_map = {entry['emoji']: entry['aliases'] for entry in j}
all_omz_names = [emoji_data[3] for emoji_data in emoji_database]
# Let's begin writing to this file
output = open("emoji-char-definitions.zsh", "w")
output.write(headers)
emoji_groups = {"fruits": "\n", "vehicles": "\n", "hands": "\n",
"people": "\n", "animals": "\n", "faces": "\n",
"flags": "\n"}
# First, write every emoji down
for _omz_codes, _status, _emoji, _omz_name, _group, _subgroup in emoji_database:
# One emoji can be mapped to multiple names (aliases or country codes)
names_for_this_emoji = [_omz_name]
# Variable that indicates in which map the emoji will be located
emoji_map = "emoji"
if _status == "component":
emoji_map = "emoji_mod"
if _group == "Flags":
emoji_map = "emoji_flags"
# Adding country codes (Optional, see above)
# names_for_this_emoji = country_iso(names_for_this_emoji, _omz_name)
# Check if there is an alias available in the Gemoji DB
if _emoji in aliases_map.keys():
for alias in aliases_map[_emoji]:
if alias not in all_omz_names:
names_for_this_emoji.append(alias)
# And now we write to the definitions file
for one_name in names_for_this_emoji:
output.write(f"{emoji_map}[{one_name}]=$'{_omz_codes}'\n")
# Storing the emoji in defined subgroups for the next step
if _status == "fully-qualified":
if _subgroup == "food-fruit":
emoji_groups["fruits"] += f" {_omz_name}\n"
elif "transport-" in _subgroup:
emoji_groups["vehicles"] += f" {_omz_name}\n"
elif "hand-" in _subgroup:
emoji_groups["hands"] += f" {_omz_name}\n"
elif "person-" in _subgroup or _subgroup == "family":
emoji_groups["people"] += f" {_omz_name}\n"
elif "animal-" in _subgroup:
emoji_groups["animals"] += f" {_omz_name}\n"
elif "face-" in _subgroup:
emoji_groups["faces"] += f" {_omz_name}\n"
elif _group == "Flags":
emoji_groups["flags"] += f" {_omz_name}\n"
# Second, write the subgroups to the end of the file
for name, string in emoji_groups.items():
output.write(f'\nemoji_groups[{name}]="{string}"\n')
output.close()