German nouns gender analyzer - Python ~ Ahmed AbouZaid!

I really enjoy learning new languages, not just grammar, but the whole culture of the language. The main problem when people try to learn a new language is the "comparison"! They always try to compare between the new language and theirs, or the new language and any other foreign language they already know. And the result always is they struggle with that new language!

I believe that, to learn anything new -not just languages- you have to accept it "as it is"! You have to deal with it with its way not with any other way! You have not make comparisons because it does not help at all!

Non-Technical details:

Recently, I've finished A1 level of German (Start Deutsch 1) in Goethe Institute, I had really enjoyed that course, and now I can read and understand most of German posts on the Internet. Not just that, but I totally enjoyed learning German, although it's a little bit complex language compared to its cognate English (In fact, English and German are from the same family "West Germanic languages"). Why? Because I focused on learning and discovering the language instead make comparisons!

Source: West Germanic languages - Wikipedia.

Anyway, one of the characteristics of the German language is noun gender, not the real/natural gender but the grammatical gender! Also, in German there are many definite articles, which change based on noun gender and grammatical case! Unlike English which has just "The" for all nouns regardless of their gender (thus, their definite article), in German there are many! "Der" for masculine nouns, "Die" for feminine nouns, "Das" for neutral nouns ... and all of them change based on grammatical case ... nominative, accusative, dative, and genitive! (So, German has about 6 definite articles!)

So, you have to learn and remember the word and its gender, which is one of three: Masculine, Feminine, Neutral (and the last one "neutral" is completely different from "inanimate" in English). And there is no rule or standard to determine noun gender certainty, but there are some general rules that may help you to "guess" the gender if you don't know it! (Like suffixes -Endungen- and common characteristics). But I just had a curiosity to know about the frequency of last character :D

Regardless of this long non-technical introduction :D
I just wondered, if I know nothing about word and can't remember any of the ending syllables, and I have to guess the gender of the noun, which one I will choose? :D

So I made a simple python script counts the frequency of the last character in every German noun based on the noun gender to find the percentage of this frequency, so you can guess if it's inevitable! :D

For example, if a noun ending with letter "L", what's the possibility that word is masculine, feminine, or neutral? Based on this script which analyzed two data sets, one of them has more than 334,000 German nouns! The statistics of words ending with L" are:

#: Masculine Feminine Neutral
L: 9733 (47.9%) 3733 (18.4%) 6854 (33.7%)

Then, if you found a German noun ends with "L", most likely it will be a masculine! So, what I did?

Technical details:

This script working with list has nouns and theirs gender, firstly, it checks last character of noun, after that checks the gender of that noun (which is provided in the same line), then move to second line in the list ... and so on. You can find script repository on Github: German-Nouns-Gender-Analyzer.

I've tested this script with two data sets, which are "dict.cc", and "Wiktionary".

Dict.cc version (About 334,000 words):

I downloaded data set from dict.cc, but due to its license I can't share the data itself, you can find more information about this data set on following URL:
http://www1.dict.cc/translation_file_request.php

I extracted single nouns only from dictionary using Regex and "grep" command (one-liner):

grep -P ".*?noun" dict.cc_full_dictionary.txt | \
grep -P -o "[A-ZÄÜÖß][a-zA-ZÄÜÖäüöß-]+[a-zäüöß] \{(m|f|n)\}" | \
sort | uniq > dict.cc_nouns_with_gender.txt

Final result example:

Apfel {m}

Wiktionary.org version (About 50,000 words):

I downloaded this data set from the following URL:
http://dumps.wikimedia.org/dewiktionary/latest/dewiktionary-latest-pages-meta-current.xml.bz2

Then extracted single nouns only using next combination (one-liner):

xbuffer=$(awk "END {print NR}" dewiktionary-latest-pages-meta-current.xml); \
pcregrep --buffer-size ${xbuffer} -M '.*?\\=\= .*? \(\{\{Sprache\|Deutsch\}\}\) \=\=.*?\n.*?\=\=\= \{\{Wortart\|Substantiv\|Deutsch\}\}\, \{\{.\}\} \=\=\=.*?' dewiktionary-latest-pages-meta-current.xml | \
awk '{noun=$3; getline; gender=gensub(/.*?\{(\{.\})\}.*?/,"\\1",$3); print noun, gender}' | \
sort | uniq > wiktionary_nouns_with_gender.txt

Final result example:

Apfel {m}

How to use:

python german-nouns-gender-analyzer.py nouns_with_gender.txt

Output example:

This an example of "dict.cc" data set.

Total number of words: (334399) words.
 - Feminine: 145711 (43.57%)
 - Masculine: 118905 (35.56%)
 - Neutral: 69784 (20.87%)

Characters statistics:
#:  Masculine        Feminine          Neutral
------------------------------------------------------
A:  581 (14.3%)      2331 (57.2%)      1162 (28.5%)
B:  1393 (83.1%)     14 (0.8%)         270 (16.1%)
C:  33 (42.3%)       15 (19.2%)        30 (38.5%)
D:  3791 (44.6%)     535 (6.3%)        4183 (49.2%)
E:  3249 (4.7%)      63684 (91.9%)     2354 (3.4%)
F:  3947 (89.9%)     27 (0.6%)         417 (9.5%)
G:  7164 (18.4%)     29996 (77.2%)     1690 (4.4%)
H:  5963 (68.8%)     304 (3.5%)        2398 (27.7%)
I:  1556 (46.1%)     1321 (39.1%)      499 (14.8%)
J:  2 (66.7%)        0 (0.0%)          1 (33.3%)
K:  3161 (38.2%)     3304 (39.9%)      1815 (21.9%)
L:  9733 (47.9%)     3733 (18.4%)      6854 (33.7%)
M:  3245 (29.2%)     618 (5.6%)        7257 (65.3%)
N:  10320 (25.9%)    13290 (33.3%)     16310 (40.9%)
O:  915 (37.0%)      184 (7.4%)        1376 (55.6%)
P:  684 (49.2%)      46 (3.3%)         660 (47.5%)
Q:  2 (40.0%)        0 (0.0%)          3 (60.0%)
R:  32620 (73.9%)    5672 (12.9%)      5821 (13.2%)
S:  8146 (59.2%)     2375 (17.3%)      3244 (23.6%)
T:  16521 (38.6%)    15601 (36.5%)     10630 (24.9%)
U:  909 (53.5%)      317 (18.7%)       473 (27.8%)
V:  137 (37.5%)      3 (0.8%)          225 (61.6%)
W:  50 (32.5%)       46 (29.9%)        58 (37.7%)
X:  611 (64.2%)      278 (29.2%)       63 (6.6%)
Y:  221 (35.6%)      173 (27.9%)       226 (36.5%)
Z:  3299 (50.5%)     1791 (27.4%)      1441 (22.1%)
ß:  650 (68.9%)      23 (2.4%)         270 (28.6%)
Ä:  0 (0.0%)         2 (40.0%)         3 (60.0%)
Ö:  1 (3.6%)         26 (92.9%)        1 (3.6%)
Ü:  1 (1.9%)         2 (3.8%)          50 (94.3%)

Now you can guess! :D

--------

BTW, this post "Solving Unicode Problems in Python 2.7" is a great resource for how to deal with Unicode in Python.