Urdu language has 46 Alphabets, 10 Digits, 6 Punctuations and 6 Diacritics characters.

The Urdu alphabet is the right-to-left alphabet used for the Urdu language. It is a modification of the Persian alphabet known as Perso-Arabic, which is itself a derivative of the Arabic alphabet. The Urdu alphabet has up to 58 letters with 39 basic letters and no distinct letter cases, the Urdu alphabet is typically written in the calligraphic Nastaʿlīq script.

What is Encoding

Character encoding may be defined as assigning a unique number to each language character to be processed by the computer. Whenever a character is input from keyboard or other input devices, this particular code is generated internally in the computer. Arbitrary encoding may be defined for any application (e.g. 80 for letter ‘a’, 81 for letter ‘b’). However, if different vendors are defining arbitrary encodings, their encodings may not agree with one another. With the advent of the Internet, it has now become increasingly essential to standardize the encoding scheme because users are accessing data created by a variety of sources through web browsers (a single application). Realizing the significance of standardizing encoding, work was done early for English and American Standard Code for Information Interchange (ASCII) was defined in 1968. This standard had 128 slots defined using 7 bits by American National Standards Institute (ANSI).

What is Unicode

Initially most documentation was done in a single language, therefore 8-bit single language code pages served the need. However, in 1990s, with increasing needs for multi-lingual documents (where one could require Japanese and Arabic in the same document), it was realized that defining 8-bit code pages were not a scalable solution. Adding code pages for various languages and scripts and using them together in one application created a lot of difficulty and complexity in processing because users had to keep toggling between them.

To address this issue, major vendors got together and created Unicode consortium (www.unicode.org). This consortium started working on developing a singular, unified and universal code chart which would contain all characters of all languages. As 8-bit (256 slots) code pages were insufficient for this requirement, Unicode character encoding standard was developed using 16 bits (65536 slots). This space has been divided to cater to various scripts and thus bypassed the need for toggling for different languages.

Urdu Unicode Range(0600-06ff)

Arabic is a superset of Urdu, Persian, and Sindhi. Urdu, Arabic, Persian, Sindhi, all occupy the same range i.e. 0600-06FF. But different code points for some different characters.

!pip install 'urduhack[tf]'
Collecting urduhack[tf]
  Downloading urduhack-1.1.1-py3-none-any.whl (105 kB)
     |████████████████████████████████| 105 kB 179 kB/s eta 0:00:01
Requirement already satisfied: tf2crf in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from urduhack[tf]) (0.1.17)
Requirement already satisfied: Click~=7.1 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from urduhack[tf]) (7.1.2)
Requirement already satisfied: tensorflow-datasets~=3.1 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from urduhack[tf]) (3.2.1)
Requirement already satisfied: regex in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from urduhack[tf]) (2020.7.14)
Requirement already satisfied: tensorflow~=2.2; extra == "tf" in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from urduhack[tf]) (2.3.0)
Requirement already satisfied: tensorflow-addons>=0.8.2 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tf2crf->urduhack[tf]) (0.11.2)
Requirement already satisfied: requests>=2.19.0 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow-datasets~=3.1->urduhack[tf]) (2.24.0)
Requirement already satisfied: tqdm in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow-datasets~=3.1->urduhack[tf]) (4.48.2)
Requirement already satisfied: attrs>=18.1.0 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow-datasets~=3.1->urduhack[tf]) (19.3.0)
Requirement already satisfied: six in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow-datasets~=3.1->urduhack[tf]) (1.15.0)
Requirement already satisfied: tensorflow-metadata in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow-datasets~=3.1->urduhack[tf]) (0.23.0)
Requirement already satisfied: termcolor in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow-datasets~=3.1->urduhack[tf]) (1.1.0)
Requirement already satisfied: dill in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow-datasets~=3.1->urduhack[tf]) (0.3.2)
Requirement already satisfied: protobuf>=3.6.1 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow-datasets~=3.1->urduhack[tf]) (3.13.0)
Requirement already satisfied: wrapt in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow-datasets~=3.1->urduhack[tf]) (1.12.1)
Requirement already satisfied: numpy in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow-datasets~=3.1->urduhack[tf]) (1.18.5)
Requirement already satisfied: absl-py in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow-datasets~=3.1->urduhack[tf]) (0.10.0)
Requirement already satisfied: future in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow-datasets~=3.1->urduhack[tf]) (0.18.2)
Requirement already satisfied: promise in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow-datasets~=3.1->urduhack[tf]) (2.3)
Requirement already satisfied: wheel>=0.26 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow~=2.2; extra == "tf"->urduhack[tf]) (0.35.1)
Requirement already satisfied: scipy==1.4.1 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow~=2.2; extra == "tf"->urduhack[tf]) (1.4.1)
Requirement already satisfied: h5py<2.11.0,>=2.10.0 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow~=2.2; extra == "tf"->urduhack[tf]) (2.10.0)
Requirement already satisfied: gast==0.3.3 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow~=2.2; extra == "tf"->urduhack[tf]) (0.3.3)
Requirement already satisfied: tensorflow-estimator<2.4.0,>=2.3.0 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow~=2.2; extra == "tf"->urduhack[tf]) (2.3.0)
Requirement already satisfied: tensorboard<3,>=2.3.0 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow~=2.2; extra == "tf"->urduhack[tf]) (2.3.0)
Requirement already satisfied: keras-preprocessing<1.2,>=1.1.1 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow~=2.2; extra == "tf"->urduhack[tf]) (1.1.2)
Requirement already satisfied: astunparse==1.6.3 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow~=2.2; extra == "tf"->urduhack[tf]) (1.6.3)
Requirement already satisfied: opt-einsum>=2.3.2 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow~=2.2; extra == "tf"->urduhack[tf]) (3.3.0)
Requirement already satisfied: google-pasta>=0.1.8 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow~=2.2; extra == "tf"->urduhack[tf]) (0.2.0)
Requirement already satisfied: grpcio>=1.8.6 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow~=2.2; extra == "tf"->urduhack[tf]) (1.31.0)
Requirement already satisfied: typeguard>=2.7 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow-addons>=0.8.2->tf2crf->urduhack[tf]) (2.9.1)
Requirement already satisfied: certifi>=2017.4.17 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from requests>=2.19.0->tensorflow-datasets~=3.1->urduhack[tf]) (2020.6.20)
Requirement already satisfied: idna<3,>=2.5 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from requests>=2.19.0->tensorflow-datasets~=3.1->urduhack[tf]) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from requests>=2.19.0->tensorflow-datasets~=3.1->urduhack[tf]) (1.25.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from requests>=2.19.0->tensorflow-datasets~=3.1->urduhack[tf]) (3.0.4)
Requirement already satisfied: googleapis-common-protos in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorflow-metadata->tensorflow-datasets~=3.1->urduhack[tf]) (1.52.0)
Requirement already satisfied: setuptools in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from protobuf>=3.6.1->tensorflow-datasets~=3.1->urduhack[tf]) (47.1.0)
Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorboard<3,>=2.3.0->tensorflow~=2.2; extra == "tf"->urduhack[tf]) (1.7.0)
Requirement already satisfied: google-auth<2,>=1.6.3 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorboard<3,>=2.3.0->tensorflow~=2.2; extra == "tf"->urduhack[tf]) (1.21.0)
Requirement already satisfied: markdown>=2.6.8 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorboard<3,>=2.3.0->tensorflow~=2.2; extra == "tf"->urduhack[tf]) (2.6.11)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorboard<3,>=2.3.0->tensorflow~=2.2; extra == "tf"->urduhack[tf]) (0.4.1)
Requirement already satisfied: werkzeug>=0.11.15 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from tensorboard<3,>=2.3.0->tensorflow~=2.2; extra == "tf"->urduhack[tf]) (0.16.1)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow~=2.2; extra == "tf"->urduhack[tf]) (0.2.8)
Requirement already satisfied: rsa<5,>=3.1.4; python_version >= "3.5" in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow~=2.2; extra == "tf"->urduhack[tf]) (4.6)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow~=2.2; extra == "tf"->urduhack[tf]) (4.1.1)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow~=2.2; extra == "tf"->urduhack[tf]) (1.3.0)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from pyasn1-modules>=0.2.1->google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow~=2.2; extra == "tf"->urduhack[tf]) (0.4.8)
Requirement already satisfied: oauthlib>=3.0.0 in /Users/ikramali/.pyenv/versions/3.7.8/envs/test/lib/python3.7/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow~=2.2; extra == "tf"->urduhack[tf]) (3.1.0)
Installing collected packages: urduhack
Successfully installed urduhack-1.1.1
WARNING: You are using pip version 20.1.1; however, version 20.2.3 is available.
You should consider upgrading via the '/Users/ikramali/.pyenv/versions/3.7.8/envs/test/bin/python3.7 -m pip install --upgrade pip' command.
from urduhack.urdu_characters import URDU_ALPHABETS,URDU_DIGITS,URDU_PUNCTUATIONS,URDU_DIACRITICS,URDU_ALL_CHARACTERS
print(URDU_ALPHABETS, end=' ')
frozenset({'گ', 'ز', 'ٹ', 'ص', 'س', 'ت', 'ح', 'ک', 'خ', 'ڈ', 'ر', 'ف', 'ھ', 'ج', 'چ', 'ؤ', 'ئ', 'ے', 'ق', 'ب', 'ڑ', 'ط', 'ل', 'ث', 'ۓ', 'ذ', 'ژ', 'ء', 'پ', 'آ', 'ض', 'ی', 'ن', 'ۂ', 'أ', 'ش', 'غ', 'م', 'ا', 'د', 'ہ', 'ظ', 'ۃ', 'ں', 'ع', 'و'}) 
print(URDU_DIGITS, end=' ')
frozenset({'۱', '۵', '۶', '۰', '۳', '۴', '۲', '۸', '۹', '۷'}) 
print(URDU_PUNCTUATIONS, end=' ')
frozenset({'؛', '؟', '٪', '،', '٫', '۔'}) 
print(URDU_DIACRITICS, end=' ')
frozenset({'ٍ', 'ِ', 'ً', 'ٰ', 'ُ', 'َ'}) 
print(URDU_ALL_CHARACTERS)
frozenset({'گ', 'ز', 'ؔ', '۱', 'ؑ', 'ح', '۴', 'ؓ', 'ر', '؛', 'ھ', '\u0601', 'ج', 'ب', '؍', 'ل', 'ث', 'ۓ', 'ذ', 'ژ', 'ً', 'آ', 'ؕ', 'ض', 'ی', 'ن', '۰', '۸', 'غ', 'ا', 'ٔ', '؏', '۲', '\u0602', '٬', 'ظ', 'ۃ', '۹', '۔', 'ُ', 'ٹ', 'ص', 'س', 'ٌ', '٘', 'ت', 'ک', 'خ', 'ڈ', 'ْ', 'ف', '\u0600', 'ٗ', '،', '۵', 'چ', 'ؤ', 'ئ', 'ے', 'ق', '٫', 'ٖ', 'ط', 'ڑ', '۷', 'ٍ', 'ِ', 'ء', 'پ', '۶', 'ّ', 'ؐ', '۳', 'ۂ', 'ؒ', 'أ', 'ش', 'ٰ', 'م', 'د', 'ٓ', '؟', '٪', '\u0603', 'ہ', '؎', 'ں', 'ع', 'َ', 'و'})

Urdu vs Arabic Presentation form Characters Challenge

Unicode provides support for Urdu language but there is a problem we have to cater in order to utilise that support. The Urdu is incorporated in Arabic language's block in the Unicode table as Urdu is derived from Arabic script. This makes things a little bit complicated for computer scientists trying to develop applications for Urdu language.

For example consider a word "خاموشی", now if we see the codes at the back-end for this word we can find two different sets of codes form Unicode table.

Now the problem is how do we know on which codes we have to train our model on? If we train our model on a specific range (Urdu 0600-06ff) and our dataset has some words formed using the Arabic set of codes then our application will fail to recognize those words resulting in low accuracy. This redundancy in codes of words hinders us to achieve a high accuracy.

So how do we handle this issue? You can go up and look at the Urdu Unicode Range table. Unicode has standardized this range (0600-06ff) for Urdu only. So all we need to do is to do some data pre-processing before running any alogrithm on data. For each word in data having redundant codes, we can replace that word with the same standardized Urdu word belonging to the range 0600 to 06ff. That's it!

Urdu Characters Shapes

Urdu characters take on different forms based on the position they are used inside a word. Like an urdu character used at the start of a word will have a different shape and the same character used in the middle or at the end of a word will have a completely different shape. This is only concerned with the font shape for that character. For illustration purpose, let's take an example of urdu character "ﻑ". Now notice the difference in "ﻑ" shape.

  • Used at the Start: "آفاق"
  • Used in the middle: "مفاحمت"
  • Used at the End: "کیف"
  • Isolated Use: "موصوف"

As you would have noticed "ﻑ" takes on a different shape based on its position of usage.

Urdu/Arabic Character Presentation Fonts

Now to get a bit more understanding of the above part, let's look at the unicode range for combined characters. These combined characters are given a unicode range separately. This range was defined for the intuition purpose only. How two characters appear when they are combined. It has more to do with the usage of characters in different positions rather than the context the character is used in. In arabic "Qaida", for teaching purpose, it is taught how two characters like "ل" and "ح" when combined will appear like "لح". This "لح" is given a new unicode FC40. There wil hardly be any keyboard or a system which will use these combined characters so this just to show different presentation forms of a character. For more illustrative purpose, look at the below links.

Comparison of Unicode values of Urdu and Arabic characters

Thanks to (https://github.com/urdutext/UrduArabicCompare)
CSV file (https://github.com/urduhack/urdu-characters/blob/master/img/Urdu_Arabic_Unicode_comparison.csv)