GitHub - Kensuke-Mitsuzawa/Python_viraster: Python implementation Persian Normalization library Virastar

What's this?

Character mapping script from arabic to latin. This original style does not have same mapping. In other words, all mappings are the one-to-one correspondances. Thus, this script can map without loosing the original character information.

For latin characters, I used Latin Supplement, Latin Extended-A, Latin Extended-B characters. To make available a keyboard that includes Latin Supplement, Latin Extended-A, Latin Extended-B, please see http://symbolcodes.tlt.psu.edu/accents/codemacext.html

Usage

To get romanized characters, codes like below in if __name__=='__main__'

ex_sent=u'من کتاب دادم';
ins=transliter(ex_sent);
print ins.main();

If you wanna get Arabic sequence form unicode sequence, use

ex_sent_2=u'mn tw ra dÝdm';
ins_2=transliter(ex_sent_2);
print ins_2.unicode_to_arabic();

Converting table

Note: I recommend to remove all diacritics before you convert to latin characters. To remove them, please use `def clean_up()`

This class converts arabic characters to ratin characters. Not all characters are supported, but basic characters are completly converted. I think that enought to express normal writing style.

In the function original_unicode(), character mapping chart is below.(2013/7/23) See also: http://jrgraphix.net/research/unicode_blocks.php?block=12

Arabic_character	Arabic_character_commonly_called	unicode_number(Hex)	mapped_ratin
ا	alef	U+0627	a
آ	alef_madda_above	U+0622	ā
ب	be	U+06FF	b
پ	pe	U+067E	p
ت	te	U+062A	t
ث	se	U+062B	ç
ج	jim	U+062C	j
چ	ˇce	U+0686	č
ح	he	U+062D	ħ
خ	xe	U+062E	x
د	dal	U+062F	d
ذ	zal	U+0630	đ
ر	re	U+0631	r
ز	ze	U+0632	z
ژ	ˇze	U+0698	ž
س	sin	U+0633	s
ش	ˇsin	U+0634	š
ص	sad	U+0635	ş
ض	zad	U+0636	ź
ط	ta	U+0637	ţ
ظ	za	U+0638	ẓ
ع	'eyn	U+0639	'
غ	qeyn	U+063A	q
ف	fe	U+0641	f
ق	qaf	U+0642	ŕ
ک	persian-kaf	U+06A9	K
	arabic-kaf	U+0643	K
گ	gaf	U+06AF	g
ل	lam	U+0644	l
م	mim	U+0645	m
ن	nun	U+0646	n
و	vav	U+0645	w
ه	he	U+0647	e
ۀ	he_yeh_above	U+06C0	X
ی	persian-ye	U+06CC	y
ي	arabic-ye	U+064A	Y
ى	arabic_alef_maksusa	U+0649	Ý
،	arabic_comma	U+060C	,
؛	arabic_semicolon	U+061B	;
؟	arabic_question	U+061F	?
ة	arabic_heh_hamza_above	U+0629	T
٪	arabic_percent	U+066A	%
	Zero-Width-Non-Joiner	U+200C	_
	arabic_hamza_above	U+0654	ú
	arabic_hamza_below	U+0655	E
	arabic_alef_hamza_above	U+0623	á
	arabic_hamza	U+0621	°
٫	arabic_decimal_separator	U+066B	⎖

Update

v0.01: First Version
v0.02: fixed bug that ث س have the same mapping 'se'. changed ث to se¥_1 and س to se¥_2.
v0.03: added 'ARABIC_ALEF_MAKSUSA' and its mapping to 'Ý'. 'PERSIAN_YEH' and its mapping is changed to 'y'.
v0.04: changed a mapping of 'persian_kaf' to 'k'. The mappig 'ه' to 'h' is disused. Instead, 'e' is used for 'ه'. 'arabic_fathatan' is disused.
v0.05: Added new map for arabic tatweel and arabic decimal separator. An arabic tatweel is \u0640 in unicode and is reffered as keyname 'arabic_tatweel' in code. An arabic decimal separator is \u066b in unicode, and is reffered as keyname 'arabic_decimal_separator' in code.
v0.06: some character + diacritic is added. u+0625, u+0629, u+06c0, u+0695, u+0624
v0.07: arabic fathatan is added
v0.08: A bug that clean_up function does not work solved.
v0.09: ۀ is added
v0.10: Per_Per2 module is added with in this system. About Per_per2.rb, please look at here

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
debug		debug
virastar		virastar
.gitignore		.gitignore
README.md		README.md
do_files_in_directory.sh		do_files_in_directory.sh
for_dataset.py		for_dataset.py
pre_per2.rb		pre_per2.rb
test_document_normalized		test_document_normalized
translitr.py		translitr.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

What's this?

Usage

Converting table

Note: I recommend to remove all diacritics before you convert to latin characters. To remove them, please use `def clean_up()`

Update

About

Uh oh!

Releases

Packages

Languages

Kensuke-Mitsuzawa/Python_viraster

Folders and files

Latest commit

History

Repository files navigation

What's this?

Usage

Converting table

Note: I recommend to remove all diacritics before you convert to latin characters. To remove them, please use def clean_up()

Update

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Note: I recommend to remove all diacritics before you convert to latin characters. To remove them, please use `def clean_up()`

Packages