I used Regex for a long time to prepare and clean text for formatting. In fact, I had learned Regex even before I became a sysadmin! (And this was a little bit funny when I was a junior sysadmin, that junior who knows everything about Regex!)
When it comes to working with text, you always find lots of common mistakes, especially when you don't work with your own text! For example, lots of people make space between word and punctuation marks e.g. question mark "?" or exclamation mark "!".
If you don't know what's the heck is wrong with this, let me tell you. When you make a space between word and punctuation mark "like this !", if they come at the end of line, and there is no space left in the line, the application (e.g. web browser or word processor) will shift this punctuation mark to next line!
So we will see something ugly and nonstandard like this:
If there is no space left in the line, this will happen! So, what do you think ?? Oh, a question mark at the beginning of line?!
This should not have happened! And this why we should not have any spaces between the word and punctuation marks "like this!", in this case, the application will shift the whole word to next line not just the punctuation mark.
Another common mistake, many of Turkish people use Turkish "i" (IPA: "/ɯ/" and "/i/" ... more info at Wikipedia: dotted and dotless I) when they write English instead real English/Latin "i" letter!
Arabs have some of them too, and I used to use LibreOffice macro written in "Basic", I didn't write whole of it, the Regex part only. I realized how super helpful it is! It saved thousands of hours every time I needed to work with a text!
But come on, we are in 2015, who still use Basic? (Yeah yeah, I know, there are always people doing). So I decided to make a new one with Python from scratch.
Unlike Basic, LibreOffice doesn't have native support for Python, and everything work thought UNO (Universal Network Objects), so the hard part is not with Python itself but to understand how Python communicates with LibreOffice. After some researches, I figured out how things work, and made Python version.
It's really simple script in regard to pure Python, and it heavily based on Regex, but using Python (or any modern language) makes life easier and getting things done faster (also I made a standalone version of the script, it is useful with bulky files or so, it may not be that cool, but may help someone).
As you see, I use it with Arabic, but of course you can use any Regex patterns. You will find full version and documentation at Github repo:
LibreOffice macros (Basic and Python) to fix common mistakes
# -*- coding: utf-8 -*- import uno def FixCommonArabicMistakes(): replaceList = { "(\p{script=arabic}\W?)([ ]?;)": "$1؛", "(\p{script=arabic}\W?)([ ]?,)": "$1،", "\([ ]+": "(", "[ ]+\)": ")", "^[ ]+$": "", "^[ ]+": "", "[ ]+$": "", "[ ]+": " ", " :": ":", " ؛": "؛", " ،": "،", " \.": ".", " !": "!", " ؟": "؟", " و ": " و", "^و ": "و", "ـ": "" } currentDoc = XSCRIPTCONTEXT.getDocument() findAndReplace = currentDoc.createReplaceDescriptor() findAndReplace.SearchCaseSensitive = True findAndReplace.SearchRegularExpression = True for replaceItem in replaceList: findAndReplace.SearchString = replaceItem findAndReplace.ReplaceString = replaceList[replaceItem] currentDoc.replaceAll(findAndReplace) return None
Have a nice day .. and always take care of typos and common mistakes :-D