Day 90 - Goodbye Capture Groups, Hello Word Boundaries
18 May 2017 100daysofwriting · regex · software · workThere was an interesting Regex problem to solve today. Insert commas in a number according to the Indian numbering system. There’s an inherent asymmetry in the Indian system
The number: 1 million
Indian: Rs. 10,00,000 American: Rs. 1,000,000
The first approach that I took was to use capture groups. Capture the last three digits, put a comma in front of them, then the next two and so on. This would have worked if all the numbers that this replacement needed to happen on were of the same number of digits. That kind of thing never happens in real life.
Scouring the internet, I eventually landed on the Regex used to insert commas according to the American system. The regex looked like this:
str.replace(/\B(?=(\d{3})+(?!\d))/g, ",")
I know a fair amount of Regex from having used sed
inside and outside vim
regularly to replace text. This regex had a lot of things that I didn’t
recognize. \B
, ?=
, ?!
and also some parentheses which weren’t working as
group capturers and a comma appearing in the middle of two characters. I had
never seen that before.
Then, I found this website which has incredibly detailed articles about all the operators that Regex supports and the things that can be accomplished with it.
Armed with the knowledge of what “Not a word boundary \B” and Lookahead
assertion using ?=
means, I delved into finding a Regex that would solve the
problem.
The solution turned out to be a two step:
str.replace(/\B(?=\d{3}$)/g, ",")
.replace(/\B(?=(\d{2})+(?!\d),)/g, ",")
The first one finds the first NOT a word boundary which is succeeded by 3 digits and the end of the string
The second regex looks for NOt a word boundaries that have exactly two digits after them and stop matching when a comma is encountered.
All of this magic (not really.) is possible because Lookahead assertions don’t consume characters. Consuming characters basically means that the regex pointer moves forward from one character to the other after looking at it. Lookahead assertions keep the pointer at the character being considered for a match and simply lookahead and assert or not assert the stuff they have been asked to.
I would love to have a look at how this is implemented in sed
or
even C++
for Node, but I am unreasonably sure I wouldn’t understand the
implementation. This “looking at the implementation and not understanding it”
will become a post inspiration in the future.
P.S. All the code above is JavaScript, duh.
POST #90 is OVER