Day 90 - Goodbye Capture Groups, Hello Word Boundaries

There was an interesting Regex problem to solve today. Insert commas in a number according to the Indian numbering system. There’s an inherent asymmetry in the Indian system

The number: 1 million

Indian: Rs. 10,00,000 American: Rs. 1,000,000

The first approach that I took was to use capture groups. Capture the last three digits, put a comma in front of them, then the next two and so on. This would have worked if all the numbers that this replacement needed to happen on were of the same number of digits. That kind of thing never happens in real life.

Scouring the internet, I eventually landed on the Regex used to insert commas according to the American system. The regex looked like this:

str.replace(/\B(?=(\d{3})+(?!\d))/g, ",")

I know a fair amount of Regex from having used sed inside and outside vim regularly to replace text. This regex had a lot of things that I didn’t recognize. \B, ?=, ?! and also some parentheses which weren’t working as group capturers and a comma appearing in the middle of two characters. I had never seen that before.

Then, I found this website which has incredibly detailed articles about all the operators that Regex supports and the things that can be accomplished with it.

Armed with the knowledge of what “Not a word boundary \B” and Lookahead assertion using ?= means, I delved into finding a Regex that would solve the problem.

The solution turned out to be a two step:

str.replace(/\B(?=\d{3}$)/g, ",")
   .replace(/\B(?=(\d{2})+(?!\d),)/g, ",")

The first one finds the first NOT a word boundary which is succeeded by 3 digits and the end of the string

The second regex looks for NOt a word boundaries that have exactly two digits after them and stop matching when a comma is encountered.

All of this magic (not really.) is possible because Lookahead assertions don’t consume characters. Consuming characters basically means that the regex pointer moves forward from one character to the other after looking at it. Lookahead assertions keep the pointer at the character being considered for a match and simply lookahead and assert or not assert the stuff they have been asked to.

I would love to have a look at how this is implemented in sed or even C++ for Node, but I am unreasonably sure I wouldn’t understand the implementation. This “looking at the implementation and not understanding it” will become a post inspiration in the future.

P.S. All the code above is JavaScript, duh.

POST #90 is OVER