Is anyone good with #Rstats and #regex ?

guyjantic@infosec.exchange

Is anyone good with #Rstats and #regex ? I'm having issues.

strings <- c("150 hertz", "70 hz", NA, "between 87 and 100 hz ocillations", "15hz", "triangle 110 hertz", "144Hz, Sine waveform", "It is a hysterical idling. More vibraton than sound.", NA)

I want to replace each string with the digits (well, the first set) found in it, if any. I try this:

sub("(^.*)(\\d{2,5})(.*$)", "\\2", strings)

I get this as a result:

[1] "50"
[2] "70"
[3] NA
[4] "00"
[5] "15"
[6] "10"
[7] "44"
[8] "It is a hysterical idling. More vibraton than sound."
[9] NA

I expect to get all digits (the first set in each string) if they are from 2 to 5 digits long. Instead, I only get 2 digits.

Using similar regex in #geany just to prepare this little example I got the expected behavior. I've updated and restarted R. I've used sub and gsub. Same result. If I specify \d{3,5} I get three digits. If I say \d{1,3} I get one digit. I always get the number of digits specified in the first value in the curly brackets.

Maybe R is just vomiting or something. But if you know of an issue with R and regex that results in this, please let me know.

rastrau@swiss.social

@guyjantic I don’t know if that’s an option, but stringr::str_extract() could be interesting to achieve this (if you don’t mind the dependency)? https://stringr.tidyverse.org/reference/str_extract.html #rstats

thatdnaguy@genomic.social

@guyjantic is it getting confused by .*<number stuff>.*?

. Includes numbers.

Maybe a something like ^[//s ]*(//d{2,5}).*$

guyjantic@infosec.exchange

@rastrau I don't mind tidyverse dependencies at all. I have tried stringr::str_replace() and it gave me exactly the results of sub() and gsub(), but I haven't tried str_extract() yet. I'll give it a shot. Thanks.

jorismeys@mstdn.social

@guyjantic
You need to make that greedy. Might be as easy as

sub("(^.*?)(\\d{2,5})(.*?$)", "\\2", strings)

This makes the matches before and after "lazy", meaning they match as few as possible.

Edit: I didn't test it due to on my phone now.

rastrau@swiss.social

@guyjantic Since the regex seems to swallow the first digit, i suspect “.” in the first group is too generous? Maybe \D (non-digit) would be better? But I’m not a regexpert (alas).

guyjantic@infosec.exchange

@rastrau Hey, that works! Thanks a ton!

rastrau@swiss.social

@guyjantic 🥳 Yay! You’re most welcome.

guyjantic@infosec.exchange

@rastrau I suspect you're more of a regexpert than I am, and your explanation seems plausible.

nxskok@cupoftea.social

@guyjantic if you are happy with just the first numerical thing, parse_number() works really well.

jmkinen@mementomori.social

@guyjantic Is "87" what you want from the fourth string?

If so, making it non greedy seems to work:
sub(".*?(\\d{2,5}).*", "\\1", strings)

CIRCLE WITH A DOT

Is anyone good with #Rstats and #regex ?