Is anyone good with #Rstats and #regex ?
-
Is anyone good with #Rstats and #regex ? I'm having issues.
strings <- c("150 hertz", "70 hz", NA, "between 87 and 100 hz ocillations", "15hz", "triangle 110 hertz", "144Hz, Sine waveform", "It is a hysterical idling. More vibraton than sound.", NA)I want to replace each string with the digits (well, the first set) found in it, if any. I try this:
sub("(^.*)(\\d{2,5})(.*$)", "\\2", strings)I get this as a result:
[1] "50"
[2] "70"
[3] NA
[4] "00"
[5] "15"
[6] "10"
[7] "44"
[8] "It is a hysterical idling. More vibraton than sound."
[9] NAI expect to get all digits (the first set in each string) if they are from 2 to 5 digits long. Instead, I only get 2 digits.
Using similar regex in #geany just to prepare this little example I got the expected behavior. I've updated and restarted R. I've used sub and gsub. Same result. If I specify \d{3,5} I get three digits. If I say \d{1,3} I get one digit. I always get the number of digits specified in the first value in the curly brackets.
Maybe R is just vomiting or something. But if you know of an issue with R and regex that results in this, please let me know.
@guyjantic I don’t know if that’s an option, but stringr::str_extract() could be interesting to achieve this (if you don’t mind the dependency)? https://stringr.tidyverse.org/reference/str_extract.html #rstats
-
Is anyone good with #Rstats and #regex ? I'm having issues.
strings <- c("150 hertz", "70 hz", NA, "between 87 and 100 hz ocillations", "15hz", "triangle 110 hertz", "144Hz, Sine waveform", "It is a hysterical idling. More vibraton than sound.", NA)I want to replace each string with the digits (well, the first set) found in it, if any. I try this:
sub("(^.*)(\\d{2,5})(.*$)", "\\2", strings)I get this as a result:
[1] "50"
[2] "70"
[3] NA
[4] "00"
[5] "15"
[6] "10"
[7] "44"
[8] "It is a hysterical idling. More vibraton than sound."
[9] NAI expect to get all digits (the first set in each string) if they are from 2 to 5 digits long. Instead, I only get 2 digits.
Using similar regex in #geany just to prepare this little example I got the expected behavior. I've updated and restarted R. I've used sub and gsub. Same result. If I specify \d{3,5} I get three digits. If I say \d{1,3} I get one digit. I always get the number of digits specified in the first value in the curly brackets.
Maybe R is just vomiting or something. But if you know of an issue with R and regex that results in this, please let me know.
@guyjantic is it getting confused by .*<number stuff>.*?
. Includes numbers.
Maybe a something like ^[//s ]*(//d{2,5}).*$
-
@guyjantic I don’t know if that’s an option, but stringr::str_extract() could be interesting to achieve this (if you don’t mind the dependency)? https://stringr.tidyverse.org/reference/str_extract.html #rstats
@rastrau I don't mind tidyverse dependencies at all. I have tried stringr::str_replace() and it gave me exactly the results of sub() and gsub(), but I haven't tried str_extract() yet. I'll give it a shot. Thanks.
-
Is anyone good with #Rstats and #regex ? I'm having issues.
strings <- c("150 hertz", "70 hz", NA, "between 87 and 100 hz ocillations", "15hz", "triangle 110 hertz", "144Hz, Sine waveform", "It is a hysterical idling. More vibraton than sound.", NA)I want to replace each string with the digits (well, the first set) found in it, if any. I try this:
sub("(^.*)(\\d{2,5})(.*$)", "\\2", strings)I get this as a result:
[1] "50"
[2] "70"
[3] NA
[4] "00"
[5] "15"
[6] "10"
[7] "44"
[8] "It is a hysterical idling. More vibraton than sound."
[9] NAI expect to get all digits (the first set in each string) if they are from 2 to 5 digits long. Instead, I only get 2 digits.
Using similar regex in #geany just to prepare this little example I got the expected behavior. I've updated and restarted R. I've used sub and gsub. Same result. If I specify \d{3,5} I get three digits. If I say \d{1,3} I get one digit. I always get the number of digits specified in the first value in the curly brackets.
Maybe R is just vomiting or something. But if you know of an issue with R and regex that results in this, please let me know.
@guyjantic
You need to make that greedy. Might be as easy assub("(^.*?)(\\d{2,5})(.*?$)", "\\2", strings)
This makes the matches before and after "lazy", meaning they match as few as possible.
Edit: I didn't test it due to on my phone now.
-
@rastrau I don't mind tidyverse dependencies at all. I have tried stringr::str_replace() and it gave me exactly the results of sub() and gsub(), but I haven't tried str_extract() yet. I'll give it a shot. Thanks.
@guyjantic Since the regex seems to swallow the first digit, i suspect “.” in the first group is too generous? Maybe \D (non-digit) would be better? But I’m not a regexpert (alas).
-
@guyjantic I don’t know if that’s an option, but stringr::str_extract() could be interesting to achieve this (if you don’t mind the dependency)? https://stringr.tidyverse.org/reference/str_extract.html #rstats
@rastrau Hey, that works! Thanks a ton!
-
@rastrau Hey, that works! Thanks a ton!
@guyjantic 🥳 Yay! You’re most welcome.
-
@guyjantic Since the regex seems to swallow the first digit, i suspect “.” in the first group is too generous? Maybe \D (non-digit) would be better? But I’m not a regexpert (alas).
@rastrau I suspect you're more of a regexpert than I am, and your explanation seems plausible.
-
Is anyone good with #Rstats and #regex ? I'm having issues.
strings <- c("150 hertz", "70 hz", NA, "between 87 and 100 hz ocillations", "15hz", "triangle 110 hertz", "144Hz, Sine waveform", "It is a hysterical idling. More vibraton than sound.", NA)I want to replace each string with the digits (well, the first set) found in it, if any. I try this:
sub("(^.*)(\\d{2,5})(.*$)", "\\2", strings)I get this as a result:
[1] "50"
[2] "70"
[3] NA
[4] "00"
[5] "15"
[6] "10"
[7] "44"
[8] "It is a hysterical idling. More vibraton than sound."
[9] NAI expect to get all digits (the first set in each string) if they are from 2 to 5 digits long. Instead, I only get 2 digits.
Using similar regex in #geany just to prepare this little example I got the expected behavior. I've updated and restarted R. I've used sub and gsub. Same result. If I specify \d{3,5} I get three digits. If I say \d{1,3} I get one digit. I always get the number of digits specified in the first value in the curly brackets.
Maybe R is just vomiting or something. But if you know of an issue with R and regex that results in this, please let me know.
@guyjantic if you are happy with just the first numerical thing, parse_number() works really well.
-
Is anyone good with #Rstats and #regex ? I'm having issues.
strings <- c("150 hertz", "70 hz", NA, "between 87 and 100 hz ocillations", "15hz", "triangle 110 hertz", "144Hz, Sine waveform", "It is a hysterical idling. More vibraton than sound.", NA)I want to replace each string with the digits (well, the first set) found in it, if any. I try this:
sub("(^.*)(\\d{2,5})(.*$)", "\\2", strings)I get this as a result:
[1] "50"
[2] "70"
[3] NA
[4] "00"
[5] "15"
[6] "10"
[7] "44"
[8] "It is a hysterical idling. More vibraton than sound."
[9] NAI expect to get all digits (the first set in each string) if they are from 2 to 5 digits long. Instead, I only get 2 digits.
Using similar regex in #geany just to prepare this little example I got the expected behavior. I've updated and restarted R. I've used sub and gsub. Same result. If I specify \d{3,5} I get three digits. If I say \d{1,3} I get one digit. I always get the number of digits specified in the first value in the curly brackets.
Maybe R is just vomiting or something. But if you know of an issue with R and regex that results in this, please let me know.
@guyjantic Is "87" what you want from the fourth string?
If so, making it non greedy seems to work:
sub(".*?(\\d{2,5}).*", "\\1", strings) -
R relay@relay.an.exchange shared this topic