Sunday, January 6, 2013

Search and replace: Are you tired of nested `ifelse`?

It happens all the time: you have a vector of fruits and you want to replace all bananas with apples, all oranges with pineapples, and leave all the other fruits as-is, or maybe change them all to figs. The usual solution? A big old nested `ifelse`:

Ok, that didn't look too bad, especially with the code and fruits nicely aligned. But what if I had a lot of fruits to change and little patience? Wouldn't it be nice if R had a built-in function for doing multiple search and replace? Someone please tell me if there is already such a function. If not, here is one I wrote that builds a nested `ifelse` function by recursion:

Note that I named the function after the `decode` SQL function. Here are a couple examples:

Feel free to use it with your favorite fruits or vegetables! Cheers!

P.S.: I wrote this function as an answer to this S.O. question. Thank you to Matthew Lundberg for sharing ideas.

Photo source: http://www.istockphoto.com/stock-photo-19534475-mixed-fruit.php

22 comments:

  1. Hi There,
    Cool post.
    The image was removed from r-bloggers since you didn't give any copyright credits. Please do so in the future.

    Cheers,
    Tal

    ReplyDelete
    Replies
    1. Tal, sorry about that. It is a stock photo I got for a small fee and I have added the source at the bottom of the post.

      Delete

    2. •★INTEGRATED HACKS★•


      Are You Seeking For A LEGIT PROFESSIONAL HACKER Who Will Get Your Job Done Efficiently With Swift Response?? CONGRATULATIONS, Your Search Ends Right Here.

      ★ ABOUT US
      • We are a Team Of Professional HACKERS , a product of the coming together of renowned Hackers from the Dark-Web (pentaguard, CyberBerkut, Grey Hat and Black Hat,)that have seen how data and information is been stolen and spoofed and are willing to help the helpless. We have been existing for over 8 years, our system is a very strong and decentralized command structure that operates on ideas and directives.

      ★ JOB GUARANTEE:
      Whenever We Are being hired, We typically only take jobs that We find somewhat original, challenging, or especially helpful to the community. We’ve never wanted to sit around defending some video game company’s source code from network intruders – We prefer to help nonprofits, private investigators, Private Individuals, government contractors, and other traditionally underserved populations.
      And We’d rather match skills against the best in the field of state-sponsored hackers engaged in economic espionage than put some kid in prison for pranking the phone company. When a company tries to hire Us, the first question we ask is: “Who is this going to help?”
      We know INTEGRATED HACKS is Well known for LEGIT HACKING SERVICES, but we always try to make people know that INTEGRATED HACKS isn't just open to big firms, any individual desiring cyber services can contact us via: "integratedhacks@protonmail.com" You Can Reach Out To Us for Your Desired HACKING Services Ranging from:
      * Penetration Testing
      * Jail Breaking
      * PHONE HACKING (Which gives you Unnoticeable Access to Everything that is Happening on the phone such as call logs, messages, chats and all social media Apps .
      * Retrieval Of Lost Files
      * Location Tracking.
      * Clearing Of Criminal Records.
      * Hacking Of Server, Database And Social Media accounts e.g Facebook, twitter, Instagram Snapchat etc

      ★ SOME SPECIAL SERVICES WE OFFER:
      * RECOVERY OF LOST FUNDS ON BINARY OPTIONS.
      * Bank Accounts Loading ( Only USA Banks)
      * Credit Cards Loading (Only USA CC’s)’

      ★Our Team houses a separate group of specialists who are productively focused and established authorities in different platforms. They hail from a proven track record Called “HackerOne” and have cracked even the toughest of barriers to intrude and capture or recapture all relevant data needed by our Clients. Some Of These Specialist Includes Yassine Aboukir, Oemer Han, Imran parray, Anees Khan, Jobert Abma and many others.

      ★INTEGRATED HACKS is available to our clients 24 hours a day and 7 days a week. We understand that your request might be urgent, so we have a separate team of allocated hackers who interact with our Clients round the clock. You are with the right people so just get started.

      ★CONTACT:
      * Email:
      Integratedhacks@protonmail.com
      Integratedhacks@gmail.com

      ★CONTACT US AND EXPERIENCE CYBER SERVICES LIKE NEVER BEFORE

      Delete
  2. How about this two-liner:

    WHICH <- sapply(c("banana", "orange"), function(x) grep(x, basket))

    for (i in 1:length(WHICH)) basket[WHICH[[i]]] <- c("apple", "pineapple")[i]

    Cheers,
    Andrej

    ReplyDelete
    Replies
    1. Andrej, that would be quite different. First of all, this is about replacing elements in a vector so you need to match the strings exactly, not use regular expressions (grep). Second, my vector could have multiple bananas and oranges to replace and they could show up in any order: your version would always replace them with apple/pineapple/apple/pineapple/etc in that order. Thanks.

      Delete
  3. You can use factors for that

    decode<-function(x,i,o) {
    x<-factor(x);
    levels(x)[levels(x) %in% i] <- o;
    as.character(x)
    }

    I suspect though that you may lose out in conversion from factor to character. But if you already have factor, then no problems.

    ReplyDelete
    Replies
    1. Interesting approach, mpiktas. I'll admit I don't use factors very often. Should you have it the other way around: levels(x)[i %in% levels(x)] <- o; ? The question on SO was using a vector of integers so my answer is a bit more general in that sense, plus the possibility to use a default value for all mismatches. I'm sure it wouldn't be too hard t o add those to your version though. Thanks!

      Delete
  4. flodel,
    Ignoring the StackOverflow question and addressing the problem you pose in this post (which is subtly different), there is a way to get around your nesting issue while still being easy to code/read. I have to do this every so often when recoding data, and I find using a transformation legend is a quick and straightforward approach for 1:1 replacements. I coded up a quick example function for your case:

    # Function to replace a data array by evaluating each
    # item individually and in series.
    # data_Array is expected to be a list or array with
    # all of the data to be replaced.
    # original_List is expected to be a list or array with
    # a list of item values in data_Array.
    # replacement_List is expected to be a list or array with
    # a list of new item values that correspond in placement
    # with original_List.
    # For example, to transform "banana" into "pineapple", we
    # would see:
    # original_List[n] = "banana" and
    # replacement_List[n] = "pineapple"

    # For this to work, you may have to use unlist(data_Array)
    # if data_Array is using factors and replacement_List is
    # introducing new values. Adding the new array into a
    # data frame will then resolve the new factors automatically.

    replace_By_Item <- function(data_Array, original_List, replacement_List){
    for (i in 1:length(data_Array)){
    data_Array[i] <- replacement_List[match(data_Array[i], original_List)]
    }
    return(data_Array)
    }


    # Demo of function.
    # Makes a random list of "banana", "orange", and "fig"
    # values, and replaces all "banana" with "apple" and
    # all "orange" with "pineapple".
    test_Basket <- sample(c("banana","orange","fig"),100, rep=TRUE)
    new_Test_Basket <- replace_By_Item(test_Basket, c("banana","orange","fig"), c("apple","pineapple","fig"))

    # Regarding your default value: I don't ever use default values
    # when transforming my data. I find that it is too easy to
    # overlook something a lose a bunch of data, but I've coded it
    # up anyway as an example.
    replace_By_Item <- function(data_Array, original_List, replacement_List, default_Value){
    for (i in 1:length(data_Array)){
    if (data_Array[i] %in% original_List){
    data_Array[i] <- replacement_List[match(data_Array[i], original_List)]
    } else {
    data_Array[i] <- default_Value
    }
    }
    return(data_Array)
    }

    ReplyDelete
    Replies
    1. Thanks dinre. match() is definitely another great tool for the task. Note that it is a vectorized function so you can avoid iterating with a for() loop and your code will be much faster. An implementation matching the usage of my decode() function would be:

      decode <- function(x, search, replace, default = NULL) {
       
      idx <- match(x, search)
      out <- ifelse(is.na(idx), if (is.null(default)) x else default, replace[idx])
      return(out)
      }

      Quite concise! I have tested both versions with large data and I do not see that one is significantly faster than the other. Thanks again!

      Delete
    2. Interesting. I have for some reason never thought to invert the match statement's syntax like that. Match is designed to work the other way around (e.g. match(search, x)), but inverting it is far more useful for replacements. If we are doing integer replacements and trying to optimize CPU cycles over memory, it would be even faster to use a null array for the search variable with the numbers to be replaced put in the corresponding replacement index of the array.

      For instance:

      decode <- function(x, search){
      return(match(x, search))
      }

      # Change c(1,2,3,4) like so:
      # 1->2
      # 2->5
      # 3->6
      # 4->8

      decode(c(1,2,3,4),c(NA,1,NA,NA,2,3,NA,4))
      # [1] 2 5 6 8

      I will have to think about how I can work this use of match into my standard toolbox of methods.

      Delete
  5. flodel,
    I realized I should perhaps return here to address your question about built-in functions.

    If you really want to use a built-in function to replace your nested ifelse statements, you might want to consider using a switch. Switches are essentially a series of nested ifelse statements simplified to a single command. In R, switches work two different ways, and for your fruit basket, we would use the "character string" method. In the following, we specify the switch values and what they should return. The final unnamed value is the default switch. These could be expressions, but for our case, we're just using the fruit names.

    for (i in 1:length(data_Array)){
    data_Array[i] <- switch(data_Array[i], banana="apple", orange="pineapple", "fig")
    }


    For the StackOverflow question, we would have to use the "integer" method for our switch. Note that the "integer" method switch does not have a default. I'm using plyr here purely for convenience, since ddply takes a data frame and returns a data frame automatically.

    # z as specified in the SO question:
    z <- data.frame(x=1:10, y=11:20, t=21:30)

    library(plyr)
    ddply(z, .(x), summarize, y=y, t=t, q=switch(x, 1, 2, 1, 4, 1, 1, 3, 1, 1, 1) * t)

    ReplyDelete
    Replies
    1. Thanks dire. While switch() is useful when you have a few ifelse() cases, it quickly becomes intractable when you have dozens of changes to make. That's the whole idea behind recode()'s better usage: the values to search for and their replacements can be provided via vectors.

      Delete
  6. There is also the "recode()" function in the car package that essentially does the same thing. It is meant to provide the untility of the RECODE statment of SPSS.

    ReplyDelete
  7. To do the same thing with the recode function of the car package:

    library(car)
    recode(basket, "'banana' = 'apple', 'orange' = 'pineapple', else = 'fig'")

    ReplyDelete
    Replies
    1. Thanks GoldGuy. I did not know about that function, or package. The usage is very similar to switch(), hence only really useful when you have a few changes to make.

      Delete
  8. There's a new function in plyr 1.8 called revalue:
    revalue(basket, replace = c(banana="apple", orange="pineapple"))

    It is implemented using the new mapvalues function, which you can use this way:
    mapvalues(basket,
    from = c("banana", "orange"),
    to = c("apple", "pineapple"))

    They both work with character vectors and, notably, factors. If you have a numeric vector, you'll have to use mapvalues(), because revalue() uses a named vector for the replacements, and the names are always strings, not numbers.

    ReplyDelete
    Replies
    1. Thanks! It is interesting to see how a lot of people have tackled the same problem. Looking at the code, mapvalues() is using match() like dinre was suggesting above. The function also has an optional argument `warn_missing`: when TRUE, it will warn if some of the values to search for (and replace) are not present in the input vector. That step requires a little extra code: the use of a potentially expensive sort(unique()) which unlike Hadley implemented should only be conditional to `warn_missing` (IMHO.)

      Delete
    2. This comment has been removed by the author.

      Delete
    3. K %<>% mutate(
      KReturnHKPriceEdge = suppressAll(
      #'@ ifelse(Result == 'Win', KStakesHKPriceEdge * EUPrice,
      #'@ ifelse(Result == 'Half Win', KStakesHKPriceEdge * (HKPrice * 0.5 + 1),
      #'@ ifelse(Result == 'Push'|Result == 'Cancelled', KStakesHKPriceEdge,
      #'@ ifelse(Result == 'Half Loss', KStakesHKPriceEdge * 0.5,
      #'@ ifelse(Result == 'Loss', 0, NA)))))
      plyr::mapvalues(Result,
      c('Win', 'Half Win', 'Push', 'Cancelled', 'Half Loss', 'Loss'),
      c(KStakesHKPriceEdge * EUPrice,
      KStakesHKPriceEdge * (HKPrice * 0.5 + 1),
      KStakesHKPriceEdge, KStakesHKPriceEdge,
      KStakesHKPriceEdge * 0.5, 0))))
      ## Its works only to map the values but not applicable to a formula...

      Delete
  9. This comment has been removed by the author.

    ReplyDelete
  10. I know this is an old post, but the function str_replace seems to do this (stringr package).

    ReplyDelete
  11. ## weighted parameter estimation
    #'@ mbase %<>% mutate(theta = suppressAll(
    #'@ ifelse(Result == 'Win', 1,
    #'@ ifelse(Result == 'Half Win', 0.5,
    #'@ ifelse(Result == 'Push'|Result == 'Cancelled', 0,
    #'@ ifelse(Result == 'Half Loss', -0.5,
    #'@ ifelse(Result == 'Loss', -1, NA)))))),
    #'@ dWin = ifelse(Result == 'Win', 1, 0),
    #'@ dwhf = ifelse(Result == 'Half Win', 1, 0),
    #'@ dpus = ifelse(Result == 'Push'|Result == 'Cancelled', 1, 0),
    #'@ dlhf = ifelse(Result == 'Half Loss', 1, 0),
    #'@ dlos = ifelse(Result == 'Loss', 1, 0))

    mbase %<>% mutate(
    theta = decode(c('Win', 'Half Win', 'Push', 'Cancelled', 'Half Loss', 'Loss'),
    c(1, 0.5, 0, 0, -0.5, -1)),
    dWin = ifelse(Result == 'Win', 1, 0),
    dwhf = ifelse(Result == 'Half Win', 1, 0),
    dpus = ifelse(Result == 'Push'|Result == 'Cancelled', 1, 0),
    dlhf = ifelse(Result == 'Half Loss', 1, 0),
    dlos = ifelse(Result == 'Loss', 1, 0))

    ## I tried to knit ifelse() in RMarkdown files and it working fine yesterday but not today, its working fine if I run it indepdently... due to the ifelse() is a vector handle conditional function therefore it is not single element conditional function if {} else {} unless we use if(x[i]==y[i]) {} else {}. When I try to use decode(), there prompt me the error message : "Error: character string is not in a standard unambiguous format".

    ReplyDelete