Sunday, January 6, 2013

Search and replace: Are you tired of nested `ifelse`?

It happens all the time: you have a vector of fruits and you want to replace all bananas with apples, all oranges with pineapples, and leave all the other fruits as-is, or maybe change them all to figs. The usual solution? A big old nested `ifelse`:

Ok, that didn't look too bad, especially with the code and fruits nicely aligned. But what if I had a lot of fruits to change and little patience? Wouldn't it be nice if R had a built-in function for doing multiple search and replace? Someone please tell me if there is already such a function. If not, here is one I wrote that builds a nested `ifelse` function by recursion:

Note that I named the function after the `decode` SQL function. Here are a couple examples:

Feel free to use it with your favorite fruits or vegetables! Cheers!

P.S.: I wrote this function as an answer to this S.O. question. Thank you to Matthew Lundberg for sharing ideas.

Photo source: http://www.istockphoto.com/stock-photo-19534475-mixed-fruit.php

17 comments:

  1. Hi There,
    Cool post.
    The image was removed from r-bloggers since you didn't give any copyright credits. Please do so in the future.

    Cheers,
    Tal

    ReplyDelete
    Replies
    1. Tal, sorry about that. It is a stock photo I got for a small fee and I have added the source at the bottom of the post.

      Delete
  2. How about this two-liner:

    WHICH <- sapply(c("banana", "orange"), function(x) grep(x, basket))

    for (i in 1:length(WHICH)) basket[WHICH[[i]]] <- c("apple", "pineapple")[i]

    Cheers,
    Andrej

    ReplyDelete
    Replies
    1. Andrej, that would be quite different. First of all, this is about replacing elements in a vector so you need to match the strings exactly, not use regular expressions (grep). Second, my vector could have multiple bananas and oranges to replace and they could show up in any order: your version would always replace them with apple/pineapple/apple/pineapple/etc in that order. Thanks.

      Delete
  3. You can use factors for that

    decode<-function(x,i,o) {
    x<-factor(x);
    levels(x)[levels(x) %in% i] <- o;
    as.character(x)
    }

    I suspect though that you may lose out in conversion from factor to character. But if you already have factor, then no problems.

    ReplyDelete
    Replies
    1. Interesting approach, mpiktas. I'll admit I don't use factors very often. Should you have it the other way around: levels(x)[i %in% levels(x)] <- o; ? The question on SO was using a vector of integers so my answer is a bit more general in that sense, plus the possibility to use a default value for all mismatches. I'm sure it wouldn't be too hard t o add those to your version though. Thanks!

      Delete
  4. flodel,
    Ignoring the StackOverflow question and addressing the problem you pose in this post (which is subtly different), there is a way to get around your nesting issue while still being easy to code/read. I have to do this every so often when recoding data, and I find using a transformation legend is a quick and straightforward approach for 1:1 replacements. I coded up a quick example function for your case:

    # Function to replace a data array by evaluating each
    # item individually and in series.
    # data_Array is expected to be a list or array with
    # all of the data to be replaced.
    # original_List is expected to be a list or array with
    # a list of item values in data_Array.
    # replacement_List is expected to be a list or array with
    # a list of new item values that correspond in placement
    # with original_List.
    # For example, to transform "banana" into "pineapple", we
    # would see:
    # original_List[n] = "banana" and
    # replacement_List[n] = "pineapple"

    # For this to work, you may have to use unlist(data_Array)
    # if data_Array is using factors and replacement_List is
    # introducing new values. Adding the new array into a
    # data frame will then resolve the new factors automatically.

    replace_By_Item <- function(data_Array, original_List, replacement_List){
    for (i in 1:length(data_Array)){
    data_Array[i] <- replacement_List[match(data_Array[i], original_List)]
    }
    return(data_Array)
    }


    # Demo of function.
    # Makes a random list of "banana", "orange", and "fig"
    # values, and replaces all "banana" with "apple" and
    # all "orange" with "pineapple".
    test_Basket <- sample(c("banana","orange","fig"),100, rep=TRUE)
    new_Test_Basket <- replace_By_Item(test_Basket, c("banana","orange","fig"), c("apple","pineapple","fig"))

    # Regarding your default value: I don't ever use default values
    # when transforming my data. I find that it is too easy to
    # overlook something a lose a bunch of data, but I've coded it
    # up anyway as an example.
    replace_By_Item <- function(data_Array, original_List, replacement_List, default_Value){
    for (i in 1:length(data_Array)){
    if (data_Array[i] %in% original_List){
    data_Array[i] <- replacement_List[match(data_Array[i], original_List)]
    } else {
    data_Array[i] <- default_Value
    }
    }
    return(data_Array)
    }

    ReplyDelete
    Replies
    1. Thanks dinre. match() is definitely another great tool for the task. Note that it is a vectorized function so you can avoid iterating with a for() loop and your code will be much faster. An implementation matching the usage of my decode() function would be:

      decode <- function(x, search, replace, default = NULL) {
       
      idx <- match(x, search)
      out <- ifelse(is.na(idx), if (is.null(default)) x else default, replace[idx])
      return(out)
      }

      Quite concise! I have tested both versions with large data and I do not see that one is significantly faster than the other. Thanks again!

      Delete
    2. Interesting. I have for some reason never thought to invert the match statement's syntax like that. Match is designed to work the other way around (e.g. match(search, x)), but inverting it is far more useful for replacements. If we are doing integer replacements and trying to optimize CPU cycles over memory, it would be even faster to use a null array for the search variable with the numbers to be replaced put in the corresponding replacement index of the array.

      For instance:

      decode <- function(x, search){
      return(match(x, search))
      }

      # Change c(1,2,3,4) like so:
      # 1->2
      # 2->5
      # 3->6
      # 4->8

      decode(c(1,2,3,4),c(NA,1,NA,NA,2,3,NA,4))
      # [1] 2 5 6 8

      I will have to think about how I can work this use of match into my standard toolbox of methods.

      Delete
  5. flodel,
    I realized I should perhaps return here to address your question about built-in functions.

    If you really want to use a built-in function to replace your nested ifelse statements, you might want to consider using a switch. Switches are essentially a series of nested ifelse statements simplified to a single command. In R, switches work two different ways, and for your fruit basket, we would use the "character string" method. In the following, we specify the switch values and what they should return. The final unnamed value is the default switch. These could be expressions, but for our case, we're just using the fruit names.

    for (i in 1:length(data_Array)){
    data_Array[i] <- switch(data_Array[i], banana="apple", orange="pineapple", "fig")
    }


    For the StackOverflow question, we would have to use the "integer" method for our switch. Note that the "integer" method switch does not have a default. I'm using plyr here purely for convenience, since ddply takes a data frame and returns a data frame automatically.

    # z as specified in the SO question:
    z <- data.frame(x=1:10, y=11:20, t=21:30)

    library(plyr)
    ddply(z, .(x), summarize, y=y, t=t, q=switch(x, 1, 2, 1, 4, 1, 1, 3, 1, 1, 1) * t)

    ReplyDelete
    Replies
    1. Thanks dire. While switch() is useful when you have a few ifelse() cases, it quickly becomes intractable when you have dozens of changes to make. That's the whole idea behind recode()'s better usage: the values to search for and their replacements can be provided via vectors.

      Delete
  6. There is also the "recode()" function in the car package that essentially does the same thing. It is meant to provide the untility of the RECODE statment of SPSS.

    ReplyDelete
  7. To do the same thing with the recode function of the car package:

    library(car)
    recode(basket, "'banana' = 'apple', 'orange' = 'pineapple', else = 'fig'")

    ReplyDelete
    Replies
    1. Thanks GoldGuy. I did not know about that function, or package. The usage is very similar to switch(), hence only really useful when you have a few changes to make.

      Delete
  8. There's a new function in plyr 1.8 called revalue:
    revalue(basket, replace = c(banana="apple", orange="pineapple"))

    It is implemented using the new mapvalues function, which you can use this way:
    mapvalues(basket,
    from = c("banana", "orange"),
    to = c("apple", "pineapple"))

    They both work with character vectors and, notably, factors. If you have a numeric vector, you'll have to use mapvalues(), because revalue() uses a named vector for the replacements, and the names are always strings, not numbers.

    ReplyDelete
    Replies
    1. Thanks! It is interesting to see how a lot of people have tackled the same problem. Looking at the code, mapvalues() is using match() like dinre was suggesting above. The function also has an optional argument `warn_missing`: when TRUE, it will warn if some of the values to search for (and replace) are not present in the input vector. That step requires a little extra code: the use of a potentially expensive sort(unique()) which unlike Hadley implemented should only be conditional to `warn_missing` (IMHO.)

      Delete
  9. This comment has been removed by the author.

    ReplyDelete