A nutritional search engine with shiny and dplyr

TL; DR: try our shiny new nutritional search engine. Feedback welcome.

“In the middle of our life’s journey, I found myself in a dark wood.” So starts Dante’s Inferno. My midlife doesn’t feel remotely as bleak, but for reasons that will be best left untold, I had to almost completely strike two nutrients from my diet: sodium and sugar. This started a thorough examination of the body of knowledge contained in nutritional labels that, in the US and many other countries, are mandatory on most packaged foods.

Despite the wisdom thus accrued, shopping frustration was inevitable. For instance, available cold breakfasts can be roughly classifieds as granolas, which are too high in sugar, and cereals, which are too high in sodium. After a long search that also involved loving family members, we found that the intersection of no sugar and no sodium cereals contained all of two products available in our favorite grocery stores. It seems natural to think that a better solution to this problem must be e-commerce, with its enormous selection and powerful search features. While the selection is indeed enormous – roughly 1M items on Amazon alone – nutritional content search is still at most a figment of Jeff Bezos imagination. Free text searches for [low sodium low sugar cereal], despite recent advances in NLP, resulted only in contempt for my own profession. If one replaces searching with browsing, no matter how time consuming, nutritional information is sometimes available, in the form of pictures of the packaging, but not on every site and not very consistently. There has to be a better way. So I decided to write something myself.

The starting point is a USDA-provided dataset of nutritional information. It’s far from complete, as it contains only 8000 entries, but it’s a start. Please let me know if you know of additional data sets that could be relevant to this project. The USDA provides its own search engine but it is not flexible enough for my needs. This data set is also fairly wide, including information for exotic nutrients such as lycopene and beta_cryptoxanthin, if those mean anything to you. What kind of searches would we like to perform on this dataset? Setting aside my doubly restricted diet for now, let’s focus on just sodium. The average intake for the American population is higher than the recommended maximum intake. An even lower intake is recommended for people with high blood pressure, heart failure or Ménière’s syndrome. The Salt Skip Program advocates an even more radical reduction. So this is a widely useful restriction. The problem is that sodium is in many foods. As sodium chloride, or table salt, it is present in all animal source foods I can think of and even some vegetables – I am looking at you, celery. Besides that, it became an important food preservative early in history, allowing the long term storage of food. Since our taste buds get used to its taste, it’s added to a large number of foods where it’s not a strictly necessary ingredient. The Salt Skip program recommends a simple approach: eat only food with less than 120mg/100g of sodium. It’s a simple criterion, which also has the advantage of selecting equally bland foods, which helps reset our taste buds, but unfortunately it tends to steer people towards more watery food for no nutritional advantage. If I need to eat more bland food, I may end up with the same total sodium intake. Our goal is achieving our main nutritional goals while also restricting sodium intake. Arguably the most important goal of nutrition is to provide enough energy to maintain a healthy body weight, hence I think ranking foods by sodium per unit of energy provided is a good idea (also suggested in the book Salt Matters). If your daily intake is 2000 cal and you eat only or on average foods with less than 1mg of sodium per 2cal, your daily intake should be less than 1g. Once we have the right amount of energy, an important nutritional goal is to get enough protein. This is particularly challenging when restricting sodium intake because animal source foods are an important source of protein and are also naturally salty – and made saltier through processing, as in prosciutto. So these are the main three filters that I implemented in a shiny application, together with a simple pattern match on the name of the food, for when you absolutely need that low sodium snivel of cheese. But what if you have different requirements? What about my weird double restriction? Rather than implementing filters for all possible diets, I decided to implement an advanced search that allows one to do whatever searches he needs, in exchange for a bit or R alphabetization. Based on 4 of the dplyr main “verbs”, the advanced screen allows to add columns (mutate), sort (arrange), filter by any logical condition, and select the columns of interest. For instance, my search for the perfect cereal could be performed by entering sos = pmax(sodium, sugar) in the mutate field, where sos is short for “sodium or sugar”; then grepl(x = food_desc, "CEREAL") in the filter field, to focus on cereals; sos in the arrange field; and finally food_desc, sodium, sugar, protein, energy, fiber in the select field. This way we could discover that the only cereal in this USDA dataset containing neither sodium nor sugar, zero, zap, nada of either is Malt-O-Meal Original Cereal, which is somewhat hard to believe as it contains malted barely which itself contains sugar, but the producer supports this claim. The two no-sodium no-sugar cereals that I eat are nowhere to be found. Clearly we could use a more extensive dataset. It would also be nice to gather more information about the availability of each food. As far as the UI, it would make sense to add other screens specialized for specific diets, as not everybody is familiar enough with dplyr to use the advanced screen, as long as the necessary information is in the data: for instance, the concentrations of gluten or oxalate are not. Please let me know what your needs are and I will try to implement your favorite search.

Implementation

Now a little on the making of this shiny app. It is pretty run-of-the mill, with the only twist of some dynamic UI elements. The app is organized into three files, ui.R, server.R and global.R. The latter can be used to create objects shared by the other two, and since we have some data-dependent UI elements, that’s where we read in the dataset.

In the global.R file, we just read in the data, change the col names to something more readable and turn integer cols into numeric ones, as all these data are continuous in reality:

food_data = read.table("sr28abbr/ABBREV.txt", sep = "^", quote = "~")

names(food_data) =
  c("food_code", "food_desc", "water", "energy", "protein", "fat", "ash", "carbohydrate_plus_fiber", "fiber", "sugar", "calcium", "iron", "magnesium", "phosphorus", "potassium", "sodium", "zinc", "copper", "manganese", "selenium", "vitamin_c", "thiamin", "riboflavin", "niacin", "pantothenic_acid", "vitamin_b6", "folate_total", "folic_acid", "food_folate", "folate", "choline", "vitamin_b12", "vitamin_a", "vitamin_a_retinol", "retinol", "alpha_carotene", "beta_carotene", "beta_cryptoxanthin", "lycopene", "lutein", "vitamin_e", "vitamin_d", "vitamin_d_iu", "vitamin_k", "saturated_fatty_acids", "monounsaturated_fatty_acids", "polyunsaturated_fatty_acids", "cholesterol", "first_household_weight", "description_household_weight_1", "second_household_weight", "description_household_weight_2", "refuse")

food_data$food_desc =
  paste0(
    "<a href=\"http://google.com/search?q=",
    map(as.character(food_data$food_desc), URLencode), "\">",
    food_data$food_desc,
    "</a>")

food_data %>%
  map(function(x) if(is.integer(x)) as.numeric(x) else x) %>%
  data.frame -> food_data

On the server side, a couple of helper functions help filling in some empty values coming from the UI, to avoid downstream errors, and help parse text field containing comma-separated R expressions, to be fed as arguments to dplyr verbs:

default =
  function(x, value, condition = is.null)
    if(condition(x)) value else x

parse_field =
  function(field) {
    as.list(
      parse(text = paste0("list(", field, ")"))[[1]][-1])}

Let’s get the server started

shinyServer(function(input, output) {

Then we wrap the generic dplyr verb into a function that will pick the right one based on a string, parse the argument and feed them to it. This is a bridge between text inputs and actual code:

  verb =
    function(data, name, val) {
      get(paste0(name, "_"))(
        data,
        .dots =
          parse_field(default(input[[name]], val, function(x) is.null(x) || x == "")))}

The main output is a table, a simple derivative of the main data set, which is obtained in one of two possible ways, one is for the sodium-focused search and the other for the advanced search:

  output$food_data =
    DT::renderDataTable({
      switch(
        input$search_type,

The processing for the low sodium search consists of pattern matching the user input on the food_desc column, then filtering on sodium, sodium/energy and sodium/protein in a cascade based on thresholds obtained from the UI, entered by the user:

        `Low Sodium` = {
          food_data %>%
            filter(
              grepl(
                x = food_desc,
                pattern = default(input$food_type, ""),
                ignore.case = TRUE)) %>%
            filter(sodium <= default(input$sodium, 120)) %>%
            filter(sodium/energy <= default(input$sodium_energy, 0.6)) %>%
            filter(sodium/protein <= default(input$sodium_protein, 19)) %>%
            select_(.dots = c("food_desc", "sodium", colnames(food_data)))},

The advanced processing is also a cascade of dplyr verbs, 4 of the most commonly used, fed user inputs as arguments or some sensible defaults.

        Advanced = {
          food_data %>%
            verb("mutate", "") %>%
            verb("filter", "TRUE") %>%
            verb("arrange", "food_code") %>%
            verb("select", paste(names(food_data), collapse = ","))})
    },
    escape = FALSE)
})

Switching to the UI side, we define a couple of specialized input elements. The first is a slider input with data-dependent min, max and starting values:

sodium_quantiles = quantile(food_data$sodium, na.rm = TRUE)
sodium_energy_quantiles =
  quantile(food_data$sodium/(food_data$energy + 0.01), na.rm = TRUE)
sodium_protein_quantiles =
  quantile(food_data$sodium/(food_data$protein + 0.01), na.rm = TRUE)

sodium_slider_input =
  function(inputId, label, quantiles)
    sliderInput(
      inputId,
      label,
      round(quantiles[["0%"]], 2),
      round(quantiles[["75%"]], 2),
      round(quantiles[["25%"]], 2))

The second is a text input used for dplyr verb arguments, which have as name and id the verb itself and a different placeholder for each verb (the placeholder is what is displayed in an empty text input):

ti =
  function(name, placeholder)
    textInput(name, name, "", "100%", placeholder)

Now we can start defining the actual UI. We pick a fluid page because we think fluid layouts best adapt to a variety of screen sizes. We adopt a simple sidebar layout, with inputs on the left and the data output on the right:

shinyUI(
  fluidPage(
    titlePanel("Nutritional Food Search"),
    p(a(href="http://www.ars.usda.gov/Services/docs.htm?docid=25700", "Dataset"),
      "from USDA"),

The first input element is a selection between the two types of search:

    sidebarLayout(
      sidebarPanel(
        selectInput("search_type", "Search type", c("Low Sodium", "Advanced")),

Then we have two conditional panels, which appear only if a condition, written in JavaScript, is satisfied. The alternative to writing JavaScript was to build UI elements on the server side, where the output of the previous selection is available in R. I feel this is a choice between a rock and a hard place. It may be that my knowledge of shiny is not advanced enough or a design flaw. Multiple available examples point to the latter, but I don’t know how hard it would be to fix it. The first of the two panels is for the sodium-focused search and contains three sliders which provide values later used in filtering the main dataset:

        conditionalPanel(
          'input.search_type == "Low Sodium"',
          p("Simple search for low sodium foods. Minimize your sodium intake per amount of food, energy or protein"),
          textInput("food_type", "Food type", "", placeholder = "Partial food name, e.g. 'cheese' or empty"),
          sodium_slider_input(
            "sodium",
            "max sodium per weight mg/100 g",
            sodium_quantiles),
          sodium_slider_input(
            "sodium_energy",
            "max sodium per energy mg/kcal",
            sodium_energy_quantiles),
          sodium_slider_input(
            "sodium_protein",
            "max sodium per protein mg/g",
            sodium_protein_quantiles)),

The second panel has a text input for each verb, which will be converted into the .dots argument and passed to the appropriate function:

        conditionalPanel(
          'input.search_type == "Advanced"',
          p("Advanced food search with dplyr-like syntax"),
          ti("mutate",  "e.g. 'ratio = sodium/energy, sodium/protein'"),
          ti("filter",   "e.g. 'energy < 100'"),
          ti("arrange", "e.g. 'desc(energy/sodium)'"),
          ti("select", "e.g. 'food_desc, energy'"))),

Finally, the main panel displays the filtered table, accoding to any criteria the user has entered

      mainPanel(DT::dataTableOutput("food_data")))))

And that’s all folks! Let me know if this helps with your food needs and how it can be improved. Email or pull requests.