• The Strong ML Hypothesis

    Data and compute power availability are important in the resurgence of ML and AI, but two of the biggest innovations in neural networks (NN), convolutional and deep networks (CN and DN), are data- and compute-efficient ideas, which allow practitioners to do more with fewer resources. I think this observation deserves...
  • The softmedian

    \[\newcommand{\f}[1]{\mathrm{#1}} \newcommand{\x}{ {\bf x} } \newcommand{\soft}[1]{\f{soft#1}} \newcommand{\softmax}{\soft{max}} \newcommand{\softmin}{\soft{min}} \newcommand{\softargmin}{\soft{argmin}} \newcommand{\softargmax}{\soft{argmax}} \newcommand{\softmedian}{\soft{median}} \newcommand{\softargmedian}{\soft{argmedian}} \newcommand{\softabs}{\soft{abs}} \newcommand{\softsign}{\soft{sign}} \newcommand{\softmedianrank}{\soft{medianrank}} \newcommand{\softremedian}{\soft{remedian}} \newcommand{\softargremedian}{\soft{argremedian}} \newcommand{\R}{\mathbb{R}}\] Why is there a softmax function, but not a softmedian? Let’s create not one, but a few of them. Basics People who are familiar with neural networks will likely have heard of...
  • altair_recipes: a Python package to generate essential statistical graphics for the web

    If you don’t need the full power of the grammar of graphics to generate classical plots for the web altair_recipes is the the easy way. Check it out with pip install altair_recipes.
  • A Simple Loss Function for Multi-Task learning with Keras implementation, part 2

    In this post, we show how to implement a custom loss function for multitask learning in Keras and perform a couple of simple experiments with itself. TL;DR; this is the code: kb.exp( kb.mean(kb.log(kb.mean(kb.square(y_pred - y_true), axis=0)), axis=-1))
  • A Simple Loss Function for Multi-Task learning with Keras implementation, part 1

    In this post I walk through a recent paper about multi-task learning and fill in some mathematical details. Implementation and experiments will follow in a later post.
  • Tame the newsfeed with homemade AI

    Back in the 60s it was called information overload and it affected so-called decision makers. Fast forward to today and the situation hasn’t improved. Now it’s not only decision makers who are overloaded: it’s everyone. The information is not just too much: it’s trivial or factually incorrect or deliberately crafted...
  • Mathematical model sides with tennis players, not pundits, on serve selection

    I was watching the current Wimbledon tennis tournament when I heard a comment by former champion and coach Boris Becker that got my attention. He complained that Canadian player Milos Raonic was not using the body serve, a shot aimed directly at the opponent that allegedly results in a weak...
  • A nutritional search engine with shiny and dplyr

    TL; DR: try our shiny new nutritional search engine. Feedback welcome. “In the middle of our life’s journey, I found myself in a dark wood.” So starts Dante’s Inferno. My midlife doesn’t feel remotely as bleak, but for reasons that will be best left untold, I had to almost completely...
  • How plyrmr was ahead of the curve

    I recently attended a talk by the always excellent Hadley Wickham about his latest work on creating and visualizing many models. I combine here two snippets from that tutorial for your convenience and further discussion: gapminder %>% group_by(continent, country) %>% nest() %>% mutate(model = purrr::map(data, ~ lm(lifeExp ~ year, data...
  • Yet Another Pipe Operator in R to unify interactive and programming use

    Prologue The pipe operator, %>% in its latest incarnation, is all the rage in R circles. I first saw it in a less-well-known package called vadr. Then one was added to dplyr, but I preferred my own implementation when working on plyrmr. Then a dedicated package emerged called magrittr and...
  • Syntax Directed Diffs for R in R

    Unsatisfied with general purpose, syntax-oblivious diff tools I take the first step towards syntax-directed diffs for R. Like many developers, I use git to manage my source code and collaborate with others. One fundamental component of source code control is a tool to compare files, namely source code files. Most...
  • Delicious R Curry

    In R, functional::Curry is a misnomer at best. Let’s implement currying in R. I’ve always wondered why the function Curry in package functional for the language R is named that way when it actually implements partial application. What it does is transfroming a function into another one with a smaller...
  • 10 eigenmaps of the United States of America

    An unbiased analysis of census data reveals not one but many maps of the United States. The original inspiration for this post comes from a New York Times article. By combining 6 socio-economic observables at a the county level, the author puts together a map that in his view describes...
  • Can't someone else find those differences?

    Use statistics and R instead of squinting at satellite images! You may have, like me, run into this article. Amazing stuff. A little startup pushing satellite imaging to the next level. Full planet coverage at the resolution of a few feet every 24 hours, soon, and on a shoestring budget....
  • The Greatest Sailing Race of All Time Seen Through Statistical Graphics

    I don't know if you've been following the America's Cup. It's the oldest sailing competition and, by some accounts, the oldest international sporting event bar none. This year, this time honored contest has been thrust into the modern age with the adoption of foiling winged catamarans that skim the water...
  • Three microblogs: The Ascetic Programmer, Science in Crisis and Data Science Matters.

    I've started three thematic microblogs you may be interested in.They are all link and quote microblogs that reflect side interests related to my work but that I don't want to force onto all of my twenty-five readers. My main microblog is focused on work related matters, projects etc. and I plan on...
  • R anti-tips

    Not all R tips are equally good. Let's set the record straight. Anti-tip #1: For loops are slower than functions in the apply familyWhy should that be the case? Let's see what the R interpreter has to say about it. Let's get some numbers to chew on first: z =...
  • The essential R packages

    Much has been said about the richness of the system of packages for R, but where is one supposed to start? The availability of a wide variety of packages has been long highlighted as one of the strengths of the R language. But the number is overwhelming — 5000 is...
  • Mapreduce everywhere

    Mapreduce could extend its reach beyond — or inside — the data center. Coming soon to a computer near you?The local Hadoop SF meetings cover a variety of topics, mostly practical. But on one occasion the discussion took a speculative turn: does Hadoop have legs or is it a stop-gap...
  • The connected components example, rewritten using RHadoop/rmr

    My new implementation of random mate for mapreduce, using the package rmr from Revolution Analytics open source project RHadoop.This story has now three episodes. First, I got interested in how to compute connected components in map reduce in a way that works even for large diameter graphs and proposed an...
  • A map reduce algorithm for connected components: implementation

    At long last, a complete implementation of the algorithm I described some time ago.You are kindly advised to go back and check the algorithm motivation and description in my older post, but the short of it is that it is a map reduce algorithm for connected components that is not...
  • Bringing relational joins to Rhipe

    Relational operations are a very common way to express map-reduce computations at a higher level, but Rhipe, an R package for mapreduce, doesn't have any. Let's start to fix this with a basic join function.This is going to be a little dry and technical, in preparation of better things to...
  • Let a million Twitters bloom

    Why are some people uncomfortable with cloud computing? What are the limitations and is there a way forward?The recent sudden change in Twitter terms of service for developers — the consensus is, despite attempts to backtrack, that they are against third party clients — has unleashed a debate about the...
  • Looking for a map reduce language

    On a quest for an elegant and effective map reduce language, I went through a number of options and put together some considerations. And the winner is …Update: since writing this post, I was approached by Revolution Analytics to write yet another map reduce library, this time for R, and...
  • Find the odd bag

    From a job interview challenge, an interesting probability exercise in two parts. One of the themes here is pretty standard fare. You are given a clearly defined random procedure whose outcome is a mixture of two distributions. The problem is, given a certain set of outcomes, find which of the...
  • On lenses for small cameras: a data-driven counterargument

    Andy Westlake of dpreview.com takes apart the current lens offering for lightweight interchangeable lens cameras (LILC) like the micro four thirds and related mirrorless designs, but I was unconvinced. Let's see what the data says.Andy Westlake is a photographer and camera reviewer at dpreview.com and his opinions carry some weight...
  • Thoughts on A/B testing

    A/B testing is part of a push towards software engineering as an experimental science, which I support, but there are plenty of open problems.I've been mulling over these points for a long while, but, after running into this excellent and amusing post by John Moult, about the pains and perils...
  • An algorithm for sample quantiles in map reduce

    A simple but often occurring problem is computing sample quantiles, sometimes named top $k$ elements, in a large data set. Here I show a solution for the MapReduce model of computation.The standard in memory algorithm for this problem is similar to quicksort, with the main difference that only one branch...
  • A map reduce algorithm for connected components

    In a recently published book about algorithms for the map reduce model of computation, a simple connected components algorithm based on lablel propagation is proposed, but its complexity depends on the diameter of the graph, which can be very large. It turns out we can get rid of that dependency...
  • Rapleaf Array Absurdity or On streaming problems in disguise

    From the interview challenges of an up and coming web startup, three problems that range from the trivial to the impossible. The key to the the solution is to recognize that the setting is close to that of streaming algorithms, which allows for very limited space resources compared to the...
  • Facebook Illegal Wiretaps

    The formulation of this problem is quite creative, but overall it is just describing a matrix where the rows are workers and the columns are tasks. Workers have numbers and tasks have names and the job completion time depend on whether the worker is odd or even, the number of...
  • Facebook Prime Bits

    This is one of Facebook job candidate puzzles. Given a range [a,b] of positive integer numbers, test for the primality of the number of 1 bits in the binary representation of each number, and do so in O(n) where n is b - a.Unfortunately the puzzle goes on to assume...
  • ProjectDescription - Lucene-hadoop Wiki

    Implementation of simple parallel computing, based on Google's map-reduce, runs over Amazon's EC2. Supercomputing for the rest of us ProjectDescription - Lucene-hadoop Wiki