Skip to main content

Experimentation Platform

Accelerating software innovation through trustworthy experimentation

Home: ExP Platform
Talks
Journal Survey
Experiments at Microsoft
Practical Guide (short)
Tracking Users' Clicks an
Seven Pitfalls
Semmelweis Reflex
Bloodletting
Power Calculator
What is a HiPPO
The Experimentation Platform was a project headed by Ronny Kohavi from 2006.  It went live 6/2007 and the team was merged into Bing in 10/2010. 
 
This site now serves as a repository for the papers related to experimentation.
For feedback, send e-mail to ronnyk at live dot you know what.
 
 
 
Papers

 

 

Unexpected Results in Online Controlled Experiments, SIGKDD Dec 2010

By Ron Kohavi and Roger Longbotham

 

Abstract: Controlled experiments, also called randomized experiments and A/B tests, have had a profound influence on multiple fields, including medicine, agriculture, manufacturing, and advertising. Offline controlled experiments have been well studied and documented since Sir Ronald A. Fisher led the development of statistical experimental design while working at the Rothamsted Agricultural Experimental Station in England in the 1920s. With the growth of the world-wide-web and web services, online controlled experiments are being used frequently, utilizing software capabilities like ramp-up (exposure control) and running experiments on large server farms with millions of users. We share several real examples of unexpected results and lessons learned.

 

 

Tracking Users' Clicks and Submits: Tradeoffs between User Experience and Data Loss, Oct 2010

By Ron Kohavi, David Messner,Seth Eliot, Juan Lavista Ferres, Randy Henne, Vignesh Kannappan, and Justin Wang

 

Abstract: Tracking users’ online clicks and form submits (e.g., searches) is critical for web analytics, controlled experiments, and business intelligence.Most sites use web beacons to track user actions, but waiting for the beacon to return on clicks and submits slows the next action (e.g., showing search results or the destination page).One possibility is to use a short timeout and common wisdom is that the more time given to the tracking mechanism (suspending the user action), the lower the data loss.Research from Amazon, Google, and Microsoft showed that small delays of a few hundreds of milliseconds have dramatic negative impact on revenue and user experience (Kohavi, et al., 2009 p. 173), yet we found that many websites allow long delays in order to collect click.For example, until March 2010, multiple Microsoft sites waited for click beacons to return with a 2-second timeout, introducing a delay of about 400msec on user clicks. To the best of our knowledge, this is the first published empirical study of the subject under a controlled environment. While we confirm the common wisdom about the tradeoff in general, a surprising result is that the tradeoff does not exist for the most common browser family, Microsoft Internet Explorer (IE), where no delay suffices. This finding has significant implications for tracking users since no waits is required to prevent data loss for IE browsers and it could significantly improve revenue and user experience.The recommendations here have been implemented by the MSN US home page and Hotmail.


 

 

Online Experiments: Practical Lessons, IEEE Computer Sept 2010

By Ron Kohavi, Roger Longbotham, and Toby Walker

Abstract: From ancient times through the 19th century, physicians used bloodletting to treat acne, cancer, diabetes, jaundice, plague, and hundreds of other diseases and ailments (D. Wooton, Doctors Doing Harm since Hippocrates, Oxford Univ. Press, 2006). It was judged most effective to bleed patients while they were sitting upright or standing erect, and blood was often removed until the patient fainted. On 12 December 1799, 67-year-old President George Washington rode his horse in heavy snowfall to inspect his plantation at Mount Vernon. A day later, he was in respiratory distress and his doctors extracted nearly half of his blood over 10 hours, causing anemia and hypotension; he died that night.

Today, we know that bloodletting is unhelpful because in 1828 a Parisian doctor named Pierre Louis did a controlled experiment. He treated 78 people suffering from pneumonia with early and frequent bloodletting or less aggressive measures and found that bloodletting did not help survival rates or recovery times.

Having roots in agriculture and medicine, controlled experiments have spread into the online world of websites and services. In an earlier Web Technologies article (R. Kohavi and R. Longobotham, “Online Experiments: Lessons Learned,” Computer, Sept. 2007, pp. 85-87) and a related survey (R. Kohavi et al., “Controlled Experiments on the Web: Survey and Practical Guide,” Data Mining and Knowledge Discovery, Feb. 2009, pp. 140-181), Microsoft’s Experimentation Platform team introduced basic practices of good online experimentation.

Three years later and having run hundreds of experiments on more than 20 websites, including some of the world’s largest, like msn.com and bing.com, we have learned some important practical lessons about the limitations of standard statistical formulas and about data traps. These lessons, even for seemingly simple univariate experiments, aren’t taught in Statistics 101. After reading this article we hope you will have better negative introspection: to know what you don’t know.

 

 

 

Online Experimentation at Microsoft, Sept 2009
By Ron Kohavi, Thomas Crook, and Roger Longbotham

 

The paper won 3rd place at the Third Workshop on Data Mining Case Studies and Practice Prize.

 

Abstract: Knowledge Discovery and Data Mining techniques are now commonly used to find novel, potentially useful, patterns in data (Fayyad, et al., 1996; Chapman, et al., 2000). Most KDD applications involve post-hoc analysis of data and are therefore mostly limited to the identification of correlations. Recent seminal work on Quasi-Experimental Designs (Jensen, et al., 2008) attempts to identify causal relationships. Controlled experiments are a standard technique used in multiple fields. Through randomization and proper design, experiments allow establishing causality scientifically, which is why they are the gold standard in drug tests. In software development, multiple techniques are used to define product requirements; controlled experiments provide a way to assess the impact of new features on customer behavior. The Data Mining Case Studies workshop calls for describing completed implementations related to data mining. Over the last three years, we built an experimentation platform system (ExP) at Microsoft, capable of running and analyzing controlled experiments on web sites and services. The goal is to accelerate innovation through trustworthy experimentation and to enable a more scientific approach to planning and prioritization of features and designs (Foley, 2008).Along the way, we ran many experiments on over a dozen Microsoft properties and had to tackle both technical and cultural challenges. We previously surveyed the literature on controlled experiments and shared technical challenges (Kohavi, et al., 2009). This paper focuses on problems not commonly addressed in technical papers: cultural challenges, lessons, and the ROI of running controlled experiments.

 

Longer version of the above

Online Experimentation at Microsoft (sanitized Thinkweek 2009)

By Ron Kohavi, Thomas Crook, Roger Longbotham, Brian Frasca, Randy Henne, Juan Lavista Ferres, Tamir Melamed

 

The ThinkWeek paper was recognized as a top-30 ThinkWeek at Microsoft.

 

Controlled experiments on the web: survey and practical guide (2009)

Ron Kohavi, Roger Longbotham, Dan Sommerfield, Randy Henne

 

Abstract: The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments, A/B tests (and their generalizations), split tests, Control/Treatment tests, MultiVariable Tests (MVT) and parallel flights. Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. We provide a practical guide to conducting online experiments, where end-users can help guide the development of features. Our experience indicates that significant learning and return-on-investment (ROI) are seen when development teams listen to their customers, not to the Highest Paid Person’s Opinion (HiPPO). We provide several examples of controlled experiments with surprising results. We review the important ingredients of running controlled experiments, and discuss their limitations (both technical and organizational). We focus on several areas that are critical to experimentation, including statistical power, sample size, and techniques for variance reduction. We describe common architectures for experimentation systems and analyze their advantages and disadvantages. We evaluate randomization and hashing techniques, which we show are not as simple in practice as is often assumed. Controlled experiments typically generate large amounts of data, which can be analyzed using data mining techniques to gain deeper understanding of the factors influencing the outcome of interest, leading to new hypotheses and creating a virtuous cycle of improvements. Organizations that embrace controlled experiments with clear evaluation criteria can evolve their systems with automated optimizations and real-time analyses. Based on our extensive practical experience with multiple systems and organizations, we share key lessons that will help practitioners in running trustworthy controlled experiments.

 

 

Seven Pitfalls to Avoid when Running Controlled Experiments on the Web, KDD 2009

By Thomas Crook, Brian Frasca, Ron Kohavi, and Roger Longbotham

 

Abstract: Controlled experiments, also called randomized experiments and A/B tests, have had a profound influence on multiple fields, including medicine, agriculture, manufacturing, and advertising. While the theoretical aspects of offline controlled experiments have been well studied and documented, the practical aspects of running them in online settings, such as web sites and services, are still being developed. As the usage of controlled experiments grows in these online settings, it is becoming more important to understand the opportunities and pitfalls one might face when using them in practice. A survey of online controlled experiments and lessons learned were previously documented in Controlled Experiments on the Web: Survey and Practical Guide(Kohavi, et al., 2009). In this follow-on paper, we focus on pitfalls we have seen after running numerous experiments at Microsoft.The pitfalls include a wide range of topics, such as assuming that common statistical formulas used to calculate standard deviation and statistical power can be applied and ignoring robots in analysis (a problem unique to online settings). Online experiments allow for techniques like gradual ramp-up of treatments to avoid the possibility of exposing many customers to a bad (e.g., buggy) Treatment. With that ability, we discovered that it’s easy to incorrectly identify the winning Treatment because of Simpson’s paradox.

Online Experiments, Lessons Learned, IEEE 2007

Ronny Kohavi and Roger Longbotham

Copyright IEEE, published with here with permission.

 

 

Practical Guide to Controlled Experiments on the Web: Listen to Your
Customers not to the HiPPO, KDD 2007
Ron Kohavi, Randy Henne, Dan Sommerfield

 

The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments (single-factor or factorial designs), A/B tests (and their generalizations), split tests, Control/Treatment tests, and parallel flights. Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. We provide a practical guide to conducting online experiments, where end-users can help guide the development of features. Our experience indicates that significant learning and return-on-investment (ROI) are seen when development teams listen to their customers, not to the Highest Paid Person’s Opinion (HiPPO). We provide several examples of controlled experiments with surprising results. We review the important ingredients of running controlled experiments, and discuss their limitations (both technical and organizational). We focus on several areas that are critical to experimentation, including statistical power, sample size, and techniques for variance reduction. We describe common architectures for experimentation systems and analyze their advantages and disadvantages. We evaluate randomization and hashing techniques, which we show are not as simple in practice as is often assumed. Controlled experiments typically generate large amounts of data, which can be analyzed using data mining techniques to gain deeper understanding of the factors influencing the outcome of interest, leading to new hypotheses and creating a virtuous cycle of improvements. Organizations that embrace controlled experiments with clear evaluation criteria can evolve their systems with automated optimizations and real-time analyses. Based on our extensive practical experience with multiple systems and organizations, we share key lessons that will help practitioners in running trustworthy controlled experiments.

 

Misc