Honest AB is an A/B testing service that uses interpretable machine learning models to provide both statistically significant testing as well as human-level insights. See it in action: https://honest-ab.herokuapp.com/demo.
The app exposes a web UI as well as an API for creating experiments (A/B tests) and submitting data from your app. The service combines typical A/B test data - clicks and views for each variant - with any additional data you think might have an effect on a user's decision to click. The service uses a Bayesian regression model to detect linear relationships between input features and click rate. Once the test reaches significance, the model is able to determine whether there is a statistically significant relationship between the features provided on the users in the experiment and their chance of clicking on either variant. The system is able to synthesize these trends into plain English, revealing, for example, that one of the input features strongly correlates with success of one of the variants. This, along with a statistically significant overall result for the A/B test, can give valuable insights for future testing as well as help interpret the test at hand.
Honest AB is powered by two machine learning models, one handling the significance estimation for the experiment, and the other performing regression analysis on the additional input features. Both are Bayesian inference models evaluated analytically, meaning no approximations or gradient-based learning. This property allows the incoming data to be processed in a streaming environment. Requests to the API are served first by basic data validation and encoding and then by appending to a write-ahead log. The log is processed asynchronously to update the models and then the data is deleted, since it's no longer needed. To allow for this architecture and the significant efficiency it allows, the models have been formulated in terms of sufficient statistics that are fixed size and bounded magnitude. Model training happens in linear time and space constant with respect to data cardinality. Due to the asynchronous architecture and the user- and data-parallel algorithms that power it, Honest AB is highly scalable to many users, large datasets, and high availability.
Contrary to standard procedure for an A/B test, Honest AB uses a Bayesian measure of significance for experiment results instead of a T-test. This allows for the statistic to serve as a stopping condition for the test, whereas the significance value reported for a T-test is only valid for a fixed, predetermined sample size. The model estimates the probability that the true click rate for variant A is greater than for variant B, modelling the click experiment as a Bernoulli trial and its corresponding click rate with a conjugate Beta prior. The sufficient statistic for this model is the counts for successes and failures of each variant. (Miller)
The feature insights are based on Bayesian linear regression models from the input features to the output click rate for each variant. Each feature is treated independently, since covariance between features and their regression weights is not useful for human insights. There is one regression model for each feature for each variant. The model performs 2D regression to learn a multiplicative weight and a bias as well as the posterior covariance of those estimates. The bias is learned simply to normalize the data and ensure that the multiplicative weight is representative of a linear trend. Trends only become insights if they are statistically significant, and this determination comes from a two-sided T-test with null hypothesis that the multiplicative weight of the feature is zero. The test is performed using the variance of the multiplicative weight (representing the uncertainty in that estimate) learned by the Bayesian model. The prior for the weights is zero-mean, with the prior multiplicative variance set as the reciprocal of the variance of the corresponding data feature. This prior corresponds to the assumption that the features are independent and have no relationship to the click rate and should then contribute equally to the variance of the regression output.
(Miller): http://www.evanmiller.org/bayesian-ab-testing.html