metadata

title: Almost Stochastic Order
emoji: ⚖️
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: 3.13.0
app_file: app.py
pinned: false
tags:
  - evaluate
  - comparison
description: >-
  The Almost Stochastic Order test is a non-parametric test that assesses the
  difference between two prediction distributions through their Wasserstein
  distance.

Comparison Card for Almost Stochastic Order

The Almost Stochastic Order test is a non-parametric test that assesses to what extent two distributions of predictions differ through measuring the Wasserstein distance from each other. It can be used to compare the predictions of two models, and is especially useful for neural networks since it compares the two full distributions, not just their means. When model 1 produces overall higher predictions than model 2, the test statistic called violation ratio will be less than 0.5.

This version of the test computes a frequentist upper bound to the violation ratio given a pre-specified confidence level (0.95 by default). For more information, refer to the README of the deep-significance package or the relevant publications (Dror et al., 2019; Ulmer et al., 2022).

Comparison description

The Almost Stochastic Order test is a non-parametric test that tests to what extent the distributions of predictions differ from each other through measuring their Wasserstein distance. It can be used to compare the predictions of two models.

How to use

The Almost Stochastic Order comparison is used to analyze any kind of real-valued predictions.

Inputs

Its arguments are:

predictions1: a list of predictions from the first model.

predictions2: a list of predictions from the second model.

Its keyword arguments are:

confidence_level: The confidence level under which the result is obtained as a float. Default is 0.95.

num_bootstrap_iterations: Number of bootstrap iterations to compute upper bound to test statistics as an integer. Default is 1000.

dt: Differential for t during numerical integral calculation as a float. Default is 0.005.

num_jobs: Number of jobs to use for test as an integer. If None, this defaults to value specified in the num_process attribute.

show_progress: A boolean flag. If True, a progress bar is shown when computing the test statistic. Default is False.

seed: Set an integer seed for reproducibility purposes. If None, this defaults to the value specified in the seed attribute.

Output values

The Almost Stochastic Order comparison output a single scalar:

violation_ratio: (Frequentist upper bound to) Degree of violation of the stochastic order as a float between 0 and 1. When it is smaller than 0.5, the model producing predictions1 performs better than the other model at a confidence level specified by confidence_level argument (default is 0.95).

Examples

Example comparison:

aso = evaluate.load("kaleidophon/almost_stochastic_order")
results = aso.compute(predictions1=[-7, 123.45, 43, 4.91, 5], predictions2=[1337.12, -9.74, 1, 2, 3.21])
print(results)
{'violation_ratio': 1.0}

Limitations and bias

The Almost Stochastic Order test is a non-parametric test, so it comes with no assumptions about the underlying distribution of predictions.

We identify the following limitations:

Even though a violation ratio less than 0.5 should be enough to reject the null hypothesis, Ulmer et al. (2022) recommend to reject the null hypothesis when violation_ratio is below 0.2.
Since the test involves a bootstrapping procedure, results may vary between function calls. For this purpose, it is possible to set a seed via the seed argument.

Citations

@article{ulmer2022deep,
  title={deep-significance-Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks},
  author={Ulmer, Dennis and Hardmeier, Christian and Frellsen, Jes},
  journal={arXiv preprint arXiv:2204.06815},
  year={2022}
}

@inproceedings{dror2019deep,
  author    = {Rotem Dror and
               Segev Shlomov and
               Roi Reichart},
  editor    = {Anna Korhonen and
               David R. Traum and
               Llu{\'{\i}}s M{\`{a}}rquez},
  title     = {Deep Dominance - How to Properly Compare Deep Neural Models},
  booktitle = {Proceedings of the 57th Conference of the Association for Computational
               Linguistics, {ACL} 2019, Florence, Italy, July 28-August 2, 2019,
               Volume 1: Long Papers},
  pages     = {2773--2785},
  publisher = {Association for Computational Linguistics},
  year      = {2019}
}