=====================
Detecting regressions
=====================

Goals and requirements
======================

When preparing updates, it is part of best practices to ensure that
the update does not introduce any regression on all the QA checks.
The concept of regression is thus to be managed at the level of the
QA workflow.

Some regressions can be identified at the level of the result of the
respective QA tasks. An ``autopkgtest`` that used to pass, and now fails,
is an obvious example. However, when both are failing, you might want
to dig deeper and see if the same set of tests are failing. If we can
identify new tests that are failing, we still have a regression.

On the opposite, a lintian analysis might indicate a success on both
versions, but generate new warning-level tags, that we might want to report
as a regression.

Other considerations:

* We want the QA workflow to have a summary view showing side-by-side the
  result of the "QA tasks" on the original and updated package(s) along
  with a conclusion on whether it's a regression or not. It should be
  possible to have detailed view of a specific comparison, showing
  a comparison of the artifacts generated by the QA tasks.
* We want that table to be regularly updated, every time that a QA task
  finishes, without waiting for the completion of all QA tasks.
* We want to be able to configure the ``debian_pipeline`` workflow
  so that any failure or any regression in the ``qa`` workflow requires a
  manual confirmation to continue the root workflow. (Right now the result of
  the ``qa`` workflow has no impact on the ``package_upload`` workflow for
  example)

Implementation of QA results regression analysis
================================================

The ``output_data`` of a qa workflow has a new ``regression_analysis``
key which is a dictionary of such analysis. The key represents the
name of a test (e.g. ``autopkgtest:dpkg:amd64``) without any version
and the value is the result of the analysis which is defined as another
dictionary with the following keys:

* ``original_artifact_id`` (optional, can be set later when the QA result is
  available): ID of the original artifact used for the comparison
* ``new_artifact_id`` (optional, can be set later when the QA result is
  available): ID of the new artifact used for the comparison
* ``status`` (required): a string value among the following values:

  * ``no-result``: when the comparison has not been completed yet (usually
    because we lack one of the two required QA results)
  * ``error``: when the comparison (or one of the required QA tasks) errored out
  * ``improvement``: when the new QA result is better than the original
    QA result
  * ``stable``: when the new QA result is neither better nor worse than
    the original QA result
  * ``regression``: when the new QA result is worse than the original QA
    result

* ``details`` (optional): an arbitrarily nested data-structure composed of
  lists and dictionaries where dictionary keys and leaf items (and/or leaf
  item values) are always strings. Expectation is that this structure is
  rendered as nested lists shown behind a collapsible section that can be
  unfolded to learn more about the analysis. The strings are HTML-escaped
  when rendered.

The regression analysis can lead to multiple results:

* no-result: when we are lacking one of the QA results for the comparison
* improvement: when the new QA result is "success" while the reference one
  is "failure"
* stable: when the two QA results are "success"
* regression: when the new QA result is "failure" while the reference one
  is "success"
* error: when one of the work requests providing the required QA result
  errored out

Details can also be provided as output of the analysis, they will
typically be displayed in the summary view of the qa workflow.

The regression analysis for a :workflow:`qa` workflow contains a ``""`` key
with a top-level summary.  This is mainly useful to help the UI establish
which source versions are being compared.

The first level of comparison is at the level of the ``result``
of the ``WorkRequest``, following the logic above. But depending on the
output of the QA task, it is possible to have a finer-grained analysis.
The next sections details how those deeper comparisons are performed.

For lintian
-----------

We compare the ``summary.tags_count_by_severity`` to determine the
status of the regression analysis::

    SEVERITIES = ("warning", "error")
    if any(new_count[s] > original_count[s] for s in SEVERITIES):
        return "regression"
    elif any(new_count[s] < original_count[s] for s in SEVERITIES):
        return "improvement"
    else
        return "stable"

We also perform a comparison of the ``summary.tags_found`` to indicate
in the ``details`` field which new tags have been reported, and which tags
have disappeared.

.. note::

   Among the difference of tags, there can be tags that have severities
   lower than warning and error, but we have no way to filter them out
   without loading the full analysis.json from the artifact which would be
   much more costly for almost no gain.


For autopkgtest
---------------

We compare the result of each individual test in the ``results`` key
of the artifact metadata. Each result is classified on its own following
the table below, the first line that matches ends the classification
process:

.. list-table::
   :header-rows: 1

   * - ORIGINAL
     - NEW
     - RESULT
   * - ``*``
     - FLAKY
     - stable
   * - PASS, SKIP
     - FAIL
     - regression
   * - FAIL, FLAKY
     - PASS, SKIP
     - improvement
   * - ``*``
     - ``*``
     - stable

Each individual regression or improvement is noted and documented in the
``details`` field of the analysis.

To compute the global result of the regression analysis, the logic is the
following::

    if "regression" in comparison_of_tests:
        return "regression"
    elif "improvement" in comparison_of_tests:
        return "improvement"
    else:
        return "stable"

For piuparts and blhc
---------------------

The provided metadata do not allow for deep comparisons, so the comparison
is based on the ``result`` of the corresponding WorkRequest (which is
duplicated in the per-item data of the :collection:`debian:qa-results`
collection).

The algorithm is the following::

    if origin.result == SUCCESS and new.result == FAILURE:
        return "regression"
    elif origin.result == FAILURE and new.result == SUCCESS:
        return "improvement"
    else
        return "stable"

About the UI to display regression analysis
===========================================

Here's an example of what the table could look like:

.. list-table::
   :header-rows: 1

   * - Test name
     - Original result for dpkg_1.2.0
     - New result for dpkg_1.2.1
     - Conclusion
   * - autopkgtest:dpkg_amd64
     - ✅
     - ❌
     - ↘️  regression
   * - lintian:dpkg_source
     - ✅
     - ✅
     - ➡️  stable
   * - piuparts:dpkg_amd64
     - ✅
     - ❔
     - ❔ no-result
   * - autopkgtest:apt_amd64
     - ❌
     - ✅
     - ↗️  improvement
   * - **Summary**
     - 1 failure
     - 1 failure, 1 missing result
     - ↘️  regression

Multiple comments about the desired table:

* We should use the standard WorkRequest result widgets instead of the special
  characters (✅ and ❌) shown above.
* We want to put links to the artifact for each QA result in the "Original
  result" and "New result" columns.
* The number of autopkgtest results due to the reverse_dependencies_autopkgtest
  workflow can be overwhelming. Due to this, the autopkgtest lines that
  concern other source packages than the one processed in the current
  workflow are hidden if the regression analysis result is "stable" or
  "no-result".
