Perf Bot Sheriffing #

The perf bot sheriff is responsible for keeping the bots on the chromium.perf waterfall up and running, and triaging performance test failures and flakes.

Key Responsibilities #

### Keeping the chromium.perf waterfall green

The primary responsibility of the perfbot health sheriff is to keep the chromium.perf waterfall green.

#### Understanding the Waterfall State

Everyone can view the chromium.perf waterfall at https://build.chromium.org/p/chromium.perf/, but for Googlers it is recommended that you use the url [https://uberchromegw.corp.google.com/i/chromium.perf/] (https://uberchromegw.corp.google.com/i/chromium.perf/) instead. The reason for this is that in order to make the performance tests as realistic as possible, the chromium.perf waterfall runs release official builds of Chrome. But the logs from release official builds may leak info from our partners that we do not have permission to share outside of Google. So the logs are available to Googlers only. To avoid manually rewriting the URL when switching between the upstream and downstream views of the waterfall and bots, you can install the Chromium Waterfall View Switcher extension, which adds a switching button to Chrome's URL bar.

Note that there are four different views:

  1. Console view makes it easier to see a summary.
  2. Waterfall view shows more details, including recent changes.
  3. Firefighter shows traces of recent builds. It takes url parameter arguments:
    • master can be chromium.perf, tryserver.chromium.perf
    • builder can be a builder or tester name, like "Android Nexus5 Perf (2)"
    • start_time is seconds since the epoch.

You can see a list of all previously filed bugs using the Performance-BotHealth label in crbug.

Please also check the recent perf-sheriffs@chromium.org postings for important announcements about bot turndowns and other known issues.

#### Handling Test Failures

You want to keep the waterfall green! So any bot that is red or purple needs to be investigated. When a test fails:

  1. File a bug using this template. You'll want to be sure to include:

    • Link to buildbot status page of failing build.
    • Copy and paste of relevant failure snippet from the stdio.
    • CC the test owner from go/perf-owners.
    • The revision range the test occurred on.
    • A list of all platforms the test fails on.
  2. Disable the failing test if it is failing more than one out of five runs. (see below for instructions on telemetry and other types of tests). Make sure your disable cl includes a BUG= line with the bug from step 1 and the test owner is cc-ed on the bug.

  3. After the disable CL lands, you can downgrade the priority to Pri-2 and ensure that the bug title reflects something like "Fix and re-enable testname".
  4. Investigate the failure. Some tips for investigating:
    • Debugging telemetry failures
    • If you suspect a specific CL in the range, you can revert it locally and run the test on the perf trybots.
    • You can run a return code bisect to narrow down the culprit CL:
      1. Open up the graph in the perf dashboard on one of the failing platforms.
      2. Hover over a data point and click the "Bisect" button on the tooltip.
      3. Type the Bug ID from step 1, the Good Revision the last commit pos data was received from, the Bad Revision the last commit pos and set Bisect mode to return_code.
    • On Android and Mac, you can view platform-level screenshots of the device screen for failing tests, links to which are printed in the logs. Often this will immediately reveal failure causes that are opaque from the logs alone. On other platforms, Devtools will produce tab screenshots as long as the tab did not crash.

##### Disabling Telemetry Tests

If the test is a telemetry test, its name will have a '.' in it, such as thread_times.key_mobile_sites, or page_cycler.top_10. The part before the first dot will be a python file in tools/perf/benchmarks.

If a telemetry test is failing and there is no clear culprit to revert immediately, disable the test. You can do this with the @benchmark.Disabled decorator. Always add a comment next to your decorator with the bug id which has background on why the test was disabled, and also include a BUG= line in the CL.

Please disable the narrowest set of bots possible; for example, if the benchmark only fails on Windows Vista you can use @benchmark.Disabled('vista'). Supported disabled arguments include:

If the test fails consistently in a very narrow set of circumstances, you may consider implementing a ShouldDisable method on the benchmark instead. Here is and example of disabling a benchmark which OOMs on svelte.

Disabling CLs can be TBR-ed to anyone in tools/perf/OWNERS, but please do not submit with NOTRY=true.

##### Disabling Other Tests

Non-telemetry tests are configured in chromium.perf.json. You can TBR any of the per-file OWNERS, but please do not submit with NOTRY=true.

#### Handling Device and Bot Failures

##### Purple bots

When a bot goes purple, it's it's usually because of an infrastructure failure outside of the tests. But you should first check the logs of a purple bot to try to better understand the problem. Sometimes a telemetry test failure can turn the bot purple, for example.

If the bot goes purple and you believe it's an infrastructure issue, file a bug with this template, which will automatically add the bug to the trooper queue. Be sure to note which step is failing, and paste any relevant info from the logs into the bug.

##### Android Device failures

There are two types of device failures:

  1. A device is blacklisted in the device_status_check step. You can look at the buildbot status page to see how many devices were listed as online during this step. You should always see 7 devices online. If you see fewer than 7 devices online, there is a problem in the lab.
  2. A device is passing device_status_check but still in poor health. The symptom of this is that all the tests are failing on it. You can see that on the buildbot status page by looking at the Device Affinity. If all tests with the same device affinity number are failing, it's probably a device failure.

For both types of failures, please file a bug with this template which will add an issue to the infra labs queue.

If you need help triaging, here are the common labels you should use:

#### Follow up on failures

Pri-0 bugs should have an owner or contact on speed infra team and be worked on as top priority. Pri-0 generally implies an entire waterfall is down.

Pri-1 bugs should be pinged daily, and checked to make sure someone is following up. Pri-1 bugs are for a red test (not yet disabled), purple bot, or failing device.

Pri-2 bugs are for disabled tests. These should be pinged weekly, and work towards fixing should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the list of Pri-2 bugs that have not been pinged in a week

### Triaging data stoppage alerts

Data stoppage alerts are listed on the perf dashboard alerts page. Whenever the dashboard is monitoring a metric, and that metric stops sending data, an alert is fired. Some of these alerts are expected:

If there doesn't seem to be a valid reason for the alert, file a bug on it using the perf dashboard, and cc the owner. Then do some diagnosis: