The perf bot sheriff is responsible for keeping the bots on the chromium.perf waterfall up and running, and triaging performance test failures and flakes.
### Keeping the chromium.perf waterfall green
The primary responsibility of the perfbot health sheriff is to keep the chromium.perf waterfall green.
#### Understanding the Waterfall State
Everyone can view the chromium.perf waterfall at https://build.chromium.org/p/chromium.perf/, but for Googlers it is recommended that you use the url [https://uberchromegw.corp.google.com/i/chromium.perf/] (https://uberchromegw.corp.google.com/i/chromium.perf/) instead. The reason for this is that in order to make the performance tests as realistic as possible, the chromium.perf waterfall runs release official builds of Chrome. But the logs from release official builds may leak info from our partners that we do not have permission to share outside of Google. So the logs are available to Googlers only. To avoid manually rewriting the URL when switching between the upstream and downstream views of the waterfall and bots, you can install the Chromium Waterfall View Switcher extension, which adds a switching button to Chrome's URL bar.
Note that there are four different views:
You can see a list of all previously filed bugs using the Performance-BotHealth label in crbug.
Please also check the recent perf-sheriffs@chromium.org postings for important announcements about bot turndowns and other known issues.
You want to keep the waterfall green! So any bot that is red or purple needs to be investigated. When a test fails:
File a bug using this template. You'll want to be sure to include:
Disable the failing test if it is failing more than one out of five runs. (see below for instructions on telemetry and other types of tests). Make sure your disable cl includes a BUG= line with the bug from step 1 and the test owner is cc-ed on the bug.
return_code
.##### Disabling Telemetry Tests
If the test is a telemetry test, its name will have a '.' in it, such as thread_times.key_mobile_sites, or page_cycler.top_10. The part before the first dot will be a python file in tools/perf/benchmarks.
If a telemetry test is failing and there is no clear culprit to revert
immediately, disable the test. You can do this with the @benchmark.Disabled
decorator. Always add a comment next to your decorator with the bug id which
has background on why the test was disabled, and also include a BUG= line in
the CL.
Please disable the narrowest set of bots possible; for example, if
the benchmark only fails on Windows Vista you can use @benchmark.Disabled('vista')
.
Supported disabled arguments include:
win
mac
chromeos
linux
android
vista
win7
win8
yosemite
elcapitan
all
(please use as a last resort)If the test fails consistently in a very narrow set of circumstances, you may consider implementing a ShouldDisable method on the benchmark instead. Here is and example of disabling a benchmark which OOMs on svelte.
Disabling CLs can be TBR-ed to anyone in tools/perf/OWNERS, but please do not submit with NOTRY=true.
Non-telemetry tests are configured in chromium.perf.json. You can TBR any of the per-file OWNERS, but please do not submit with NOTRY=true.
#### Handling Device and Bot Failures
When a bot goes purple, it's it's usually because of an infrastructure failure outside of the tests. But you should first check the logs of a purple bot to try to better understand the problem. Sometimes a telemetry test failure can turn the bot purple, for example.
If the bot goes purple and you believe it's an infrastructure issue, file a bug with this template, which will automatically add the bug to the trooper queue. Be sure to note which step is failing, and paste any relevant info from the logs into the bug.
There are two types of device failures:
device_status_check
step. You can look at
the buildbot status page to see how many devices were listed as online during
this step. You should always see 7 devices online. If you see fewer than 7
devices online, there is a problem in the lab.device_status_check
but still in poor health. The
symptom of this is that all the tests are failing on it. You can see that on
the buildbot status page by looking at the Device Affinity
. If all tests
with the same device affinity number are failing, it's probably a device
failure.For both types of failures, please file a bug with this template which will add an issue to the infra labs queue.
If you need help triaging, here are the common labels you should use:
Cr-Tests-AutoBisect for bisect and perf try job failures.
If you still need help, ask the speed infra chat, or escalate to sullivan@.
Pri-0 bugs should have an owner or contact on speed infra team and be worked on as top priority. Pri-0 generally implies an entire waterfall is down.
Pri-1 bugs should be pinged daily, and checked to make sure someone is following up. Pri-1 bugs are for a red test (not yet disabled), purple bot, or failing device.
Pri-2 bugs are for disabled tests. These should be pinged weekly, and work towards fixing should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the list of Pri-2 bugs that have not been pinged in a week
### Triaging data stoppage alerts
Data stoppage alerts are listed on the perf dashboard alerts page. Whenever the dashboard is monitoring a metric, and that metric stops sending data, an alert is fired. Some of these alerts are expected:
If there doesn't seem to be a valid reason for the alert, file a bug on it using the perf dashboard, and cc the owner. Then do some diagnosis:
buildbot stdio
link in the tooltip
to find the buildbot status page for the last good build, and increment
the build number to get the first build with no data, and note that in the
bug as well. Check for any changes to the test in the revision range.json.output
link on the buildbot status page for the test. This is the data the test
sent to the perf dashboard. Are there null values? Sometimes it lists a
reason as well. Please put your finding in the bug.