Overview¶
What is Socorro?¶
Socorro is a crash ingestion pipeline.
The crash ingestion pipeline that we have at Mozilla looks like this:
Arrow direction represents connections to access services.
Important services in the diagram:
Collector: Collects incoming crash reports via HTTP POST. It throttles crash reports–some get accepted and some get rejected. It generates a crash id for the crash report. It parses the HTTP POST payload into a raw crash with crash annotations and minidump files. It saves this crash data to AWS S3 and publishes crash ids to AWS SQS for processing.
Processor: Processes crash reports, normalizes and validates data, extracts data from minidumps, generates crash signatures, performs other analysis, and saves everything as a processed crash to AWS S3 and Elasticsearch.
Webapp (aka Crash Stats): Web user interface for looking at, searching, and analyzing processed crash data.
Crontabber: Runs periodic housekeeping tasks using cronrun Django command.
The collector we use is called Antenna and the code is in https://github.com/mozilla-services/antenna/.
The processor, webapp, and crontabber services are in the Socorro repository at https://github.com/mozilla-services/socorro/.
Let’s take a more detailed tour through the crash ingestion pipeline!
A tour through the crash ingestion pipeline¶
Crash report generated by a crash reporter¶
When Firefox crashes, the crash reporter collects information about the crash (stack, register contents, bits of heap) and generates a minidump. The crash reporter also captures crash annotations. The crash annotations plus zero or more minidumps is collectively called a crash report.
Depending on what kind of crash just happened, a crash reporter dialog may prompt the user for additional information and whether the user wants to send the crash report to Mozilla.
If the user says “yes” or has opted-in to sending crash reports [1], the crash reporter will send the crash report as a multipart/form-data payload via an HTTP POST to the crash ingestion pipeline collector.
This process is complicated because each product and platform has different crash reporters, crash annotations, crash reporter dialogs, and other things and this code is spread out across a bunch of repositories.
See also
- Breakpad overview
https://chromium.googlesource.com/breakpad/breakpad/+/master/docs/getting_started_with_breakpad.md
- minidump
https://docs.microsoft.com/en-us/windows/win32/debug/minidump-files
- Crash reporter documentation
https://firefox-source-docs.mozilla.org/toolkit/crashreporter/crashreporter/index.html
- Crash report specification
- Crash Annotations
Collected by the Collector¶
The collector (Antenna) is the beginning of the crash ingestion pipeline.
The collector handles the incoming crash reports and does the following:
assigns the crash report a unique crash id
adds a submitted time stamp and some other metadata to the crash report
determines whether Socorro should process this crash report or not
If Socorro shouldn’t process this crash report, then the crash report is rejected and the collector is done.
If Socorro should process this crash report, then the collector returns the
crash id to the crash reporter in the HTTP response. The crash reporter records
the crash id on the user’s machine. The user can see crash reports in
about:crashes
.
The collector saves the crash report data to AWS S3 as a raw crash and minidumps in a directory structure like this:
v1/
raw_crash/
20160513/
00007bd0-2d1c-4865-af09-80bc02160513 crash annotations and collection metadata
dump_names/
00007bd0-2d1c-4865-af09-80bc02160513 list of minidumps for this crash
dump/
00007bd0-2d1c-4865-af09-80bc02160513 minidump file
A crash id looks like this:
de1bb258-cbbf-4589-a673-34f800160918
^^^^^^^
||____|
| yymmdd
|
throttle result instruction
The collector then publishes the crash report id to AWS SQS for processing.
Note that the throttle result instruction character is no longer used and
always set to 0
.
See also
Processed by Processor¶
The processor pulls crash ids from the AWS SQS queues. It fetches the raw crash and minidumps from AWS S3.
It passes the crash data through the processing pipeline which generates a processed crash.
One of the rules runs the stackwalker on the minidump to extract information about the process and stack. It symbolicates stack symbols. It determines some other things about the crash.
Another rule generates a crash signature from the stack of the crashing thread. We use crash signatures to group crashes that have similar symptoms so that we can more easily see trends and causes.
There are other rules, too.
After the crash gets through the processing pipeline, the processed crash is saved to several places:
AWS S3
Elasticsearch
AWS S3 (different bucket) to be ingested into Telemetry BigQuery
See also
- Code
- Documentation
- Stack walking
https://chromium.googlesource.com/breakpad/breakpad/+/master/docs/stack_walking.md
- rust-minidump
- Breakpad symbols files format
https://chromium.googlesource.com/breakpad/breakpad/+/master/docs/symbol_files.md
- Mozilla symbols server
- Socorro processor documentation
- Signature generation
Investigated with Webapp aka Crash Stats¶
The webapp is located at https://crash-stats.mozilla.org.
The webapp lets you search through and facet on processed crash data with Super Search.
The webapp shows Top Crashers.
The webapp has a set of APIs for accessing data.
You can create an account in the webapp by logging in.
Administrators can grant you access to protected data in crash reports. Without access to protected data, you can’t see data in crash reports like the URL the user was visiting.
See also
- Code
- Documentation
- Crash Stats user documentation
- Crash Stats Super search
- Crash Stats APIs
- Privacy policy
- Socorro webapp documentation
Housekeeping with cronrun¶
We have a cronrun
Django command that acts as a self-healing command runner
that can run any Django command with specified arguments at scheduled times.
We use it to run jobs that perform housekeeping functions in the crash
ingestion pipeline like:
updating product/version information for the Beta version lookup
updating data about bugs associated with crash signatures
updating “first time we saw this signature” type information
cronrun jobs that fail are re-run. Some cronrun jobs are set up to backfill, so if they fail, they will eventually run for all the times they needed to.
See also
- Code (Jobs)
- Socorro scheduled tasks (cronrun) documentation
Sent to Telemetry (External system)¶
Socorro exports a subset of crash data to Telemetry where it can be queried. It’s in
the telemetry.socorro_crash
dataset.
The exported data is considered publicly-safe–there’s no protected data in it.
See Telemetry (telemetry.socorro_crash) for more details.