Overview

What is Socorro?

Socorro is a crash ingestion pipeline.

The crash ingestion pipeline that we have at Mozilla looks like this:

digraph G {
   rankdir=LR;
   splines=lines;

   subgraph coll {
      rank=same;

      client [shape=box3d, label="firefox"];
      collector [shape=rect, label="collector"];
   }

   subgraph stores {
      rank=same;

      s3raw [shape=tab, label="S3 (Raw)", style=filled, fillcolor=gray];
      sqs [shape=tab, label="SQS", style=filled, fillcolor=gray];
   }

   processor [shape=rect, label="processor"];

   subgraph stores2 {
      rank=same;

      postgres [shape=tab, label="Postgres", style=filled, fillcolor=gray];
      elasticsearch [shape=tab, label="Elasticsearch", style=filled, fillcolor=gray];
      s3telemetry [shape=tab, label="S3 (Telemetry)", style=filled, fillcolor=gray];
      s3processed [shape=tab, label="S3 (Processed)", style=filled, fillcolor=gray];
   }

   subgraph processing {
      rank=same;

      crontabber [shape=rect, label="crontabber"];
      webapp [shape=rect, label="webapp"];
      telemetry [shape=rect, label="telemetry"];
   }


   client -> collector [label="HTTP"];
   collector -> s3raw [label="save raw"];
   collector -> sqs [label="publish"];

   sqs -> processor [label="crash id"];
   s3raw -> processor [label="load raw"];
   postgres -> processor [label="betaversionrule"];
   processor -> { s3processed, elasticsearch, s3telemetry } [label="save processed"];

   postgres -> webapp;
   webapp -> postgres;
   s3raw -> webapp [label="load raw"];
   s3processed -> webapp [label="load processed"];
   elasticsearch -> webapp;

   postgres -> crontabber;
   crontabber -> postgres;
   elasticsearch -> crontabber;

   s3telemetry -> telemetry [label="telemetry ingestion"];

   { rank=min; client; }
}

Arrow direction represents actions and flow of information through the ingestion pipeline.

Important services in the diagram:

  • Collector: Collects incoming crash reports via HTTP POST. It saves crash data to AWS S3 and publishes crash ids to AWS SQS for processing.

  • Processor: Processes crashes and extracts data from minidumps, generates crash signatures, performs other analysis, and saves everything as a processed crash.

  • Webapp (aka Crash Stats): Web user interface for analyzing crash data.

  • Crontabber: Runs periodic housekeeping tasks.

The collector we use is called Antenna and the code is in https://github.com/mozilla-services/antenna/.

The processor, webapp, and crontabber services are in the Socorro repository.

Let’s take a tour through the crash ingestion pipeline!

A tour through the crash ingestion pipeline

Breakpad-style crash report generated by a crash reporter

When Firefox crashes, the breakpad client assembles information about the crash in a minidump format. The crash reporter dialog prompts the user for some more information and whether to send the crash report to Mozilla.

If the user presses “Send crash report”, then the crash reporter sends the crash report as a multipart/form-data payload via an HTTP POST to the collector.

At Mozilla, this is a bit complicated because each product and platform has different breakpad client bits and crash reporters and is spread out across a bunch of repositories.

Collected by the Collector

The collector (Antenna) is the beginning of the crash ingestion pipeline.

The collector handles the incoming crash reports and does the following:

  1. assigns the crash report a unique crash id

  2. adds a submitted time stamp to the crash report

  3. figures out whether Socorro should process this crash report or not

If Socorro shouldn’t process this crash report, then the crash report is rejected and the collector is done.

If Socorro should process this crash report, then the collector will return the crash id to the crash reporter in the HTTP response. The crash reporter records the crash id on the user’s machine. The user can see crash reports in about:crashes.

The collector then saves the crash report data to Amazon S3 as a raw crash in a directory structure like this:

v2/
  raw_crash/
    000/
      20160513/
        00007bd0-2d1c-4865-af09-80bc02160513    raw crash metadata
v1/
  dump_names/
    00007bd0-2d1c-4865-af09-80bc02160513        list of minidumps for this crash
  dump/
    00007bd0-2d1c-4865-af09-80bc02160513        minidump file

A crash id looks like this:

de1bb258-cbbf-4589-a673-34f800160918
                             ^^^^^^^
                             ||____|
                             |  yymmdd
                             |
                             throttle result instruction

The collector then publishes the crash report id to AWS SQS for processing.

Processed by Processor

The processor pulls crash report ids from the AWS SQS queues. It fetches the raw crash report data and minidumps from Amazon S3.

It processes the crash report with a pipeline of rules that transform the raw crash into a processed crash.

One of the rules runs the minidump-stackwalk on the minidump to extract information about the process and stack. It symbolicates stack symbols. It determines some other things about the crash.

Another rule generates a crash signature from the stack of the crashing thread. We use crash signatures to group crashes that have similar symptoms so that we can more easily see trends and causes.

There are other rules, too.

After the crash gets through the processing pipeline, the processed crash is saved to several places:

  1. Amazon S3

  2. Elasticsearch

  3. Amazon S3 (different bucket) to be ingested into the Telemetry data set

Investigated with Webapp aka Crash Stats

The webapp is located at https://crash-stats.mozilla.org.

The webapp lets you search through crash reports and facet on aspects of them with Super Search.

The webapp shows top crashers.

The webapp has a set of APIs for accessing data.

You can create an account in the webapp by logging in.

By default, information in a crash report that’s personally identifiable information is hidden. This includes the user’s email address and the url the user was visiting when Firefox crashed.

Housekeeping with cronrun

We have a cronrun Django command that acts as a self-healing command runner that can run any Django command with specified arguments at scheduled times. We use it to run jobs that perform housekeeping functions in the crash ingestion pipeline like:

  1. updating product/version information

  2. updating regarding bugs associated with crash signatures

  3. updating “first time we saw this signature” type information

cronrun jobs that fail are re-run. Some cronrun jobs are set up to backfill, so if they fail, they will eventually run for all the times they needed to.

See also

Code (Jobs)

https://github.com/mozilla-services/socorro/

Documentation (Jobs)

https://socorro.readthedocs.io/

Socorro scheduled tasks (cronrun) documentation

Crontabber

Telemetry (External system)

Socorro exports a subset of crash data to Telemetry where it can be queried. It’s in the socorro_crash dataset.

See Telemetry (telemetry.socorro_crash) for more details.