1. Overview

1.1. What is Socorro?

Socorro is software that implements a crash ingestion pipeline.

The Socorro code is hosted in a GitHub repository at https://github.com/mozilla-services/socorro and released and distributed under the Mozilla Public License v2.

The crash ingestion pipeline that we have at Mozilla looks like this:

digraph G {
   rankdir=LR;
   splines=lines;

   subgraph coll {
      rank=same;

      client [shape=box3d, label="firefox"];
      collector [shape=rect, label="collector"];
   }

   subgraph stores {
      rank=same;

      s3raw [shape=tab, label="S3 (Raw)", style=filled, fillcolor=gray];
      pigeon [shape=cds, label="Pigeon"];
      rabbitmq [shape=tab, label="RabbitMQ", style=filled, fillcolor=gray];
   }

   processor [shape=rect, label="processor"];

   subgraph stores2 {
      rank=same;

      postgres [shape=tab, label="Postgres", style=filled, fillcolor=gray];
      elasticsearch [shape=tab, label="Elasticsearch", style=filled, fillcolor=gray];
      s3telemetry [shape=tab, label="S3 (Telemetry)", style=filled, fillcolor=gray];
      s3processed [shape=tab, label="S3 (Processed)", style=filled, fillcolor=gray];
   }

   subgraph processing {
      rank=same;

      crontabber [shape=rect, label="crontabber"];
      webapp [shape=rect, label="webapp"];
      telemetry [shape=rect, label="telemetry"];
   }


   pigeon -> rabbitmq [label="crash id"];
   s3raw -> pigeon [label="S3 CreateObject:Put"];

   client -> collector [label="HTTP"];
   collector -> s3raw [label="save raw"];

   rabbitmq -> processor [label="crash id"];
   s3raw -> processor [label="load raw"];
   processor -> { s3processed, elasticsearch, s3telemetry } [label="save processed"];

   postgres -> webapp;
   webapp -> postgres;
   s3raw -> webapp [label="load raw"];
   s3processed -> webapp [label="load processed"];
   elasticsearch -> webapp;

   postgres -> crontabber;
   crontabber -> postgres;
   elasticsearch -> crontabber;

   s3telemetry -> telemetry [label="telemetry ingestion"];

   { rank=min; client; }
}

Arrows direction represents the flow of interesting information like saving crash information, loading crash information, pulling crash ids from queues, and so on.

Important services in the diagram:

  • Collector: Collects incoming crash reports via HTTP POST. The collector we use is called Antenna. It saves crash data to AWS S3.
  • Processor: Processes crashes and extracts data from minidumps, generates crash signatures, performs other analysis, and saves everything as a processed crash.
  • Webapp (aka Crash Stats): Web user interface for analyzing crash data.
  • Crontabber: Runs periodic housekeeping tasks.

Let’s take a tour through the crash ingestion pipeline!

1.2. A tour through the crash ingestion pipeline

1.2.1. Crash report generated by Breakpad Client (breakpad)

When Firefox crashes, the breakpad client assembles information about the crash in a minidump format. The crash reporter dialog prompts the user for some more information and whether to send the crash report to Mozilla.

If the user presses “Send crash report”, then the breakpad client sends the crash report as a multipart/form-data payload via an HTTP POST to the collector.

1.2.2. Collected by the Collector (Python, Falcon)

The collector is the beginning of the crash ingestion pipeline. It accepts the incoming crash and does several things to it:

  1. assigns it a unique crash id
  2. tags it with a time stamp
  3. figures out whether the pipeline should process this crash or not

The collector returns the crash id to the crash reporter which records it on the user’s machine.

The collector saves the crash data to Amazon S3 as a raw crash in a directory structure like this:

v2/
  raw_crash/
    000/
      20160513/
        00007bd0-2d1c-4865-af09-80bc02160513    raw crash metadata
v1/
  dump_names/
    00007bd0-2d1c-4865-af09-80bc02160513        list of minidumps for this crash
  dump/
    00007bd0-2d1c-4865-af09-80bc02160513        minidump file

A crash id looks like this:

de1bb258-cbbf-4589-a673-34f800160918
                             ^^^^^^^
                             ||____|
                             |  yymmdd
                             |
                             throttle result instruction

The collector we currently use is called Antenna.

1.2.3. Queued for processing by Pigeon (Python, AWS Lambda)

When the raw crash is saved to Amazon S3, Pigeon is invoked with an S3 ObjectCreated:Put event with the filename for the raw crash. The filename contains the crash id. Pigeon looks at the throttle result instruction character in the crash id to determine if the crash was deferred or accepted for processing.

If the crash is not accepted for processing, then its story ends here. [EXEUNT STAGE LEFT.]

If the crash is accepted for processing, Pigeon adds the crash id to the socorro.normal processing queue in RabbitMQ.

1.2.4. Processed by Processor (Python, Configman)

The processor gets a crash id from the socorro.normal queue in RabbitMQ. It fetches the raw crash data and related minidumps from Amazon S3.

It passes all that information through the processing pipeline which consists of a series of rules that transform the crash into a processed crash.

One of the rules runs the minidump-stackwalker over the minidump to extract information about the process and symbolicates the symbols on the stack. It also determines some other things about the state of the process when Firefox crashed.

Another rule generates a crash signature from the stack of the crashing thread. We use crash signatures to group crashes that have similar symptoms so that we can more easily see trends and causes.

There are other rules, too.

After the crash gets through the processing pipeline, it’s saved to several destinations in various forms:

  1. Amazon S3
  2. Elasticsearch
  3. Amazon S3 (different bucket) to be ingested into the Telemetry data set

1.2.5. Investigated with Webapp aka Crash Stats (Python, Django)

The webapp is located at https://crash-stats.mozilla.com.

The webapp lets you search through crash reports and facet on aspects of them with Super Search.

The webapp shows top crashers.

The webapp has a set of APIs for accessing data.

You can create an account in the webapp by logging in.

By default, information in a crash report that’s personally identifiable information is hidden. This includes the user’s email address and the url the user was visiting when Firefox crashed.

1.2.6. Housekeeping with Crontabber (Python, Crontabber, Configman)

Crontabber is a self-healing periodic task manager. We use it to run jobs that perform housekeeping functions in the crash ingestion pipeline like:

  1. updating product/version information
  2. updating regarding bugs associated with crash signatures
  3. updating “first time we saw this signature” type information

Crontabber jobs that fail are re-run. You can see the state of Crontabber jobs on the Crontabber State page.

See also

Code (Jobs)
https://github.com/mozilla-services/socorro
Documentation (Jobs)
https://socorro.readthedocs.io/
Code (Crontabber)
https://github.com/mozilla/crontabber
Documentation (Crontabber)
https://crontabber.readthedocs.io/
Crontabber state
https://crash-stats.mozilla.com/crontabber-state/
Socorro crontabber documentation
Service: Crontabber

1.2.7. Telemetry (External system)

Socorro exports a subset of crash data to Telemetry where it can be queried.