Socorro Schemas

This folder (socorros/schemas/ in the repo) contains JSON Schema files describing the documents that Socorro generates.

These files will be used as a contract between Socorro and other systems to which we might send our data.

raw_crash.schema.yaml schema

Adding new annotations

Note

Before adding an annotation to this file, it should be documented in:

https://github.com/mozilla-services/socorro/tree/main/socorro/schemas

Further, it requires a data review.

See Crash Annotations for adding new annotations.

New annotations are added in alphabetical order. The first few annotations in the schema are generated by the collector. New annotations generated by products should be after the line:

# Add new annotations from products below this line.

The template for adding an annotation looks like this:

AnnotationName:
  description: |
    Description goes here.
  type: string
  permissions: ["protected"]
  data_review:
    - url
  bugs:
    - url
  examples: ["xxx", "yyy"]

See the file for examples.

Field details:

description

The description for the annotation. Start with whatever is in CrashAnnotations.yaml and then improve it such that it’s clear what it is for someone who’s looking at crash report data.

Add any data gotchas to the description. For example, note differences for this annotation between products or platforms.

The description is formatted with Markdown.

The schema is specified with YAML, so you can use > to denote multiline descriptions which don’t have empty lines and | to denote multiline descriptions that do have empty lines. If in doubt, use |.

Example:

description: |
  Amount of free physical memory in bytes.
  - Under Windows, populated with the contents of the `MEMORYSTATUSEX`
    structure `ullAvailPhys` field.
  - Under macOS, populated with `vm_statistics64_data_t::free_count`.
  - Under Linux, populated with `/proc/meminfo` `xMemFree`.
  - Not available on other platforms.
type

This is the type of the field value.

With the exception of some of the collector-generated data, this should always be string because crash annotation values are encoded as strings in the submitted crash report.

Example:

type: string
permissions

This is the list of permissions required to view this field.

The value is EITHER:

  • ["public"]

  • the list of permissions required to view this field; currently the only supported permission is “protected”

If the field can be public, do this:

permissions: ["public"]

Otherwise, the field is protected, so do this:

permissions: ["protected"]
data_reviews

This is a list of one or more urls to data reviews for this annotation.

Example:

data_reviews:
  - https://bugzilla.mozilla.org/show_bug.cgi?id=1697875#c8
bugs (optional)

List of bugs related to adding this field to Socorro.

Examples:

bugs:
  - https://bugzilla.mozilla.org/show_bug.cgi?id=1741131
examples (optional)

Array of examples values for this field.

Example:

examples:
  - "Auto"
  - "Infobar"
  - "AboutCrashes"
  - "CrashedTab"
  - "Client"

Removing annotations

Annotations can be removed when they’re no longer supported.

Make sure the annotation isn’t used in processing and isn’t the source_annotation of a processed crash field.

Testing schema changes

After editing the raw_crash.schema.yaml file, verify it still validates.

First, download raw crash files into a crash directory that you want to validate your changes against. The crash directory can contain raw crash files or it could contain raw crash files in a directory structure generated by fetch_crash_data.

Second, run validate_raw_crash.py against that crash directory.

$ python socorro/schemas/validate_raw_crash.py [CRASHDIR]

Then run the tests which will test several invariants for the schemas.

Then look at the field in the data dictionary in the webapp to make sure the description and other parts are formatted well.

processed_crash.schema.yaml schema

Supporting new fields

When adding new fields, add them to the top-level properties section of the schema.

The template for adding a field looks like this:

field_name:
  description: |
    Description goes here.
  bugs:
    - related bug
  deprecated: true                                (optional)
  examples: ["some", "examples"]                  (optional)
  type: string
  permissions: ["protected"]
  source_annotation: AnnotationName               (optional)
field_name

Field names are snake case.

If the field is derived from a crash annotation, the field name should be a snake case version of the crash annotation name. For example WindowsErrorReporting would become windows_error_reporting.

description

The description for the field. If this field is based on an annotation, start with the annotation’s description. and then improve it such that it’s clear what it is for someone who’s looking at crash report data.

Add any data gotchas to the description. For example, note differences for this annotation between products or platforms.

The description is formatted with Markdown.

The schema is specified with YAML, so you can use > to denote multiline descriptions which don’t have empty lines and | to denote multiline descriptions that do have empty lines. If in doubt, use |.

Example:

description: |
  If Firefox crashes while some code is spinning manually the event loop on
  the main thread, this will be the stack of nested annotations.

  If the crashing process was killed (e.g. due to an IPC error), this
  annotation may refer to the parent process that killed it, look out for
  the prefix (`default` means parent) and see bug 1741131 for details.
type

This is the type or types of the processed crash field value.

Valid types:

  • string

  • boolean

  • integer (integer)

  • number (float)

  • null

  • array

  • object

Example of a field that can only be a string:

type: string

Example of a field that can be a string or null:

type: ["string", "null"]

This document will talk more about types later.

permissions

This is the list of permissions required to view this field.

The value is EITHER:

  • ["public"]

  • the list of permissions required to view this field; currently the only supported permission is “protected”

If the field can be public, do this:

permissions: ["public"]

Otherwise, the field is protected, so do this:

permissions: ["protected"]
bugs (optional)

List of bugs related to adding this field to Socorro.

Examples:

bugs:
  - https://bugzilla.mozilla.org/show_bug.cgi?id=1741131
deprecated (optional)

If this field is deprecated and slated to be removed, mark it as such in the schema.

Example:

deprecated: true
examples (optional)

Array of examples values for this field.

Example:

examples:
  - "`Auto`"
  - "`Infobar`"
  - "`AboutCrashes`"
  - "`CrashedTab`"
  - "`Client`"
source_annotation (optional)

The processor has a CopyFromRawCrashRule which will use the source_annotation value to copy the crash annotation value from the raw crash to the processed crash. It will use the type value to determine how to convert and validate the value.

This works as expected for boolean, integer, number, and string types.

For objects, the CopyFromRawCrashRule will JSON-decode the crash annotation value and then validate the resultings tructure against the subschema from the raw crash schema.

If a crash annotation value doesn’t validate or an error occurs during normalization, then a note is added to the processor_notes and the field is skipped.

default (optional)

If source_annotation is set, then default lets you have a default value for the processed crash field if the crash report doesn’t contain the annotation at all.

For example, if the crash report doesn’t contain a ProcessType annotation, then the CopyFromRawCrashRule will assign process_type to parent because that’s the default:

process_type:
  description: >
    Type of the process that crashed. This will be `parent` if the crash
    report has no ProcessType annotation.
  default: "parent"
  examples:
    - "any"
    - "content"
    - "gpu"
    - "parent"
    - "plugin"
  type: string
  permissions: ["public"]
  source_annotation: ProcessType

Types

There are primitive types:

  • string: a string

  • boolean: a boolean

  • integer: integer

  • number: float

  • null: the value can be null

Then there are complex types:

  • array: an array of things

  • object: an object of things

null

If a field can be something or null, then that needs to be noted in the type.

For example, this field can be an integer or a null:

crashing_thread:
  description: >
    Index of the crashing thread.
  type: ["integer", "null"]
  permissions: ["public"]

This is valid:

{
  "crashing_thread": 100
}

As is this:

{
   "crashing_thread": null
}

If the type doesn’t specify null, then the value cannot be null.

Arrays

For arrays, you need to specify what’s it an array of using items.

For example, this defines a field crash_report_keys which has a value that is an array of strings:

crash_report_keys:
  description: >
    The keys in the crash report
  type: array
  items:
    type: string
    permissions: ["public"]
  permissions: ["public"]

Example JSON for that field:

{
  "crash_report_keys": ["Product", "ReleaseChannel", "Version"]
}

Objects

For objects, you need to specify the structure of the object by defining properties and pattern_properties.

You use properties when you know the keys. For example, threads is an array of object with keys frame_count, frames, last_error_value, and thread_name.

Example:

threads:
  items:
    description: Information on a thread.

    properties:
      frame_count:
        description: How many stack frames therea re.
        type: ["integer", "null"]
        permissions: ["public"]
      frames:
        description: >
          Stack frames of the thread from top (the code that
          was currently executing) to bottom (start of the thread's
          execution).
        items:
          $ref: "#/definitions/json_dump_frame"
          permissions: ["public"]
        type: ["array", "null"]
        permissions: ["public"]
      last_error_value:
        description: >
          The windows `GetLastError()` value for this thread.
        type: ["string", "null"]
        permissions: ["public"]
      thread_name:
        description: The name of the thread.
        type: ["string", "null"]
        permissions: ["public"]
    type: object
    permissions: ["public"]

  type: ["array", "null"]
  permissions: ["public"]

If a processed crash has different keys for this object, then the schema doesn’t know anything about those keys, they can’t be validated, they’re ignored, and they’re treated as if they were protected data.

You use pattern_properties when you don’t know the keys. For example, CPU registers have a variety of names that differ from CPU to CPU and we can’t know the key names. pattern_properties has a set of key matches with a field value.

For example, registers is an object which has keys that are at least one character long and each value is a field specifying a single register:

registers:
  description: |
    The values the general purpose registers contained.

    This can contain sensitive data.

  pattern_properties:
    ^.+$:
      nickname: REGISTER
      description: Register contents as a hexstring.
      type: ["string", "null"]
      permissions: ["protected"]

  type: ["object", "null"]
  permissions: ["protected"]

When you use pattern_properties, the field must define a nickname. In this case, we nicknamed the field register. This will show up in the data dictionary. nickname has an uppercase value.

Supporting changes in stackwalker output

Stackwalker output is in the following places in the processed crash schema:

  • json_dump

  • upload_file_minidump_browser

To reduce redundancy, these two sections use references to subschemas in the top-level definitions section of the schema.

The stackwalker we use is the rust-minidump stackwalker and the JSON output is documented at:

https://github.com/rust-minidump/rust-minidump/blob/main/minidump-processor/json-schema.md

When we update the stackwalker to a new version, we may have new fields show up in the stackwalker output. Use the documentation to add the appropriate bits to the processed crash schema.

For permissions, everything should be protected unless we’re sure it’s category 1 or 2. See Data Collection Categories.

Otherwise, everything is the same as for supporting new fields.

Testing schema changes

After editing the processed_crash.schema.yaml file, verify it still validates.

First, download processed crash files into a crash directory that you want to validate your changes against. The crash directory can contain processed crash files or it could contain processed crash files in a directory structure generated by fetch_crash_data.

Second, run validate_processed_crash.py against that crash directory.

$ python socorro/schemas/validate_processed_crash.py [CRASHDIR]

Then run the tests which will test several invariants for the schemas.

Then look at the field in the data dictionary in the webapp to make sure the description and other parts are formatted well.

socorro-data-1-0-0.schema.yaml

This is the JSON Schema that defines the schema that we use for crash ingestion data. It’s heavily inspired by jsonschema itself and the metrics schema and I took from that as much as I could such that there was some consistency for engineers defining metrics and annotations.

When in doubt, the bits in the schema structure work like they do in metrics schema and jsonschema.

For any changes we make to socorro-data-1-0-0 schema, we need to make sure they work with both the raw and processed crash schemas. We should also make sure any changes don’t conflict with the metrics schema.

telemetry_socorro_crash.json schema

This schema covers documents being sent to Telemetry ingestion.

Modifying

The JSON Schema should contain a key called $target_version.

  • This is a monotonically increasing integer

  • Don’t increment the version if you’re…

    • Adding more keys at the root level.

    • Editing comments (content of description values).

  • Do increment the version if you’re…

    • Adding more keys inside a nested object.

    • Changing the type definition of an existing key in any way.

    • Add or remove keys from a required sub-key. For example, if a key was required but you’ve now removed it. This is applicable at any nested level.

For example, if you want to add a new field to the root like this:

+ "addons_checksum": {
+     "type": ["string", "null"],
+     "description": "Sample specimen"
+ }

then don’t change the version.

However, if you add a key inside a nested structure, you have to bump the $target_version number by 1. For example:

@@ -286,8 +286,12 @@
     "json_dump": {
         "type": "object",
         "description": "The dump as a JSON object.",
         "properties": {
+            "for_example": {
+                "type": ["string", "null"],
+                "description": "Brand spanking new field inside json_dump"
+            },
             "crash_info": {
                 "type": "object",
                 "properties": {
                     "address": {

Don’t change the type definition. That breaks existing data. You must create a new field and deprecate the old one.

Testing schema changes

After editing the telemetry_socorro_crash.json file verify it still validates.

After any change, you should test that at least 100 randomly picked crashes from prod. To do that, from a checkout of socorro run:

$ python socorro/schemas/validate_telemetry_socorro_crash.py

Running validate_telemetry_socorro_crash.py will download 100 crashes, run the JSON Schema validator against those crashes with your local telemetry_socorro_crash.json file.

Note

The validate_telemetry_socorro_crash.py, by default, does a Super Search query for basically product=Firefox and takes the 100 most recent crash IDs. This might miss out on some more “rare” crashes whose additional values might better test your JSON Schema changes. To remedy that, go to Super Search in your browser, make a search that you know includes good crash IDs to test and paste that URL like this:

$ python socorro/schemas/validate_telemetry_socorro_crash.py \
      'https://crash-stats.mozilla.org/search/?dom_ipc_enabled=%21__null__&memory_images=%3E10&version=54.0a1' \
      'https://crash-stats.mozilla.org/api/SuperSearch/?memory_private=%3E100&product=Firefox&date=%3E%3D2017-02-24T16%3A14%3A00.000Z&date=%3C2017-03-03T16%3A14%3A00.000Z'