Open Data Contract Standard
Executive summary
This document describes the keys and values expected in a YAML data contract, per the Open Data Contract Standard. It is divided in multiple sections: demographics, dataset & schema, data quality, pricing, stakeholders, roles, service-level agreement, and other properties. Each section starts with at least an example followed by definition of each field/key.
Table of content
- Fundamentals & demographics
- Datasets & schema
- Data quality
- Pricing
- Stakeholders
- Roles
- Service-level agreement
- Other properties
- Full example
Notes
- This contract is containing example values, we reviewed very carefully the consistency of those values, but we cannot guarantee that there are no errors. If you spot one, please raise an issue.
- Some fields have
null
value: even if it is equivalent to not having the field in the contract, we wanted to have the field for illustration purpose. - This contract leverages BigQuery but should be platform agnostic. If you think it is not the case, please raise an issue.
Demographics
This section contains general information about the contract.
Example
# What's this data contract about?
datasetDomain: seller # Domain
quantumName: my quantum # Data product name
userConsumptionMode: Analytical
version: 1.1.0 # Version (follows semantic versioning)
status: current
uuid: 53581432-6c55-4ba2-a65f-72344a91553a
# Lots of information
description:
purpose: Views built on top of the seller tables.
limitations: null
usage: null
tenant: ClimateQuantumInc
# Getting support
productDl: product-dl@ClimateQuantum.org
productSlackChannel: '#product-help'
productFeedbackUrl: null
# Physical parts / GCP / BigQuery specific
sourcePlatform: googleCloudPlatform
sourceSystem: bigQuery
project: edw # BQ dataset
datasetName: access_views # BQ dataset
kind: DataContract
apiVersion: 2.3.0 # Standard version (follows semantic versioning, previously known as templateVersion)
type: tables
# Physical access
driver: null
driverVersion: null
server: null
database: pypl-edw.pp_access_views
username: '${env.username}'
password: '${env.password}'
schedulerAppName: name_coming_from_scheduler # NEW 2.1.0 Required if you want to schedule stuff, comes from DataALM.
# Data Quality
quality: null # See more information below
# Tags
tags: null
Definitions
Key | UX label | Required | Description |
---|---|---|---|
datasetDomain | Dataset domain | No | Name of the logical domain dataset the contract describes. This field is only required for output data contracts. Examples: imdb_ds_aggregate , receiver_profile_out , transaction_profile_out . |
quantumName | Quantum name | Yes | The name of the data quantum or data product. |
userConsumptionMode | Consumption mode | No | List of data modes for which the dataset may be used. Expected sample values might be analytical or operational . Note: in the future, this will probably be replaced by ports. |
version | Version | Yes | Current version of the data contract. |
status | Status | Yes | Current status of the dataset. Valid values are production , test , or development . |
uuid | Identifier | Yes | A unique identifier used to reduce the risk of dataset name collisions; initially the UUID will be created using a UUID generator tool (example). However, we may want to develop a method that accepts a seed value using a combination of fields–such as name, kind and source–to create a repeatable value. |
description | Description | No | Object. |
description.purpose | Purpose | No | Purpose of the dataset, table, or column (depending on the level); the key may appear at the dataset, table, or column level. |
description.limitations | Limitations | No | Limitations of the dataset, table, or column (depending on the level); the key may appear at the dataset, table, or column level. |
description.usage | Usage | No | Intended usage of the dataset, table, or column (depending on the level); the key may appear at the dataset, table, or column level. |
tenant | Tenant | No | Indicates the property the data is primarily associated with. Value is case insensitive. |
productDl | E-mail distribution list | No | The email distribution list (DL) of the persons or team responsible for maintaining the dataset. |
productSlackChannel | Support Slack channel | No | Slack channel of the team responsible for maintaining the dataset. |
productFeedbackUrl | Feedback URL | No | The URL for submitting feedback to the team responsible for maintaining the dataset. |
sourcePlatform | Source platform | No | The platform where the dataset resides. Expected value is GoogleCloudPlatform, IBMCloud, Azure... |
sourceSystem | Source system | No | The system where the dataset resides. Expected value can be BigQuery. |
project | Project | No | Associated cloud project name, can be used for billing or administrative purpose. Used to be datasetProject. |
datasetName | Dataset name | No | May be required in cloud instance like GCP BigQuery dataset name. |
server | Server | No | The server where the dataset resides. |
kind | Kind | Yes | The kind of file this is. Valid value is DataContract . |
apiVersion | Standard version | No | Version of the standard used to build data contract. Default value is v2.2.2. |
type | Type | Yes | Identifies the types of objects in the dataset. For BigQuery or any other database, the expected value would be Tables. |
driver | Driver | No | The connection driver required to connect to the dataset. |
driverVersion | Driver version | No | The version of the connection driver to be used to connect to the dataset. |
username | Username | No | User credentials for connecting to the dataset; how the credentials will be stored/passed is outside of the scope of the contract. |
password | Password | No | User credentials for connecting to the dataset; how the credentials will be stored/passed is out of the scope of this contract. |
schedulerAppName | Scheduler App Name | No | Required if you want to schedule stuff. The name of the application used to schedule jobs |
Dataset and schema
This section describes the dataset and the schema of the data contract. It is the support for data quality, which is detailed in the next section.
Example
dataset:
- table: tbl
physicalName: tbl_1 # NEW in v2.1.0, Optional, default value is table name + version separated by underscores, as table_1_2_0
priorTableName: null # if needed
description: Provides core payment metrics
authoritativeDefinitions: # NEW in v2.2.0, inspired by the column-level authoritative links
- url: https://catalog.data.gov/dataset/air-quality
type: businessDefinition
- url: https://youtu.be/jbY1BKFj9ec
type: videoTutorial
tags: null
dataGranularity: Aggregation on columns txn_ref_dt, pmt_txn_id
columns:
- column: txn_ref_dt
isPrimary: false # NEW in v2.1.0, Optional, default value is false, indicates whether the column is primary key in the table.
primaryKeyPosition: -1
businessName: transaction reference date
logicalType: date
physicalType: date
isNullable: false
description: null
partitionStatus: true
partitionKeyPosition: 1
clusterStatus: false
clusterKeyPosition: -1
criticalDataElementStatus: false
tags: []
classification: public
transformSourceTables:
- table_name_1
- table_name_2
- table_name_3
transformLogic: sel t1.txn_dt as txn_ref_dt from table_name_1 as t1, table_name_2 as t2, table_name_3 as t3 where t1.txn_dt=date-3
transformDescription: Defines the logic in business terms.
sampleValues:
- 2022-10-03
- 2020-01-28
- column: rcvr_id
isPrimary: true # NEW in v2.1.0, Optional, default value is false, indicates whether the column is primary key in the table.
primaryKeyPosition: 1
businessName: receiver id
logicalType: string
physicalType: varchar(18)
isNullable: false
description: A description for column rcvr_id.
partitionStatus: false
partitionKeyPosition: -1
clusterStatus: true
clusterKeyPosition: 1
criticalDataElementStatus: false
tags: []
classification: restricted
encryptedColumnName: enc_rcvr_id
- column: rcvr_cntry_code
isPrimary: false # NEW in v2.1.0, Optional, default value is false, indicates whether the column is primary key in the table.
primaryKeyPosition: -1
businessName: receiver country code
logicalType: string
physicalType: varchar(2)
isNullable: false
description: null
partitionStatus: false
partitionKeyPosition: -1
clusterStatus: false
clusterKeyPosition: -1
criticalDataElementStatus: false
tags: []
classification: public
authoritativeDefinitions:
- url: https://collibra.com/asset/742b358f-71a5-4ab1-bda4-dcdba9418c25
type: businessDefinition
- url: https://github.com/myorg/myrepo
type: transformationImplementation
- url: jdbc:postgresql://localhost:5432/adventureworks/tbl_1/rcvr_cntry_code
type: implementation
encryptedColumnName: rcvr_cntry_code_encrypted
Definitions
Key | UX label | Required | Description |
---|---|---|---|
dataset | Dataset | Yes | Array. A list of tables within the dataset to be cataloged. |
dataset.table | Table | Yes | Name of the table being cataloged; the value should only contain the table name. Do not include the project or dataset name in the value. |
dataset.table.physicalName | Physical Name | No | Physical name of the table, default value is table name + version separated by underscores, as table_1_2_0 . |
dataset.table.priorTableName | Prior Table Name | No | Name of the previous version of the dataset, if applicable. |
dataset.table.description | Description | No | Description of the dataset. |
dataset.table.authoritativeDefinitions | Authoritative Definitions | No | List of links to sources that provide more details on the table; examples would be a link to an external definition, a training video, a GitHub repo, Collibra, or another tool. Authoritative definitions follow the same structure in the standard. |
dataset.table.dataGranularity | Data Granularity | No | Granular level of the data in the table. Example would be pmt_txn_id . |
dataset.table.columns | Columns | Yes | Array. A list of columns in the table. |
dataset.table.columns.column | Column | Yes | The name of the column. |
dataset.table.columns.column.isPrimaryKey | Primary Key | No | Boolean value specifying whether the column is primary or not. Default is false. |
dataset.table.columns.column.primaryKeyPosition | Primary Key Position | No | If column is a primary key, the position of the primary key column. Starts from 1. Example of account_id, name being primary key columns, account_id has primaryKeyPosition 1 and name primaryKeyPosition 2. Default to -1. |
dataset.table.columns.column.businessName | Business Name | No | The business name of the column. |
dataset.table.columns.column.logicalType | Logical Type | Yes | The logical column datatype. |
dataset.table.columns.column.physicalType | Physical Type | Yes | The physical column datatype. |
dataset.table.columns.column.description | Description | No | Description of the column. |
dataset.table.columns.column.isNullable | Nullable | No | Indicates if the column may contain Null values; possible values are true and false. Default is false. |
dataset.table.columns.column.isUnique | Unique | No | Indicates if the column contains unique values; possible values are true and false. Default is false. |
dataset.table.columns.column.partitionStatus | Partition Status | No | Indicates if the column is partitioned; possible values are true and false. |
dataset.table.columns.column.partitionKeyPosition | Partition Key Position | No | If column is used for partitioning, the position of the partition column. Starts from 1. Example of country, year being partition columns, country has partitionKeyPosition 1 and year partitionKeyPosition 2. Default to -1. |
dataset.table.columns.column.clusterStatus | Cluster Status | No | Indicates of the column is clustered; possible values are true and false. |
dataset.table.columns.column.clusterKeyPosition | Cluster Key Position | No | If column is used for clustering, the position of the cluster column. Starts from 1. Example of year, date being cluster columns, year has clusterKeyPosition 1 and date clusterKeyPosition 2. Default to -1. |
dataset.table.columns.column.classification | Classification | No | Can be anything, like confidential, restricted, and public to more advanced categorization. Some companies like PayPal, use data classification indicating the class of data in the column; expected values are 1, 2, 3, 4, or 5. |
dataset.table.columns.column.authoritativeDefinitions | Authoritative Definitions | No | List of links to sources that provide more detail on column logic or values; examples would be URL to a GitHub repo, Collibra, on another tool. |
dataset.table.columns.column.encryptedColumnName | Encrypted Column Name | No | The column name within the table that contains the encrypted column value. For example, unencrypted column email_address might have an encryptedColumnName of email_address_encrypt . |
dataset.table.columns.column.transformSourceTables | Transform Source Tables | No | List of sources used in column transformation. |
dataset.table.columns.column.transformLogic | Transform Logic | No | Logic used in the column transformation. |
dataset.table.columns.column.transformDescription | Transform Description | No | Describes the transform logic in very simple terms. |
dataset.table.columns.column.sampleValues | Sample Values | No | List of sample column values. |
dataset.table.columns.column.criticalDataElementStatus | Critical Data Element Status | No | True or false indicator; If element is considered a critical data element (CDE) then true else false. |
dataset.table.columns.column.tags | Tags | No | A list of tags that may be assigned to the dataset, table or column; the tags keyword may appear at any level. |
Authorative definitions
Updated in ODCS (Open Data Contract Standard) v2.2.1.
Key | UX label | Required | Description |
---|---|---|---|
type | Definition type | Yes | Type of definition for authority: v2.2.1 adds standard values: businessDefinition , transformationImplementation , videoTutorial , tutorial , and implementation . |
url | URL to definition | Yes | URL to the authority. |
Data quality
This section describes data quality rules & parameters. They are tightly linked to the schema described in the previous section.
Example of data quality at the dataset level
Note: * This example relies on a data quality tool called Elevate. It should be easily transformed to any other tool. If you have questions, please raise an issue. * The Open Data Contract Standard has a provision for supporting multiple data quality tools.
dataset:
- table: tab1
# ...
quality:
- code: countCheck # Required, name of the rule
templateName: CountCheck # NEW in v2.1.0 Required
description: Ensure row count is within expected volume range # Optional
toolName: Elevate # Required
toolRuleName: DQ.rw.tab1.CountCheck # NEW in v2.1.0 Optional (Available only to the users who can change in source code edition)
dimension: completeness # Optional
type: reconciliation # Optional NEW in v2.1.0 default value for column level check - dataQuality and for table level reconciliation
severity: error # Optional NEW in v2.1.0, default value is error
businessImpact: operational # Optional NEW in v2.1.0
scheduleCronExpression: 0 20 * * * # Optional NEW in v2.1.0 default schedule - every day 10 a.m. UTC
- code: distinctCheck
description: enforce distinct values
toolName: Elevate
templateName: DistinctCheck
dimension: completeness
toolRuleName: DQ.rw.tab1.DistinctCheck
columns:
- txn_ref_dt
- rcvr_id
column: null
type: reconciliation
severity: error
businessImpact: operational
scheduleCronExpression: 0 20 * * *
customProperties:
- property: FIELD_NAME
value: rcvr_id
- property: FILTER_CONDITIONS
value: 1>0
- code: piiCheck
description: null
toolName: Elevate
templateName: PIICheck
toolRuleName: DQ.rw.tab1_2_0_0.piiCheck
type: dataQuality
severity: error
businessImpact: operational
scheduleCronExpression: 0 20 * * *
customProperties:
- property: FIELD_NAME
value:
- property: FILTER_CONDITIONS
value:
Example of data quality at the column level
dataset:
- table: tab1
- column: rcvr_id
isPrimary: true # NEW in v2.1.0, Optional, default value is false, indicates whether the column is primary key in the table.
businessName: receiver id
# ...
quality:
- code: nullCheck
templateName: NullCheck
description: column should not contain null values
toolName: Elevate
toolRuleName: DQ.rw.tab1_2_0_0.rcvr_cntry_code.NullCheck
dimension: completeness # dropdown 7 values
type: dataQuality
severity: error
businessImpact: operational
scheduleCronExpression: 0 20 * * *
customProperties:
- property: FIELD_NAME
value:
- property: COMPARE_TO
value:
- property: COMPARISON_TYPE
value: Greater than
Definitions
Key | UX label | Required | Description |
---|---|---|---|
quality | Quality | No | Quality tag with all the relevant information for rule setup and execution. |
quality.code | Quality Code | No | The Rosewall data quality code(s) indicating which quality checks need to be performed at the dataset, table, or column level. The quality keyword may appear at any level. Some quality checks require parameters, so the check can be completed (e.g., a list of fields used to identify a distinct row). Therefore, some quality checks may be followed by a single value or an array. See the appendix for a link to quality checks. |
quality.templateName | Template Name | Yes | The template name that indicates what is the equivalent template from the tool. |
quality.description | Description | No | Describe the quality check to be completed. |
quality.toolName | Tool Name | Yes | Name of the tool used to complete the quality check; Most will be Elevate initially. |
quality.toolRuleName | Tool Rule Name | No | Name of the quality tool's rule created to complete the quality check. |
quality.dimension | Dimension | No | The key performance indicator (KPI) or dimension for data quality. |
quality.columns | Columns | No | List of columns to be used in the quality check. |
quality.column | Column | No | To be used in lieu of quality.columns when only a single column is required for the quality check. |
quality.type | Type | No | The type of quality check. |
quality.severity | Severity | No | The severity of the quality rule. |
quality.businessImpact | Business Impact | No | Consequences of the rule failure. |
quality.scheduleCronExpression | Schedule Expression | No | Rule execution schedule details. |
quality.customProperties | Custom Properties | No | Additional properties required for rule execution. |
Pricing
This section covers pricing when you bill your customer for using this data product. Pricing is experimental in v2.1.1 of the data contract.
Example
Definitions
Key | UX label | Required | Description |
---|---|---|---|
price | Price | No | Object |
price.priceAmount | Price Amount | No | Subscription price per unit of measure in priceUnit . |
price.priceCurrency | Price Currency | No | Currency of the subscription price in price.priceAmount . |
price.priceUnit | Price Unit | No | The unit of measure for calculating cost. Examples megabyte, gigabyte. |
Stakeholders
This section lists stakeholders and the history of their relation with this data contract.
Example
stakeholders:
- username: ceastwood
role: Data Scientist
dateIn: 2022-08-02
dateOut: 2022-10-01
replacedByUsername: mhopper
- username: mhopper
role: Data Scientist
dateIn: 2022-10-01
- username: daustin
role: Owner
comment: Keeper of the grail
dateIn: 2022-10-01
Definitions
The UX label is the label used in the UI and other user experiences. It is not limited to BlueRacket.
Key | UX label | Required | Description |
---|---|---|---|
stakeholders | Stakeholders | No | Array |
stakeholders.username | Username | No | The stakeholder's username or email. |
stakeholders.role | Role | No | The stakeholder's job role; Examples might be owner, data steward. There is no limit on the role. |
stakeholders.dateIn | Date In | No | The date when the user became a stakeholder. |
stakeholders.dateOut | Date Out | No | The date when the user ceased to be a stakeholder |
stakeholders.replacedByUsername | Replaced By Username | No | The username of the user who replaced the stakeholder |
Roles
This section lists the roles that a consumer may need to access the dataset depending on the type of access they require.
Example
roles:
- role: microstrategy_user_opr
access: read
firstLevelApprovers: Reporting Manager
secondLevelApprovers: 'mandolorian'
- role: bq_queryman_user_opr
access: read
firstLevelApprovers: Reporting Manager
secondLevelApprovers: na
- role: risk_data_access_opr
access: read
firstLevelApprovers: Reporting Manager
secondLevelApprovers: 'dathvador'
- role: bq_unica_user_opr
access: write
firstLevelApprovers: Reporting Manager
secondLevelApprovers: 'mickey'
Definitions
Key | UX label | Required | Description |
---|---|---|---|
roles | Roles | Yes | Array. A list of roles that will provide user access to the dataset. |
roles.role | Role | Yes | Name of the IAM role that provides access to the dataset; the value will generally come directly from the "BQ dataset to IAM roles mapping" document. |
roles.access | Access | Yes | The type of access provided by the IAM role; the value will generally come directly from the "BQ dataset to IAM roles mapping" document. |
roles.firstLevelApprovers | 1st Level Approvers | No | The name(s) of the first-level approver(s) of the role; the value will generally come directly from the "BQ dataset to IAM roles mapping" document. |
roles.secondLevelApprovers | 2nd Level Approvers | No | The name(s) of the second-level approver(s) of the role; the value will generally come directly from the "BQ dataset to IAM roles mapping" document. |
Service-level agreement
This section describes the service-level agreements (SLA). SLA have been extended in version v2.1.0.
- Use the
Table.Column
to indicate the number to do the checks on, as inSELECT txn_ref_dt FROM tab1
. - Separate multiple table.columns by a comma, as in
table1.col1
,table2.col1
,table1.col2
. - If there is only one table in the contract, the table name is not required.
Example
slaDefaultColumn: tab1.txn_ref_dt # Optional, default value is partitionColumn.
slaProperties:
- property: latency # Property, see list of values in DP QoS
value: 4
unit: d # d, day, days for days; y, yr, years for years
column: tab1.txn_ref_dt # This would not be needed as it is the same table.column as the default one
- property: generalAvailability
value: 2022-05-12T09:30:10-08:00
- property: endOfSupport
value: 2032-05-12T09:30:10-08:00
- property: endOfLife
value: 2042-05-12T09:30:10-08:00
- property: retention
value: 3
unit: y
column: tab1.txn_ref_dt
- property: frequency
value: 1
valueExt: 1
unit: d
column: tab1.txn_ref_dt
- property: timeOfAvailability
value: 09:00-08:00
column: tab1.txn_ref_dt
driver: regulatory # Describes the importance of the SLA: [regulatory|analytics|operational|...]
- property: timeOfAvailability
value: 08:00-08:00
column: tab1.txn_ref_dt
driver: analytics
Definitions
Key | UX label | Required | Description |
---|---|---|---|
slaDefaultColumn | Default SLA column(s) | No | Columns (using the Table.Column notation) to do the checks on. By default, it is the partition column. |
slaProperties | SLA | No | A list of key/value pairs for SLA specific properties. There is no limit on the type of properties (more details to come). |
slaProperties.property | Property | Yes | Specific property in SLA, check the periodic table. May requires units (more details to come). |
slaProperties.value | Value | Yes | Agreement value. The label will change based on the property itself. |
slaProperties.valueExt | Extended value | No - unless needed by property | Extended agreement value. The label will change based on the property itself. |
slaProperties.unit | Unit | No - unless needed by property | d, day, days for days; y, yr, years for years, etc. Units use the ISO standard. |
slaProperties.column | Column(s) | No | Column(s) to check on. Multiple columns should be extremely rare and, if so, separated by commas. |
slaProperties.driver | Driver | No | Describes the importance of the SLA from the list of: regulatory , analytics , or operational . |
Other properties
This section covers other properties you may find in a data contract.
Example
customProperties:
- property: refRulesetName
value: gcsc.ruleset.name
- property: somePropertyName
value: property.value
- property: dataprocClusterName # Used for specific applications like Elevate
value: [cluster name]
systemInstance: instance.ClimateQuantum.org
contractCreatedTs: 2022-11-15 02:59:43
Definitions
Key | UX label | Required | Description |
---|---|---|---|
customProperties | Custom Properties | No | A list of key/value pairs for custom properties. Initially created to support the REF ruleset property. |
customProperties.property | Property | No | The name of the key. Names should be in camel case–the same as if they were permanent properties in the contract. |
customProperties.value | Value | No | The value of the key. |
systemInstance | System Instance | No | System Instance name where the dataset resides. |
contractCreatedTs | Contract Created UTC | No | Timestamp in UTC of when the data contract was created. |