Introduction

Your landing tables are the initial point of entry for your valuable data. By following these best practices, you can ensure a smooth, efficient, and insightful data journey.

Consistent and Clear Naming Conventions

Adopt a standardized naming convention for your landing tables. This makes it easier for everyone to understand the purpose and content of each table at a glance. Consider including prefixes or suffixes that indicate the source system or data domain.
Use descriptive and consistent column names. Clarity here minimizes confusion during transformation and analysis.

Include Essential Metadata Columns

Beyond the core data from your source systems, incorporating specific metadata columns into your landing tables is a crucial best practice. These fields provide essential context, streamline deduplication processes, and empower effective landing table pruning.

Recommended Metadata Column Names and Reactor Mappings

Ingestion Timestamp

Recommended column name: ingestion_timestamp
Data type: timestamp
Description: The timestamp when Reactor initially ingested the event that led to the creation of this record in the landing table. Useful for tracking the data pipeline timeline.
Recommended Reactor mapping expression: source.__reactor.event_timestamp

Is Deleted

Recommended column name: reactor_is_deleted
Data type: boolean
Description: Used when processing data deletion and redaction requests, to indicate whether a record has been purged from Reactor and should be deleted from downstream destinations. For any record with true in the is_deleted field, this version and all prior versions of the record should be purged from downstream destinations. See The Importance of Landing Table Pruning for more information on purging deleted records.
Recommended Reactor mapping expression: IF(source.__reactor.should_delete, TRUE, FALSE)

Is Redacted

Recommended column name: reactor_is_redacted
Data type: boolean
Description: Used when processing data deletion and redaction requests, to indicate whether personally identifiable data in a record has been redacted by Reactor. For any record with true in the has_redaction field, all PII field values in the record have been redacted, masked, or nulled, and those fields should not be used in downstream customer activations.
Recommended Reactor mapping expression: IF(source.__reactor.has_redactions, TRUE, FALSE)

Loaded At

Recommended column name: loaded_at
Data type: timestamp
Description: The precise timestamp when the row was mapped. Crucial for determining data recency.
Recommended Reactor mapping expression: CURRENT_DATETIME()

Message Version

Recommended column name: message_version
Data type: timestamp
Description: An optional timestamp indicating when the record was last updated in the source system. This is a version field you can use to keep a history of an entity, akin to a Type 2 slowly-changing dimension (SCD). It is also used to determine the latest version during deduplication.
Recommended Reactor mapping expression: Map to the source system's update timestamp

Node

Recommended column name: node
Data type: string
Description: A unique identifier for the originating source system of the record. Useful for tracing data lineage and troubleshooting.
Recommended Reactor mapping expression: SPLIT(source.__reactor.input_event_scid,":")[2]

Scope

Recommended column name: scope
Data type: string
Description: A parent-level entity identifier used to differentiate records with the same source_id that are distinct within a broader business context.
Recommended Reactor mapping expression: SPLIT(source.__reactor.input_event_scid,":")[3]

Source ID

Recommended column name: source_id
Data type: string
Description: The unique identifier of the record as it exists in the source system. Essential for matching and deduplication.
Recommended Reactor mapping expression: Map to the source system's unique identifier

Version

Recommended column name: version
Data type: integer
Description: Indicates which version of mapping configurations was applied to the record. Essential for deduplication.
Recommended Reactor mapping expression: Reactor auto-outputs this field as an epoch timestamp (in milliseconds) representing the latest update timestamp of any model configuration applied to the record; no mapping expression is necessary

Optimal Data Types

Choose the most appropriate data types for each column to optimize storage and query performance. For example, use integer types for numerical IDs and appropriate date and time types for date and time information.

Partitioning and Clustering (for Large Tables):

For very large landing tables, consider partitioning based on a relevant date or timestamp column (like the load timestamp or update timestamp). This can significantly improve query performance and make landing table pruning more efficient.

Within partitions, explore clustering on columns frequently used in your deduplication logic (like the unique identifier and update timestamp).

Design for Append-Only Inserts

Reinforce the understanding that Reactor appends new rows to the landing table. Avoid direct updates or deletes on the landing table itself, as this can complicate your deduplication and historical tracking strategies.

Establish Clear Deduplication Strategies

We highly recommend building a deduplication view on top of your landing table to ensure your analytical efforts are laser-focused on the most current information and avoid any data duplication headaches. Think of this view as a smart filter, ensuring that only the latest and greatest version of each record shines through, leading to more accurate and efficient data analysis.

Define the primary key or unique identifier that will be used for deduplication for each landing table.
Determine the column that indicates the order of updates (typically a timestamp).
Document the specific deduplication logic implemented in your views.

See Keeping Your Data Warehouse Lean and Insightful for more details on deduplication and an example deduplication query.

Implement Robust Landing Table Pruning Policies

Over time, not all historical data retains its value. Holding onto this outdated information can lead to increased storage costs and potentially impact query performance. This is where landing table pruning comes in.

Define clear retention policies for the data in your landing tables based on analytical, operational, or AI requirements.
Establish a regular schedule for pruning stale data.
Consider archiving older data in a separate storage location if it is needed for compliance or infrequent historical analysis.
Document your pruning policies and procedures.

See Keeping Your Data Warehouse Lean and Insightful for more details on pruning and example pruning queries.

Monitoring and Alerting

Implement monitoring to track the size and growth of your landing tables. Set up alerts for unexpected data volumes or potential issues.

Documentation is Key

Maintain comprehensive documentation for each landing table, including its purpose, source system, column definitions, deduplication strategy, and pruning policy.

Creating Landing Tables in the Data Warehouse