Database Administrator FAQ - Knowledge Base

This FAQ is designed to help database administrators understand how Reactor's ETL platform impacts their day-to-day work, particularly concerning data ingestion, schema management, and data lifecycle within their data warehouse.

Understanding Reactor's Data Delivery

Q: What is Reactor, and how does it deliver data to my data warehouse?

A: Reactor is an intelligent ETL (Extract, Transform, Load) pipeline platform that ingests, transforms, and delivers data in real-time or batch to your data warehouse or data lake. Reactor can export structured data (JSON, Avro, or Parquet) directly to Amazon S3 or Google Cloud Storage (GCS) buckets. Additionally, Reactor natively supports direct export to Google BigQuery and Snowflake. When exporting to S3 or GCS, your DWH platform can then directly access and load these files using native connectors or commands, allowing you to create external tables or ingest the data into your DWH tables. This approach leverages the scalability and cost-effectiveness of cloud object storage for interim data staging.

See our help article Reactor Overview for a fuller overview of Reactor's capabilities.

Understanding Reactor's Data Transformation

Q: How can I understand how data is transformed within Reactor before it lands in my data warehouse?

A: While you, as a DBA, may not directly configure the transformations within Reactor, understanding the logic applied to the data before it reaches your data warehouse is crucial for data validation, troubleshooting, and downstream analysis. Reactor supports both its native expression language and Python-based expressions to transform data between the source and the destination schema and in intermediate models.

To gain insight into these transformations, you can leverage Electron Data Lineage Agent, Reactor's built-in AI agent. This agent specializes in tracing data from source to output. You can use it to:

Understand how a specific native expression works.
Trace the origin and transformations of any field.
Debug unexpected values by following the data path.

For example, you could ask Electron Assistant (your central AI guide in Reactor) to "Trace the origin and transformations of the 'order_total' field". Electron would then route this request to the Data Lineage Agent, providing you with a clear understanding of how that specific field is calculated or modified within Reactor's pipeline.

See our help article Introduction to Electron AI for more information about using Electron to better understand data lineage and data transformations.

Data Privacy and Compliance

Q: How does Reactor handle Personally Identifiable Information (PII) and what are my responsibilities as a DBA?

A: Reactor provides robust PII controls at the source level. Reactor Data users can define PII fields for each connected data source , categorizing them by PII classification (PII, None, or Other) and PII level (No PII, Direct PII, or Indirect PII). Furthermore, Reactor allows for PII handling actions on storage (anonymization before data is stored, so the original PII value will not be stored, visible in the Reactor interface, or transmitted to any downstream destinations) and on anonymization (anonymization when Reactor processes a data deletion request). Anonymization options include hashing, masking, or replacing PII with null or a default value.

As a DBA, your organization is responsible for promptly notifying Reactor Data when a consumer requests the deletion of their information. Reactor will then mask and anonymize all records associated with the identified individuals within 10 business days per the PII Handling settings. Upon masking and anonymizing the records, Reactor will emit two versions of every message ingested by Reactor that is associated with that individual:

Version 1: This version contains the unredacted message, with the Reactor metafield should_delete set to true. This field indicates that this specific version and all prior versions of the record should be purged from downstream destinations.
Version 2: This version contains the redacted version of the message (redacted at the field level as defined in the source's Data Standard), with the Reactor metafield has_redactions set to true. This indicates that all PII field values in the record have been redacted, masked, or nulled, and those fields should not be used in downstream customer activations.

For clients who are using Reactor to export data to their own data warehouse (BigQuery, Snowflake, Databricks) or file-store solution (GCS, S3/Redshift), it is crucial to utilize Reactor's deletion policy with minimal maintenance by following the best practices for deduplication and pruning. The provided deduplication and pruning queries (discussed below) ensure that messages marked for deletion (is_deleted = true) and all other previously ingested versions are removed from your data warehouse.

See our help articles Requesting Removal of Consumer Data for more information on customer data deletion and redaction, and California Consumer Privacy Act (CCPA) Compliance for more information about how Reactor complies with consumer privacy regulations.

Managing Landing Tables

Q: What are the best practices for staging landing tables for data from Reactor?

A: Your landing tables are the initial point of entry for your valuable data. To ensure a smooth, efficient, and insightful data journey, consider these best practices:

Consistent and Clear Naming Conventions: Adopt a standardized naming convention for your landing tables. This makes it easier for everyone to understand the purpose and content of each table at a glance. Use descriptive and consistent column names.
Include Essential Metadata Columns: Incorporating specific metadata columns into your landing tables is a crucial best practice. These fields provide essential context, streamline deduplication processes, and empower effective landing table pruning. Recommended metadata columns and their Reactor mappings include:
- ingestion_timestamp: The timestamp when Reactor initially ingested the event that led to the creation of this record. Recommended Reactor mapping expression: source._reactor.event_timestamp.
- reactor_is_deleted: A boolean indicating whether a record has been purged from Reactor and should be deleted from downstream destinations. Recommended Reactor mapping expression: IF (source._reactor.should_delete, TRUE, FALSE).
- reactor_is_redacted: A boolean indicating whether personally identifiable data in a record has been redacted by Reactor. Recommended Reactor mapping expression: IF (source._reactor.has_redactions, TRUE, FALSE).
- loaded_at: The precise timestamp when the row was mapped. Recommended Reactor mapping expression: CURRENT_DATETIME().
- message_version: An optional timestamp indicating when the record was last updated in the source system. Useful for keeping a history of an entity and determining the latest version during deduplication. Recommended Reactor mapping expression: Map to the source system's update timestamp.
- node: A unique identifier for the originating source system of the record. Recommended Reactor mapping expression: SPLIT(source._reactor.input_event_scid, ":") [2].
- scope: A parent-level entity identifier used to differentiate records with the same source_id that are distinct within a broader business context. Recommended Reactor mapping expression: SPLIT(source._reactor.input_event_scid, ":") [3].
- source_id: The unique identifier of the record as it exists in the source system. Recommended Reactor mapping expression: Map to the source system's unique identifier.
- version: Indicates which version of mapping configurations was applied to the record. Reactor auto-outputs this field as an epoch timestamp representing the latest update timestamp of any model configuration applied to the record; no mapping expression is necessary.
Optimal Data Types: Choose the most appropriate data types for each column to optimize storage and query performance.
Partitioning and Clustering (for Large Tables): For very large landing tables, consider partitioning based on a relevant date or timestamp column (like loaded_at or an update timestamp). Within partitions, explore clustering on columns frequently used in your deduplication logic (like the unique identifier and update timestamp).
Design for Append-Only Inserts: Reinforce the understanding that Reactor appends new rows to the landing table. Avoid direct updates or deletes on the landing table itself, as this can complicate your deduplication and historical tracking strategies.

See our help article Creating Landing Tables in the Data Warehouse for more information on how to stage landing tables.

Data Lifecycle Management (Deduplication & Pruning)

Q: How do I deduplicate data exported from Reactor in my data warehouse?

A: Reactor's approach means that new, fresh rows are added to your data warehouse. As a result, you might notice multiple rows accumulating over time, each representing a different update to the same underlying record. To ensure your analytical efforts are focused on the most current information and to avoid data duplication headaches, we highly recommend building a deduplication view on top of your landing table.

This view acts as a "smart filter," ensuring that only the latest and greatest version of each record shines through. You should define the primary key or unique identifier that will be used for deduplication (e.g.,

source_id, node, scope) and determine the column that indicates the order of updates (typically a timestamp like message_version or loaded_at).

Reactor's help documentation provides example deduplication queries tailored for:

Databricks (Spark SQL)
Google BigQuery
Microsoft SQL
Amazon Redshift
Snowflake

These queries typically use window functions (e.g., ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)) to identify the most recent version of a record, often filtering out records where reactor_is_deleted is TRUE.

See our help article Keeping Your Data Warehouse Lean and Insightful for more information on deduplication.

Q: How do I prune stale data from my landing tables in the data warehouse?

A: Over time, not all historical data retains its value. Rows in your landing table that were once critical for analysis might become stale and no longer contribute meaningful insights. Holding onto this outdated information can lead to increased storage costs and potentially impact query performance. This is where the concept of landing table pruning comes in, which is the process of systematically deleting these "stale" rows from your landing table.

You should define clear retention policies for the data in your landing tables based on analytical, operational, or AI requirements and establish a regular schedule for pruning stale data. You may also consider archiving older data in a separate storage location if it is needed for compliance or infrequent historical analysis.

Reactor's help documentation provides example pruning queries that select rows in the landing table that are

not included in a deduplicated view of the landing table, referencing fields like node, scope, source_id, version, and loaded_at:

Databricks (Spark SQL)
Google BigQuery
Microsoft SQL
Redshift
Snowflake

For Databricks users who ingest files from S3 and do not retain these files after ingesting their contents to Databricks, it is necessary to prune the files directly in S3. It is also recommended to consider moving these old JSON files to a lower-cost storage tier or archiving them if they are no longer actively needed for analysis or compliance purposes.

See our help article Keeping Your Data Warehouse Lean and Insightful for more information on pruning.