Data Stack π
Telgeaβs data pipeline moves information from our production databases into a centralized data warehouse, enabling analytics and business intelligence reporting. Our analytics stack consists of Google BigQuery as our data warehouse and Metabase as our business intelligence and visualization tool. Metabase connects directly to BigQuery to query the data and build dashboards. The usage of a data warehouse (a central repository optimized for business analytics) ensures we have a clean, organized archive of all of our data, separate from our live production databases. This separation protects our live services β running large analytical queries on our production database could cause slowdowns or crashes, directly impacting customers. Additionally, data warehouses are specifically built for scalability and performance, answering complex questions across years of historical data in seconds. The data warehouse also acts as a single source of truth by unifying data from different systems, ensuring consistent reporting.Available Data
Data snapshots are captured from our source databases and loaded into BigQuery daily at 1:00 AM UTC. The following table shows the current status of key data sources.| Data Source | Available in BigQuery | Notes |
|---|---|---|
| Production Database Tables | β Yes | All core application tables. |
| Call Detail Records (CDR) | β No | On the roadmap |
| Number Inventory | β No | On the roadmap |
Technical Implementation
This section details the technical architecture and data flow from the source databases to the data warehouse.High-Level Data Flow
The data travels from our Azure-hosted production databases to Google BigQuery through a secure, multi-cloud network configuration. Flow: Azure PostgreSQL-> Azure Private Endpoint -> Azure Virtual Network (VNet) -> Azure VPN Gateway -> IPSec Tunnel -> Google Cloud VPN -> Google Cloud VPC -> Google Cloud Data Fusion -> Google BigQuery -> Metabase
Networking Details π
The core of the connection is a secure site-to-site VPN tunnel between our Azure and Google Cloud environments.- Virtual Private Clouds (VPCs) in both clouds are the key enablers.
- In Azure, the Virtual Network (VNet) securely connects the VPN Gateway to our PostgreSQL servers via a Private Endpoint. This ensures the database isnβt exposed to the public internet.
- In Google Cloud, Data Fusion runs its data processing workers inside our VPC. This allows them to access the Azure PostgreSQL source databases directly and securely over the VPN tunnel.
- Azure: Tutorial: Create a site-to-site VPN connection in the Azure portal
- Google Cloud: Creating an HA VPN gateway to a Peer VPN gateway
Google Cloud Data Fusion Configuration
We use Google Cloud Data Fusion as our ETL (Extract, Transform, Load) tool to move the data.- Pipeline Structure: Each table replicated from a source database has its own dedicated pipeline in Data Fusion. This isolates failures and simplifies management.
- Components: Each pipeline is very simple, consisting of a single Source (the PostgreSQL JDBC connector) and a single Sink (the BigQuery connector). No complex transformations are applied during transit.
- Schedule: When setting up new pipelines, use the standard daily trigger scheduled for 1:00 AM UTC to maintain consistency.
- Access: You need to have the
Cloud Data Fusion Developerrole in thetelgea-dataGCP project to edit Data Fusion pipelines.
