Data pipeline

Data Stack 📊

Telgea’s data pipeline moves information from our production databases into a centralized data warehouse, enabling analytics and business intelligence reporting. Our analytics stack consists of Google BigQuery as our data warehouse and Metabase as our business intelligence and visualization tool. Metabase connects directly to BigQuery to query the data and build dashboards. The usage of a data warehouse (a central repository optimized for business analytics) ensures we have a clean, organized archive of all of our data, separate from our live production databases. This separation protects our live services – running large analytical queries on our production database could cause slowdowns or crashes, directly impacting customers. Additionally, data warehouses are specifically built for scalability and performance, answering complex questions across years of historical data in seconds. The data warehouse also acts as a single source of truth by unifying data from different systems, ensuring consistent reporting.

Available Data

Data snapshots are captured from our source databases and loaded into BigQuery daily at 1:00 AM UTC. The following table shows the current status of key data sources.

Data Source	Available in BigQuery	Notes
Production Database Tables	✅ Yes	All core application tables.
Call Detail Records (CDR)	❌ No	On the roadmap
Number Inventory	❌ No	On the roadmap

Technical Implementation

This section details the technical architecture and data flow from the source databases to the data warehouse.

High-Level Data Flow

The data travels from our Azure-hosted production databases to Google BigQuery through a secure, multi-cloud network configuration. Flow: Azure PostgreSQL -> Azure Private Endpoint -> Azure Virtual Network (VNet) -> Azure VPN Gateway -> IPSec Tunnel -> Google Cloud VPN -> Google Cloud VPC -> Google Cloud Data Fusion -> Google BigQuery -> Metabase

Networking Details 🌐

The core of the connection is a secure site-to-site VPN tunnel between our Azure and Google Cloud environments.

Virtual Private Clouds (VPCs) in both clouds are the key enablers.
In Azure, the Virtual Network (VNet) securely connects the VPN Gateway to our PostgreSQL servers via a Private Endpoint. This ensures the database isn’t exposed to the public internet.
In Google Cloud, Data Fusion runs its data processing workers inside our VPC. This allows them to access the Azure PostgreSQL source databases directly and securely over the VPN tunnel.

For more detailed information on the underlying network setup, see the official documentation:

Azure: Tutorial: Create a site-to-site VPN connection in the Azure portal
Google Cloud: Creating an HA VPN gateway to a Peer VPN gateway

Google Cloud Data Fusion Configuration

We use Google Cloud Data Fusion as our ETL (Extract, Transform, Load) tool to move the data.

Pipeline Structure: Each table replicated from a source database has its own dedicated pipeline in Data Fusion. This isolates failures and simplifies management.
Components: Each pipeline is very simple, consisting of a single Source (the PostgreSQL JDBC connector) and a single Sink (the BigQuery connector). No complex transformations are applied during transit.
Schedule: When setting up new pipelines, use the standard daily trigger scheduled for 1:00 AM UTC to maintain consistency.
Access: You need to have the Cloud Data Fusion Developer role in the telgea-data GCP project to edit Data Fusion pipelines.

Getting Started

Git

Projects

Tech Stack

Data Stack 📊

Available Data

Technical Implementation

High-Level Data Flow

Networking Details 🌐

Google Cloud Data Fusion Configuration

Getting Started

Git

Projects

Tech Stack

​Data Stack 📊

​Available Data

​Technical Implementation

​High-Level Data Flow

​Networking Details 🌐

​Google Cloud Data Fusion Configuration

Data Stack 📊

Available Data

Technical Implementation

High-Level Data Flow

Networking Details 🌐

Google Cloud Data Fusion Configuration