Data Enrichment Flow
How the MXP Catalog Enrichment pipeline discovers and fills missing product attributes using Gemini AI on GCP
The Catalog Enrichment pipeline is a GCP-native, fully automated system that detects and fills gaps in product attribute data — missing colors, materials, descriptions, and other fields that degrade search ranking and recommendation quality.
The pipeline consumes the product snapshot already produced by the GD Indexing Pipeline, so no separate catalog export is needed. Each tenant has its own isolated pipeline instance. The pipeline runs daily via Cloud Scheduler and is orchestrated end-to-end by Google Cloud Workflows.
Overview
Products flow from your PIM or commerce platform through the GD Indexing Pipeline, which enriches attributes using Gemini AI and pushes the results directly into the search index. Every AI-generated value carries a confidence score — and merchandisers can revert any change at any time.
Pipeline stages
Step 1 — Catalog Analytics
Reads the product snapshot and produces attribute coverage statistics: which fields are sparse, how many products are missing each attribute, and the top-N most common values per field. These statistics surface configuration gaps and improve enrichment quality over time.
Step 2 — Gemini Enrichment Processor
The core enrichment stage. For each product in the snapshot:
- Reads the product's existing attributes and the per-tenant enrichment rules from Cloud Storage
- Checks Cloud SQL for any previously stored enrichment results (avoids re-enriching unchanged products)
- Calls Gemini AI in parallel batches to generate missing or incorrect attribute values
- Applies a two-phase generate → validate pattern to catch hallucinations and assign a confidence level
- Writes only changed products to the updates table — incremental by design
Step 3 — Product Importer
Reads the updates table and performs an incremental import back into Vertex AI Retail, updating only the products that changed. The enriched attributes become immediately available for search and recommendations.
GCP services
| Service | Role |
|---|---|
| Google Cloud Workflows | Sequences and monitors all three pipeline stages end-to-end; one instance per tenant |
| Cloud Scheduler | Triggers the workflow on a daily cron (default: 02:00 UTC) |
| Google Kubernetes Engine | Runs each pipeline stage as a short-lived Kubernetes Job; also hosts the long-running Catalog Enrichment API (attribute configuration, review UI backend) |
| Vertex AI Retail API | Destination for enriched product imports |
| Gemini AI | Generates AI attribute values per product |
| BigQuery | Stores product snapshots (from GD Indexing Pipeline), enrichment updates, and analytics results |
| Cloud Storage | Hosts per-tenant enrichment configuration and rule definitions |
| Cloud SQL (PostgreSQL) | Persists enrichment results and per-product enrichment history |
| Artifact Registry | Stores Docker images for the Catalog Analytics and Gemini Enrichment jobs |
Per-tenant isolation
Each tenant has its own Cloud Workflow instance, its own enrichment configuration in Cloud Storage, and its own set of BigQuery tables and Cloud SQL tables. Running or updating one tenant's pipeline has no impact on other tenants.