MXP Platform

Data Enrichment Flow

How the MXP Catalog Enrichment pipeline discovers and fills missing product attributes using Gemini AI on GCP

The Catalog Enrichment pipeline is a GCP-native, fully automated system that detects and fills gaps in product attribute data — missing colors, materials, descriptions, and other fields that degrade search ranking and recommendation quality.

The pipeline consumes the product snapshot already produced by the GD Indexing Pipeline, so no separate catalog export is needed. Each tenant has its own isolated pipeline instance. The pipeline runs daily via Cloud Scheduler and is orchestrated end-to-end by Google Cloud Workflows.

Overview

Products flow from your PIM or commerce platform through the GD Indexing Pipeline, which enriches attributes using Gemini AI and pushes the results directly into the search index. Every AI-generated value carries a confidence score — and merchandisers can revert any change at any time.

Enrichment Flow
① Make changes
👤Merchandiser
🗄️PIM /Commerce Tools
⚙️GD IndexingPipeline
CatalogEnrichments
🔍Search Index

Pipeline stages

Data enrichment flow diagram

Step 1 — Catalog Analytics

Reads the product snapshot and produces attribute coverage statistics: which fields are sparse, how many products are missing each attribute, and the top-N most common values per field. These statistics surface configuration gaps and improve enrichment quality over time.

Step 2 — Gemini Enrichment Processor

The core enrichment stage. For each product in the snapshot:

  1. Reads the product's existing attributes and the per-tenant enrichment rules from Cloud Storage
  2. Checks Cloud SQL for any previously stored enrichment results (avoids re-enriching unchanged products)
  3. Calls Gemini AI in parallel batches to generate missing or incorrect attribute values
  4. Applies a two-phase generate → validate pattern to catch hallucinations and assign a confidence level
  5. Writes only changed products to the updates table — incremental by design

Step 3 — Product Importer

Reads the updates table and performs an incremental import back into Vertex AI Retail, updating only the products that changed. The enriched attributes become immediately available for search and recommendations.

GCP services

ServiceRole
Google Cloud WorkflowsSequences and monitors all three pipeline stages end-to-end; one instance per tenant
Cloud SchedulerTriggers the workflow on a daily cron (default: 02:00 UTC)
Google Kubernetes EngineRuns each pipeline stage as a short-lived Kubernetes Job; also hosts the long-running Catalog Enrichment API (attribute configuration, review UI backend)
Vertex AI Retail APIDestination for enriched product imports
Gemini AIGenerates AI attribute values per product
BigQueryStores product snapshots (from GD Indexing Pipeline), enrichment updates, and analytics results
Cloud StorageHosts per-tenant enrichment configuration and rule definitions
Cloud SQL (PostgreSQL)Persists enrichment results and per-product enrichment history
Artifact RegistryStores Docker images for the Catalog Analytics and Gemini Enrichment jobs

Per-tenant isolation

Each tenant has its own Cloud Workflow instance, its own enrichment configuration in Cloud Storage, and its own set of BigQuery tables and Cloud SQL tables. Running or updating one tenant's pipeline has no impact on other tenants.