Syncing STAC Catalogs with pystac-client

The SpatioTemporal Asset Catalog (STAC) specification has become the de facto standard for organizing, discovering, and accessing geospatial raster and vector assets. For data engineers and GIS analysts building automated ingestion pipelines, manually traversing STAC endpoints is inefficient and error-prone. Syncing STAC catalogs with pystac-client provides a programmatic, Python-native interface to query, paginate, validate, and extract metadata and assets at scale. This workflow integrates directly into modern ETL architectures, enabling reproducible data ingestion that scales from local development to cloud-native orchestration.

If you are building foundational ingestion routines, this approach complements broader strategies outlined in Mastering Geospatial Data Ingestion in Python, where standardized parsing and schema validation form the backbone of reliable pipelines. By treating STAC endpoints as structured APIs rather than static file directories, teams can implement deterministic sync jobs that respect bandwidth limits, handle transient network failures, and maintain strict data lineage.

Prerequisites & Environment Configuration

Before implementing catalog synchronization, ensure your environment meets the following baseline requirements:

  • Python 3.9+: Required for modern type hinting, dataclasses, and compatibility with contemporary HTTP clients.
  • Core Dependencies: pystac-client>=0.7.0, requests>=2.31.0, pydantic>=2.0 (for optional schema validation), and tenacity (for resilient retry logic).
  • Network Access: Unrestricted outbound HTTPS to STAC API endpoints (e.g., https://earth-search.aws.element84.com/v1, https://planetarycomputer.microsoft.com/api/stac/v1).
  • Storage Target: Local filesystem path, S3 bucket, or cloud storage mount with write permissions and appropriate IAM credentials.

Install the stack in an isolated environment:

pip install pystac-client requests tenacity pydantic

Familiarity with the STAC specification is recommended, particularly the Item, Collection, and Catalog JSON structures. Understanding how STAC separates metadata from binary assets is critical when designing sync routines that only pull what is necessary. The official pystac-client documentation provides comprehensive API references that align closely with the implementation patterns below.

Production Sync Architecture

A production-grade STAC sync follows a deterministic sequence. Each phase isolates a specific failure domain, making the pipeline easier to monitor, test, and scale.

1. Client Initialization & Conformance Validation

STAC APIs are not uniform. Some endpoints implement the full ItemSearch and Collections conformance classes, while others expose only legacy catalog browsing. Initializing the client requires explicit validation of the /conformance endpoint to prevent runtime errors downstream.

from pystac_client import Client

def initialize_client(api_url: str) -> Client:
    client = Client.open(api_url)
    required_conformance = {
        "https://api.stacspec.org/v1.0.0/core",
        "https://api.stacspec.org/v1.0.0/item-search"
    }
    supported = set(client.get_conforms_to())
    if not required_conformance.issubset(supported):
        raise RuntimeError(f"Endpoint {api_url} lacks required STAC conformance classes.")
    return client

2. Query Construction & Spatial-Temporal Filtering

STAC queries leverage RFC 3339 timestamps, WGS84 bounding boxes, and CQL2 property filters. Constructing precise queries minimizes payload size and reduces sync latency. Always define explicit temporal windows and spatial extents; open-ended queries often trigger API rate limits or return stale historical data.

from datetime import datetime, timezone

def build_query_params(
    bbox: tuple[float, float, float, float],
    start_dt: datetime,
    end_dt: datetime,
    collections: list[str],
    max_cloud_cover: float = 20.0
) -> dict:
    return {
        "bbox": bbox,
        "datetime": f"{start_dt.isoformat()}/{end_dt.isoformat()}",
        "collections": collections,
        "query": {"eo:cloud_cover": {"lt": max_cloud_cover}},
        "limit": 100  # Optimal page size for most STAC providers
    }

3. Paginated Retrieval & Cursor Management

STAC APIs return results in discrete pages. The sync routine must iterate through pages, track cursors, and handle API-specific pagination drift. For deeper implementation details, review Best practices for STAC catalog pagination in Python.

Relying on pystac-client’s built-in .get_items() generator abstracts away manual cursor parsing, but you should still implement explicit backoff strategies when encountering 429 Too Many Requests responses. Pagination should never block the main thread in async environments; consider wrapping sync calls in concurrent.futures.ThreadPoolExecutor for parallelized catalog traversal.

4. Metadata Validation & Asset Resolution

Parsing returned Item objects requires strict validation. STAC providers occasionally return malformed assets dictionaries or missing geometry fields. Implementing a lightweight Pydantic model catches structural drift before assets are downloaded. Once validated, resolve asset URLs to their final storage locations (S3, Azure Blob, GCS, or HTTPS) and apply idempotent sync logic to avoid redundant downloads.

Implementation Blueprint

The following script demonstrates a resilient, production-ready sync routine. It integrates retry logic, schema validation, and local asset persistence.

import os
import json
import logging
from pathlib import Path
from typing import Iterator
from datetime import datetime, timezone

import requests
from pydantic import BaseModel, ValidationError
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from pystac_client import Client
from pystac import Item

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

class STACItemValidator(BaseModel):
    id: str
    geometry: dict
    properties: dict
    assets: dict
    collection: str

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10),
    retry=retry_if_exception_type((requests.exceptions.HTTPError, requests.exceptions.ConnectionError))
)
def fetch_and_validate_items(client: Client, params: dict) -> Iterator[Item]:
    search = client.search(**params)
    for item in search.items_as_dicts():
        try:
            STACItemValidator(**item)
            yield Item.from_dict(item)
        except ValidationError as e:
            logging.warning(f"Skipping malformed item {item.get('id')}: {e}")

def sync_assets_to_disk(item: Item, target_dir: Path) -> None:
    target_dir.mkdir(parents=True, exist_ok=True)
    for asset_key, asset in item.assets.items():
        dest_path = target_dir / f"{item.id}_{asset_key}.tif"
        if dest_path.exists():
            logging.debug(f"Asset already exists: {dest_path}")
            continue
            
        logging.info(f"Downloading {asset_key} for {item.id}")
        response = requests.get(asset.href, stream=True, timeout=30)
        response.raise_for_status()
        
        with open(dest_path, "wb") as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
                
    # Persist metadata alongside assets
    meta_path = target_dir / f"{item.id}_metadata.json"
    with open(meta_path, "w") as f:
        json.dump(item.to_dict(), f, indent=2)

def run_stac_sync(api_url: str, query_params: dict, output_dir: str) -> None:
    client = initialize_client(api_url)
    output_path = Path(output_dir)
    
    for item in fetch_and_validate_items(client, query_params):
        sync_assets_to_disk(item, output_path / item.collection)
        
    logging.info("Sync completed successfully.")

This blueprint isolates network volatility using tenacity, validates structural integrity before disk I/O, and maintains strict separation between metadata and binary payloads. It can be scheduled via cron, Airflow, or AWS Step Functions without modification.

Integration Patterns & Data Pipeline Context

STAC synchronization rarely operates in isolation. Modern geospatial pipelines frequently blend satellite raster data with open-source vector layers and legacy administrative datasets. When orchestrating multi-source ingestion, align your STAC sync cadence with complementary workflows. For example, if your pipeline requires road networks or building footprints alongside Sentinel-2 imagery, coordinate your STAC sync with the vector extraction routines detailed in Fetching OSM Data via Overpass API. This ensures temporal alignment between raster acquisitions and vector baselines.

Similarly, many organizations still rely on fragmented government data portals that lack STAC compliance. Bridging these legacy systems with modern catalog syncs requires a normalization layer. The patterns discussed in Automating Government Portal Downloads demonstrate how to scrape, reformat, and inject non-STAC metadata into your ingestion queue before it reaches the pystac-client validation stage. By treating legacy portals as upstream data sources rather than primary catalogs, teams maintain a clean, queryable STAC index while preserving historical data continuity.

Troubleshooting & Performance Optimization

Handling Rate Limits & API Quotas

Public STAC endpoints enforce strict rate limits. Implement exponential backoff (as shown in the tenacity configuration) and cache search results locally when running iterative development cycles. For enterprise deployments, authenticate via API keys or OAuth2 tokens and pass them through the headers parameter in Client.open().

Resolving Cloud-Native Asset URLs

Many STAC providers store assets in cloud storage using signed URLs or requester-pays buckets. Ensure your execution environment has the appropriate IAM roles or AWS CLI credentials configured. If assets return 403 Forbidden, verify that the provider requires explicit X-Amz-Credential headers or that your sync job runs within the same cloud region as the data.

Managing Large-Scale Syncs

Syncing entire collections can exhaust local disk space. Implement streaming validation and process items in micro-batches. Consider writing metadata to a lightweight database (SQLite, PostgreSQL, or DuckDB) while streaming assets directly to cloud storage using boto3 or google-cloud-storage SDKs. This decouples metadata indexing from binary transfer, allowing parallelized, fault-tolerant pipelines.

Debugging Malformed Geometries

Some providers return null or invalid GeoJSON in the geometry field. The Pydantic validator catches these, but you may want to implement a fallback that attempts to reconstruct bounds from the bbox property. Always log validation failures to a dedicated error queue for manual review rather than silently dropping items.

Conclusion

Syncing STAC Catalogs with pystac-client transforms geospatial data ingestion from a manual, error-prone process into a deterministic, scalable engineering workflow. By enforcing conformance validation, implementing resilient pagination, and applying strict schema checks, teams can build ingestion pipelines that reliably handle petabyte-scale satellite archives. When combined with robust retry logic, cloud-native storage targets, and clear integration patterns for complementary data sources, pystac-client becomes a cornerstone of modern spatial ETL architecture. Start with targeted queries, validate early, and scale horizontally as your data requirements grow.