This guide is part of Mastering Geospatial Data Ingestion in Python.

Bulk Downloading Satellite Imagery with Python

Pulling one Sentinel-2 tile by hand is a two-minute task. Pulling 4 000 tiles for a multi-year land-cover study is a multi-day engineering problem — and the naive approach (a for-loop around requests.get) will fail spectacularly. Ephemeral signed URLs expire mid-download. Rate limiters kick in after the fifth concurrent socket. A corrupted TIFF that passes HTTP 200 silently poisons downstream model training. This guide builds a production-ready ingestion pipeline — from catalog discovery through integrity-checked archival — that handles those failure modes by design.

Why Naive Download Loops Fail at Scale

A simple for scene_id in scenes: requests.get(url) loop collapses for several compounding reasons:

Signed URL expiration. AWS S3, Google Cloud Storage, and ESA’s CDSE all issue pre-signed download URLs that expire in 15–60 minutes. Resolving all URLs upfront and then downloading them sequentially guarantees many will expire before the transfer completes.
Rate limiting without backoff. Providers enforce per-IP request quotas. A tight loop triggers 429 responses; without exponential backoff, the pipeline stalls permanently.
No resumption. A dropped connection at byte 900 MB of a 1 GB Sentinel-2 scene means re-downloading the full scene. At scale, retrying from zero wastes hours of bandwidth.
Silent corruption. HTTP 200 does not imply the payload is intact. Truncated responses, network bit errors, and mid-stream proxy interventions produce files that os.path.exists() happily reports as present but that rasterio refuses to open.
Single-threaded I/O starvation. Download is I/O-bound, not CPU-bound. A single thread sits idle waiting on the network while one download completes.

Prerequisites & Environment

Python 3.10 or later is recommended. Install the following dependencies into a dedicated virtual environment:

# requirements.txt
pystac-client==0.8.*
requests==2.32.*
rasterio==1.4.*
tqdm==4.67.*
tenacity==9.*
boto3==1.35.*          # only if archiving to S3

pip install pystac-client requests rasterio tqdm tenacity boto3

Verify your GDAL / rasterio installation can open a remote COG without materializing the full file:

import rasterio
from rasterio.env import Env

with Env(GDAL_DISABLE_READDIR_ON_OPEN="EMPTY_DIR", CPL_VSIL_CURL_CACHE_SIZE="200000000"):
    with rasterio.open("https://sentinel-cogs.s3.amazonaws.com/sentinel-s2-l2a-cogs/36/Q/YD/2023/6/S2A_36QYD_20230601_0_L2A/B04.tif") as ds:
        print(ds.profile)  # should print without downloading the full ~100 MB

Version/Compatibility Matrix

pystac-client	rasterio	requests	Notes
0.8.x	1.4.x	2.32.x	Recommended; supports STAC 1.1 conformance classes
0.7.x	1.3.x	2.31.x	No `sortby` parameter support in catalog searches
0.6.x	1.3.x	2.30.x	`ItemSearch` pagination differs; `pages()` not available
0.5.x	1.2.x	2.28.x	Use `get_all_items()` only; large catalogs exhaust memory

Step 1: Catalog Query & Scene Discovery

Use pystac-client to sync STAC catalogs for asset-level discovery. The key discipline here is paginating lazily to avoid loading tens of thousands of item records into memory simultaneously.

from __future__ import annotations

import json
import logging
from pathlib import Path
from typing import Iterator

import pystac_client

logger = logging.getLogger(__name__)

STAC_ENDPOINT = "https://earth-search.aws.element84.com/v1"
COLLECTION = "sentinel-2-l2a"


def discover_scenes(
    bbox: list[float],          # [west, south, east, north] in EPSG:4326
    date_range: str,            # e.g. "2023-01-01/2023-12-31"
    max_cloud_cover: float = 20.0,
    cache_path: Path | None = None,
) -> list[dict]:
    """
    Query a STAC catalog and return a deduplicated list of scene asset dicts.
    Results are cached as newline-delimited JSON to avoid re-querying on restart.
    """
    if cache_path and cache_path.exists():
        logger.info("Loading scene list from cache: %s", cache_path)
        return [json.loads(line) for line in cache_path.read_text().splitlines() if line]

    client = pystac_client.Client.open(STAC_ENDPOINT)
    search = client.search(
        collections=[COLLECTION],
        bbox=bbox,
        datetime=date_range,
        query={"eo:cloud_cover": {"lt": max_cloud_cover}},
        sortby=["+datetime"],
        max_items=None,          # let pystac-client paginate
    )

    scenes: list[dict] = []
    for page in search.pages():
        for item in page:
            for band, asset in item.assets.items():
                if band in ("B04", "B03", "B02", "B08", "SCL"):
                    scenes.append({
                        "scene_id": item.id,
                        "band": band,
                        "href": asset.href,
                        "datetime": item.datetime.isoformat() if item.datetime else None,
                    })
        logger.info("Discovered %d assets so far…", len(scenes))

    if cache_path:
        cache_path.parent.mkdir(parents=True, exist_ok=True)
        cache_path.write_text("\n".join(json.dumps(s) for s in scenes))
        logger.info("Cached %d assets to %s", len(scenes), cache_path)

    return scenes

Cache the query result as newline-delimited JSON. Re-querying a catalog across 3 000+ scenes can take minutes; a local cache enables instant pipeline restarts without repeating that cost.

Step 2: URL Resolution & Authentication Lifecycle

STAC href values for public buckets are permanent. For protected providers — ESA’s Copernicus Data Space, USGS EarthExplorer, Planet — the catalog returns indirect endpoints that require token exchange. Resolve signed download URLs immediately before each transfer, not at query time.

For government portal authentication specifics, automating USGS EarthExplorer bulk downloads with requests covers multi-step login flows, CSRF token extraction, and session cookie management in detail.

For OAuth2-protected STAC APIs (e.g. Copernicus Data Space), implement a token-refresh wrapper:

import time
import threading
import requests as req

class TokenManager:
    """Thread-safe OAuth2 access token manager with proactive refresh."""

    def __init__(self, token_url: str, client_id: str, client_secret: str) -> None:
        self._url = token_url
        self._id = client_id
        self._secret = client_secret
        self._token: str = ""
        self._expires_at: float = 0.0
        self._lock = threading.Lock()

    def get_token(self) -> str:
        with self._lock:
            # Refresh 60 s before actual expiry to avoid race conditions
            if time.monotonic() >= self._expires_at - 60:
                self._refresh()
            return self._token

    def _refresh(self) -> None:
        resp = req.post(
            self._url,
            data={
                "grant_type": "client_credentials",
                "client_id": self._id,
                "client_secret": self._secret,
            },
            timeout=15,
        )
        resp.raise_for_status()
        payload = resp.json()
        self._token = payload["access_token"]
        self._expires_at = time.monotonic() + payload["expires_in"]
        logger.debug("OAuth2 token refreshed; expires in %ds", payload["expires_in"])

Inject token_manager.get_token() into the Authorization header of each download request. Store credentials in environment variables; never interpolate them into source files or logs.

Step 3: Concurrent Chunked Download with Resumption

Satellite downloads are I/O-bound. ThreadPoolExecutor gives you true parallelism during network waits without the overhead of multiprocessing. Limit workers to 8–12 per node to stay within provider rate limits.

Implement HTTP Range requests so that a dropped connection at 800 MB resumes from byte 800 000 001 rather than restarting:

import hashlib
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
from typing import NamedTuple

import requests as req
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from tqdm import tqdm

CHUNK_SIZE = 1 << 20   # 1 MiB
MAX_WORKERS = 10
MAX_RETRIES = 5


class DownloadResult(NamedTuple):
    scene_id: str
    band: str
    dest: Path
    sha256: str
    success: bool
    error: str | None


@retry(
    stop=stop_after_attempt(MAX_RETRIES),
    wait=wait_exponential(multiplier=1, min=2, max=60),
    retry=retry_if_exception_type((req.exceptions.RequestException, OSError)),
    reraise=True,
)
def _download_asset(
    href: str,
    dest: Path,
    session: req.Session,
) -> str:
    """
    Download a single asset with HTTP Range resumption.
    Returns the hex SHA-256 digest of the complete file.
    """
    dest.parent.mkdir(parents=True, exist_ok=True)
    existing_bytes = dest.stat().st_size if dest.exists() else 0
    headers = {}
    if existing_bytes:
        headers["Range"] = f"bytes={existing_bytes}-"
        logger.debug("Resuming %s from byte %d", dest.name, existing_bytes)

    with session.get(href, headers=headers, stream=True, timeout=30) as resp:
        if resp.status_code == 416:
            # Server says the file is already complete
            pass
        elif resp.status_code in (200, 206):
            mode = "ab" if existing_bytes and resp.status_code == 206 else "wb"
            with dest.open(mode) as fh:
                for chunk in resp.iter_content(CHUNK_SIZE):
                    if chunk:
                        fh.write(chunk)
        else:
            resp.raise_for_status()

    # Compute digest over the complete file
    sha256 = hashlib.sha256()
    with dest.open("rb") as fh:
        for chunk in iter(lambda: fh.read(CHUNK_SIZE), b""):
            sha256.update(chunk)
    return sha256.hexdigest()


def download_scenes_concurrent(
    scenes: list[dict],
    output_dir: Path,
    token_manager: TokenManager | None = None,
) -> list[DownloadResult]:
    """
    Download all assets in parallel. Returns one DownloadResult per asset.
    """
    session = req.Session()
    adapter = req.adapters.HTTPAdapter(
        pool_connections=MAX_WORKERS,
        pool_maxsize=MAX_WORKERS,
        max_retries=0,   # tenacity handles retries
    )
    session.mount("https://", adapter)

    results: list[DownloadResult] = []

    def _worker(scene: dict) -> DownloadResult:
        if token_manager:
            session.headers.update({"Authorization": f"Bearer {token_manager.get_token()}"})
        dest = output_dir / scene["scene_id"] / f"{scene['band']}.tif"
        try:
            digest = _download_asset(scene["href"], dest, session)
            return DownloadResult(scene["scene_id"], scene["band"], dest, digest, True, None)
        except Exception as exc:
            logger.error("Failed %s/%s: %s", scene["scene_id"], scene["band"], exc)
            return DownloadResult(scene["scene_id"], scene["band"], dest, "", False, str(exc))

    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as pool:
        futures = {pool.submit(_worker, s): s for s in scenes}
        with tqdm(total=len(scenes), unit="asset", desc="Downloading") as pbar:
            for fut in as_completed(futures):
                results.append(fut.result())
                pbar.update(1)

    return results

Step 4: Integrity Verification with rasterio

An HTTP 200 response does not guarantee the payload is intact. Validate every file against two complementary checks: a cryptographic digest and a structural rasterio open.

import rasterio
from rasterio.env import Env

EXPECTED_BAND_COUNT = {"B04": 1, "B03": 1, "B02": 1, "B08": 1, "SCL": 1}
EXPECTED_DTYPE = {"B04": "uint16", "B03": "uint16", "B02": "uint16", "B08": "uint16", "SCL": "uint8"}


def validate_raster(result: DownloadResult, catalog_checksum: str | None = None) -> bool:
    """
    Returns True only when both checksum and structural checks pass.
    Moves corrupt files to a quarantine directory rather than deleting them.
    """
    if catalog_checksum and result.sha256 != catalog_checksum:
        logger.error(
            "Checksum mismatch for %s/%s: expected %s got %s",
            result.scene_id, result.band, catalog_checksum, result.sha256,
        )
        _quarantine(result.dest)
        return False

    try:
        with Env(GDAL_DISABLE_READDIR_ON_OPEN="EMPTY_DIR"):
            with rasterio.open(result.dest) as ds:
                assert ds.count == EXPECTED_BAND_COUNT[result.band], \
                    f"Band count mismatch: {ds.count}"
                assert ds.dtypes[0] == EXPECTED_DTYPE[result.band], \
                    f"Dtype mismatch: {ds.dtypes[0]}"
                assert ds.crs is not None, "CRS is None"
                assert ds.transform != rasterio.transform.IDENTITY, "Transform is identity"
    except Exception as exc:
        logger.error("Structural validation failed for %s: %s", result.dest, exc)
        _quarantine(result.dest)
        return False

    return True


def _quarantine(path: Path) -> None:
    quarantine_dir = path.parent.parent / "_quarantine"
    quarantine_dir.mkdir(exist_ok=True)
    path.rename(quarantine_dir / path.name)
    logger.warning("Quarantined corrupt file: %s", path.name)

When CRS validation is relevant to downstream joins or rasterization, chain this step with CRS Normalization Across Mixed Datasets to re-project any scenes that arrive in an unexpected EPSG code before they enter your model training pipeline.

Step 5: Idempotent State Tracking & Logging

Never re-download a validated file. Maintain a lightweight manifest that records every successfully verified asset. On restart, the pipeline checks this manifest and skips already-completed transfers.

import sqlite3
from datetime import datetime, timezone

def open_manifest(manifest_path: Path) -> sqlite3.Connection:
    conn = sqlite3.connect(manifest_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS completed_assets (
            scene_id  TEXT NOT NULL,
            band      TEXT NOT NULL,
            sha256    TEXT NOT NULL,
            dest      TEXT NOT NULL,
            validated_at TEXT NOT NULL,
            PRIMARY KEY (scene_id, band)
        )
    """)
    conn.commit()
    return conn


def is_already_validated(conn: sqlite3.Connection, scene_id: str, band: str) -> bool:
    row = conn.execute(
        "SELECT 1 FROM completed_assets WHERE scene_id=? AND band=?",
        (scene_id, band),
    ).fetchone()
    return row is not None


def record_validated(conn: sqlite3.Connection, result: DownloadResult) -> None:
    conn.execute(
        "INSERT OR REPLACE INTO completed_assets VALUES (?,?,?,?,?)",
        (result.scene_id, result.band, result.sha256,
         str(result.dest), datetime.now(timezone.utc).isoformat()),
    )
    conn.commit()

Use a structured logger with correlation IDs (scene_id + band) so that log aggregation tools (Loki, CloudWatch Logs Insights, Datadog) can reconstruct the full lifecycle of any individual asset across retry attempts.

Advanced Patterns & Edge Cases

Handling Partial Scene Availability in STAC Collections

Some STAC items exist in the catalog but their underlying assets have been moved to cold storage (Glacier, nearline) or are temporarily unavailable. The asset.roles field may contain "data" while the actual object returns a 403 or 503. Build a pre-flight availability check that issues a HEAD request before the full GET:

def preflight_check(href: str, session: req.Session) -> tuple[bool, int]:
    """Returns (is_available, content_length_bytes)."""
    try:
        resp = session.head(href, timeout=10, allow_redirects=True)
        if resp.status_code == 200:
            return True, int(resp.headers.get("Content-Length", 0))
        return False, 0
    except req.exceptions.RequestException:
        return False, 0

Filter unavailable assets into a separate dead-letter queue before committing them to the concurrent download pool. This prevents worker slots from being wasted on guaranteed failures.

Large-Area Mosaics and Tile Boundary Management

Spatial queries for continental-scale analysis return thousands of scenes with overlapping footprints. Ingesting duplicate coverage wastes storage and biases training data. Use shapely.unary_union on scene footprint polygons to identify true spatial gaps before submitting download jobs. Pair this with Fetching OSM Data via Overpass API when you need to mask imagery to administrative boundaries or road buffer zones — the OSM footprints serve as the reference geometry for the overlap analysis.

from shapely.geometry import shape
from shapely.ops import unary_union

def deduplicate_by_coverage(scenes: list[dict], aoi_geom) -> list[dict]:
    """Remove scenes whose footprints are already covered by earlier acquisitions."""
    covered = None
    selected = []
    for scene in sorted(scenes, key=lambda s: s["datetime"]):
        footprint = shape(scene["footprint"])
        if covered is None or not covered.contains(footprint):
            selected.append(scene)
            covered = unary_union([covered, footprint]) if covered else footprint
    return selected

COG Validation and Overview-Level Integrity

Cloud Optimized GeoTIFFs include internal overviews and a specific IFD layout. A COG that passes rasterio.open() may still be malformed if a mid-transfer interruption truncated the overview pyramid rather than the main image data. Use GDAL’s validate_cloud_optimized_geotiff utility for stricter checks:

from osgeo import gdal

def validate_cog(path: Path) -> bool:
    """Returns True if the file is a valid, fully-formed COG."""
    ds = gdal.Open(str(path))
    if ds is None:
        return False
    is_cog, _, errors = gdal.GetDriverByName("GTiff").ValidateCOG(ds, {})
    if errors:
        logger.warning("COG validation errors for %s: %s", path.name, errors)
    return bool(is_cog) and not errors

Performance Optimization

The dominant cost in a bulk download pipeline is network I/O, not CPU. Profile before optimizing — but for networks with high bandwidth-delay product (BDP), the following configuration changes yield consistent throughput gains:

import urllib3

# Increase socket buffer sizes to fill high-BDP links
urllib3.util.connection.TCP_NODELAY = True

session = req.Session()
adapter = req.adapters.HTTPAdapter(
    pool_connections=MAX_WORKERS,
    pool_maxsize=MAX_WORKERS * 2,   # headroom for connection churn
)
session.mount("https://", adapter)

# Request HTTP/2 when the provider supports it (requires httpx or h2 extras)
# For standard requests, ensure Keep-Alive is honoured:
session.headers.update({"Connection": "keep-alive"})

Benchmark the effect of CHUNK_SIZE on your specific network: on high-latency links (>100 ms RTT), larger chunks (8–16 MiB) reduce the per-chunk overhead; on low-latency local networks, 1–2 MiB is optimal for memory pressure.

For post-download compression or reprojection, use ProcessPoolExecutor — those operations are CPU-bound and benefit from bypassing the GIL. Keep download and processing in separate pools to avoid a slow CPU task starving a download worker.

Integration into ETL Pipelines

The download stage should be a discrete, idempotent node in your broader pipeline DAG. In Apache Airflow, wrap download_scenes_concurrent in a @task decorated function and pass the output manifest path downstream to validation and feature-extraction tasks. Dead-letter assets (failed after all retries) should be written to a separate table or queue for human review rather than silently dropped.

For automating government portal downloads where STAC is not available, the same manifest and concurrency patterns apply — only the catalog discovery and URL resolution stages differ.

Schema enforcement is straightforward: the manifest table acts as the contract. Downstream tasks query it with WHERE validated_at IS NOT NULL to consume only verified assets. CI pipelines can assert on COUNT(*) in the manifest after a synthetic test run over a small AOI to detect regressions in the download or validation logic without transferring production volumes.

Failure-Mode Reference

Failure Mode	Root Cause	Mitigation Strategy
HTTP 403 on signed URL	Token expired before download started	Resolve signed URLs per-worker, immediately before GET
HTTP 429 Too Many Requests	Concurrent workers exceed provider rate limit	Cap `MAX_WORKERS` at 8–12; add per-domain rate-limiter
Truncated file, HTTP 200	Proxy or CDN closed connection early	SHA-256 check + rasterio structural validation; quarantine and retry
COG overview pyramid incomplete	Transfer interrupted during overview IFD	`validate_cog()` check; re-download if COG invalid
Scene in catalog but 503	Asset in cold/nearline storage	HEAD preflight filter; route to dead-letter queue
Duplicate tiles inflate dataset	Overlapping STAC footprints	Spatial deduplication via `shapely.unary_union` before scheduling

Automating USGS EarthExplorer Bulk Downloads with requests — step-by-step credential exchange and HTML session management for the EarthExplorer portal
Syncing STAC Catalogs with pystac-client — paginating STAC search results and resolving asset hrefs at scale
Fetching OSM Data via Overpass API — aligning satellite imagery with road networks and administrative boundaries
Automating Government Portal Downloads — applying the same concurrent-download patterns to non-STAC government data sources
CRS Normalization Across Mixed Datasets — re-projecting imagery that arrives in unexpected EPSG codes before downstream joins