Automating USGS EarthExplorer Bulk Downloads with requests

Automating USGS EarthExplorer bulk downloads with requests requires authenticating against the USGS Machine-to-Machine (M2M) API, generating temporary download URLs, and streaming large archives using connection pooling and exponential backoff. Direct HTTP requests to the EarthExplorer web interface are blocked by anti-bot protections and violate USGS terms of service. The only supported automation path uses the M2M API v1.5+ to generate authenticated endpoints that requests can consume safely at scale.

The workflow follows a strict three-phase pipeline:

  1. Token Acquisition – Exchange credentials for a time-limited X-Auth-Token (~2 hour validity)
  2. Scene Metadata Resolution – Query available scenes by spatial/temporal bounds
  3. Chunked File Streaming – Fetch temporary URLs and write GeoTIFFs/archives to disk without memory overflow

This pattern is foundational for teams building Mastering Geospatial Data Ingestion in Python pipelines, where reliability, resumable transfers, and strict rate-limit compliance are non-negotiable.

Production-Ready Architecture

Satellite archives routinely exceed 1–2 GB per scene. Loading them into memory will crash standard Python processes. Instead, use requests.Session with urllib3 retry logic, connection pooling, and iter_content() chunking. The configuration below handles transient network failures, HTTP 429 throttling, and server-side 5xx errors automatically.

import os
import json
import time
import logging
from pathlib import Path
from urllib.parse import urljoin
from requests import Session
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger(__name__)

USGS_M2M_BASE = "https://m2m.cr.usgs.gov/api/api/json/stable/"
DATASET_ID = "LANDSAT_8_C2_L1"
OUTPUT_DIR = Path("./usgs_downloads")
OUTPUT_DIR.mkdir(exist_ok=True)

def get_session() -> Session:
    """Configure session with connection pooling and exponential backoff."""
    session = Session()
    retry_strategy = Retry(
        total=5,
        backoff_factor=1.5,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["GET", "POST"]
    )
    adapter = HTTPAdapter(
        max_retries=retry_strategy,
        pool_connections=10,
        pool_maxsize=10
    )
    session.mount("https://", adapter)
    session.headers.update({"Content-Type": "application/json"})
    return session

def authenticate(session: Session, username: str, password: str) -> str:
    """Exchange credentials for a temporary M2M token."""
    payload = {"username": username, "password": password}
    resp = session.post(urljoin(USGS_M2M_BASE, "login"), json=payload)
    resp.raise_for_status()
    token = resp.json()["data"]
    logger.info("Authentication successful. Token valid for ~2 hours.")
    return token

def search_scenes(session: Session, token: str, bbox: list, start_date: str, end_date: str) -> list:
    """Retrieve scene metadata for a bounding box and date range."""
    payload = {
        "datasetName": DATASET_ID,
        "maxResults": 50,
        "sceneFilter": {
            "acquisitionFilter": {"start": start_date, "end": end_date},
            "spatialFilter": {"filterType": "mbr", "lowerLeft": bbox[:2], "upperRight": bbox[2:]}
        }
    }
    headers = {"X-Auth-Token": token}
    resp = session.post(urljoin(USGS_M2M_BASE, "scene-search"), json=payload, headers=headers)
    resp.raise_for_status()
    return resp.json()["data"]["results"]

def get_download_urls(session: Session, token: str, scene_ids: list) -> dict:
    """Resolve temporary download URLs for a list of scene IDs."""
    payload = {"datasetName": DATASET_ID, "sceneIds": scene_ids}
    headers = {"X-Auth-Token": token}
    resp = session.post(urljoin(USGS_M2M_BASE, "download-options"), json=payload, headers=headers)
    resp.raise_for_status()
    
    url_map = {}
    for scene in resp.json()["data"]:
        # Filter for primary product (usually Level-1 or Level-2 GeoTIFF/ZIP)
        primary = next((opt for opt in scene.get("downloadOptions", []) if opt.get("primary")), None)
        if primary:
            url_map[scene["entityId"]] = primary["url"]
    return url_map

def download_file(session: Session, url: str, dest_path: Path, chunk_size: int = 8192) -> None:
    """Stream large files to disk with progress logging."""
    with session.get(url, stream=True) as resp:
        resp.raise_for_status()
        total_size = int(resp.headers.get("content-length", 0))
        downloaded = 0
        
        with open(dest_path, "wb") as f:
            for chunk in resp.iter_content(chunk_size=chunk_size):
                f.write(chunk)
                downloaded += len(chunk)
                if total_size:
                    logger.info(f"{dest_path.name}: {downloaded/1024/1024:.1f}MB / {total_size/1024/1024:.1f}MB")
    logger.info(f"Download complete: {dest_path}")

def run_pipeline(username: str, password: str, bbox: list, start_date: str, end_date: str) -> None:
    session = get_session()
    token = authenticate(session, username, password)
    
    scenes = search_scenes(session, token, bbox, start_date, end_date)
    if not scenes:
        logger.warning("No scenes found for the specified criteria.")
        return
        
    scene_ids = [s["entityId"] for s in scenes]
    url_map = get_download_urls(session, token, scene_ids)
    
    for scene_id, dl_url in url_map.items():
        dest = OUTPUT_DIR / f"{scene_id}.tar.gz"
        if dest.exists():
            logger.info(f"Skipping existing: {dest.name}")
            continue
        try:
            download_file(session, dl_url, dest)
        except Exception as e:
            logger.error(f"Failed to download {scene_id}: {e}")

if __name__ == "__main__":
    # Replace with actual credentials or environment variables
    run_pipeline(
        username="YOUR_USGS_USERNAME",
        password="YOUR_USGS_PASSWORD",
        bbox=[-122.5, 37.7, -122.3, 37.9],  # San Francisco Bay Area
        start_date="2023-06-01",
        end_date="2023-06-30"
    )

Operational Hardening

Token Lifecycle Management

M2M tokens expire after approximately 120 minutes. For pipelines exceeding this window, wrap API calls in a lightweight refresh loop that catches 401 Unauthorized responses and re-authenticates. Never hardcode credentials; inject them via environment variables or a secrets manager.

Rate Limit Compliance

The USGS enforces strict request quotas. The Retry configuration above automatically backs off on HTTP 429 responses using an exponential schedule (backoff_factor=1.5 yields delays of ~1.5s, 3s, 6s, 12s). If you consistently hit limits, stagger requests across multiple API keys or schedule downloads during off-peak UTC hours.

Resumable & Partial Downloads

Large satellite archives frequently drop mid-transfer. Implementing HTTP Range headers enables true resumable downloads, but requires tracking byte offsets across process restarts. For most ETL workloads, the chunked streaming approach above combined with idempotent file checks (if dest.exists(): continue) provides sufficient fault tolerance without added complexity.

Scaling for Geospatial ETL

When integrating this pattern into production data infrastructure, decouple metadata resolution from file transfer. Query scene metadata once, cache results in a lightweight database (SQLite/PostgreSQL), and distribute download workers across a queue system like Celery or Prefect. This prevents redundant API calls and allows horizontal scaling across multiple nodes.

For teams evaluating broader ingestion strategies, the Bulk Downloading Satellite Imagery guide covers alternative protocols (S3, STAC, OGC API) and storage optimization techniques. Always validate checksums post-download using hashlib to ensure archive integrity before feeding data into raster processing pipelines.

The requests library handles connection reuse and TLS negotiation efficiently when configured correctly. Pair it with urllib3 retry policies and chunked I/O, and you can reliably automate terabyte-scale EarthExplorer downloads without violating platform constraints or corrupting geospatial assets.