Why Spatial Science Never Got Its Protein Data Bank

In October 1971 a small group of crystallographers gathered at Cold Spring Harbor and agreed to pool their raw data. Seven protein structures were deposited into what became the Protein Data Bank. The atomic coordinates travelled between its two founding institutions on punched cards.^[1] The format was primitive—myoglobin alone required more than a thousand cards to exchange—and the incentive to share almost non-existent.

Fifty years later, the PDB now holds over 200,000 structures,^[2] a textbook network effect operating not on users but on deposited objects. Each new structure increased the marginal value of every prior structure, because composability at shared atomic coordinates let downstream researchers cross-reference and reanalyze work deposited for unrelated reasons.^[3] A gravity well opened under structural biology that no working crystallographer could escape.

Spatial science—the loose cluster of disciplines that study phenomena anchored to physical coordinates, from ecology and hydrology to urban demographics and spatial economics—has orders of magnitude more data, users, and funding than 1971 crystallography ever had, but no Protein Data Bank. The absence of a compositional layer for place-specific knowledge is an institutional coordination failure, not a technological one.

The wall

Satellite data lives in Copernicus, distributed as Sentinel-2 MGRS tiles.^[4] Economic data lives in Eurostat, organized by NUTS administrative boundaries at spatial scales one to three orders of magnitude larger than any single ecological project.^[5] Biodiversity observations live in GBIF, indexed by taxon occurrence records.^[6] Grid carbon lives across ENTSO-E (bidding zone)^[7] and EIA (balancing authority)^[8]; PJM alone publishes over 10,000 locational marginal prices every five minutes.^[9]

These six datasets all describe the same river basins, but they do not fit together. The questions this blocks are exactly the ones most worth asking. Does a forest restoration project in a specific watershed actually deliver the climate benefit it claims, once you account for what the local economy is doing, what the weather has done over the last decade, and how the land is being managed? What are the second order and systemic effects of different spatial policies? Hundreds of studies have answered pieces of that question somewhere in the world. None of them can talk to each other.

Why the obvious technical answers don't close the gap

A careful reader will object that spatial data already has interoperability—Open Geospatial Consortium standards, INSPIRE, GEOSS, the SpatioTemporal Asset Catalog, Google Earth Engine, Microsoft's Planetary Computer.^[10] These are real, well-funded, yet they do not close the gap. The distinction that matters is between standards for exchanging data and a corpus of stably-identified objects that accretes over time. OGC-family standards are the former: they let systems speak the same format. The PDB is the latter: it is over 200,000 deposited objects with permanent identifiers that any researcher can cite and reanalyze, in perpetuity, regardless of what format the original was in. GEE, the most ambitious effort in the space, is explicitly framed by its team as a planetary-scale geospatial analysis platform—a computation substrate over public rasters—not a deposition registry of stably-identified ecological claims at specific coordinates over time.^[11] You can run beautiful analyses on GEE. You cannot cite a place-object in GEE the way a structural biologist cites PDB entry 1HHO.

An institutional failure, not a technical one

Michael Nielsen, in Reinventing Discovery, tells the canonical story of how the Bermuda Principles—the 1996 funder pact governing Human Genome Project data release—established a norm of 24-hour public release for genome data.^[12] Jorge Contreras later documented that these principles were not primarily achieved through consensus: it was funder leverage.^[13] The NIH, the US Department of Energy, and the Wellcome Trust together funded over 90% of the initial sequencing work, and in February 1996 they made 24-hour release a condition of the money.^[14] The commons came into being because three institutions had enough concentrated financial power to compel it.

Spatial science has no equivalent concentration. No single institutional body can convene every field that touches "place"—geography, ecology, remote sensing, urban planning, hydrology, climate economics—and tell them that composable deposition will happen or grants will stop. And no single discipline owns "place" the way crystallographers owned protein structure. Structural biology had an obvious atomic unit: a solved protein. A coastal bioregion is claimed partially by hydrologists, ecologists, municipal planners, climate modelers, and the people who live in it. None will accept a primitive they did not help define.

The deeper question is ontological. A Protein Data Bank is possible because a solved protein structure is the same object to every laboratory that works on it. A place is not the same object to the hydrologist, the ecologist, and the municipal planner, and there is no reason to expect a single canonical representation to emerge top-down from any one of them. The open question is whether a working atomic unit for place can instead be constructed bottom-up—through distributed validation, multi-agent protocols for object identity, or something else entirely. Paul Edwards, writing on climate data, called the top-down path infrastructural globalism: shareable data requires institutions to first agree on the models that make it comparable.^[15] The bottom-up path has no name yet.

A minimal demonstration

Before anyone commits to building the missing infrastructure, the compositional lens has to pass a cheap demonstration. What follows is an illustrative case-study protocol, not a statistical study. The question it answers is not "what percentage of environmental projects fail" but "does composing independent spatial layers catch things any single layer would miss, with enough force that the pattern is visible in a handful of cases?"

Method. Take five Mediterranean forestry projects from the Verra carbon registry that publish usable area geometry through their public project design or project area data access (PADA) documents. For each project, compose three independent spatial layers over the crediting period using open tooling: (1) Dynamic World 10-metre land-cover trajectory against the project's claimed avoided loss;^[16] (2) a counterfactual built from climatically and topographically similar unprotected parcels within 50 km, following West et al. (2023), which is the current state of the art for remote-sensing-based REDD+ verification;^[17] (3) ERA5-Land 9-kilometre temperature and precipitation anomalies against the 1991–2020 baseline,^[18] to flag crediting periods whose outcomes are better explained by favourable weather than by the intervention itself. For each project, produce a short narrative verdict: how does the reported outcome square with the land-cover trajectory, the counterfactual, and the climate window, and where do the layers disagree?
What would make this informative. If at least two of the five cases show cross-panel discrepancies that no single layer would have flagged alone, the compositional lens is doing work that single-domain verification cannot. If all five look clean under composition, either the lens is not adding value for this class of project or single-domain verification is already adequate. Both outcomes are publishable. Both redirect effort.
Scope caveats. The five-case illustration is deliberately not a statistical test. A properly powered study of the same question would require ten times more—out of scope for an essay, and in any case blocked by a bigger structural obstacle: Verra PADA access requirements, which hide project geometry behind a paywall. That a credible compositional audit cannot be run at all on most registry listings is itself representative of the institutional bottleneck.
Why this is newly feasible. Composition across incompatible spatial formats was, until recently, prohibitively expensive. Open geospatial tooling (STAC, stackstac, the Planetary Computer public catalog) combined with pre-classified products like Dynamic World and ERA5-Land has lowered the floor enough that the case study protocol is a one-person week in 2026, once project geometry is in hand. The bottleneck is no longer computational: a determined vibe coder can run this analysis themselves, likely in a week.

If the case study does not reveal a pattern, the bottleneck is upstream of composition. This is also a useful result. Keeping the demonstration small is how the wrong version stays cheap.

What the outcome could mean

The argument applies beyond voluntary carbon markets, but environmental verification is the cheapest domain to run these tests. The same bottleneck applies in public health (disease-environment-economic composition at neighborhood scale), urban planning, climate adaptation, and indigenous land management. It also blocks the emergence of heterodox fields—ecological economics, political ecology, socio-hydrology, the reparative economics of bioregional restoration—that need to reason about ecological condition, economic activity, and governance in the same coordinate frame. Today, they are limited to the few researchers who can personally reconcile enough datasets to make a point.

If the case study shows composition catching discrepancies that single-layer audits would miss, the implication is that a large share of environmental capital allocation is being evaluated with structurally incomplete information, and the missing infrastructure is worth building—as a public good, not a private product. Not a new database: a spatial index that maps existing open datasets to common coordinates—the same way the PDB did not generate new structures but made existing ones composable, and in doing so grew from seven deposited objects to more than two hundred thousand.

An institutional forcing function has arrived from somewhere in every successful case. For crystallography it was the conviction of a small group at Cold Spring Harbor who controlled the only structures worth depositing. For genomics it was Bermuda funder leverage. For place, the analogous concentration is beginning to assemble. Developed-country climate finance reached USD 115.9 billion in 2022, the first year the USD 100 billion goal was met, with multilateral channels carrying the largest share.^[19] MDBs alone delivered USD 74.7 billion to low- and middle-income economies in 2023;^[20] the Green Climate Fund, the major development banks, and EU instruments under the Green Deal account for the overwhelming majority of that multilateral flow. The Green Climate Fund's Integrated Results Management Framework already requires geospatially disaggregated outcome measurement.^[21] If even one of these funders made composable spatial verification a condition of disbursement, the result would be the Bermuda moment place has never had: concentrated capital paired with a verification demand, forcing a commons into existence because the cost of not having one is suddenly higher than the cost of building it.

Ecofrontiers is an applied research agency working at the intersection of spatial economics, environmental computing, and AI coordination. Get in touch if you're working on any of this.