Imagine there were a database containing all the world’s Earth science data, from those characterizing hazards like earthquakes and volcanoes to those recording changes in the atmosphere and ocean. This hypothetical resource had been built up, little by little, through untold piecemeal contributions from individuals and small teams of researchers, many of the observations gathered from sites now inaccessible or destroyed. Then, after decades (centuries, even) of collection, a catastrophic flood took the server offline and wiped out the data.
How long would it take to replace these data? How much would it cost to do so? And how much cheaper would it have been to instead invest in better safeguarding this critical scientific infrastructure?
Expert data curation and stewardship of long-tail data are critical to advancing understanding of Earth system dynamics across scales.
Versions of this hypothetical database, containing not all but, nevertheless, substantial collections of the world’s Earth science data, are real (in fact, there are dozens of them), and the threat of their demise is real as well. The looming catastrophes—funding cuts, software obsolescence, and loss of community support—may sound more mundane than massive flooding but may pose a greater risk and be no less damaging.
These sorts of database disasters can be subtle and are challenging to prevent because maintaining research databases is a complex, interdisciplinary practice requiring technical expertise in cyberinfrastructure development, deep engagement with disciplinary experts, and sustained political and financial support. Prevention of database loss is also challenging because problems like software obsolescence are both slow moving and omnipresent. The destruction they cause may go unnoticed until it’s too late.

Scientific databases are particularly valuable—and particularly vulnerable—in the Earth and environmental sciences (EES). Whereas in fields like astronomy and high-energy physics, many researchers may primarily use data collected by a single high-powered tool (e.g., the Hubble Space Telescope or the Large Hadron Collider), EES researchers often must go through a laborious process of integrating data from various sources to enable large-scale studies. Indeed, much EES research depends on the reuse and integration of heterogeneous, “long-tail” datasets—those collected over years to decades by small, distributed teams with relatively small amounts of funding (Figure 1, top) [Heidorn, 2008].
Such data are typically accessible only through community-curated data resources (CCDRs) [Williams et al., 2018a], which are researcher-led data infrastructures where scientists contribute, curate, and integrate data with the goal of studying phenomena such as the global carbon cycle and biodiversity dynamics. Expert data curation and stewardship of these long-tail data are critical to advancing understanding of Earth system dynamics across scales. However, demonstrating these resources’ value to science and society can be difficult because they are built over long periods of time and by many people via curatorial labor that is often invisible to standard academic metrics of productivity.
Small Teams Build a Billion-Dollar Asset
As a case study to illustrate the hidden value of expert-curated datasets in EES, we calculated the simplest metric of value, replacement cost, for the Neotoma Paleoecology Database [Williams et al., 2018b]. Neotoma is a CCDR that gathers and shares datasets of species abundances, as recovered from the sedimentary record, that provide long records of past ecological dynamics spanning centuries to millions of years.
More than 1,600 papers citing Neotoma data have themselves been cited nearly 50,000 times. Among many other timely research applications, the database is widely used to study the effects of human activity and changing climates on species and ecosystems, which can improve forecasts of ecological responses to current climate trends. Neotoma data have shown, for example, that current rates of ecological change are as fast as or faster than those that accompanied widespread ecological transformations at the end of the last ice age roughly 11,700 thousand years ago [Mottl et al., 2021].
Each dataset in Neotoma was collected and submitted by a small team of scientists, with times to publication of usually 2 years or more after samples were collected in the field. The database is continually growing and currently includes 34,100 datasets (Figure 1, bottom) comprising analyses of 450,000 individual samples from 155 countries and 20,800 field sites. Neotoma also stores more than 44,000 radiometric sample ages, which provide temporal context for sedimentary archives.
A central characteristic of long-tail scientific data is that, at best, they are extraordinarily expensive to replace and, at worst, they are irreplaceable.
The total data volume of Neotoma is small (about 6 gigabytes), but its replacement cost is quite high: at least $1.5 billion dollars (Figure 2). We calculated this overall amount using estimates informed by decades of experience building and reviewing grant budgets for fieldwork, sample preparation and analysis, labor, and overhead. These estimates include $3,000 per site for fieldwork, $150 per sample to cover expert time and analytical costs, $350 per radiometric date, and $45,000 per site for labor (i.e., 1 year of a graduate student’s time to analyze and publish data), all increased by 55% to account for institutional overhead costs. Of course, exact costs can vary; overall, though, these estimates are conservative because, for example, graduate students often need more than 1 year per site, principal investigator salaries are not included in the estimates, and infrastructure-dependent research such as ice core or oceanographic drilling has higher costs.
The fiscal year 2022 budget of the U.S. National Science Foundation (NSF) Sedimentary Geology and Paleobiology (SGP) Program, the major U.S. funder of this research, was $10.4 million. At that level, it would take SGP roughly 145 years to rebuild Neotoma’s archives—assuming all program resources were devoted to this activity. In fact, many records would be impossible to re-collect because some sites are now destroyed or inaccessible. This example underscores a central characteristic of long-tail scientific data: At best, they are extraordinarily expensive to replace, and at worst, they are irreplaceable.

The Pricelessness of Community Curation
Our direct replacement estimate for Neotoma is also conservative compared to the resource’s true value because it fails to capture the worth that community curation adds. For instance, the use of community-developed standards and common vocabularies in Neotoma (and other CCDRs) helps tame the inherent heterogeneity of long-tail data and makes Neotoma’s data fundamentally more reusable than data deposited in generalist repositories.
Whereas generalist repositories, which accept any kind of data, typically lack strict curatorial standards, domain-specific CCDRs like Neotoma often reformat data into a common structure. They also describe data using standardized or controlled vocabularies for species names, measurement units, and other variable names. Controlled vocabularies are essential for large data syntheses using geoscientific data from many sites. For example, if two scientists use different versions of a species name for the same fossil specimen, then estimates of past biodiversity will be artificially increased.
Curation by community experts also builds trust with researchers, who will not reuse data they don’t trust [Yoon, 2017]. So Neotoma’s reputation as a well-curated resource is as valuable as the technical work of data curation. The replacement cost estimate further omits the value of time and resources saved by researchers and others as these data are repeatedly reused: Each time a study reuses Neotoma data, this value increases, reflecting the mounting savings for teams not needing to collect, aggregate, and harmonize the data themselves. Clearly, these intangibles, though harder to quantify than field, labor, and analytical costs, carry great benefit.
Better Support for the Long Tail
Neotoma is one of many vital CCDRs in the Earth and environmental sciences. Other examples include AmeriFlux and FLUXNET; the Botanical Information and Ecology Network (BIEN); the Global Biodiversity Information Facility; and Interdisciplinary Earth Data Alliance databases like SESAR, the Paleobiology Database, the Nutrient Network (NutNet), Water Isotopes, and VertNet. Each of these has its own potential replacement costs, some less and some more than Neotoma’s. Each also offers its own case study in resilience and precarity.
Although community-curated data resources (CCDRs) have, in some cases, persisted across multiple generations of scientists, their financial and technical sustainability is an ever-present concern.
Although CCDRs have, in some cases, persisted across multiple generations of scientists—indeed, they are where many EES researchers get essential data science training—their financial and technical sustainability is an ever-present concern. Many continue to be or were at least initially funded with federal grants, but their persistence and growth also depend—sometimes to a large extent—on grassroots, volunteer labor by experts with highly specialized and in-demand skill sets. This reliance on specialized, uncompensated labor introduces precarity because traditional metrics of productivity and impact in academic careers continue to undervalue, and thus disincentivize, the essential work of data science and stewardship [Merow et al., 2023; Goring et al., 2014; Williams et al., 2018a].
How might society better support these crucial, multigenerational data resources and the people who maintain them?
CCDRs must be recognized as being one of the most effective ways to compound the value of long-tail scientific research—the more they’re used, the more valuable they become, and the more they reduce additional costs for scientists and funders. The scientific community also needs to acknowledge that both the human and technical components of CCDRs are critical to these resources’ long-term functionality and that both components need sustained support. The roles of data stewards and curators must be formally recognized as important professional work.
Most important, CCDRs should be funded as sustained infrastructure. The biggest challenge to maintaining these resources is the current system by which they’re supported on a short-term, grant-by-grant basis on 3- to 4-year cycles. Ideally, federal grant models for CCDRs should both lengthen the time horizon of funding cycles to 5–10 years and increase funding levels to better support infrastructure and personnel for core functionality.
Diversifying funding streams for CCDRs could further enhance their sustainability.
NSF programs such as Geoinformatics and Infrastructure Capacity for Biological Research have moved in this direction by enabling longer support horizons for data resources. Still, sustainability and tight budgets for CCDRs remain significant issues. More robust support for infrastructure and personnel can be complemented by science-driven campaigns on typical 3-year grant cycles that both add data to these resources and use them for research at spatiotemporal scales not achievable by individual site-level studies.
Diversifying funding streams for CCDRs could also further enhance their sustainability [Virapongse et al., 2024]. One possibility is to implement pay-to-submit models, in which researchers pay a fee to archive their data. These models increasingly sustain generalist repositories (though, often, the fees are hidden to researchers) but unfortunately tend to work less well for CCDRs. Although CCDRs’ data curation requirements add value, they rely both on work by paid skilled experts and on donated time, goodwill, and expertise by scientists. These elements may necessitate data submission fees that are cost prohibitive for researchers and make it harder to achieve the economies of scale that work for generalist repositories.
Reducing the curatorial effort in CCDRs by building data pipelines from the point of collection to their availability as a community data resource is one way to reduce costs. Still, work is needed to explore reasonable fee-for-service models that account for the value of time donated to CCDRs while offering generous waiver policies for individuals and institutions that simply don’t have the capacity to contribute either time or money.
Opportunities also exist for private foundations to engage more in supporting community-led scientific data resources as loci of transformative scientific discovery, artificial intelligence development, and next-generation scientific training. And apart from funding bodies like NSF, other federal agencies that increasingly rely on (and sometimes mandate the use of) CCDRs for data archiving and sharing should also develop sustained service agreements and funding support models for these resources.
CCDRs in the Earth and environmental sciences build, support the accessibility of, and compound the value of long-tail data, thereby powering critical scientific research into how our planet is changing and the societal and ecosystem effects of this change. By inadequately supporting these resources, we risk losing and potentially needing to replace them, an extremely costly, yet preventable, outcome.
References
Goring, S. J., et al. (2014), Improving the culture of interdisciplinary collaboration in ecology by expanding measures of success, Front. Ecol. Environ., 12, 39–47, https://doi.org/10.1890/120370.
Heidorn, P. B. (2008), Shedding light on the dark data in the long tail of science, Library Trends, 57(2), 280–299, https://doi.org/10.1353/lib.0.0036.
Merow, C., et al. (2023), Better incentives are needed to reward academic software development, Nat. Ecol. Evol., 7, 626–627, https://doi.org/10.1038/s41559-023-02008-w.
Mottl, O., et al. (2021), Global acceleration in rates of vegetation change over the past 18,000 years, Science, 372, 860–864, https://doi.org/10.1126/science.abg1685.
Virapongse, A., J. Gallagher, and B. Tikoff (2024), Insights on sustainability of Earth science data infrastructure projects, Data Sci. J., 23, 14, https://doi.org/10.5334/dsj-2024-014.
Williams, J. W., et al. (2018a), Building open data: Data stewards and community-curated data resources, PAGES Mag., 26, 50–51, https://doi.org/10.22498/pages.26.2.50.
Williams, J. W., et al. (2018b), The Neotoma Paleoecology Database, a multiproxy, international, community-curated data resource, Quat. Res., 89(1), 156–177, https://doi.org/10.1017/qua.2017.105.
Yoon, A. (2017), Data reusers’ trust development, J. Assoc. Inf. Sci. Technol., 68(4), 946–956, https://doi.org/10.1002/asi.23730.
Authors’ Note: This article was initially drafted in 2024, before the Trump administration’s dramatic cuts to multiple federal scientific agencies. Although we have not updated it to reflect on these changes, we believe our core argument is all the more timely: Scientific collections and data repositories are among the most vital, valuable, and cost-effective ways of supporting science, and they need our continued funding and support.
Author Information
Andrea K. Thomer (athomer@arizona.edu), College of Information Science, University of Arizona, Tucson; John W. Williams, Department of Geography and Center for Climatic Research, University of Wisconsin–Madison; Simon Goring, Department of Geography, Center for Climatic Research, and Data Science Institute, University of Wisconsin–Madison; and Jessica L. Blois, Department of Life and Environmental Sciences, University of California, Merced