Imagine strolling through your favourite forest on a clear Sunday morning. The air is moist and fresh, you hear birds singing in the trees and smell that distinct scent of the forest after a light rain. Water drops from light green leaves that burst from thick tree buds, and you can somehow sense the energy that is flowing through the forest.

Have you ever wondered what processes govern these energy and matter fluxes in the forest or the date of bud burst in leaf-shedding trees? Me and my fellow forest modellers are trying to built tools that help us find answers. Understanding these and other ecosystem processes and their interaction is essential for assessing impacts of climate change on forests as they are directly affected by the current and average state of the atmosphere which is shifting during climate change.



Picture by Blake Cheek

Forests play a crucial role in the global carbon and water cycles, are hotspots for biodiversity and can provide a multitude of ecosystem services, like timber, air filtration, as well as protection against floods, rockfall or desertification. They sequester carbon dioxide from the atmosphere through photosynthesis and store it in biomass which is then released back into the atmosphere during decomposition or further stored in forest soils or timber products. Many species are adapted to forest ecosystems and intact forests provide habitat for a myriad of species from tiny soil-dwelling insects to large mammals, like the tiger. Climate change is already impacting forests in many ways, and we ask ourselves what impacts we should expect in the future and how do we design appropriate adaptation and mitigation measures to ensure that our forests stay intact while also providing ecosystem services for society.

For understanding forest dynamics under climate change that result from ecosystem processes and their interaction with each other over time, we build models that integrate our current knowledge of what is going on in the forest. These forest models are simplified abstractions of real forests. A considerable number of forest models has been developed in the last decades. They all focus on different objectives but many of them are used to assess climate change impacts in forests.

Unfortunately, our current knowledge, which is integrated in these models, is incomplete. Therefore, real forest dynamics differ from the virtual forest dynamics in the models. Additionally, uncertain input data (e.g. climate data) as well as parameter uncertainty, i.e. uncertainty about single values describing specific aspects in the model, along the modeling chain – from global general circulation models to local forest models - contribute to this difference. And this is an issue, because we want to use these models to infer real world forest dynamics. The issue resulting from the offset between real world forests and virtual model forests becomes smaller if we know about this offset and in which contexts (e.g. forest types) or for which aspects of the forest the offset is larger or smaller, because we can then interpret the model outcomes considering these offsets. Furthermore, understanding the offsets better, we can work on improving our models. Hence, we need information on the size of the offset between real forests and the virtual forests represented in our models.

How to assess what we do not know?

To assess what we do not know about forest dynamics, or rather, what our models do not represent accurately and realistically, we evaluated 13 widely used, state-of-the-art, stand-scale forest models that have been used in Europe. We used field measurements of forest structure and tower-based measurements of carbon and water fluxes over multiple decades in different environments at nine typical European forest stands as the reference for comparison with the model output (Reyer et al. 2020). The models had to be set up and run at the nine sites according to a harmonized framework to ensure comparability of the outputs. Therefore, we included only model simulations that followed the framework of the Inter-Sectoral Impact Model Intercomparison Project (ISIMIP). This harmonized framework defined the forest structure at the start of the simulation, the soil conditions, driving climate data from nearby meteorological stations, atmospheric CO2 concentration, nitrogen deposition and the forest management as deduced from observations of stem numbers over time. The models were run over 13 to 63 years and we then looked at three forest structure variables, basal area (BA), average change in mean diameter at breast height (DBHinc) and average change in mean height (Hinc), as well as four carbon and water flux variables, namely gross primary productivity (GPP), net ecosystem exchange (NEE), ecosystem respiration (Reco) and actual evapotranspiration (AET) (Fig. 1 & Fig. 2). In addition to the individual model outputs we also derived the ensemble mean predictions from the model outputs to check how the ensemble mean would perform in relation to individual models.

After running the models in the harmonized framework and formatting all the output data, we started comparing the reference data with the model outputs. However, getting all the model output data in shape was a considerable challenge that took more than a year as it is always complicated to make datasets compatible across multiple models, e.g. models could not all run the same sites nor time periods due to model structure or parameters, they also did not provide the same set of variables and not even in the same temporal resolution.

So one key process causing a lot of headache throughout our work on this study was the model output data processing before starting the evaluation against reference data.

We then tested the models' performance on these sites for the seven variables in three dimensions: realism of environmental responses, general applicability and accuracy of local predictions.

These three dimensions of model performance are based on a theory of model building from the 1960s introduced by Levins (1966). The theory stipulates that any given model cannot maximise all three aspects realism, generality and accuracy, where realism is the representation of all causal relationships that are significant for the examined phenomenon, i.e. how well the model resembles causal relationships between the ecosystem components, generality is the usefulness in various temporal and spatial contexts, i.e. how well the model performs across the whole range of sample applications, and accuracy is the goodness-of-fit of the model, i.e. how well model outputs agree with observations. Operationalising this conceptual theory proved to be difficult, but we could approximate these dimensions of model performance with three heuristic metrics realism of environmental responses, general applicability and accuracy of local predictions. Based on the theory by Levins (1966) we hypothesized that there is no “silver bullet” model that maximizes all three model performance dimensions but that the models rather complement each other.

In our approach, realism of environmental responses is the agreement of modeled and observed relationships between environmental drivers (temperature, radiation, vapor pressure deficit) and GPP, since GPP is sensitive to several interacting environmental drivers. General applicability is the share of tree species that are covered in the model taking into account the species’ abundance across Europe, hence independent of the model simulations. For the accuracy of local predictions we computed multiple metrics describing different aspects of the disagreement between predictions and observations to get a better idea on where the disagreement might originate. A detailed explanation of these metrics would go beyond the scope of this article, but in general one can think of four kinds of information these metrics describe: an overall error (mean squared deviation), a systematic error, i.e. describing constant over- or underestimation (squared bias), random errors, i.e. errors that are not resulting from constant over- or underestimation but are by chance positive or negative (lack of correlation), and patterns in the residuals, i.e. growing or decreasing size of errors scaling with the variable looked at (non-unity slope) (Gauch, Hwang, and Fick 2003). Furthermore, we followed a standardization and aggregation procedure to make the metrics comparable over models, sites and variables. For more detailed information on the methods for deriving the metrics as well as the standardization and aggregation procedure, see Mahnken et al. (2022).

What we do not know, and what we know

Comparing model outputs with reference data, we found that multiple models are available in Europe that excel according to our three dimensions of model performance.

Realism of environmental responses

Observed relationships of daily GPP to temperature, radiation and vapor pressure deficit followed plausible patterns for all models, while the distinct patterns differed from site to site (Figure 3). Increasing temperature and increasing radiation were related to increasing daily GPP, except for temperature relationships at higher temperature ranges in Bily Kriz, while an increase in vapor pressure deficit (i.e. increasingly dry conditions) was related to decreasing daily GPP. Most models were able to reproduce these observed patterns. Distinct site-specific patterns however, were not predicted well at all sites by all models. Strong non-linear patterns were observed for the temperature relationship in GOTILWA+ at Collelongo and for the vapor pressure deficit relationship of 4C at Sorø. Models tended to overestimate daily GPP at dry conditions (i.e. high vapor pressure deficit). High daily GPP at high levels of vapor pressure deficit for 4C at Bily-Kriz and Sorø and many models at Le Bray and Hyytiälä indicated unrealistic productivity responses.

On average, the ensemble mean showed the most realistic environmental responses while Landscape-DNDC and 3D-CMCC-FEM BGC also showed more realistic responses of daily GPP to different environmental drivers than other models in our ensemble. Yet, there is no individual model that has the most realistic responses of GPP to all three environmental variables at all sites. Some models featured intermediate realism of environmental responses to all environmental variables, for example, 3D-CMCC-FEM LUE. The most realistic response to radiation was obtained by the ensemble mean. In the ensemble, Landscape-DNDC had the most realistic GPP response to vapor pressure deficit, while GOTILWA+ had the most realistic GPP response to temperature. At the same time GOTILWA+ had the least realistic GPP response to radiation, 4C had the least realistic GPP response to temperature and BASFOR had the least realistic GPP response to vapor pressure deficit.


Forest productivity explained by climatic variables


Environmental drivers of gross primary productivity of forests.


FIG 3

General applicability

The most common tree species and species groups in Europe are Scots pine, spruce, European beech, pedunculate oak and sessile oak, which dominate around 75% of Europe's forests. Almost all models covered these species with species-specific parameters. Only PREBAS and BASFOR were missing the oak species, whereas GOTILWA+ was missing spruce and the oak species. Additionally, most models covered other species that are less common in Europe; hence, most models had species covered that represented the dominant tree species on 73%–98% of Europe's forest cover. The two models covering the least of Europe's forest cover are BASFOR and GOTILWA+ with 66% and 54%. Naturally, the ensemble mean had the highest general applicability since it combined the species covered by all models.

Accuracy of local prediction

Overall, no model was able to predict all variables at all sites with a high accuracy, and only few models showed a high accuracy of local predictions for all variables at one site (SALEM at Bily-Kriz, 3PG at Solling-spruce and 3D-CMCC-FEM BGC at Solling-beech). At the same time, every model predicted at least one variable at one site with an adequate accuracy of local predictions except for 3PGN-BW which showed consistently low performance.

Random errors made up the largest share of the overall error, except for BA and AET. Systematic errors of the structure variables may have been a result of offsets in model initialization from the initialization data. Flux variables were also prone to systematic errors due to systematic over- or underestimation. Persistent under- or overestimation of GPP and AET was evident for some models at single sites. Predicted-observed offsets from linear patterns in the residuals were generally low with some exceptions .

Furthermore, forest structure variables displayed a higher overall accuracy of local predictions than the carbon and water flux variables. On average, simulated BA showed the highest accuracy of local predictions. This is partly related to the temporal autocorrelation of the variable, as the basal area naturally increases as the forest grows older. Annual carbon variables had the lowest accuracy of local predictions, while NEE had the lowest accuracy of the annual carbon variables. None of the sites' observed data could be predicted with a high accuracy of local predictions for all carbon and water variables simultaneously by any given model.

Interestingly, we found that the accuracy of local predictions in the historical period is not related to the level of complexity of a model; that is, empirical models do not necessarily provide less accurate predictions than hybrid or process-based models under current climate conditions.


Statistical metrics to compare model accuracy


Metrics for


FIG 4

Takeaway

We performed a large forest model comparison with a wide range of observed data in a model performance framework that complements existing knowledge from model-model and model-data comparisons.

While our results confirm that there is no “silver bullet” model, we could not find explicit trade-offs such as a systematic negative relation between general applicability and accuracy of local predictions either. Models that have a high general applicability score such as 3D-CMCC-FEM BGC also perform well in terms of accuracy of local predictions and realism of environmental responses. In general, the scores of the three dimensions of model performance seem to be balanced for most models although at different overall levels. While a balance between the three dimensions is advisable, it may not always be necessary. Qualitatively correct insights about forest growth and forest dynamics under climate change may be sufficient to guide adaptation planning. For example, insights about the growth dominance of one species over the other may provide valuable implications for the afforestation plans in a specific stand. This indicates that realism and generality may be more important for this purpose than accuracy.

The quantified model performances need to be interpreted in light of our heuristic approach for quantifying the metrics. Even though we test the model’s performance with carbon and water variables, further refinements of the model performance framework should include testing other variables for their realism to environmental responses such as structure and mortality variables or autotrophic and heterotrophic respiration to assess model realism across a broader range of processes. Also, we did not derive the general applicability across time but focused on the general applicability in space. Furthermore, we need to be aware that models that assess variables which are generally more difficult to accurately predict will have lower levels of accuracy of local predictions than those models only assessing variables that are less difficult to predict.

Finally, we conclude that, if accuracy is the objective, individual models may provide the best results at single specific locations. Which model will provide optimal results depends on the environmental conditions, structural properties, disturbances, etc. of those locations. Accurate predictions of carbon variables at annual scale are more difficult to obtain than accurate predictions of structure variables. The realism of environmental responses in model simulations provides an approximation how well relationships that are crucial to assessing climate impacts are covered. We showed that the model ensemble mean has the most realistic daily GPP responses to environmental variables. Moreover, most individual models cover the most relevant European tree species, but to cover all and particularly the less abundant species, multiple models need to be applied. This is particularly important under climate change since less common species may become more important under climate change. Furthermore, we highlight the importance to evaluate several model output variables with a wide range of data, because models struggle to achieve high accuracies for several variables at the same time. Because already multiple models exist to study climate impacts on forests, we expect that our study will provide a common benchmark to test whether new modelling efforts outperform the models presented here to add value to the existing set of tools.

In the end we need to consider that the models we currently have are not perfectly describing the real world forest dynamics. Nevertheless, we still use these models for climate change impact assessment because we are able to view the model outcomes through the ‘lens of the model’, i.e. interpreting the model outputs while considering limitations of the models, identified for example in this study. This way we are able to still make use of these important tools to assess impacts, as real world experiments in forests are often not feasible due to several constraints, one of them being the slow dynamics happening in forests with trees reaching several hundreds of years until their maximum biological age, while we need answers now.

Ever since finishing this study, the walks through my favourite forest have become excursions with my mind wandering around, trying to think of the processes we are not capturing well in our models. I am constantly trying to figure out what we need to work on next to increase the performance of our models - is it the way trees suffer from stress, when they die, how they regenerate or some other process outside our current focus?

References

Gauch, Hugh G., J. T. Gene Hwang, and Gary W. Fick. 2003. “Model Evaluation by Comparison of Model-Based Predictions and Measured Values.” Agronomy Journal 95 (6): 1442–46. https://doi.org/10.2134/agronj2003.1442.
Levins, Richard. 1966. “The Strategy of Model Building in Population Biology.” American Scientist 54 (4): 421–31. http://www.jstor.org/stable/27836590.
Reyer, Christopher P. O., Ramiro Silveyra Gonzalez, Klara Dolos, Florian Hartig, Ylva Hauf, Matthias Noack, Petra Lasch-Born, et al. 2020. “The PROFOUND Database for Evaluating Vegetation Models and Simulating Climate Impacts on European Forests.” Earth System Science Data 12 (2): 1295–1320. https://doi.org/10.5194/essd-12-1295-2020.

Affiliations

1 Potsdam Institute for Climate Impact Research