User Manual

US Department of Agriculture Snow Survey and Water Supply Forecasting Program Multi Model Machine-learning Metasystem (M 4 )


First complete draft 10 March 2023, S.W. Fleming, NRCS National Water and Climate Center
Last modified 5 May 2023, S.W. Fleming, NRCS National Water and Climate Center

For information contact Beau Uriona ( beau.uriona@usda.gov ) or Sean Fleming ( sean.fleming@usda.gov ) at the USDA Natural Resources Conservation Service


  1. What is M 4 ?

    The multi-model machine-learning metasystem (M 4 ) is the prototype next-generation mathematical and software model for operational water supply forecasting (seasonal river flow volume prediction) built for and employed by the Snow Survey and Water Supply Forecasting (SSWSF) Program of the US Department of Agriculture (USDA) Natural Resources Conservation Service (NRCS), which spans close to 600 forecast locations across the American West. It will replace the linear principal component regression (PCR) method that the NRCS SSWSF Program, and in particular the National Water and Climate Center (NWCC), has used as its primary (but not sole) water supply forecasting (WSF) model in this West-wide NRCS operational role since the early 1990s. M 4 leverages a broadly PCR-like architecture and philosophy but augments it with multi-model ensemble modeling for improved forecast stability and accuracy, explainable and automated machine learning/artificial intelligence (ML/AI) for improved forecast accuracy, evolutionary computing to support global feature optimization, more advanced statistical methods for prediction uncertainty estimation under non-Gaussian and heteroscedastic error distributions, and other advances that were directly motivated by the day-to-day operational requirements of a governmental service-delivery organization performing operational water supply forecasting at scale. It is written in the free, open-source R scientific/statistical computing language.


  2. Disclaimer

    The NWCC posts the M 4 code publicly here for four reasons: (a) to satisfy internal NRCS archiving, access, and version-control needs around the code and its documentation, including instructions for its use; (b) to demonstrate full transparency in the techniques used by the NRCS SSWSF Program, and in so doing, and in conjunction with extensive vetting of the M 4 methodology and performance in the peer- reviewed science and engineering literature, to support the technical credibility of the program and public confidence in its products; (c) as a practical mechanism for sharing M 4 with partner agencies and organizations; and (d) to facilitate and encourage further refinement and development of M 4 by making it available as open-source code to the STEM research community.

    Any purpose to which this code is put by external users downloading M 4 from this site is entirely at the discretion of those users, who accept full responsibility for such uses and their consequences, which are in general beyond our knowledge or control. M 4 is an operational prototype, and while it has undergone successful testing, much of which has been documented in the peer-reviewed science and engineering literature, the possibility of bugs or errors cannot be ruled out (see also Section 6 below).

    Note that operational forecasts officially issued by the NRCS under a specific set of user protocols and institutional controls, irrespective of whether those forecasts were or were not developed in part using guidance from M 4 , are distinct from the predictions and performance of the M 4 code itself as posted here to GitHub and for which no claims are made as to accuracy or value. Additionally, all environmental analysis and prediction software, including M 4 , has inherent limitations and idiosyncrasies. The characteristics of the prototype M 4 probabilistic nonlinear regression metasystem are such that it may be useful for purposes other than the WSF applications for which it was designed, but while we applaud exploration of such additional applications, we emphasize that absolutely no testing of any such applications has been undertaken by the NRCS and no claims as to its performance in or suitability for such applications are made. No warranties, guarantees, or any other assurances whatsoever around M 4 and its results in any application, WSF-related or not, are provided. Users who believe they have discovered a problem are encouraged to report it to the NRCS.

    Users of M 4 should read and understand this user manual in its entirety, including the externally peer- reviewed journal articles in its Appendices which describe the M 4 prototype. Note also that while it is hoped that the current distribution of the M 4 prototype will be taken as a starting point for additional M 4 development, and there is no shortage of potential future directions that such further development might take, developers are responsible for their own results and should familiarize themselves thoroughly with the peer-reviewed literature on M 4 and with the code itself in its entirety: there are a lot of moving parts in M 4 , many of which are interconnected, especially as they relate to functionalities specifically directed to operational WSF environments and standard WSF use-cases and AutoML functionalities (see below). Commenting in the code is fairly ubiquitous, which may hopefully help scientific software developers decipher the intent and rationales of certain code snippets and the overall workflows in M 4 .

    While just about anyone is welcome to experiment with M 4 , please be aware that some required but not necessarily sufficient prerequisites for using M 4 correctly, using its results appropriately, and accurately understanding its nature and capabilities, are good familiarity with current standard procedures in statistical WSF in the western US as implemented by NRCS and other similar operational service-delivery organizations, solid experience with R or at a minimum other similar interpreted scientific computing languages, and a thorough understanding of the principles, goals, methods, and capabilities and limitations of M 4 as described in the peer-reviewed literature and this manual.

    The code, as it is posted here, should be viewed as experimental – use at your own risk! See also the GitHub license agreement.

  3. A synopsis of M 4 philosophy, context, and basic workflow

    Full explanations of M 4 and issues motivating its development and design choices are provided in the Appendices, which users should read and digest. The following is a short summary of the general context, philosophy, and workflows associated the new NRCS forecasting metasystem.


    1. Design criteria

      M 4 did not transition from research to operations (R2O) per se. Rather, a more pragmatic applied- science and engineering paradigm was used: operational requirements were first defined, and M 4 was then designed and built to meet those requirements. These specific design criteria for operational WSF at NRCS, many of which are applicable more generally to operational hydrologic forecasting at other service-delivery organizations, included:

      1. Improved forecast accuracy

      2. Improved potential for automation (largely an “over-the-loop” system)

      3. Relatively low cost & good ease of development, implementation, and operation

      4. Address known technical issues with the existing system by seamlessly accommodating nonlinear functional forms, and heteroscedastic and non-Gaussian residuals requiring time- varying & asymmetric prediction intervals, without the use of predictand transforms

      5. Flexible, modular, expandable: relatively easy ability to integrate emerging new techniques as appropriate

      6. Physics-aware AI: hydrologic theory-guided system, generating geophysically explainable forecasts

      7. Multi-model ensemble paradigm: address equifinality and model selection uncertainty, with robustness over diverse geophysical environments across the US West and Alaska

      8. Accommodate high-dimensional predictor datasets & potential for multiple independent input signals: will only grow more important with increasing use of highly spatially distributed input datasets such as those from remote sensing or climate modeling products

      9. Balance innovation and performance gains vs. established building blocks and proven tools: apply most advanced yet acceptable (MAYA) industrial design principle.


    2. MAYA

      The MAYA principle was introduced by the renown French-American industrial designer, Raymond Loewy. In the context of a next-generation WSF system, MAYA signifies a compromise between adopting the most advanced techniques coming out of the research community, which offer great promise but remain largely unproven in real-world high-stakes applications having strong organizational accountabilities, and established technologies and practices that have proven themselves and achieved buy-in from the water resource science, engineering, and management communities but may not represent the best-performing methods now available. The goal is a unique new mission-specific prediction model, using a carefully selected suite of up-to-date methods from the geophysical modeling

      and data science communities, yet one that also corresponds to a ‘hot-rodded’ version of the proven and accepted PCR-based approach to WSF.

      1. Synopsis of some conventional elements of M 4

        The conventional elements of M 4 include a PCR-like structure, adoption of the canonical use-case for statistical WSF in the western US, quantitative estimation of forecast uncertainty in the form of prediction intervals, incorporation of individually well-proven techniques, and an emphasis on the ability to easily generate a relatable storyline for any given forecast on any given day.

        Linear PCR was introduced to operational WSF by the NRCS in the early 1990s as a replacement for earlier linear regression methods. Its primary advantage is that it provides a theoretical and practical framework for performing linear regression with multicollinear predictors. The basic workflow of PCR involves performing a principal component analysis (PCA) on the input data matrix, and using one or more of the resulting modes, starting with the leading mode, as predictors in an otherwise standard linear regression to predict seasonal river flow volume. Because inputs are mainly measurements of snow water equivalent (SWE) and other similar hydroclimatic parameters, the leading PCA mode effectively serves as an index of watershed-scale hydroclimatic conditions. Higher modes may be retained, reflecting more subtle hydroclimatic patterns, or capturing other processes like aquifer-stream interactions if antecedent streamflow is additionally included as a predictor. Out-of-sample prediction performance is normally found as the leave-one-out cross-validated (LOOCV; sometimes called jackknifed though strictly speaking this refers to something slightly different) standard error, root mean square error, coefficient of determination, or similar metric. This LOOCV procedure is standard practice in statistical WSF systems deployed operationally by service-delivery organizations in western North America, and larger cross-validation folds are generally not required because, while many hydroclimatic variables are serially correlated, the residuals from WSF regression models typically demonstrate little or no statistically significant autocorrelation at the annual timescale considered in the canonical western US statistical WSF use-case (see below; see also examples given later in this manual, and in the M 4 distribution files; the appendices also provide more information). That cross-validated performance measure then provides the basis for forming prediction intervals, expressed for example in NRCS practice as 90%, 70%, 30%, and 10% exceedance flows. In particular, a simple but common heuristic is used: the probability density function for model error is taken to be a stationary Gaussian (normal) distribution, with a mean equal to that deterministic prediction and a standard deviation equal to the LOOCV standard error or more-or-less equivalently the root mean square error, and the exceedance flows are estimated as the 0.10, 0.30, 0.70, and 0.90 quantiles of that distribution. The entire workflow may be wrapped in an automated system for identifying optimal predictors and PCA modes to retain.

        The foregoing processes have been implemented in many software systems over the years, including the VIPER program used operationally by NRCS. This proven overall procedure was in many, but not all, respects largely replicated in M 4 , which augments it with the use of multi-model ensemble modeling, a genetic algorithm for optimal feature extraction, and more sophisticated and flexible methods for estimating prediction uncertainty.

        The canonical use-case for statistical WSF in the western US involves a single predictand, which is the upcoming spring-summer flow volume at a given location on a given river; predictors consisting mainly of current SWE and year-to-date precipitation at various locations in and near the watershed, and possibly antecedent streamflow or an ENSO index; and a different model for each forecast point-

        publication date-target period triplet. To elaborate, operational statistical WSF models at practical service-delivery organizations (SDOs) having strong accountabilities for reliably producing and publicly disseminating forecast information are organized around solving a specific forecasting problem (SFP). An SFP is a unique combination of annual publication date, target period, and forecast point.

        Publication date is the day of the year that the forecast is issued. Common examples are January 1, February 1, March 1, April 1, May 1, or June1, but others are possible depending on the particular river, reflecting local hydroclimatic characteristics and water management concerns. The target period defines the upcoming seasonal timeframe for which accumulated river runoff volume is being forecasted; a prevalent example is April-July but others are commonly used as well, again depending on the particular river, reflecting local hydroclimatic characteristics and water management concerns. The forecast point is a particular location on a particular river for which a runoff forecast is being made.

        There may be more than one forecast point on a given river, especially long rivers; but each forecast point corresponds to a separate SFP. An example of a SFP is a water supply forecast, issued on February 1 (publication date), of the total April-July flow volume (target period) of the Deschutes River below Snow Creek (forecast point). A unique model is built for each SFP. However, there may be a great deal of overlap between models for similar SFPs. For example, the January 1 and April 1 publication-date models for the Deschutes River below Snow Creek may use updated versions of the same predictors, such as SWE at some SNOTEL station on January 1 and on April 1, respectively. However, the relevance of certain predictors may also change over the course of the forecast season reflecting seasonal hydroclimatic processes, so the optimal predictor set may gradually change as well; for example, ENSO indices, in regions where they are useful, tend to provide much of their value relatively early in the forecast season when they provide some information about upcoming winter-spring rainfall and snowfall. M 4 was built for this canonical use-case, albeit with an eye to make better use of emerging predictors.

        Western water management can be a very high-stakes enterprise, and the water managed using WSFs may have values ranging into the billions or trillions of dollars every year. While this militates on the one hand for using the best available prediction technologies, responsibilities around sound decision- making, and the economic and political ramifications of making bad choices, also call for a certain level of prudence and conservatism. This risk-aversion leads to a reliance on proven methods. M 4 employs a proven PCR-like architecture and adheres to the canonical WSF use-case, as described above, and its constituent elements consist of methods that have individually been widely tested and accepted in physical, statistical, and machine learning applications.

        Similarly, most water resource scientists, engineers, and managers hold low trust in predictions that are not geophysically explainable. A readily identifiable and relatable ‘storyline’ for any given forecast on any given day is, in practice, a necessary ingredient of any operational WSF modeling system at a SDO. This typically consists of a general explanation of why a certain forecast was obtained, in light of well- understood terrestrial hydrologic processes and current watershed-scale climatic (snowpack, soil moisture, etc.) conditions. Producing such a physical explanation of the forecast does not necessarily require a process simulation-based hydrologic model. Use of PCR by well-trained operational hydrologists, for example, has produced clear hydroclimatic storylines for decades. The ability to continue this forward into the machine-learning world was one of the most important and challenging design goals for M 4 .

      2. Synopsis of some fresh elements

        New elements in M 4 include a multi-model ensemble forecasting framework, the first wide-scale incorporation of AI into a governmental SDO operational WSF system, evolutionary computing, and advanced statistical prediction uncertainty estimation.

        The final product of M 4 is a multi-model ensemble mean forecast. M 4 consists of six semi-independent forecast modeling systems acting in parallel and may be viewed as a system of systems (hence ‘metasystem’). Each of the constituent modeling systems has a PCR-like architecture as noted above but replaces simple linear regression and prediction intervals based on stationary normal error distributions with various advanced statistical or machine learning algorithms. The results of all six are averaged together to create the ensemble mean forecast. Advantages of this approach are that it deals with model equifinality, that is, similar overall performance from several quite different models, and provides more stable, and in some cases, more accurate results than any of the individual models alone. It is a standard operational approach in numerical weather prediction and climate modeling, it is very well- established in applied hydrologic research using process-based models, and it is used within some machine learning approaches, such as bootstrap aggregation (bagging) within some neural networks.

        However, multi-model ensembles have not, in general, migrated into operational river forecasting systems at SDOs (although related but simpler concepts like extended (or ensemble) streamflow prediction (ESPs), which use a single hydrologic model with multiple climate traces, made the jump decades ago). We emphasize that using the results of a multi-model hydrologic prediction system wisely can require operating in a different headspace than some operational hydrologists may be accustomed to. For example, hunting-and-pecking for which of the six constituent models works ‘best’ for some particular combination of forecast point, publication date, and target period misses the point of the exercise and can lead to wasted time and ultimately, overall, an underperforming modeling strategy.

        More background on this sometimes-counterintuitive approach can be found in the Appendices, and the publications cited in the Appendices.

        Research using AI for river forecasting dates back to 1995 or possibly earlier, and its superior prediction accuracy has been proven repeatedly. Mainstreaming of AI into wider and operational hydrologic applications has been much more recent and limited due to various stumbling blocks. These included a lack of prediction uncertainty intervals; a lack of explainability, i.e., the so-called black-box problem; skepticism in high-stakes operational settings where existing methods are well-established and new methods coming out of the research community are often viewed as lacking real-world relevance and credibility; and misunderstandings and lack of technical training and professional familiarity among many physical scientists in both the research and operational communities regarding AI and its real limitations and strengths. But things are changing, as almost everyone is getting used to the idea of AI, machine learning tools have become far more accessible and methods are far more diversified, and explainability, overtraining, prediction uncertainty estimation, and other questions can be addressed with appropriate AI design and implementation choices. WSF applications of AI, however, still have to be approached carefully and with the goal of satisfying specific user needs, rather than throwing machine learning at a problem and expecting the result to be useful and accepted.

        Accomplishing this, as noted above, was a key goal when developing M 4 . The approach and results are documented in the Appendices, and some cursory summaries are provided in the following subsections.

        Before forging ahead, it’s probably also useful to provide a glossary of some basic machine learning jargon, bearing in mind however that these terms are evolving quickly and may be used in slightly different ways in different fields:

        • Data science Algorithms for building data-based information/knowledge pipelines, often to enable automated predictions and actions; spans traditional statistical modeling and AI

        • Data mining Extracting descriptive, explanatory, or predictive patterns from data, typically without clear a priori expectations around specific causal relationships

        • Big data Satisfy three Vs of volume, velocity, and variety: gigantic and never-ending torrents of data having a wide & unpredictable content range (e.g., YouTube uploads). Strictly speaking, few environmental datasets satisfy this definition, but big-data methods can nevertheless be useful. A fourth V – veracity – is also sometimes included.

        • Artificial intelligence (AI) Technologies that emulate human intelligence in various ways, in turn including several subdisciplines, like machine learning and robotics

        • Machine learning (ML) An AI field using algorithms to identify patterns in data and, typically, applying to those patterns to make predictions; divided into classification or regression, and unsupervised or supervised methods; examples include a wide variety of artificial neural network (ANN) types, random forests (RFs), support vector machines (SVMs), and deep learning (DL) architectures like long short term memory (LSTM) networks and convolutional neural networks (CNNs); these algorithms are often inspired by biological systems, such as the human brain (ANNs) or human decision-making processes (the decision trees in RFs)

        • Hyperparameters AIs have parameters loosely analogous to coefficients in a linear statistical model (e.g., a neural network’s neuron weights) that are optimized in training; but AIs additionally have higher-level hyperparameters (e.g., a neural network’s learning rate) that control this process of estimating parameter values and the overall architecture

        • Features and targets Akin to predictors and predictands, respectively, in traditional statistical modeling; features engineering is the procedure of processing/manipulating input data to extract and select features to be passed to the AI as predictor variables, a major element of AI application; one of the goals of deep learning is to integrate features engineering into the core of the machine learning algorithm, avoiding the need to perform features engineering explicitly; conversely, user-directed features engineering has been identified as a type of physics-informed machine learning

        • AutoML System to automatically build the best AI for a given dataset; may include optimal hyperparameter selection and is intended to make ML easier for non-AI specialists to apply in respective practice domains; also creates the need for still-higher-level ‘hyperhyperparameters’

        • Overtraining and regularization Overtraining is memorization of the training data, compromising generalization accuracy; it happens in all models involving calibration or derivation of parameters from observational data, not just ML, including physics-based and conventional linear statistical models, but it can be particularly acute for ML due the tremendous fitting ability of machine learning algorithms; however, it is easily detectable in validation; regularization refers to technical methods that mitigate it.

        Evolutionary computing is also used in M 4 . This refers to a family of computing techniques that emulate biological evolutionary processes. M 4 deploys a genetic algorithm, which is a type of optimization procedure, for optimizing the features (specifically, the combination of input predictor variables and

        PCA modes) fed to its constituent modeling systems. The goal is identical to techniques often conventionally used for optimizing these choices, such as the tree algorithm used in the NRCS VIPER implementation of PCR, but the methods it uses are more advanced. Of particular note is that it is a stochastic optimization scheme, which helps avoids having the solution process get trapped in local minima in a complex nonlinear error space, contributing to a greater likelihood of obtaining a global (true) optimum.

        Considered in aggregate, the foregoing attributes mean that M 4 can be viewed as a community of artificial lifeforms. Each of these AIs consists of an artificial intelligence, and following rules adapted from genetics, each evolves over a series of generations in such a way as to perform the best it can in its environment. Ultimately, this community of six artificial lifeforms reaches a consensus decision on the best WSF and how much certainty it collectively has in that decision.


    3. Explainable machine learning

      Several pragmatic, somewhat WSF-specific steps were taken in M 4 design to dovetail machine learning with the underlying process physics of watershed hydrology:

      • Operational hydrologist-based features engineering User-directed selection of predictive data, and how that data are compressed and fed to supervised learning algorithms, reflects end-user knowledge around representativeness, reliability, quirks, and capabilities of potential input variables and measurement sites and geophysical interpretations of PCA modes. That the hydrologist should pick what data to use and how to process it may seem obvious, but it is somewhat of a counterpoint to some big data-focused deep learning approaches, where large datasets are fed to complex multi-layer neural networks that, in a nutshell, perform features engineering automatically. It is a key location in AI development process for domain experts to insert their physical hydrologic knowledge.

      • A priori physicality constraints. Some feature-target relationships are known to be significantly nonlinear but monotonic. For example, a high-snow winter, all other things being equal, won’t produce a low-flow spring and summer. Certain AI methods allow a monotonicity constraint to be enforced. While overall monotonicity of the ensemble mean relationship is not strictly enforced, M 4 strongly encourages that net behavior, as two of its six methods are intrinsically monotonic and another two have the option, which is activated in the default application of M 4 , of enforcing monotonicity constraints. Similarly, negative-valued volume predictions are non- physical yet can happen in some prediction systems, and certain AI methods allow a non- negativity constraint to be enforced. Several such methods are used here. Moreover, the use of nonlinear ML techniques tends to avoid the generation of negative-valued best-estimate predictions. M 4 also includes algorithmic logic to sequentially prune ensemble members if the 90% exceedance flow in the ensemble mean forecast distribution violates the nonnegativity constraint for any sample in the training interval, though this does not necessarily guarantee that a forecast going forward is strictly nonnegative. Note that monotonicity and nonnegativity constraints are simple forms of theory-guided data science, and that by severely restricting the solution space available to the MLs, such geophysical plausibility constraints improve both regularization and explainability.

      • WSF is purposefully framed as low-dimensional problem with a parsimonious solution . PCA pre- processing dramatically reduces problem dimensionality, and the predictand is a simple scalar. Additionally, considerable effort and testing was investing in identifying default hyperparameters and AutoML algorithms that help create the most parsimonious ML architecture possible for the canonical statistical WSF use-case. The result is, in most such applications, an extremely compact ML problem with one or two inputs and a single output. The relationships between those inputs and outputs can be easily visualized in a x-y or contour plot, allowing the user to see what the ML is thinking. Another advantage of low-dimensional features and targets and compact architectures (e.g., minimizing the number of hidden-layer

      neurons) is that it reduces the number of parameters that must be trained, making the ML more amenable to application using small sample sizes.


    4. AutoML

      While completely hands-off operation of a WSF system is in general not desired by SDOs, streamlining and automation of many tasks is, both in the interest of operational efficiency and to make a system like M 4 accessible to users who expertise lies in areas other than AI. This leads to judicious use of automatic, or autonomous, machine learning (commonly dubbed AutoML). Its use dovetails with “over-the-loop” hydrometeorological forecasting concepts. In particular:

      • Algorithmic logic is introduced to automate most optimization/decision points (including ML hyperparameters). However, manual overrides remain available.

      • Other hyperparameters are set to robust default values on the basis of extensive experimentation using various NRCS WSF test cases and problem setups. For example, hundreds of metasystem runs were completed to locate values for the population size and the number of generations in the genetic algorithm that work as generally serviceable defaults for the canonical statistical WSF use-case.

      The ability to override these defaults and AutoML algorithms is also considered important at SDOs, providing hydrologists with an opportunity to make adjustments in accordance with their professional judgement.


    5. M 4 development approach, process, and status

      Why choose R? There were four reasons. First, it’s free for everyone. Second, the design criteria for M 4 included some very specific requirements around a priori physicality constraints and non-standard statistical properties of the forecast distributions, and to our knowledge, these are most readily, and perhaps exclusively, available in certain existing R packages previously written by others and freely available on the CRAN server. Third, by downloading RStudio (and R along with it) and installing the needed packages directly from within that GUI by a simple point-and-click process, everything needed to run M 4 can be obtained very easily by one-stop (free) shopping with very little or no system configuration on the user’s part, further improving access. And fourth, it’s one of the most widely used scientific computing languages, and therefore likely to be familiar to a relatively wide range of users.

      Basically, it came down to a question of some specifical technical requirements plus general accessibility.

      The M 4 development process began in early 2017 with a partnership between NRCS and White Rabbit R&D LLC facilitated through Elyon International Inc. Over the next three years, through a collaborative and consultative process, the existing PCR-based VIPER system at NRCS was evaluated, alternative approaches were explored under the constraint that only data-driven methods were to be considered, the attractiveness of developing a new system was identified along with its desired characteristics, and an initial research prototype was jointly developed with some preliminary testing, documentation, and external vetting through the engineering journal peer-review process. Work on M 4 since 2019 has been undertaken exclusively by NRCS and has included extensive retrospective testing, live operational testing for selected test sites during the 2020 and 2021 water supply forecast seasons, some further code debugging and refinement, and extensive documentation with additional external vetting through scientific journal peer-review processes, leading to the more refined and better-proven prototype archived here.

      Considerable effort has also been undertaken by NRCS in partnership with a software development team at Colorado State University to build Persephone, a prototype production platform for large-scale operational deployment of M 4 . The existing M 4 code sits inside the Persephone wrapper. Persephone adopts a Software-as-a-Service (SaaS) framework residing on a private cloud. This enables increased levels of parallelization, multi-user server-based access for an operational forecast team, and variety of enhanced user-focused practical capabilities: GUI, IT/database linkages, interactive capabilities around graphics, mapping and geospatial analysis, features engineering, data pre-processing and forecast distribution post-processing, and job-queuing, for example. A prototype of Persephone-M 4 became operational at NRCS on a limited testing and forecast process-augmentation basis during the 2023 water supply forecast season. Note that this GitHub distribution of the M 4 prototype does not include Persephone.


    6. General workflow of M 4

      There are two main steps in the M 4 WSF modeling process – building a model, and then running it in forecast operations – which closely mirror the steps in other hydrologic forecast models.

      1. The build step

        Training is the first step. It amounts to building a M 4 model suite, including data selection, setting some run control parameters, fitting the various constituent modeling systems within M 4 , and saving the outcomes. It can loosely be viewed as a geophysical model inversion process, in which model parameters are estimated from observational data. Training is akin to fitting a linear regression model in statistical WSF, or calibrating model parameters in a process-based model used for WSF.

        In general, and in brief, training involves the following processes within M 4 (see also foregoing sections in this manual and, in particular, the Appendices). As noted above, there are currently six semi- independent models in M 4 , each built around a different regression-type supervised learner. These are a monotone artificial neural network, a monotone composite quantile regression neural network, a

        support vector machine (more specifically, support vector regression), random forests, linear quantile regression, and a standard linear regression as in conventional PCR. A sample of input variables is drawn from a pool of candidates, a principal component analysis (PCA) is performed on it, and principal component time series for a certain number of PCA modes – the features, in ML jargon – are fed to one of the six constituent models as predictor variables. That model is then trained by fitting parameters, and to some extent hyperparameters through AutoML processes, using that sample dataset. The accuracy of these intermediate results is assessed. A type of global optimization approach drawn from evolutionary computing, called a genetic algorithm, then efficiently iterates through various combinations of input variables and corresponding PCA modes, ultimately selecting the most accurate combination. This is repeated independently for each of the six constituent models. (As such, each of the constituent modeling systems is a variant on the PCR approach; for example, principal component support vector machine, which we abbreviate here as PCSVM.) When completed, the results from all the models – consisting of a best estimate, and prediction intervals (uncertainty estimates) expressed as 10%, 30%, 70%, and 90% exceedance volumes – are averaged together to create the ensemble mean prediction. Because river runoff volume cannot be negative, the final training-period ensemble is tested for that condition, and if needed, the results of individual models contributing most the problem are removed one by one. In general, all of this is done internally and automatically by M 4 .

        From a user’s perspective, the procedure only involves providing a historical dataset to M 4 for the combination of forecast point, publication date, and target period that the model is being built for, along with some instructions to M 4 about how the user would like M 4 to run for that particular problem.

        For the canonical use-case in statistical WSF as described above, the default instructions (this is explained more below) often don’t need to be modified in practice. In that case all the user needs to do is decide what predictor data they’d like to use – which SNOTEL snow and precipitation sites they want, maybe antecedent streamflow at the forecast point, perhaps an ENSO index thrown in for good measure, and so forth – just like in conventional statistical WSF. After the input data file containing that information is created by the user, which is an external process completed by the user in a text editor or Excel spreadsheet for instance, M 4 is then opened and run in R. For typical canonical statistical WSF use- case applications, depending on the computer and other factors, and using default M 4 parameters and settings including genetic algorithm-based feature optimization with a population size and number of generations found by experimentation to be generally suitable for such applications at a prototype level, the build run might take somewhere between 15 and 35 minutes or so. When it’s done, all important information is written to files by M 4 . This information includes the trained models.

        In general, training is performed only once, to create the trained suite of models and metadata describing them. Note that in usual operational WSF practice, however, models are updated roughly every five years or so by redeveloping them with newly acquired data. There may also be situations where the user would like to try creating a few different versions of a M 4 modeling suite, using different candidate input variables for example to test the usefulness of new kinds of predictive data, or giving the genetic algorithm in M 4 either more leeway, or less, to optimize features than in the default case.

      2. The forecast step

        Creating operational forecasts using the previously trained model and new predictive input data is the second step in the M 4 WSF modeling process. It can loosely be viewed as a forward run of a geophysical

        model developed through an inversion process. For example, in the initial build step, a M 4 ensemble modeling suite might first be trained to predict, on February 1, the April-July flow volume at some particular location on some particular river using, say, 30 years of historical SNOTEL data as input and observed April-July flow volumes over that same 30 year interval as the predictand. Then, in forecast operations, the user takes that trained modeling suite and runs it on February 1 of the current water year, using the present values for the SNOTEL stations, to create this year’s forecast of upcoming April- July flow volume. The user will do the same thing next water year, using the same old trained model and new values for the SNOTEL stations, when their February 1 forecast date rolls around again, and so forth. This is exactly like taking an existing fitted linear regression model in statistical WSF, or an existing calibrated process-simulation model in ESP-based WSF, and running it using new forcing data in order to issue a WSF for the current forecast season.

        From a user’s perspective, workflows for such forecast runs are in some sense much simpler than for training: the finished model and metadata around it, along with up-to-date data for the predictors, are all that’s needed. However, the user does need to set some slightly persnickety values in a run-control file; the user simply pulls that information from the output files of the build run (if the genetic algorithm was used for M 4 feature optimization during the build phase) or, even more simply, repeats information they provided to M 4 during the build run (if the user switched off genetic algorithm-based feature optimization during the build phase, which they have to option to do, as described below).


  4. How to use M 4

    1. M 4 installation

      The M 4 prediction analytics engine consists of 14 R script files, described in Section 4.2.1 below. These are a main program (MMPE-Main_MkII.R) plus 13 externally defined functions which perform principal component analysis, neural network modeling, genetic algorithm-based feature optimization, and so forth. The “MMPE” acronym refers to a previous name (multi-model prediction engine) given to an earlier development version of the M 4 metasystem.

      Installation is straightforward and essentially the same as for more-or-less any R-based software, amounting to downloading the 14 script files and installing the required packages. M 4 is run by executing the main program, MMPE-Main_MkII.R.

      M 4 was developed, tested, and run in the RStudio freeware development environment on a Windows PC, and the following, more detailed, installation instructions are tailored to that case:

      1. Download the 14 R script files (which are simply text files with a .R extension) from the GitHub repo. To keep things tidy, you'll presumably want to create a new subdirectory to place them in; you can name it whatever you like. All the model, input, and output files will be in the same folder. It’s good practice – especially when using RStudio, which can be less stable than running R in its basic console – to place this folder high up in the directory structure, to run it on a local hard drive rather than a cloud, and to use reasonably short folder names; a good choice could be something like C:\ThisWhereMyM4RunsAre\.

      2. Download the 3 regular text files (i.e., with a .txt extension) to the same location, which are input files. MMPE_RunControlFile.txt is where the user specifies run options. The other text files are input data files (the sample input file provided here is for the Yellowstone River). As noted above, M 4 has two basic modes. In inverse or model-building mode, the AIs are trained, optimized, saved, etc., using the calibration-period observational data in MMPEInputData_ModelBuildingMode.txt. In forward or operational forecasting mode, the completed models are read in from files previously created by M 4 during the build phase, and are then ran with a new sample of the input data, contained in MMPEInputData_ForecastingMode.txt.

      3. In Windows file explorer, navigate to this new subdirectory and double-click on the main program file, MMPEMain_MkII.R, to open the prediction engine in RStudio. (Alternatively, open RStudio, then from the main menu bar at the top select File -> Open File and choose MMPE- Main_MkII.R.)

      4. You'll have to install the non-default R packages used by the M 4 main program and function files. This is done in RStudio by going to its bottom right window, selecting Packages -> Install and finding the ones you want by typing their names into the bar that pops up. The full list of additional packages needed is:

        Akima

        forecast

        qrnn

        e1071

        randomForest

        monmlp

        genalg

        stringr

        doParallel

        foreach

        quantreg

        quantregGrowth

        matrixStats

      5. The user-selected options in MMPE_RunControlFile.txt are discussed below. However, one of the parameters the user sets there counts as an installation step. It's the number of cores in the computer's CPU, num_cores, which is used for parallelization of the time-intensive cross- validation calculations in the two neural network methods included among the six supervised learning systems in M 4 . On a Windows PC, you can find the number of cores your processor has by pressing Ctrl + Shift + Esc to open Task Manager; then select the Performance tab to see how many cores (not logical processors!) there are on your machine.

      Details on how to run M 4 from within the RStudio environment are given in Section 4.3 below. As noted above, it is of course possible to run M 4 directly in R as well, which is less user-friendly but can have advantages like avoiding graphics-related crashes that sometimes occur in RStudio. If a crash occurs and cryptic errors are recorded in the log file (see below), a simple hack that can sometimes work is to just resize the plot (lower right) window in RStudio to make it larger and rerun. It also seems to be good

      practice to shut down and restart RStudio after every few M 4 runs. Some pointers around directory placement and names that were provided above can help too.

      Note that M 4 has thus far only been tested on a Windows PC.


    2. File inventory and descriptions

      1. M 4 code files

        The R files that collectively make up M 4 are as follows. All are invoked in M 4 runs during the model-build phase; in forecast runs, only the main program and the three ensemble-creation utility functions are used:

        • MMPE-Main_MkII.R The M 4 main program. Clears workspace, reads in data, declares external functions, tracks errors, creates (in build mode) the six modeling systems, including ensemble creation and other tasks, along with graphics and output files or runs (in forecast mode) the six existing modeling systems using output files created during the build phase.


        • PCA-Module_MkII.R Function performing principal component analysis on the predictor variates provided to it; this may be all the predictors in the input data file, MMPEInputData_ModelBuildingMode.txt (described in detail below), if the user specifies in MMPE_RunControlFile.txt (also described in the detail below) that genetic algorithm-based feature optimization is not to be used, or a subset of the predictors in that input data file if GA- based feature optimization is used (see Sections 4.2.2 and 4.3 below). PCA is performed on the correlation, not covariance, matrix.


        • PCA-Graphics-Module_MkII.R Function creating some PCA-related plots.


        • GA-PredictorSelection-Module_MkII.R Function performing genetic algorithm-based feature optimization, if that option is selected by user in the input file, MMPE_RunControlFile.txt; includes some automatic hyperparameter-setting capabilities around the monotone artificial neural network and monotone composite quantile regression neural network algorithms; cost function is root mean square error. Note that the cost is given a penalty if the trial feature set provided by the genetic algorithm in a given iteration includes fewer than MinNumVars of the input predictor variables provided by the user in the building-phase input data file, MMPEInputData_ModelBuildingMode.txt, where MinNumVars is settable by the user in MMPE_RunControlFile.txt but must be > 1 and < number of predictor variates provided by the user in MMPEInputData_ModelBuildingMode.txt and should in general be left at 2. The effect is to ensure that the genetic algorithm is guided away from cases where only a single predictor variable is retained, which is inconsistent with the canonical use-case for statistical WSF in the western US (generally framed around several multicollinear predictor variables). A single-

          predictor application of M 4 is possible for unusual circumstances but does not use genetic algorithm feature optimization (see use instructions below in Section 4.3).


        • PCR-Module_MkII.R Function performing linear regression model-fitting along with cross- validated performance evaluation and estimation of both stationary Gaussian, and heteroscedastic non-Gaussian, prediction intervals around the best-fit PCR estimates; the latter uses a method based on a Box-Cox transform on the prediction residuals, including estimation of the best-fit Box-Cox transform parameters.


        • PCSVM-Module_MkII.R As in PCR-Module_MkII.R but for the support vector machine; additionally includes some AutoML capabilities for identifying optimal SVM hyperpameters in accordance with options set by user in in the input file, MMPE_RunControlFile.txt.


        • PCRF-Module_MkII.R As in PCR-Module_MkII.R but for random forests.


        • PCANN-Module_MkII.R As in PCR-Module_MkII.R but for a monotone artificial neural network; additionally includes setting some hyperparameters in accordance with values specified by user in MMPE_RunControlFile.txt and/or delivered to it by AutoML algorithms in MMPE-Main_MkII.R and/or GA-PredictionSelection-Module_MkII.R; uses (if option enabled in MMPE_RunControlFile.txt) light parallelization of embarrassingly parallel ANN cross-validation task across process cores.


        • PCQR-Module_MkII.R As in PCR-Module_MkII.R but using quantile regression with enforcement of a quantile non-crossing constraint that can be crucial for small sample sizes; expectation value is taken to be the median; heteroscedastic non-Gaussian prediction intervals around the best-fit QR predictions are estimated using the linear quantile-regression capabilities inherent to the method; due to some quirks with one of the R packages used, while a .Rdata file is produced, the primary information about the trained model is written to a text file instead, unlike the other six modeling systems.


        • PCMCQRNN_MkII.R As in PCANN-Module_MkII.R, but for the monotone composite quantile regression neural network; expectation value is taken to be the median; heteroscedastic non- Gaussian prediction intervals around the best-fit predictions are estimated using the nonlinear quantile-regression capabilities inherent to the method.


        • InitializeEnsemble-Module_MkII.R, AppendEnsemble-Module_MkII.R, FinalizeEnsemble- Module_MkII.R Utility functions used in creating multi-model ensemble in accordance with user-specified preferences set in MMPE_RunControlFile.txt input file.


        • Diagnostics-Module_MkII.R Creates visual diagnostics and statistical measures describing model fit for each of the six constituent modeling systems and the multi-model ensemble mean forecast distribution. Some of these results are written to file, whereas others are graphically depicted in windows that open in R during an M 4 run.

        Note that for standard- or intermediate-level use of M 4 (see below) no edits are required to any of these files. In other words, it’s strongly suggested you don’t mess with them unless you really know what you’re doing!

      2. Input files

        M 4 has only three input files, and only two of them are used for a given run of the program:

        • MMPEInputData_ModelBuildingMode.txt The input data file for the training phase of M 4 application. These are the historical data used to fit the M 4 modeling suite. The file is only required during the build step. Don’t change the name of the file.


          The file contains a ( N +1) row by ( M +2) column matrix, where N is the number of samples and M is the number of predictors. The first column is, in the canonical use-case for western US statistical WSF, the year of the observation but could conceivably be a different sort of numerical sample index; the second column gives the observed flow volume in that year, i.e., the predictand; and subsequent columns give the observed values of various predictor variables in that year, which in the canonical use-case will typically be quantities like SWE, accumulated precipitation, and so forth (see previous sections). The top row is a header, providing information about each of the columns; avoid using ambiguous wildcard characters or spaces in the title for a given column. Columns are tab-separated.

          If the genetic algorithm in M 4 is employed for feature optimization, these predictor variables collectively constitute a pool of candidates, and the GA will pick the ones that work best (see definition of ‘best’ in section 4.2.1 or the Appendices) for a given model. This is done independently for each of the six constituent modeling systems, as each of the corresponding supervised regression-type models have their own strengths and weaknesses and may work best for a different subset of the candidate predictors. The canonical western US statistical WSF use-case is built around several multicollinear predictors. If feature optimization is used, at least three, and in normal WSF use typically somewhere around a half-dozen to two-dozen, candidate input variables would be provided in this file.

          If on the other hand the genetic algorithm is not used in M 4 , all the input variables provided in this file will be used in all the models. In this case, only one predictor variable is necessary (though for canonical western US statistical WSF applications more should generally be provided; see above).

          Here’s a complete example of the contents of MMPEInputData_ModelBuildMode.txt, using data from 1986 through 2015:


          Year

          Obs

          LM_P

          LM_SWE

          Sig_P

          Sig_SWE

          Sil_P

          Sil_SWE

          1986

          33.036

          9.3

          0

          13.8

          1

          16.6

          8.3

          1987

          50.664

          11.2

          5.1

          15.8

          6.9

          19.8

          10.5

          1988

          34.106

          9.2

          4.2

          11.5

          7.7

          17.1

          10.8

          1989

          12.24

          4.7

          0

          7.2

          0

          9.8

          5.4

          1990

          15.647

          5.4

          1.1

          8.3

          4.3

          12.8

          7.2

          1991

          97.577

          7.8

          1.6

          16.4

          7.9

          20.2

          10.8

          1992

          116.254

          11.2

          6.8

          14.7

          9.3

          20

          13.5

          1993

          78.318

          11

          4.7

          28.4

          12.8

          29.2

          20.4

          1994

          18.621

          4.5

          0.5

          8.3

          2.5

          12.2

          8.2

          1995

          39.69

          10.5

          0.1

          23.6

          5.3

          21.6

          8.1

          1996

          8.884

          3.4

          0.1

          6.5

          0.5

          8.5

          6.1

          1997

          43.454

          9.7

          3.8

          12.8

          4.9

          15.2

          11.4

          1998

          84.973

          7.3

          4

          14.8

          9.4

          18.3

          13.7

          1999

          10.171

          5.4

          0

          9

          0

          7.4

          1

          2000

          9.007

          2.1

          0

          1.9

          0

          1.2

          1.1

          2001

          34.079

          10.2

          3.4

          14.3

          4.7

          22.8

          10.8

          2002

          8.822

          2.5

          0

          5.1

          1.4

          4.6

          4.1

          2003

          19.799

          4.1

          0.2

          7

          3.4

          12.2

          6.8

          2004

          46.132

          4.8

          0.4

          7.2

          4.5

          12.3

          6.1

          2005

          54.77

          10.2

          0.3

          21.5

          4.2

          22.5

          11.5

          2006

          9.223

          2.3

          0

          3.1

          0

          3.4

          0.6

          2007

          33.017

          5.3

          0.8

          13.3

          6.6

          12.6

          7.5

          2008

          30.537

          5

          0.2

          11.6

          0.3

          19.1

          8.4

          2009

          14.418

          5.3

          0

          5.9

          0

          13.1

          6.2

          2010

          93.842

          8.5

          5.1

          15.5

          13.4

          18.7

          16.4

          2011

          9.288

          2.8

          0.1

          4.7

          0

          5.5

          3.9

          2012

          16.623

          8.3

          0

          12.3

          4.9

          11.5

          7

          2013

          19.035

          3.7

          0.4

          6.2

          2

          9

          5.5

          2014

          13.127

          3.6

          0

          2.6

          0

          5.2

          1.3

          2015

          22.494

          6.5

          0

          8.1

          0

          9.8

          4.4


          This is the predictand and forcing data to train a M 4 modeling suite that forecasts, on March 1, the total March-through-May flow volume of the Gila River near Gila in New Mexico. Thirty years of historical data are used in this example. The first column is the year of observation, the second column is the observed March 1-May 31 Gila flow volume at this location over those years, the third and fourth columns give the water-year-to-date (October 1 through February 28) accumulated precipitation and March 1 SWE at the Lookout Mountain SNOTEL site, the fifth and sixth columns do the same for the Signal Peak SNOTEL site, and the seventh and eighth columns do the same for the Silver Creek Divide SNOTEL site. If the genetic algorithm is invoked by the user, M 4 will pick, independently for each of its constituent modeling systems, which of these six predictors to keep. If the genetic algorithm isn’t used, all six predictors in this example will be used in each of the constituent modeling systems.

          Here are a few tips on creating this input dataset. M 4 isn’t a magical exception to the garbage- in, garbage-out rule. One example of something to avoid is a time series predictor (i.e., a column in the data matrix show above) with no, or almost no, variability (e.g., a SWE value from some SNOTEL site that is 0 or near-0 in all, or almost all, years for the Julian day of the forecast, which can happen for, say, a low-elevation SNOTEL site in a late-season forecast after significant snowmelt has occurred). Such a constant-valued dataset in general has no use for making predictions. More generally, the user has to employ some professional judgement – potentially based on prior experience with the watershed, and exploratory statistical or graphical analysis – to choose predictors for the input data matrix that are likely to have some potential to offer predictive value. Also use sufficient sample sizes. M 4 ’s use of PCA-based dimensionality reduction, compact ML architectures, and a priori monotonicity and nonnegativity constraints in default applications collectively result in a comparatively small number of parameters having to be trained and, in effect, only over restricted ranges (see again background above and in the appendices). This has a few advantages, one of which is that it allows the use of what would normally be considered, in the AI community, tiny datasets – like the one shown above, containing only 30 samples. Testing has suggested it may have capacity to work effectively in canonical western US statistical WSF use-cases as described above with as little as 20 samples. But 10 samples, for example, is insufficient to fit and test any statistical WSF model to be used in operational applications with much confidence, including even a simple linear regression, and

          it’s blatantly inadequate for a ML approach. Mistakes like these will result at best in poor models and at worst in a crash of one of the constituent machine learning algorithms. More broadly, recall that M 4 was explicitly and specifically designed to leverage the professional expertise of the hydrologic modeler during the process of features engineering – in particular, when creating the input data pool – as a pragmatic means toward achieving theory-guided and explainable machine learning.

        • MMPEInputData_ForecastingMode.txt The input data file for the forecasting phase of M 4 application. It must be exactly identical to the MMPEInputData_ModelBuildingMode.txt file used to build the M 4 model suite that’s now being run in forecast mode, except:


          • There aren’t any columns for the year or for the observed flow volume; and

          • Below the header, there’s only a single row of data, which is the present value for the various candidate input predictors included in the build phase, arranged left to right in exactly the same order as they were in MMPEInputData_ModelBuildingMode.txt.

            Here is a complete example for the Gila River case considered immediately above:

            LM_P LM_SWE Sig_P Sig_SWE Sil_P Sil_SWE

            3.6 0 2.6 0 5.2 1.3


            Given that the model was built on 1986-2015 data, this might be the observed values during forecast operations on, say, March 1, 2016, or March 1, 2017, and so forth. Note that the predictors, going from left to right (Lookout Mountain SNOTEL accumulated year-to-date precipitation, Lookout Mountain SNOTEL March 1 SWE, etc.) are in exactly the same order as in the training data file (see above). This file is only required when M 4 is being run in forecast mode. Don’t change the name of the file.

        • MMPE_RunControlFile.txt The user specifies here, by setting various parameters or flags, how they would like M 4 to run. This includes whether a build or forecast run is desired; for a build run, some controls over how that is approached; and for a forecast run, some outcomes from the build run and a couple options. This file is required for every M 4 run. The MMPE_RunControlFile.txt example provided in the M 4 installation package provides:


          • Default values for every parameter or flag. Testing has suggested that these settings work well for the canonical use-case for statistical WSF in the western US. Such standard WSF use should generally not require diverging from these default values (see Section 4.3 below). There are two exceptions: num_cores, which is a parameter set by the user as part of the M 4 installation (see step 5 in Section 4.1 above), and a few settings in the bottom, forecast-mode portion of the file that need to be set by the user on the basis of the results from the build-mode M 4 run.

          • Annotation, on each line, of the variable name used within the M 4 code for that parameter or flag, followed by a brief description and some guidance on its allowable values and format.

        The following is drawn from the example file provided in the distribution. (Open the file and compare to the descriptions provided below.) The file is divided into three sections.

        The first section simply specifies whether a build or forecast run is desired, and whether the user wishes to switch on or off a log file:

        # BASIC PARAMETERS: CONTROL WHETHER INVERSE OR FORWARD RUN, AND WHETHER MESSAGES ARE SENT TO AN EXTERNAL LOG FILE

        "BUILD" (variable name: RunTypeFlag)

        Write in “BUILD” if a build run is desired, or “FORECAST” if a forecast run is desired. Use all caps and don’t drop the quotation marks.

        "Y" (variable name: errorlog_flag)

        Do you want a log file to be created? Enter “Y” or “N”, again all caps, don’t drop the quotation marks. Generally not much reason to use anything other than “Y” but the user is given the choice.

        In the second section, the user provides information about how they would like a build run to proceed. Correctly formatted values must always be provided; however, M 4 ignores the values of these parameter and flags if RunTypeFlag = “FORECAST” (see above):

        # RUN PARAMETERS FOR MODEL DEVELOPMENT (Used only if RunTypeFlag

        = "BUILD")

        # CONTROL FEATURE CREATION & SELECTION:

        "Y" (variable name: GeneticAlgorithmFlag)

        Use the genetic algorithm in M 4 for optimal predictor selection? If “Y”, code finds optimal combination of input variables, and which PCA modes to retain, separately for each of the modeling technologies, and fits each of the models accordingly. Use all caps and don’t drop the quotation marks.

        2 (variable name: MaxModes)

        If GeneticAlgorithmFlag = "Y" (ignored otherwise): specify the maximum number of PCA modes the genetic algorithm is allowed to retain as it tries to optimize feature selection, counting upward from the leading mode, which is mode 1.

        The leading mode is always retained, so the minimum allowable value of MaxModes is 1. Note that if MaxModes is set to its minimum allowable value of 1, i.e., if only the leading PCA mode is retained, then the genetic algorithm only optimizes which predictors in the candidate pool to retain.

        If MaxModes is set to 2, for example, the genetic algorithm may retain PCA modes 1 and 2; or it may only retain the leading mode, depending on whether or not mode 2 helps improve prediction skill. Note that the genetic algorithm may skip modes (except for mode 1, which as noted is always retained), e.g., if MaxModes is 4 it might choose to keep modes 1, 2, and 4, skipping 3; in practice, though, this generally doesn’t happen often for the canonical WSF use-case described above in this manual.

        The following is some guidance and constraints around the largest allowable value. The largest value MaxModes can take is 4, that is, the 4th PCA mode is highest available in the current version of M 4 ; this is thought to be generally far more than needed for

        canonical western US WSF use-cases, where often only the first, and rarely more than the first and second, PCA modes are retained in operational WSF models. Additionally, because a principal component analysis of a M-variable by N-sample dataset will produce M modes, MaxModes must be also < the number of predictors that the user provides in the candidate predictor pool of MMPEInputData_ModelBuildMode.txt. It should also be recalled that a goal of PCA in applications like the canonical statistical WSF use-case in the western US is data compression/dimensionality reduction from a multicollinear data pool, so that as a matter of best-practice, MaxModes should in general be set << M.

        "Y" (variable name: DisplayFlag)

        If GeneticAlgorithmFlag = "Y" (ignored otherwise), display to screen the progress of genetic algorithm? (note: progress written to file irrespective of this setting). Use all caps and don’t drop the quotation marks.

        2 (variable name: MinNumVars)

        If GeneticAlgorithmFlag = "Y" (ignored otherwise): specify the minimum allowable number of input variables to retain. Discussion of this parameter and how it’s used was provided in the description of GA-PredictorSelection-Module_MkII.R in Section 4.2.1. It cannot be lower than 2, and obviously it can’t be higher than the number of candidate predictors provided by the user in the input data file, MMPEInputData_ModelBuildingMode.txt. In general, the default value of 2 can be used.

        25 (variable name: GAPopSize)

        If GeneticAlgorithmFlag = "Y" (ignored otherwise): specify the population size to be used in the genetic algorithm. The larger the population, the better (roughly speaking) the result, up to a point, but the longer the build run. GAPopSize = 25 with GANumGens = 10 (see below) is a good compromise for canonical WSF use-cases of the type described above in this manual and is a recommended default. If in a rush, GAPopSize = 15 with GANumGens = 7 generally gives usable results with a shorter run time. If the size of the candidate predictor pool is larger, or if build-phase run time is not an issue, it may be a good idea to run a larger genetic algorithm problem to help ensure that the global best estimate is obtained. GAPopSize = 100 with GANumGens = 15, for example, can take hours to turn but will more reliably obtain a truly optimal feature selection for a canonical WSF use-case scenario, though the improvement will often be modest relative to the default, and there’s a point of diminishing returns with increasing GAPopSize and GANumGens.

        10 (variable name: GANumGens)

        If GeneticAlgorithmFlag = "Y" (ignored otherwise): specify the number of generations used in the genetic algorithm. See description of GAPopSize above.

        1,2 (variable name: ManualPCSelection)

        If GeneticAlgorithmFlag = "N" (ignored otherwise): specify which PCA modes to fit the models to. 1 = leading mode, 2 = second mode, etc., 3 = third mode, etc.; use comma between modes if more than one retained; the same modes will be used for all modeling

        techniques; all input variables in the input data file will be used; modes may be skipped, except for the leading mode; the 4th PCA mode is highest allowable; recall that a PCA returns only as many modes as there are input variables to it, and that a goal of PCA in an application like this is data compression/dimensionality reduction, so in general it must be no larger than and generally should be << M, the number of predictor variables provided by the user in MMPEInputData_ModelBuildingMode.txt.

        # CONTROL ENSEMBLE GENERATION:

        "Y" (variable name: AutoEnsembleFlag)

        Use automatic ensemble generation? If "Y", the code starts with default ensemble members, automatically checks ensemble output for non-physical (negative) values, and adjusts member composition accordingly.

        The 6 default ensemble members are linear regression with post hoc Box-Cox transform space-based prediction intervals (denoted PCR_BC below); quantile regression (PCQR); a monotone artificial neural network with Box-Cox transform space-based prediction intervals (PCANN_BC); a monotone composite quantile regression neural network (PCMCQRNN); random forests with Box-Cox transform space-based prediction intervals (PCRF_BC); and a support vector machine with Box-Cox transform space-based prediction intervals (PCSVM_BC).

        If instead set to “N”, the user specifies which models to include in the ensemble mean forecast distribution using the next 10 run control parameters described immediately below. This capability is rarely used, and AutoEnsembleFlag should generally be left at the default, “Y”. In that case, the values used for the next 10 run control parameters below don’t matter, though some value, in the correct format, has to be provided.

        However, it can be useful to provide the user with some flexibility around which models, and which variants of which models, to include in the ensemble mean forecast distribution.

        For example, PCR and PCR_BC (see below) are the same linear regression model, producing the same best estimate, but one (PCR) uses conventional (stationary Gaussian) prediction intervals around the best-estimate, much like the current NRCS PCR model, and the other (PCR_BC) uses prediction intervals calculated in a similar fashion but in Box-Cox transform space. The latter is in general superior because it seamlessly and automatically accommodates non-Gaussian and heteroscedastic error distributions yet can also adequately capture Gaussian and homoscedastic conditions when appropriate. However, once in a blue moon, PCR_BC produces unrealistically wide prediction intervals, a quirk that will be addressed in some future version of M 4 , and in such cases the user may prefer to use PCR instead in the final ensemble mean forecast distribution. Similar considerations apply for PCANN vs. PCANN_BC, PCRF vs. PCRF_BC, and PCSVM vs. PCSV_BC. Note that in this case, the default values for the next 10 run control parameters as provided in the descriptions below and in the M 4 standard distribution are set in such a way that, if the user wishes to avoid the Box-Cox transform space-based prediction intervals altogether, the user merely has to switch

        AutoEnsembleFlag from “Y” to “N”. Recall, though, that this also turns off the negativity-check and ensemble-pruning algorithm.

        (Note also that in very rare cases, the post hoc Box-Cox transform space-based prediction intervals are so wide they include infinity, which would cause subsequent visualization steps in M 4 , such as diagnostics plots, to crash. To avoid this, an algorithm tests for Inf values in any of the prediction intervals for PCR_BC, and if found, all the PCR_BC prediction intervals are replaced by those from PCR. In this case, the outputs for PCR and PCR_BC are identical and represent PCR only. When this happens, it is immediately obvious on inspection of the output files; additionally, one of the output files (see detailed list of output files provided in a subsequent section) explicitly tracks whether this has happened, so there is no room for confusion as to what has been produced. The same applies for the other three models in M 4 that are available with both conventional prediction intervals and Box-Cox transform space-based intervals: the monotone artificial neural network, support vector machine, and random forests. None of this applies to the two (nonlinear neural network, and linear) quantile regression methods, which intrinsically generate non-Gaussian heteroscedastic predictions intervals.)

        As another example of an unusual case in which the user might wish to switch AutoEnsembleFlag to N, in an application that is known a priori to be strongly nonlinear, the user may wish to exclude linear regression (PCR) and linear quantile regression (PCQR) from the ensemble.

        Note, though, that all training runs of M 4 produce all 10 (the 6 defaults, plus non-Box Cox prediction interval versions of PCR, PCRF, PCANN, and PCSVM) sets of individual model outputs and write them to files, so the user can just manually create their own multi-model ensemble mean with a fully custom ensemble composition post-hoc outside of M 4 . This may be more practical than setting AutoEnsembleFlag to “N” and specifying which ensemble members to retain using the following flags, because in many cases you might not know until after you’ve completed the entire M 4 build run, and evaluated its results, whether you want to change the ensemble composition. In general, however, one would generally want to avoid using, for example, both PCANN and PCANN_BC, as they differ only in their prediction intervals, and this therefore doesn’t add another independent best-estimate to the multi-model ensemble. Note also that the diagnostics M 4 produces for the final multi-model ensemble mean it generates, like statistical measures of model fit in the output file ensemble_StandardPerformanceReportingSuite.csv (see output file list below), are based on the ensemble composition it generates, not what the user might have manually created after the run finished.

        AutoEnsembleFlag pertains to the training phase of M 4 . A forecast run of M 4 (see below) produces and writes to file a prediction from all 10 of these models as well. For details on the ensemble mean forecast distribution that forward runs generate, see below; note that if the user decided on the basis of the training run to keep a custom mix of constituent models, this may not be reflected in the forward-run ensemble mean, and

        the user may have to construct the forward-run ensemble mean manually in accordance with what they decided the model mix should be on the basis of the training run. Recall that it is generally poor practice to use different mixes of constituent models in the forecast multi-model ensemble mean than in the training-phase multi-model ensemble mean (in part because if the user takes such an unusual step, the out-of-sample performance measures for the training-phase multi-model ensemble mean, like cross- validated RMSE, may not speak to the performance that could be expected from the different custom mix of models used in the forward run).

        Overall, the current version of M 4 is built around the six default models and the negativity-check and ensemble-pruning algorithm (i.e., AutoEnsembleFlag = Y), which testing has suggested works well for the canonical western US WSF use-case described above. Long story short: testing and experience to date applying M 4 to NRCS water supply forecast problems suggests that the user should, generally speaking, only stray away from this default if they can clearly and objectively demonstrate a good reason to, and if they are prepared to take on the additional training, responsibility, and work around building and handling custom ensemble compositions.

        Use caps and don’t drop the quotation marks.

        "Y" (variable name: EnsembleFlag_PCR)

        If AutoEnsembleFlag = "N" (ignored otherwise, but a value is always required): for the case where the user manually specifies a la carte which models to include in the ensemble, include output from conventional linear regression (actually principal component regression because of the PCA preprocessing) in the multi-model ensemble? The prediction intervals in this case are calculated using a simple heuristic widely used in WSF applications of linear PCR assuming Gaussian homoscedastic error distributions (see review in Section 3.2.1 and the Appendices). The best-fit model and its best-estimate predictions are identical to PCR_BC (see immediately below) but the prediction intervals are, in general, different. Capital "Y" = yes, capital "N" = no. Use all caps and don’t drop the quotation marks.

        "N" (variable name: EnsembleFlag_PCR_BC)

        If AutoEnsembleFlag = "N" (ignored otherwise, but a value is always required): for the case where the user manually specifies a la carte which models to include in the ensemble, include output from linear regression (again, actually principal component regression because of the PCA preprocessing) with Box-Cox transform-space prediction bounds in the multi-model ensemble? This is a widely used heuristic for estimating prediction intervals in WSF applications of PCR but is modified here to allow non- Gaussian and heteroscedastic error distributions, though in general it can also adequately capture Gaussian and homoscedastic distributions if appropriate. It is one of the default models in M 4 when automatic ensemble generation is used (see above). The best-fit model and its best-estimate predictions are identical to PCR (see immediately above) but the prediction intervals are, in general, different. Capital "Y" = yes, capital "N" = no. Use all caps and don’t drop the quotation marks.

        "Y" (variable name: EnsembleFlag_PCQR)

        If AutoEnsembleFlag = "N" (ignored otherwise, but a value is always required): for the case where the user manually specifies a la carte which models to include in the ensemble, include output from quantile regression (actually principal component quantile regression due to the use of PCA preprocessing, hence “PCQR”) in the multi- model ensemble? Quantile regression is a variant of linear regression that calculates multiple regression lines representing quantiles of the error distribution, and permits non-Gaussian and heteroscedastic residuals. It is one of the default models in M 4 when automatic ensemble generation is used (see above). Capital "Y" = yes, capital "N" = no. Use all caps and don’t drop the quotation marks.

        "Y" (variable name: EnsembleFlag_PCANN)

        If AutoEnsembleFlag = "N" (ignored otherwise, but a value is always required): for the case where the user manually specifies a la carte which models to include in the ensemble, include output from the monotone artificial neural network (actually a principal component artificial neural network because of the PCA preprocessing, hence “PCANN”) in the multi-model ensemble? The prediction intervals in this case are calculated using a simple heuristic widely used in WSF applications of linear PCR assuming Gaussian homoscedastic error distributions (see EnsembleFlag_PCR above). The best-fit model and its best-estimate predictions are identical to PCANN_BC (see immediately below) but the prediction intervals are, in general, different. Capital "Y" = yes, capital "N" = no. Use all caps and don’t drop the quotation marks.

        "N" (variable name: EnsembleFlag_PCANN_BC)

        If AutoEnsembleFlag = "N" (ignored otherwise, but a value is always required): for the case where the user manually specifies a la carte which models to include in the ensemble, include output from the artificial neural network (actually a principal component artificial neural network because of the PCA preprocessing, hence “PCANN”) with Box-Cox transform-space prediction bounds in the multi-model ensemble? This is a widely used heuristic for estimating prediction intervals in WSF applications of PCR but is modified to allow non-Gaussian and heteroscedastic error distributions (see EnsembleFlag_PCR_BC above). It is one of the default models in M 4 when automatic ensemble generation is used (see above). The best-fit model and its best-estimate predictions are identical to PCANN (see immediately above) but the prediction intervals are, in general, different. Capital "Y" = yes, capital "N" = no. Use all caps and don’t drop the quotation marks.

        “N” (variable name: EnsembleFlag_PCRF)

        If AutoEnsembleFlag = "N" (ignored otherwise, but a value is always required): for the case where the user manually specifies a la carte which models to include in the ensemble, include output from the random forests (actually a principal component random forests because of the PCA preprocessing, hence “PCRF”) in the multi-model ensemble? The prediction intervals in this case are calculated using a simple heuristic widely used in WSF applications of linear PCR assuming Gaussian homoscedastic error distributions (see EnsembleFlag_PCR above). The best-fit model and its best-estimate predictions are identical to PCRF_BC (see immediately below) but the prediction intervals

        are, in general, different. "Y" = yes, "N" = no. Use all caps and don’t drop the quotation marks.

        "N" (variable name: EnsembleFlag_PCRF_BC)

        If AutoEnsembleFlag = "N" (ignored otherwise, but a value is always required): for the case where the user manually specifies a la carte which models to include in the ensemble, include output from the random forests with Box-Cox transform-space prediction bounds in the multi-model ensemble? This is a widely used heuristic for estimating prediction intervals in WSF applications of PCR but is modified to allow non- Gaussian and heteroscedastic error distributions (see EnsembleFlag_PCR_BC above). It is one of the default models in M 4 when automatic ensemble generation is used (see above). The best-fit model and its best-estimate predictions are identical to PCRF (see immediately above) but the prediction intervals are, in general, different. Capital "Y" = yes, capital "N" = no. Use all caps and don’t drop the quotation marks.

        "Y" (variable name: EnsembleFlag_PCMCQRNN)

        If AutoEnsembleFlag = "N" (ignored otherwise, but a value is always required): for the case where the user manually specifies a la carte which models to include in the ensemble, include output from the monotone composite quantile regression neural network (actually principal component monotone composite quantile regression neural network due to the use of PCA preprocessing, hence “PCMCQRNN”) in the multi-model ensemble? This is an ANN variant that calculates multiple nonlinear regression lines representing quantiles of the error distribution, and permits non-Gaussian and heteroscedastic residuals. It is one of the default models in M 4 when automatic ensemble generation is used (see above). Capital "Y" = yes, capital "N" = no. Use caps and don’t drop the quotation marks.

        "Y" (variable name: EnsembleFlag_PCSVM)

        If AutoEnsembleFlag = "N" (ignored otherwise, but a value is always required): for the case where the user manually specifies a la carte which models to include in the ensemble, include output from the support vector machine (actually a principal component support vector machine because of the PCA preprocessing, hence “PCSVM”) in the multi-model ensemble? The prediction intervals in this case are calculated using a simple heuristic widely used in WSF applications of linear PCR assuming Gaussian homoscedastic error distributions (see EnsembleFlag_PCR above). The best-fit model and its best-estimate predictions are identical to PCSVM_BC (see immediately below) but the prediction intervals are, in general, different. "Y" = yes, "N" = no. Use all caps and don’t drop the quotation marks.

        "N" (variable name: EnsembleFlag_PCSVM_BC)

        If AutoEnsembleFlag = "N" (ignored otherwise, but a value is always required): for the case where the user manually specifies a la carte which models to include in the ensemble, include output from the support vector machine with Box-Cox transform- space prediction bounds in the multi-model ensemble? This is a widely used heuristic for estimating prediction intervals in WSF applications of PCR but is modified to allow non- Gaussian and heteroscedastic error distributions (see EnsembleFlag_PCR_BC above). It is

        one of the default models in M 4 when automatic ensemble generation is used (see above). The best-fit model and its best-estimate predictions are identical to PCSVM (see immediately above) but the prediction intervals are, in general, different. Capital "Y" = yes, capital "N" = no. Use all caps and don’t drop the quotation marks.

        # CONTROL CONFIGURATION & PARALLELIZATION OF NEURAL NETWORKS (BOTH mANN AND MCQRNN):

        "Y" (variable name: AutoANNConfigFlag)

        Use automatic configuration selection in both the artificial neural network (ANN) and monotone composite quantile regression neural network (MCQRNN)? If "Y", code starts with a basic configuration, automatically compares prediction quality to that of other models, and uses an alternative configuration if necessary and appropriate, using information theoretic and other criteria. “Configuration” in this context refers to various hyperparameters controlling network properties like neural network topology and bootstrap aggregation. The available configurations in this AutoML step were determined by extensive experimentation with canonical use-cases for statistical WSF in the western US (see above). It is recommended for most WSF users that this be left at the default value of “Y”. Use all caps and don’t drop the quotation marks.

        25 (variable name: ANN_config1_cutoff)

        If AutoANNConfigFlag = "Y" (ignored otherwise, but some value is always required): this is a hyperhyperparameter in the AutoML routine for autoconfiguring the two neural networks (ANN and MCQRNN) in M 4 ; maximum allowable % performance deficit in R 2 or RMSE of default configuration relative to other (non-neural) models.

        1 (variable name: mANN_config_selection)

        If AutoANNConfigFlag = "N" (ignored otherwise): if 1, use basic monotone artificial neural network configuration (1 hidden-layer neuron with no bagging), if 2 use standard backup (alternative) configuration (2 hidden-layer neurons with 20 bootstraps), if 3 use custom configuration (to be hard-coded by an experienced user within the ANN module itself). Note that because there is only one output neuron (predictand/target), and a maximum of four and typically only one or two input neurons (predictor PCA modes), the neural networks used here are, deliberately and by construction, extremely parsimonious; moreover, the monotonicity constraint generally used (see below) for applications of M 4 to canonical WSF use-cases greatly reduces the flexibility required by the neural networks. Hence, very compact hidden layers are suitable for the canonical use-case applications that are the primary focus to date of M 4 development and testing. Expert users, however, may invoke mANN_config_selection = 3 and hard-code the module implementing the ANN to use a multilayer perceptron topology just about as complex as they like.

        1 (variable name: MCQRNN_config_selection)

        If AutoANNConfigFlag = "N" (ignored otherwise): As above for mANN_config_selection but for the monotone composite quantile regression neural network

        "Y" (variable name: ANN_monotone_flag)

        Enable monotonicity constraint in the monotone artificial neural network and monotone composite quantile regression neural network? If "Y" then relationships between all predictors (i.e., principal component scores for the retained PCA modes) and the predictand will be forced to be monotonic. This flag is not used if configuration selection is 3 but a value must always be given.

        "Y" (variable name: ANN_parallel_flag)

        Enable simple parallelization across processor cores? If "Y" then foreach loop is used to perform cross-validation for the two neural network prediction systems (monotone artificial neural network, and monotone composite quantile regression neural network)

        4 (variable name: num_cores)

        If ANN_parallel_flag = "Y" (ignored otherwise): set to number of processor cores across which neural network cross-validation tasks will be distributed. There is no default for this value; the user must find the correct value as per the M 4 installation instructions in step 5 of Section 4.1.

        # CONTROL SVM CONFIGURATION:

        2 (variable name: SVM_config_selection)

        Set AutoML hyperhyperparameter for support vector machine: if 1, tune gamma, epsilon, and cost; if 2 tune epsilon and cost only; if 3 no automated tuning is performed and hyperparameters are hard-coded by experienced user at corresponding location within the R module implementing SVM. Experimentation with canonical statistical WSF use-cases in the western US suggested that 1 may tend to excessively overtrain; the default configuration is 2, using the default gamma value in the next run control parameter below.

        0.2112 (variable name: fixedgamma)

        If SVM_config_selection = 2 (ignored otherwise): specify the fixed value of gamma to use in the support vector machine if partial tuning is selected. Default value is 0.2112 (or similar).

        # CONTROL SOME OUTPUTS:

        "Y" (variable name: PC1form_plot_flag)

        Create scatter plots of observed leading-mode principal component versus observed flow and prediction flow, for each of the six models in M 4 ? Apart from producing or not producing this plot, the choice has no impact on M 4 functionality. In general, this is often left to “Y”. However, users should be aware of some complexities in this graphic.

        If more than one PCA mode is retained, this plot is generally not very meaningful. Also, bear in mind that if feature optimization, i.e., the genetic algorithm, is employed, then in general each of the six models in M 4 will use (1) different input variables and (2) different PCA modes (unless MaxModes is 1), so each model will in this case have its own set of observed PC1 values.

        "Y" (variable name: PC12form_plot_flag)

        Create contour plots of model predictions as a function of the first and second PCA modes, for each model? This will only be done if the second PCA mode is available, e.g., retained by the genetic algorithm for that model. Note also that if more than two PCA modes are retained, the meaning of this plot may be difficult to interpret.

        In the third and final section of the run control file, the user provides information about how they would like a forecast run to proceed. Correctly formatted values must always be provided; however, M 4 ignores the values of these parameters and flags if RunTypeFlag = “BUILD” (see above). Setting these values requires some hands-on work from the user, but most of the values are actually determined by (i) what the user asked for during the build phase and/or (ii) what M 4 produced at the end of that build phase; in general, little to no user judgement is required.

        # RUN PARAMETERS FOR FORECASTING (Used only if RunTypeFlag = "FORECAST")

        # SPECIFY WHICH PCA MODES TO RETAIN:

        1,2 (variable name: PCSelection_Frwrd_LR)

        Specify which PCA modes to retain for linear regression (PCR); 1=leading mode, 2=second mode, etc.; if there is more than one mode, separate by commas. It is a combination of up to the first four PCs (as the current M 4 version functionality is limited by design to a maximum of four PCA modes, as discussed above), determined by what was either manually specified, or determined by the genetic algorithm, during model-building for PCR. Recall that the leading PCA mode is always retained but other modes may be skipped. For example, if only the leading PCA is retained, then the value of PCSelection_Frwrd_LR is 1, but if the first, second, and fourth PCA modes are retained then the value is 1,2,4. Recall also that the maximum number of PCA modes available is also limited by the number of input variables the user provided in the input data file (again see above).

        If automatic feature optimization wasn’t used during model training (i.e., if GeneticAlgorithmFlag was set to “N” during the build phase) then the value is exactly identical to the value the user specified for ManualPCSelection in the build phase, and may simply be copied and pasted from there.

        If automatic feature optimization was used during model training (i.e., if GeneticAlgorithmFlag was set to “Y” during the build phase) it’s a little more complicated. In that case, the value of PCSelection_Frwrd_LR is taken by the user from the optimal genetic sequence stored by M 4 at the end of model training. This is the last line (‘Best Solution’) in the build-phase M 4 output file, GA_RunSummary_PCR.txt ( not GA_RunLog_PCR.txt, which is something slightly different); open the example file in the M 4 distribution and take a look.

        That genetic algorithm output information in GA_RunSummary_PCR.txt is used as follows. The first M entries in this binary sequence switch on (1) or off (0) predictor variates, going left to right, as listed in the input data file

        MMPEInputData_ModelBuildingMode.txt, which are of course the same M input variables listed in the same order in the forecast-phase data file, MMPEInputData_ForecastingMode.txt (we return to this below). There are then none, one, or two more binary digits in the sequence after those first M bits, depending on how many PCA modes were specified by the user during the build phase for possible retention. The leading PCA mode is always retained, so it is not encoded for here, and if MaxModes was set to 1 by the user during the build phase then the ‘Best Solution’ genetic sequence in GA_RunSummary_PCR.txt is only M binary digits long, and PCSelection_Frwrd_LR is simply set by the user to 1. If MaxModes was set to 2 by the user during the build phase then the ‘Best Solution’ genetic sequence in GA_RunSummary_PCR.txt is M+1 binary digits long, with the (M+1) th digit indicating whether the second PCA mode was retained (1) or not (0). If it’s 1, then the user would enter 1,2 for PCSelection_Frwrd_LR because both the first and second modes are retained. If it’s 0, then the user would enter 1 for PCSelection_Frwrd_LR because only the first mode is retained. The idea is similar for MaxModes = 3 or 4. For example, if the user set MaxModes = 4 in the build phase, then the ‘Best Solution’ genetic sequence in GA_RunSummary_PCR.txt is M+3 binary digits long, with the (M+1) th digit indicating whether the second PCA mode was retained (1) or not (0), the (M+2) th digit indicating whether the third PCA mode was retained (1) or not (0), and the (M+3) th digit indicating whether the fourth PCA mode was retained (1) or not (0). Say if MaxModes = 4, and if the last three digits in the binary string turned out to be 1 0 1, and recalling that the leading PCA mode is always used, then the user would set PCSelection_Frwrd_LR to 1,2,4.

        Further instructions are given in Section 4.3 below, and examples are provided in the M 4 distribution files. There is no default value for PCSelection_Frwrd_LR; the user must provide the correct value, and the value shown here is just a randomly chosen example from a past M 4 run. It may look complicated at first but it’s actually really easy once you get the hang of it.

        1 (variable name: PCSelection_Frwrd_QR)

        As above in PCSelection_Frwrd_LR, but for quantile regression (PCQR) instead. In this case, if the genetic algorithm was employed in the build phase, the optimal genetic sequence is provided by the build-phase M 4 output file, GA_RunSummary_PCQR.txt. There is no default value for PCSelection_Frwrd_QR; the user must provide the correct value.

        1,2 (variable name: PCSelection_Frwrd_mANN)

        As above with PCSelection_Frwrd_LR, but for the monotone artificial neural network (PCANN). In this case, if the genetic algorithm was employed in the build phase, the optimal genetic sequence is provided by the build-phase M 4 output file, GA_RunSummary_PCANN.txt. There is no default value for PCSelection_Frwrd_mANN; the user must provide the correct value.

        1 (variable name: PCSelection_Frwrd_MCQRNN)

        As above with PCSelection_Frwrd_LR, but for the monotone composite quantile

        regression neural network (PCMCQRNN). In this case, if the genetic algorithm was employed in the build phase, the optimal genetic sequence is provided by the build- phase M 4 output file, GA_RunSummary_PCMCQRNN.txt. There is no default value for PCSelection_Frwrd_MCQRNN; the user must provide the correct value.

        1 (variable name: PCSelection_Frwrd_RF)

        As above with PCSelection_Frwrd_LR, but for random forests (PCRF). In this case, if the genetic algorithm was employed in the build phase, the optimal genetic sequence is provided by the build-phase M 4 output file, GA_RunSummary_PCRF.txt. There is no default value for PCSelection_Frwrd_RF; the user must provide the correct value.

        1 (variable name: PCSelection_Frwrd_SVM)

        As above with PCSelection_Frwrd_LR, but for the support vector machine (PCSVM). In this case, if the genetic algorithm was employed in the build phase, the optimal genetic sequence is provided by the build-phase M 4 output file, GA_RunSummary_PCSVM.txt. There is no default value for PCSelection_Frwrd_SVM; the user must provide the correct value.

        # SPECIFY WHICH INPUT VARIABLES TO RETAIN: 1 TO KEEP, 0 TO OMIT, POSITIONS CORRESPOND TO LEFT-TO-RIGHT CONSECUTIVE LOCATIONS IN INPUT DATASET

        "0 1 0 1 0 1" (variable name: VariableSelection_Frwrd_LR)

        Specify which of the input variables to use for linear regression, i.e., PCR. Use quotation marks, and single spaces with no commas between each binary digit in the string, as shown in the example above. This length-M binary string corresponds to the M input variables listed left-to-right in the build-phase data file, MMPEInpuData_ModelBuildingMode.txt, and which are of course the same M input variables listed in the same order in the forecast-phase data file, MMPEInputData_ForecastingMode.txt. 1 indicates retain the variable, and 0 indicates that the variable will not be retained. .


        If automated feature optimization wasn’t used, i.e., if GeneticAlgorithmFlag was set to “N” during the build phase, all the variables in the input data file are retained (see above), and the user simply sets VariableSelection_Frwrd_LR to a length-M string of 1s. For instance, if MMPEInpuData_ModelBuildingMode.txt contained seven variables, then set VariableSelection_Frwrd_LR to “1 1 1 1 1 1 1”.

        If automated feature optimization was used, i.e., if GeneticAlgorithmFlag was set to “Y” during the build phase, then VariableSelection_Frwrd_LR is set to the first M digits in the binary string at the bottom of the file, GA_RunSummary_PCR.txt, that was saved by M 4 during the build phase. (This string may or may not be of length > M, depending on whether MaxModes was set to > 1 or not; for setting VariableSelection_Frwrd_LR, ignore everything after the first M digits.) That is, the first M entries in the binary string at the end of GA_RunSummary_PCR.txt can simply be copied-and-pasted into VariableSelection_Frwrd_LR .

        Further instructions are given in Section 4.3 below, and examples are provided in the M 4

        distribution files. See also description of variable PCSelection_Frwrd_LR above. There is no default value for VariableSelection _Frwrd_LR; the user must provide the correct value. The value shown here is just a randomly chosen example from a past run of M 4 .

        "0 1 0 1 1 0" (varianble name: VariableSelection_Frwrd_QR)

        As above in VariableSelection_Frwrd_LR, but for quantile regression (PCQR) instead. In this case, if the genetic algorithm was employed in the build phase, the optimal genetic sequence is provided by the build-phase M 4 output file, GA_RunSummary_PCQR.txt.

        There is no default value for VariableSelection_Frwrd_QR); the user must provide the correct value.

        "0 0 1 1 1 1" (variable name: VariableSelection_Frwrd_mANN)

        As above in VariableSelection_Frwrd_LR, but for the monotone artificial neural network (PCANN) instead. In this case, if the genetic algorithm was employed in the build phase, the optimal genetic sequence is provided by the build-phase M 4 output file, GA_RunSummary_PCANN.txt. There is no default value for VariableSelection_Frwrd_mANN); the user must provide the correct value.

        "0 1 0 1 0 1" (variable name: VariableSelection_Frwrd_MCQRNN)

        As above in VariableSelection_Frwrd_LR, but for the monotone composite quantile regression neural network (PCMCQRNN) instead. In this case, if the genetic algorithm was employed in the build phase, the optimal genetic sequence is provided by the build- phase M 4 output file, GA_RunSummary_PCMCQRNN.txt. There is no default value for VariableSelection_Frwrd_MCQRNN; the user must provide the correct value.

        "0 0 0 1 1 0" (variable name: VariableSelection_Frwrd_RF)

        As above in VariableSelection_Frwrd_LR, but for the random forests model (PCRF) instead. In this case, if the genetic algorithm was employed in the build phase, the optimal genetic sequence is provided by the build-phase M 4 output file, GA_RunSummary_PCRF.txt. There is no default value for VariableSelection_Frwrd_RF; the user must provide the correct value .

        "0 0 0 1 0 1" (variable name: VariableSelection_Frwrd_SVM)

        As above in VariableSelection_Frwrd_LR, but for the support vector machine (PCSVM) instead. In this case, if the genetic algorithm was employed in the build phase, the optimal genetic sequence is provided by the build-phase M 4 output file, GA_RunSummary_PCSVM.txt. There is no default value for VariableSelection_Frwrd_SVM; the user must provide the correct value.

        # HOUSEKEEPING FOR MONOTONIC NEURAL NETS:

        "Y" (variable name: ANN_monotone_flag_Frwrd)

        Was the monotonicity constraint enabled for the neural networks (PCANN, PCMCQRNN) in the build run? Enter “Y” is if it was, and “N” otherwise; use caps and keep the quotation marks. If default values were used during the build phase, this will always be “Y”.

        # SPECIFY WHETHER ENSEMBLE MEAN FORECAST DISTRIBUTION IS TO BE GENERATED AND IF SO FROM WHICH MODELS

        "Y" (variable name: Ensemble_flag_frwrd)

        Create an multi-model ensemble mean forecast distribution by calculating the average of the forecast distributions (specifically, the 10, 30, 70, and 90% exceedance volumes and the best-estimate prediction) from the individual models in the forward runs? If set to "N" then only the forecast distributions from the individual models are written to file; if "Y" then the ensemble mean is additionally calculated and written to file. Although this flag is functional and the user can set it to “N” if they wish, it is a largely an artefact from an early development version of M 4 and in general it should always be set to “Y”. Use caps and don’t drop the quotation marks.

        "AUTO" (variable name: Ensemble_type_frwrd)

        If Ensemble_flag_frwrd = "Y" (ignored otherwise): "ALL" creates an ensemble mean from the full default model set (PCR-BC, PCQR, PCANN-BC, PCMCQRNN, PCRF-BC, PCSVM-BC), "AUTO" does the same but removes any individual ensemble members that may have been pruned from the final build-phase ensemble if M 4 ’s negativity check algorithm was switched on by the user during the build phase by setting AutoEnsembleFlag to “Y”.

        Whether such exclusions were needed, and if so which models were excluded, was recorded by M 4 in the build-phase output file, AutomatedEnsembleMemberExclusions.csv, which M 4 reads in during forecast runs if Ensemble_type_frwrd is set to “AUTO”. Generally speaking, the default for AutoEnsembleFlag is “Y” (see above) and the default for Ensemble_type_frwrd is “AUTO”. Use caps and don’t drop the quotation marks.

      3. Output files

        1. Output file naming conventions

          Some file name conventions used here are:

          • PCR_BC: refers to output from the linear regression model within M 4 (i.e., principal component regression, given the PCA preprocessing used in all six modeling systems within M 4 ) when Box- Cox transform-space residuals are used as the basis for prediction interval estimation capable of handling non-Gaussian and heteroscedastic error distributions.


          • PCR: refers to output from the linear regression model in M 4 for (i) outputs that are specific to the use of classical Gaussian stationary prediction intervals as in most traditional applications of PCR to western US WSF, in contrast to the Box-Cox (BC) space method described above, and (ii) outputs which are the same irrespective of the method used for prediction interval estimation. The deterministic regression model and associated properties, like which input variables and PCA modes are retained for instance, are identical between the two cases, so for the most part that kind of information is not repeated separately for PCR_BC. Similarly, the best-estimate models and best-estimate predictions are identical between PCR and PCR_BC. See also detailed descriptions of “CONTROL ENSEMBLE GENERATION” flags in MMPE_RunControlFile.txt portion of Section 4.2.2 above.

          • PCRF_BC: as for PCR_BC above, but for random forests.


          • PCRF: as for PCR above, but for random forests.


          • PCSVM_BC: as for PCR_BC above, but for the support vector machine.


          • PCSVM: as for PCR above, but for the support vector machine.


          • PCANN_BC: as for PCR_BC above, but for the monotone artificial neural network.


          • PCANN: as for PCR above, but for the monotone artificial neural network.


          • PCQR: refers to output from linear quantile regression. Because quantile regression inherently generates non-Gaussian and heteroscedastic prediction intervals, there is no separate PCQR_BC.


          • PCMCQRNN: refers to output from the monotone composite quantile regression neural network. Because quantile regression techniques inherently generate non-Gaussian and heteroscedastic prediction intervals, there is no separate PCMCQRNN_BC.

        2. Output files created by build-step M 4 runs irrespective of whether feature optimization is used

          The following are output files created by M 4 during model-building runs (that is, if RunTypeFlag = “BUILD”), regardless of whether the genetic algorithm is switched on or off by the user in the run control file, that is, for GeneticAlgorithmFlag = “Y” or “N” alike.

          Files and their descriptions are grouped below by content. Many of these files contain information about the build-phase results that are required by M 4 during the forecasting-mode runs but which may not otherwise be of much interest to the user, whereas other files contain key results.

          • Standard performance metrics:

            • These files give a standard set of fit measures for each of the models, following the naming convention described in Section 4.2.3.1 above.

            • The files are:

              • ensemble_StandardPerformanceReportingSuite.csv

              • PCANN_BC_StandardPerformanceReportingSuite.csv

              • PCANN_StandardPerformanceReportingSuite.csv

              • PCMCQRNN_StandardPerformanceReportingSuite.csv

              • PCQR_StandardPerformanceReportingSuite.csv

              • PCR_BC_StandardPerformanceReportingSuite.csv

              • PCR_StandardPerformanceReportingSuite.csv

              • PCRF_BC_StandardPerformanceReportingSuite.csv

              • PCRF_StandardPerformanceReportingSuite.csv

              • PCSVM_BC_StandardPerformanceReportingSuite.csv

              • PCSVM_StandardPerformanceReportingSuite.csv

            • The core fit measures provided in these files for each model are:

              • Cross-validated (out-of-sample) coefficient of determination (R 2 ), a standard deterministic model fit metric, where 1 is perfect and lower values denote progressively poorer performance with 0 indicating no ability to reproduce the data

              • Cross-validated (out-of-sample) root mean square error (RMSE), a standard deterministic model fit metric, where 0 is perfect and higher values denote progressively poorer performance

              • Cross-validated (out-of-sample) ranked probability skill score (RPSS), a standard probabilistic model fit metric which in some sense gives an estimate of how good the prediction intervals are, where 1 is perfect and lower values denote progressively poorer performance; the value is based on post facto probabilities estimated on the assumption of either a homoscedastic Gaussian error distribution centered around the best estimate with a standard deviation equal to the RMSE (e.g., for PCR) or similarly but in Box-Cox transform space (e.g., for PCR_BC) which allows for heteroscedastic and non-Gaussian residuals; the exception is the two modeling systems based on quantile regression-type calculations, where the probability estimates are intrinsic to the technique

              • Cross-validated (out-of-sample) categorical hit rates for the lowermost (HR-L), middle (HR-M), and uppermost (HR-H) terciles of observations, where perfect values are 1, and any value > 1/3 indicates the presence of skill relative the climatology forecast (to borrow a term from the numerical weather prediction community)

            • Some subsidiary in-sample fit measures also included in these files are:

              • Whether the best-estimate prediction is negative-valued (i.e., for runoff volume, nonphysical) for any sample during the training period (BE<0?); note that for the ensemble mean, and if the option to use the nonnegativity pruning algorithm is selected by the user as it is in the default M 4 application, this metric is evaluated after any such ensemble member removal has taken place

              • Whether the lowermost prediction bound considered (90% exceedance probability flow volume) is negative-valued (i.e., for runoff volume, nonphysical) for any sample during the training period (PB<0?): note that for the ensemble mean, and if the option to use the negativity pruning algorithm is selected by the user as it is in the default M 4 application, this metric is evaluated after any possible ensemble member removal has taken place

              • The forecast distribution, expressed as categorical bin counts where the bands are segmented according to the prediction intervals considered (between the minimum and the 90% exceedance flow, FD 100-90; between the 90% and 70% exceedance flows, FD 90-70; between the 70% exceedance flow and the best estimate, FD 70-BE; between the best estimate and the 30% exceedance flow, FD BE-30; between the 30% and 10% exceedance flows, FD 30-10; and between the 10% exceedance flow the maximum, FD 10-0); where the expected frequencies are the number of samples, N , multiplied by 0.1, 0.2, 0.2, 0.2, 0.2, and 0.1 respectively; these statistics do not seem particularly stable for small

                sample sizes often encountered in canonical WSF use-cases and might not be accorded a great deal of meaning but are provided as additional information nevertheless

              • The maximum absolute error and the maximum relative (percentage) error and the years (or sample indices) in which they occur; 0 is a perfect model; the years during which the maximum absolute and relative errors occur are not necessarily the same.


          • Augmented performance metrics:

            • For each of the models, and following the naming convention described in Section 4.2.3.1, these files give both the in-sample and cross-validated (out-of-sample) values for a key subset of the metrics described above.

            • The files are:

              • enemsble_AdditionalReportingSuite.csv

              • PCANN_BC_AdditionalReportingSuite.csv

              • PCANN_AdditionalReportingSuite.csv

              • PCMCQRNN_AdditionalReportingSuite.csv

              • PCQR_AdditionalReportingSuite.csv

              • PCR_BC_AdditionalReportingSuite.csv

              • PCR_AdditionalReportingSuite.csv

              • PCRF_AdditionalReportingSuite.csv

              • PCRF_BC_AdditionalReportingSuite.csv

              • PCSVM_AdditionalReportingSuite.csv

              • PCSVM_BC_AdditionalReportingSuite.csv

            • The measures considered are R 2 , RMSE, and the tercile hit rates (HR-L, HR-M, HR-H) (see above for descriptions).


          • Model predictions:

            • For each of the models, and following the naming convention described in Section 4.2.3.1, these files give the forecast distributions for each sample time over the training period for each of the constituent modeling systems. Following typical practice, in- sample values are provided, which speak primarily to the ability of the model to reproduce historical data and can be helpful for diagnosing and understanding model dynamics. For measures of predictive performance, focus instead on cross-validated (out-of-sample) performance statistics mentioned above. Out-of-sample predictions can also be found in some of the files listed under “Additional Information” below.

            • The files are:

              • ensemble_predictions.csv

              • PCANN_predictions.csv

              • PCANN_predictions_BC.csv

              • PCMCQRNN_predictions.csv

              • PCQR_predictions.csv

              • PCR_predictions.csv

              • PCR_predictions_BC.csv

              • PCRF_predictions.csv

              • PCRF_predictions_BC.csv

              • PCSVM_predictions.csv

              • PCSVM_predictions_BC.csv

            • The first column is a sequential index, the second column is the user-specific sample index (assumed to be the year), the third column is the observed predictand (flow volume), the fourth column gives the best estimate of the predictand, and the next four columns give the prediction intervals as the 90%, 70%, 30%, and 10% exceedance flow volumes (i.e., 10 th , 30 th , 70 th , and 90 th percentiles of the error distribution, respectively).

            • An additional file compiles the best estimates for each of the models into a single table:

              • BestEstimatesAllModels.csv


          • Principal component analysis (PCA) results:

            • For each of the models, and following the naming convention described in Section 4.2.3.1, these files give outputs from the PCA data preprocessing step.

            • If the user specified that genetic algorithm-based feature optimization was to be used, the PCA results will in general be different for each of the six constituent modeling systems, as each is independently optimized and will typically retain different input variables from the candidate pool, leading to different PCA results.

            • If feature optimization was not used, then the same data were used for all the models, and the PCA results will be identical between the constituent modeling systems, and many of the files below will therefore have identical content.

            • The files are:

              • PCANN_eigenspectrum.csv

              • PCANN_eigenvector.csv

              • PCANN_MeanOfEachRetainedInputVariate.csv

              • PCANN_PCtimeseries.csv

              • PCANN_StdevOfEachRetainedInputVariate.csv

              • PCMCQRNN_eigenspectrum.csv

              • PCMCQRNN_MeanOfEachRetainedInputVariate.csv

              • PCMCQRNN_PCtimeseries.csv

              • PCMCQRNN_StdevOfEachRetainedInputVariate.csv

              • PCQR_eigenspectrum.csv

              • PCQR_eigenvector.csv

              • PCQR_MeanOfEachRetainedInputVariate.csv

              • PCQR_PCtimeseries.csv

              • PCQR_StdevOfEachRetainedInputVariate.csv

              • PCR_eigenspectrum.csv

              • PCR_eigenvector.csv

              • PCR_MeanOfEachRetainedInputVariable.csv

              • PCR_PCtimeseries.csv

              • PCR_StdevOfEachRetainedInputVariate.csv

              • PCRF_eigenspectrum.csv

              • PCRF_eigenvector.csv

              • PCRF_MeanOfEachRetainedInputVariate.csv

              • PCRF_PCtimeseries.csv

              • PCRF_StdevOfEachRetainedInputVariate.csv

              • PCSVM_eigenspectrum.csv

              • PCSVM_eigenvector.csv

              • PCSVM_MeanOfEachRetainedInputVariate.csv

              • PCSVM_PCtimeseries.csv

              • PCSVM_StdevOfEachRetainedInputVariate.csv

            • As the foregoing list shows, there is a set of five files for the final retained predictor set (if the genetic algorithm is used) or for the full set of predictors in the input data file (if the genetic algorithm is not used) for each model:

              • One file (modelabbreviation_eigenspectrum.csv) gives the eigenvalues for each mode, expressed in terms of the percentage of dataset variance explained by each PCA mode

              • One file (modelabbreviation_eigenvector.csv) gives the eigenvectors (loading patterns) expressed as a matrix: each column contains an eigenvector, with the leading mode in the left-most column, and each row corresponds to a different variable, sorted in the same order as in the input data file (but omitting variables not retained if the genetic algorithm was invoked).

              • One file (modelabbreviation_PCtimeseries.csv) give the principal component time series (scores) expressed as a matrix, where the rows correspond to PCA modes (1 is leading mode, etc.) and the columns correspond to sequential sampling times

              • One file (modelabbreviation_MeanOfEachRetainedInputVariate.csv) captures the mean of each of the retained input variables, as calculated during the process of standardizing each of those variables to zero mean and unit variance, such that PCA is performed on the correlation rather than covariance matrix.

              • One file (modelabbreviation_StdevOfEachRetainedInputVariate.csv) captures the standard deviation of each of the retained input variables, as calculated during the process of standardizing each of those variables to zero mean and unit variance, such that PCA is performed on the correlation rather than covariance matrix.

            • The primary purpose of these PCA output files is to store information that will be used later in a forecasting run of M 4 .

            • They also can provide valuable diagnostic information for intermediate- to advanced- level users of M 4 and, in particular, support explainable machine learning in the context of WSF (see Appendices).


          • Models:

            • For each of the models, and following the naming convention described in Section 4.2.3.1, these files define the final supervised learning models for each of the modeling systems within the current version of M 4 .

            • The files are:

              • PCANNmodel.Rdata

              • PCMCQRNNmodel.Rdata

              • PCRFmodel.Rdata

              • PCRmodel.Rdata

              • PCSVMmodel.Rdata

              • PCQRmodel.Rdata

              • PCQR_model_summary.txt

            • As noted in Section 4.2.1, the quantile regression module provides slightly different output format than the other five constituent modeling systems within the current version. See also comments in the R file containing the quantile regression function.


          • Additional information:

            • Additional files providing information of potential interest to users are:

              • AutomatedEnsembleMemberExclusions.csv shows which, if any, of the constituent modeling systems were excluded from the final ensemble by the pruning algorithm, if used (see run control flags around automatic ensemble generation above), for contributing to a negative-valued ensemble mean prediction interval at any sample time during the training period (see above and appendices for more details); for default applications, this information is also used by M 4 during a subsequent forecast run.

              • BoxCoxPredictionIntervalsSwitches.csv shows which, if any, of PCR_BC, PCSVM_BC, PCRF_BC, and/or PCANN_BC have had their prediction intervals replaced by those from PCR, PCSVM, PCRF, and/or PCANN, respectively. This occurs in rare cases, on a model-by-model basis, when the Inf check algorithm has been tripped has been tripped for that model; see the description of AutoEnsembleFlag in Section 4.2.2 above. Given that the best estimates are always the same between PCR and PCR_BC, PCRF and PCRF_BC, PCSVM and PCSVM_BC, and PCANN and PCANN_BC, in such cases PCR_BC, PCSVM_BC, PCRF_BC, and/or PCANN_BC are made fully identical to PCR, PCSVM, PCRF, and/or PCANN. Of course these changes, on the rare occasion they occur, are also reflected in the calculation of the multi-model ensemble mean forecast distribution.

              • residuals_all_models.csv gives the training-period in-sample and cross-validated (LOOCV) residuals for the best-estimate predictions from each constituent modeling system and from the ensemble mean.

              • error_message_log.txt logs output messages, if the option to write it is turned on by the user in the run control file (as in the default case; see above); this should in general be briefly consulted after a M 4 run is complete; most warnings and other information can usually be ignored, but in the event of a crash the errors logged here can be useful for diagnosis, particularly for experienced R and RStudio users.

              • The files PCANN_AdditionalDataForPostProcessing.csv, PCANN_BC_AdditionalDataForPostProcessing.csv, PCMCQRNN_AdditionalDataForPostProcessing.csv, PCQR_AdditionalDataForPostProcessing.csv,

                PCR_AdditionalDataForPostProcessing.csv, PCR_BC_AdditionalDataForPostProcessing.csv, PCRF_AdditionalDataForPostProcessing.csv, PCRF_BC_AdditionalDataForPostProcessing.csv, PCSVM_AdditionalDataForPostProcessing.csv, and PCSVM_BC_AdditionalDataForPostProcessing.csv each provide information that is used by the Persephone UI (not included in this M 4 distribution) and may not be of much interest to most users, but also includes both the in-sample and out- of-sample best-estimate predictions and residuals for each of the constituent modeling systems.

            • Additional files providing information that in general may be of little interest to users but is required in subsequent M 4 forward runs using the built models:

              • AutomaticallyOptimizableHyperparameters.csv records some hyperparameter tuning results.

              • BoxCoxInfo.csv records some automated Box-Cox transform parameter estimation results.

              • MontonicityConstraintRequiredAdjustments.csv records some AutoML outcomes for the neural networks.

              • QcritandN.csv records information about the terciles of the observed predictand over the training period, used for some statistical performance diagnostics, and the number of samples used in the build phase.

        3. Additional output files created by build-step M 4 runs if automated feature optimization is used

          The following are additional output files created by M 4 during model-building runs (that is, if RunTypeFlag = “BUILD”) only if the genetic algorithm is switched on by the user in the run control file, that is, for GeneticAlgorithmFlag = “Y”.

          • Genetic algorithm status reports piped to run logs:

            • For each of the models, and following the naming convention described in Section 4.2.3.1, these files give intermediate outputs from the genetic algorithm as it evolves toward an optimal solution.

            • This information is not used in subsequent forecast operations and will usually be of limited interest to the user, but there may be cases where a user may find it instructive to review the run logs to determine if a set of features (combination of retained predictor variates, and principal component time series derived from that retained set of candidates) other than the final optimized feature set might provide similar performance but offer other more desirable characteristics (e.g., retaining more predictors, giving better operational redundancy in operations) in accordance with the hydrologist’s professional judgement.

            • The files are:

              • GA_RunLog_PCANN.txt

              • GA_RunLog_PCMCQRNN.txt

              • GA_RunLog_PCQR.txt

              • GA_RunLog_PCR.txt

              • GA_RunLog_PCRF.txt

              • GA_RunLog_PCSVM.txt.

          • Final genetic algorithm results:

            • For each of the models, and following the naming convention described in Section 4.2.3.1, these files give the final results from the genetic algorithm and describe which input variables and (if MaxModes is set by the user to > 1 in the run control file, PCA modes) are retained in each of the six final modeling systems.

            • The file is small. It records some of the genetic algorithm parameters set by the user in the run control file, or which are hard-wired in the M 4 code but accessible to expert users. The most important portion is the final line, which provides key information for the user when setting up the run control file for subsequent forecasts using the final build-phase models. This information is encoded in the string of binary digits (that is, the string of 0s and 1s) following “Best Solution : “

            • The first M binary digits in this string indicate which of the input variables in the candidate pool supplied by the user to M 4 in MMPEInputData_ModelBuildingMode.txt are retained in the final model. For a detailed explanation, see the description given above in Section 4.2.2 for how to set VariableSelection_Frwrd_LR in MMPE_RunControlFile.txt . To recap, reading this string from left to right, 1 means the input predictor variable is kept and 0 means the variable isn’t retained, where the order is identical to the input predictor variables as listed left to right in MMPEInputData_ModelBuildingMode.txt. Think of the string as being a template placed over the headers in MMPEInputData_ModelBuildingMode.txt: 1 means keep that variable, 0 means it’s discarded.

            • If MaxModes is set by the user to 1 in the run control file, the binary string is only M

              digits long.

            • If MaxModes is set by the user to 2 Q 4 in the run control file, there will be Q -1 additional digits in the binary string after the first M digits which encode which of the PCA modes (beyond the first mode, which is always retained and therefore not explicitly encoded for here) is retained. For a detailed explanation, see the description given above in Section 4.2.2 for how to set PCSelection_Frwrd_LR in MMPE_RunControlFile.txt . To summarize, the ( M +1)th through M + Q -1)th digits in the binary string, respectively, encode whether the Q th mode is retained.

            • The files are:

              • GA_RunSummary_PCANN.txt

              • GA_RunSummary_PCMCQRNN.txt

              • GA_RunSummary_PCQR.txt

              • GA_RunSummary_PCR.txt

              • GA_RunSummary_PCRF.txt

              • GA_RunSummary_PCSVM.txt.

        4. Unsaved graphical outputs

          • M 4 generates a large number of plots during the build phase. They are intended primarily for diagnostic purposes and are not written to file. While users are encouraged to explore these

            graphics, some of their outputs are complex or come with caveats, and the output information that most users will want is contained in the .csv and .txt files written by M 4 .

            • A partial exception may be a multi-panel diagnostics plot; these are produced for each of the constituent modeling systems but the diagnostics plots for the final multi-model ensemble mean prediction will generally be of the most interest to users. An example is given below for an application of M 4 to the Gila River.

            • The top panel gives historical in-sample performance of M 4 during training. The red dots connected by a thick red line are the observations, the thick black line is the M 4 ensemble best-estimate prediction, and the dashed gray lines are the prediction intervals (dashed gray lines give estimated 30% and 70% exceedance flows, and solid gray lines give estimated 10% and 90% exceedance flows). A thin red horizontal line indicates zero flow for reference. This multi-model ensemble mean forecast distribution was generated using the default model combination (linear quantile regression and a monotone composite quantile regression neural network with prediction intervals intrinsic to those quantile regression methods; and linear regression, a monotone artificial neural network, random forests, and a support vector machine, each individually using prediction intervals generated in Box-Cox transform space on the basis of the out-of-sample root mean square errors of each model). The nonnegativity check algorithm did not require removal of any ensemble members. Note that the prediction intervals around the ensemble mean prediction, which are the means of the prediction intervals from the six modeling systems, are asymmetric around the best estimate and vary in width from year to year with wider intervals during high-flow years, suggesting heteroscedastic and non-Gaussian error distributions; this is justified by the results in the next two rows of the diagnostics plot (see immediately below).

            • The second row down gives scatter plots of observed vs. predicted flows and of model error as a function of the predicted flow. The diagonal line in the lefthand plot gives the 1:1 ratio that a perfect prediction would follow. These two plots also provide qualitative indications around whether heteroscedasticity is present. In this example, the error variance is obviously larger when flows are high; this nonstationarity in the variance is not unusual in many environmental systems.

            • The third row down gives information around the distributional form of the residuals, including a quantile-quantile plot (a normal distribution would follow the diagonal line), a Shapiro-Wilk test on the normality of the error distribution ( p -value < 0.05 to 0.10 or so is typically taken to indicate that the null hypothesis of a normal distribution can be rejected), and a simple histogram. The error distribution is, in this example, clearly non- Gaussian.

            • The fourth row down gives information about linear autocorrelation in the residuals at lags up to 10 (in the canonical WSF use-case this corresponds to a lag of 10 years given the use of annual time series; see example dataset given in Section 4.2.2 above and the accompanying M 4 application examples), where the blue lines indicate rejection of the null hypothesis of no serial dependence at p = 0.05. Both the autocorrelation and the partial autocorrelation (which can be helpful for localizing autocorrelation to a particular lag) functions are provided. There is little evidence in this particular example for any substantial autocorrelation in model error.

            • The bottom panels give some information on categorical and distributional forecast performance. On the left, the ability of the model to predict above-normal, normal, or below-normal conditions, where the category cutoffs are defined as terciles of the observed predictand, is depicted. A perfect hit rate is 1 (dashed blue line) and the simple climatology forecast is 1/3 (dashed red line), so a result above the red line indicates the presence of prediction skill. The right-most panel breaks up the forecast distribution according to the prediction intervals. Expected performance is shown by the black bars, whereas actual performance is shown by the gray bars; note, however, that these statistics do not seem to be particularly stable for the small sample sizes often encountered in canonical WSF use-cases and might not be accorded a great deal of meaning. The information in the bottom panels is also provided in the set of standard performance measures written to a .csv file as described in Section 4.2.3.2.


        5. Output files created by forecast-step M 4 runs

          A forecasting run of M 4 creates only three files:

          • ForwardRunMultiModelEnsembleMeanPrediction.csv gives the best-estimate predictions and associated prediction intervals, generated using the finalized models created during the

            preceding build-phase run of M 4 and the new sample fed to M 4 by the user in MMPEInputData_Forecasting.txt, in accordance with the options specified by the user in MMPE_RunControlFile.txt. Recall that the ensemble mean will in general use only a subset of the available models; if AutoEnsembleFlag is “Y” then these will be the six default models, potentially pruned by the automated negativity check algorithm to help ensure nonnegativity of the runoff prediction, as described above. Note also that if the Inf check algorithm was triggered in training, then the prediction intervals in forecasts generated using PCR_BC, PCRF_BC, PCSVM_BC, and/or PCANN_BC are replaced by those in PCR, PCRF, PCSVM, and/or PCANN, respectively (refer back to description of BoxCoxPredictionIntervalsSwitches.csv in Section 4.2.3.2 and description of AutoEnsembleFlag in Section 4.2.2), and this information is propagated into the multi-model ensemble mean prediction.

          • ForwardRunAllModelsPredictions.csv gives the best-estimate predictions and associated prediction intervals for each of the 10 available constituent modeling systems (6 models, plus both conventional and Box-Cox transform space-based prediction intervals for random forests, support vector machine, linear regression, and the monotone artificial neural network, as described above) generated using the finalized models created during the preceding build-phase run of M 4 and the new sample fed to M 4 by the user in MMPEInputData_Forecasting.txt, in accordance with the options specified by the user in MMPE_RunControlFile.txt. Note that if the Inf check algorithm was triggered in training, then the prediction intervals in forecasts generated using PCR_BC, PCRF_BC, PCSVM_BC, and/or PCANN_BC are replaced by those in PCR, PCRF, PCSVM, and/or PCANN, respectively (refer back to description of BoxCoxPredictionIntervalsSwitches.csv in Section 4.2.3.2 and description of AutoEnsembleFlag in Section 4.2.2).

          • error_messages_log.txt as in the build phase as described above; this overwrites the file of the same name created during the build phase.

          • No graphics are produced by M 4 in forecast-step runs.


    3. Use instructions

      M 4 use instructions are provided below and divided into three groups by user type: basic (Section 4.3.1), intermediate (Section 4.3.2), and expert (Section 4.3.3). These instructions necessarily assume that the user has read and understood the preceding sections of this manual and the background provided in the Appendices. Detailed examples distributed in the repo are described below in Section 5.

      1. Basic instructions

        The basic instructions provided in this section correspond to a simple, default approach for applying M 4 to the NRCS WSF canonical use-case (see above). It is intended for users whose primary area of interest and expertise is in hydrology and, in particular, operational water supply forecasting in the American West, but who may have limited knowledge of machine learning or the other technical methods deployed in M 4 .

        This mode of application relies heavily on the AutoML capabilities of M 4 . Its capabilities as a WSF approach should not be underestimated; based on retrospective and live operational testing conducted so far, this standardized approach appears to consistently out-perform current WSF methods in

        common use at governmental service-delivery organizations in the US West, and it may be anticipated to yield reasonable results for the WSF canonical use case while maintaining reasonable run times and geophysical explainability of results. That said, we emphasize again (see Disclaimer above) that M 4 remains a prototype. Additionally, the basic instructions may not provide the best performance available potentially available from M 4 in a given application, and ideally, it should be viewed as a starting point for learning more about M 4 and the details of the techniques it deploys, and how to use those methods to maximum advantage.

        1. Building a M 4 model

          To apply M 4 to some particular forecast problem (i.e., SFP; see Section 3.2.1), the user begins by building a model (see above). To do this, the user needs to modify, in general, only two input text files as described in detail in Section 4.2.2 above:

          • The build-phase input data file, MMPEInputData_ModelBuildingMode.txt. Examples are provided in the M 4 distribution; one of these, for the Gila River, is shown and described in detail in Section 4.2.2 above. An easy way to do this is to compile the datasets you want to use in a spreadsheet program like Excel for example, using the format shown in the example, and then save it as a tab-delimited text file.

          • The run-control file, MMPE_RunControlFile.txt. For the basic-level WSF user, almost nothing needs to be set in the run control file for the build-phase run. The values and flags as set in the run control file provided in the installation package, and as shown above in Section 4.2.2, are defaults which have been tested for general applicability to the NRCS WSF canonical use-case at a prototype level and do not need to be changed (for basic application) from one SFP to the next. The user need only ensure that, near the top of the file, RunTypeFlag is set to “BUILD”. This can be done in any text editor, for example. Note also that, as mentioned above, the values and flags in the third and final section of the run control file, where the user provides information about how they would like a forecast run to proceed, must be provided during the build phase but their actual values are irrelevant.

            Note that (as mentioned previously) these filenames must not be changed.

            To run M 4 in RStudio in a Windows environment, simply navigate in Windows Explorer to the subdirectory in which M 4 was installed, double-click on the main program file (MMPE-Main_MkII.R) to open it in RStudio, place the cursor in the upper-left RStudio subwindow which contains the main program script, hit Ctrl-A to highlight the script, and press the Run button at the top right of that window. As noted previously, for the default configuration as applied to a canonical western US statistical WSF use-case as described above, a build-mode run will in general take about 15-35 minutes or so, depending on the computer and other factors. (This run time can be greatly reduced by modifying the genetic algorithm parameters or using manual feature selection instead of the genetic algorithm, but doing so falls into the intermediate-level user instructions provided in Section 4.3.2 below.)

        2. Using the finished model to make a forecast

          To take the M 4 model suite built in accordance with Section 4.3.1.1 and run it with a new sample to make a forecast as per the usual M 4 workflow (see background provided earlier in this manual), the user again needs to modify, in general, only two input text files as described in detail in Section 4.2.2 above:

          • The forecasting-phase input data file, MMPEInputData_Forecasting.txt. Examples are provided in the M 4 distribution; one of these, for the Gila River, is shown and described in detail in Section

            4.2.2 above.

          • The run-control file, MMPE_RunControlFile.txt. The basic-level WSF user should keep most of the values and flags at the values provided in the run control file provided in the installation package, and as shown above in Section 4.2.2, which are defaults that have been tested for general applicability to the NRCS WSF canonical use-case at a prototype level as discussed above and in the Appendices. However, the user needs to make three sets of modifications, which can be done in any text editor for example:

            • Near the top of the file, RunTypeFlag must be set to “FORECAST”

            • In the “RUN PARAMETERS FOR FORECASTING” section near the bottom of the file, the user must specify:

              • Which PCA modes to retain for each of the models, i.e., the user must provide values for PCSelection_Frwrd_LR, PCSelection_Frwrd_QR, PCSelection_Frwrd_mANN, PCSelection_Frwrd_MCQRNN, PCSelection_Frwrd_RF, and PCSelection_Frwrd_SVM, as described in the detailed summary of MMPE_RunControlFile.txt provided in Sections 4.2.2 above, with further elaboration in Section 4.2.3.3

              • Which predictor variables to retain for each of the models, i.e., the user most provide values for VariableSelection_Frwrd_LR, VariableSelection_Frwrd_QR, VariableSelection_Frwrd_mANN, VariableSelection_Frwrd_MCQRNN, VariableSelection_Frwrd_RF, and VariableSelection_Frwrd_SVM, as described in the detailed summary of MMPE_RunControlFile.txt provided in Sections 4.2.2 above, with further elaboration in Section 4.2.3.3.

                Let’s look at an example of how to set the values of PCSelection_Frwrd_MCQRNN and VariableSelection_Frwrd_MCQRNN for a forecast run. As explained previously, this information is found in the genetic algorithm’s final output file for the monotone composite quantile regression neural network, GA_RunSummary_PCMCQRNN.txt, written by M 4 during the build-phase run. Say that the file looks like this for the Gila River application considered earlier in this manual:

                GA Settings

                Type = binary chromosome

                Population size = 25 Number of Generations = 10 Elitism = TRUE

                Mutation Chance = 0.05

                Search Domain Var 1 = [,]

                Var 0 = [,]

                GA Results

                Best Solution : 0 1 0 1 0 1 0

                Focus on the string of binary digits in the last line, which encodes the genetic algorithm’s conclusions about how to optimize the features for monotone composite quantile regression neural network. The user provided M = 6 candidate predictors in MMPEInputData_ModelBuildingMode.txt during the build step, so the first 6 binary digits in this string, going from left to right, tell us whether to keep (1) or not

                (0) the six predictors that were listed, again from left to right, in MMPEInputData_ModelBuildingMode.txt. All we have to do is copy-and-paste those first six digits into the slot for VariableSelection_Frwrd_MCQRNN in MMPE_RunControlFile.txt:

                "0 1 0 1 0 1" # VariableSelection_Frwrd_MCQRNN - For MCQRNN

                With the default values used in the basic-level use instructions, MaxModes = 2. Recalling from earlier in this manual that the leading PCA mode is always retained, we need only to look at the ( M +1) th (here, seventh) digit in the binary string and see whether it’s a 1 (keep the second PCA mode) or 2 (don’t keep the second PCA mode). In this particular example, it’s 0, so that’s what we enter in the slot for PCSelection_Frwrd_MCQRNN in MMPE_RunControlFile.txt:

                1 # PCSelection_Frwrd_MCQRNN - For PCMCQRNN

                If that seventh digit in the binary string had happened to be 1 instead, we would’ve entered this in the slot for PCSelection_Frwrd_MCQRNN in MMPE_RunControlFile.txt:

                1,2 # PCSelection_Frwrd_MCQRNN - For PCMCQRNN

                The same process needs to be done for the other five modeling systems in M 4 , drawing information from their corresponding build-phase genetic algorithm run summary files.

                As in the build run, the filenames must not be changed. The process for running M 4 in RStudio is described in Section 4.3.1.2 above and is identical for the build and forecast runs. For the default configuration as applied to a canonical western US statistical WSF use-case as described above in this manual, a forecast-mode run will in general take < 1 minute.

                A quick and easy check on whether MMPE_RunControlFile.txt and MMPEInputData_ForecastingMode.txt have been set up correctly for the forecast run is as follows:

          • Create a test version of MMPEInputData_ForecastingMode.txt that contains the input data from one of the years during the training period, i.e., copying-and-pasting the predictor data from any selected line of MMPEInputData_ModelBuildingMode.txt.

          • Run M 4 in forecast mode.

          • Check that the forecast distributions from each of the constituent modeling systems in M 4 (i.e., as listed in ForwardRunPredictions.csv forecast-mode output file) match the hindcast predictions for that same year for those same models in the M 4 build-phase outputs (i.e., as collectively listed in ensemble_predictions.csv, PCANNpredictions_BC.csv, MCQRNN_predictions.csv, PCQR_predictions.csv, PCR_predictions_BC.csv, PCRF_predictions_BC.csv, and PCSVM_predictions_BC.csv for basic-level users employing the default automatic ensemble composition).

        3. Other considerations

          Both the build and forecast phase must take place in the same subdirectory, because forecast runs need to read information from the output files created by M 4 during the build step.

          Note that two identical training runs of M 4 , using exactly the same dataset and run controls, will in general lead to slightly different results. This may be unfamiliar to users accustomed to linear statistical WSF models, though it is common in automated parameter estimation for process-based hydrologic

          models. In M 4 , this behavior has two main sources. One is the genetic algorithm, which is a stochastic global optimization approach. Without going into detail on the topic (there is plenty of literature on the subject elsewhere) it uses both a gradual march toward better selections of retained input variables and PCA modes based on how the last iteration of the algorithm did, as well as random jumps to help ensure that it finds the true optimum across all possible combinations of candidate input variables and PCA modes without getting stuck in local minima in the hyperdimensional error surface. This can be important for nonlinear, noisy, and high-dimensional problems. Because of the randomized portion of the process, the final result may not be identical from run to run. The other main source of run-to-run variability in build-phase outcomes is that portions of the training process for the various constituent machine learning algorithms also have a stochastic component; for example, initial guesses for neural network weights and biases are typically randomized. Overall, the variation in the final ensemble mean M 4 WSF from one build run to the next is, in general, slight and of little practical consequence; indeed, it will not be noticed unless multiple identical M 4 training runs are performed, which would likely only occur if some other issue with the training run was encountered. Multi-model ensembles also tend to damp out the practical impacts of those effects on the final predictions. Steps like increasing the population size and number of generations in the genetic algorithm, or increasing the number of bootstraps in the neural networks, can help mitigate the issue if it is felt to be a problem and will typically lead to incremental prediction improvements more generally, but as discussed previously, this will come at the expense of increased computational times.

          Note that this run-to-run variation does not occur in forecast runs using a given finalized model. That is, taking a M 4 model suite developed as per section 4.3.1.1 and running it in forecast mode as per section

          4.3.1.2 over and over again using the same forecast-phase input data will yield exactly the same forecast each time (though of course there is no reason to actually do this in practice).

          Finally, we note that the simple, quick, but nitpicky procedure of manually taking genetic algorithm results from the build-step output files and entering them into the run control file for forecasting runs could be automated with some additional M 4 code fairly easily. It was left this way, though, because it was known that M 4 would ultimately be integrated into an operational platform and user interface, in which that functionality could be implemented alongside additional UI-driven features like options for post-hoc user-adjusted ensemble compositions.

      2. Intermediate instructions

        Intermediate instructions are intended for users who, in addition to the hydrologic expertise described in 4.1 above, also have a good understanding of the principles and approaches of the technical methods employed in M 4 , though not necessarily in-depth knowledge of the specific algorithms and R packages used to implement those methods.

        The intermediate-level application involves user selection, based on professional expertise, of a few non-default options in the run control file, MMPE_RunControlFile.txt, during M 4 training. Intermediate- level applications should cover, at a prototype level, most if not all of the practical situations arising in contemporary operational WSF as routinely undertaken by service-delivery organizations in the western US and by applied water resource science and engineering researchers working to improve the methods used by those service-delivery organizations.

        The intermediate-level instructions are identical to the basic-level instructions but accommodate a few additional possibilities.

        1. Modify the genetic algorithm parameters

          As discussed in Section 4.2.2, the larger the population size and the more generations used, the better (roughly speaking, and up to a point) the result of the feature optimization process, but the longer the build run. Experimentation with western US statistical WSF canonical use-cases suggests that a point of diminishing returns is approached relatively quickly. GAPopSize = 25 with GANumGens = 10 appears to be a good compromise and is the recommended default.

          However, other choices can be useful in some cases. if the user is in a rush to develop a M 4 modeling suite for some SFP, GAPopSize = 15 with GANumGens = 7 has generally been found, at a prototype level, to give reasonable results with a shorter run time. Conversely, if the size of the candidate predictor pool is larger, or if build-phase run time is not an issue, it may be a good idea to run a larger genetic algorithm problem size to help ensure that the global best estimate is more closely approximated in the final modeling suite. GAPopSize = 100 with GANumGens = 15, for example, can take several hours to complete on an otherwise identical M 4 run that would complete in perhaps 15-35 minutes for GAPopSize = 25 with GANumGens = 10, or even less for GAPopSize = 15 with GANumGens = 7, but will more reliably obtain a truly optimal feature selection for a canonical WSF use-case scenario, though in testing so far the improvement will not be dramatic relative to the default genetic algorithm settings for the canonical WSF use cases we focus on here.

          Note that GAPopSize and GANumGens can be varied independently of each other by the user, and similarly, can be varied independently of other run control parameters.

        2. Retain fewer or more PCA modes

          The default case is MaxModes = 2, that is, the genetic algorithm has to decide whether or not to include the second PCA mode. This is again a compromise solution.

          Physical interpretation is one issue to consider in WSF applications. For canonical use-cases where the main predictive inputs are variables like current SWE and accumulated precipitation, the leading PCA mode effectively constitutes a watershed-scale index of wintertime hydroclimatic conditions, and in practice often captures 80% or more of the total input dataset variance. In a substantial minority of existing NRCS PCR-based WSF models, though, antecedent streamflow is included in the predictor candidate pool to index river flow inputs from natural water storages in a watershed, like soil moisture, groundwater, or wetlands; these geophysical processes are obviously influenced by, but physically distinct from, SWE and precipitation and may appear as a distinct second PCA mode that will in general be retained by a feature optimization routine and has clear physical meaning. Even if such distinct input variables are not included in the candidate pool, more than one PCA mode may be retained in some cases, presumably reflecting second-order spatiotemporal patterns in snowpack or precipitation accumulation across the watershed.

          That said, there may be cases where the hydrologist wishes to consider fewer or more PCA modes.

          The case for fewer modes (that is, MaxModes = 1) is simple: in the canonical use-cases described above the leading mode is usually the strongly dominant and most physically explainable PCA mode; this

          choice reduces the amount of work that the genetic algorithm has to do; and it also makes the forward run easier to set up in the run control file, as PCSelection_Frwrd_LR, PCSelection_Frwrd_QR, PCSelection_Frwrd_mANN, PCSelection_Frwrd_MCQRNN, PCSelection_Frwrd_RF, and PCSelection_Frwrd_SVM are all simply 1 in this case. The trade-off is a potential loss of some predictive skill, and possibly loss of some physical insight if patterns in the predictors chosen for the candidate pool can be legitimately decomposed by PCA into more than one significant geophysical process.

          Conversely, testing so far suggests it’s not unusual in canonical use-case scenarios to be able to eke out a little more predictive skill by increasing MaxModes from its default value to 3 or 4. The trade-offs for this choice in the canonical use-cases are that it may be difficult to reliably attribute physical meaning to the third or fourth PCA mode; these higher modes typically explain very little of the total input dataset variance; and the scale of the optimization problem may be substantially increased, with potential implications for run times or global optimality of the solution. It also makes setting PCSelection_Frwrd_LR, PCSelection_Frwrd_QR, PCSelection_Frwrd_mANN, PCSelection_Frwrd_MCQRNN, PCSelection_Frwrd_RF, and PCSelection_Frwrd_SVM more complicated (see above detailed instructions, such as in Sections 4.2.2 and 4.2.3.3).

          Note that MaxModes can be varied independently of other run control parameters.

        3. Turn off the genetic algorithm

          It may be useful to turn off automated feature optimization in some cases. The user may wish all variables in MMPEInputData_ModelBuildingMode.txt to be used, perhaps for testing purposes or because the user is confident that all the predictor variables provided in the input data file should be included in the model, perhaps for geophysical or operational redundancy reasons. Another possibility is that very few input variables are available; in this case, the genetic algorithm may not have much to do. Speediness is another factor; a build-phase run that could take roughly 15-35 minutes with feature optimization using default genetic algorithm parameters may run in < 1 minute if feature optimization is switched off.

          To do this, in the run control file, during the build run set GeneticAlgorithmFlag to “N” and ManualPCSelection to whichever PCA modes the user wishes to retain (e.g., 1 or 1,2 or 1,3,4 etc.) (see detailed instructions in Section 4.2.2). Generally speaking, all other flags and values in MMPE_RunControlFile.txt can be left at their default settings.

          For the forecast run, in the run control file, just copy-and-paste the values set by the user for ManualPCSelection during the build run into the slots for which PCA modes to retain in the forward run. For instance, if the user specified retention of the first and second PCA modes in training:

          1,2 # ManualPCSelection

          then use the following in MMPE_RunControlFile.txt for the forecast run:


          1,2

          #

          PCSelection_Frwrd_LR

          - For

          PCR

          1,2

          #

          PCSelection_Frwrd_QR

          - For

          PCQR

          1,2

          #

          PCSelection_Frwrd_mANN

          - For

          PCANN

          1,2

          #

          PCSelection_Frwrd_MCQRNN

          - For

          PCMCQRNN

          1,2

          #

          PCSelection_Frwrd_RF

          - For

          PCRF

          1,2

          #

          PCSelection_Frwrd_SVM

          - For

          PCSVM

          Analogously, for the forecast run, set the input variable selections to a length- M string of 1s. For example, if the MMPEInputData_ModelBuildingMode.txt file supplied by the user for the build run (and therefore the MMPEInputData_ForecastingMode.txt file as well) contains six variables as in the Gila case shown above in this manual, then in MMPE_RunControlFile.txt set these selections to a string of six 1s separated by a single space with quotation marks at the start and end of the string:


          "1 1 1 1 1 1"

          # VariableSelection_Frwrd_LR

          - For

          PCR

          "1 1 1 1 1 1"

          # VariableSelection_Frwrd_QR

          - For

          PCQR

          "1 1 1 1 1 1"

          # VariableSelection_Frwrd_mANN

          - For

          PCANN

          "1 1 1 1 1 1"

          # VariableSelection_Frwrd_MCQRNN

          - For

          MCQRNN

          "1 1 1 1 1 1"

          # VariableSelection_Frwrd_RF

          - For

          PCRF

          "1 1 1 1 1 1"

          # VariableSelection_Frwrd_SVM

          - For

          PCSVM

        4. Single predictor variable

          This is merely a special case of Section 4.3.2.3 in which MMPEInputData_ModelBuildingMode.txt contains only one predictor variable. That also means that only one PCA mode is available. (Of course, no PCA is actually required if only a single variable exists, but because this is an edge case, rather than coding a separate workflow the single predictor is simply routed through the normal PCA workflow.) In this case, during the build phase use the following in MMPEInputData_ModelBuildingMode.txt:

          "N" # GeneticAlgorithmFlag


          and:

          1 # ManualPCSelection


          Then in the forecasting step, use the following:


          1

          #

          PCSelection_Frwrd_LR - For

          PCR

          1

          #

          PCSelection_Frwrd_QR - For

          PCQR

          1

          #

          PCSelection_Frwrd_mANN - For

          PCANN

          1

          #

          PCSelection_Frwrd_MCQRNN - For

          PCMCQRNN

          1

          #

          PCSelection_Frwrd_RF - For

          PCRF

          1

          #

          PCSelection_Frwrd_SVM - For

          PCSVM

          and:




          "1"

          #

          VariableSelection_Frwrd_LR

          - For PCR

          "1"

          #

          VariableSelection_Frwrd_QR

          - For PCQR

          "1"

          #

          VariableSelection_Frwrd_mANN

          - For PCANN

          "1"

          #

          VariableSelection_Frwrd_MCQRNN

          - For MCQRNN

          "1"

          #

          VariableSelection_Frwrd_RF

          - For PCRF

          "1"

          #

          VariableSelection_Frwrd_SVM

          - For PCSVM


      3. Expert instructions

        In addition to hydrologic expertise and solid understanding of the principles of the techniques deployed in M 4 as described in 4.2 above, expert users have an in-depth understanding of the M 4 code itself, the various R packages it incorporates, and the underlying theory and applications around the algorithms used by M 4 and the R packages it employs.

        As noted above, the basic-level application relies heavily on AutoML, which in M 4 essentially amounts to a combination of custom automation algorithms and carefully selected and tested default hyperparameter choices. This remains largely true of the intermediate-level application as well, which for the most part merely changes some hyperhyperparameters or other predefined options in the run control file.

        In contrast, expert users access various additional capabilities, such as:

        • Modifying the values of additional flags and variables in the run control file beyond the specific cases described in section 4.3.2.

        • Modifying additional hyperparameters and flags within the M 4 calls to the R packages through small edits to the M 4 code. One example would be to go into the custom R function file implementing the monotone artificial neural network (PCANN-Module_MkII.R; see Section 4.2.1) and manually increase the number of hidden-layer neurons in the artificial neural network beyond the possibilities considered in the M 4 AutoML routine designed for this task; this could conceivably be required if larger or more diverse predictor datasets are used relative to the canonical or near-canonical use-cases described above in 4.3.1 and 4.3.2; indeed, even a light version of deep learning, i.e., a neural network with more than one hidden layer and many neurons in those hidden layers, could be implemented in this way.

        • More extensively modify the open-source M 4 code itself; as just one of many possible examples, by adding a seventh forecasting model of their choice to the metasystem.


  5. Files provided in the M 4 distribution


    1. M 4 installation files

      The M 4 code files were described above in Section 4.2.1, which can be installed as per the instructions given in that section.

      One example of each of the three required M 4 input files described above in Section 4.2.2 (MMPE_RunControlFile.txt, MMPEInputData_ModelBuildingMode.txt, and MMPEInputData_ForecastingMode.txt) is also provided along with the code because these need to be in quite specific formats; these examples therefore provide a template for the user to employ when setting up the model for a given SFP. The run control file given here contains default flags and values as described in detail in Section 4.2.2, though any forecast-mode run of M 4 will require the user to set a few values manually, also as described above. As a FYI, the two input data files happen to correspond to an April forecast of April-July flow volume of the Yellowstone River at Corwin Springs, using 30 years of data for training and 18 predictor variates corresponding to NRCS SNOTEL and snow course sites.


    2. Completed examples

      The distribution also includes four completed examples of how to run M 4 , each in a different folder. Each folder contains the source code, input files, and build-step and forecast-step output files for a different M 4 application. The four examples are described below.

      1. Example 1: Gila River at Gila, March 1 forecast of March-through-May volume, genetic algorithm feature selection with up to 2 PCA modes

        This is an example of the default, basic-level use instructions in Section 4.3.1 above.

        The model-building step followed the workflow described in Section 4.3.1.1: (a) created the input data file, MMPEInputData_ModelBuildingMode.txt, containing historical flow volume data and a pool of candidate predictor variables, which in this example was the Gila River data file shown above in Section

        4.2.2; (b) made sure that RunTypeFlag was set to “BUILD” in MMPE_RunControlFile.txt, left everything else in that file at its default values as shown in Section 4.2.2, and saved and closed it; (c) ran M 4 in RStudio. When complete, M 4 wrote the large number of output files listed in Sections 4.2.3.2 and 4.2.3.3.

        The forecasting step followed the workflow described in Section 4.3.1.2: (a) put together a new sample of the predictor set in MMPEInputData_ForecastingMode.txt, which in this example was the Gila River data file shown above in Section 4.2.2; (b) went back into MMPE_RunControlFile.txt, changed RunTypeFlag to “FORECAST”, set the values of PCSelection_Frwrd_LR, PCSelection_Frwrd_QR, PCSelection_Frwrd_mANN, PCSelection_Frwrd_MCQRNN, PCSelection_Frwrd_RF, PCSelection_Frwrd_SVM, VariableSelection_Frwrd_LR, VariableSelection_Frwrd_QR, VariableSelection_Frwrd_mANN, VariableSelection_Frwrd_MCQRNN, VariableSelection_Frwrd_RF, and VariableSelection_Frwrd_SVM, following the instructions provided in Section 4.3.1.2, and saved and closed it again; (c) ran M 4 in RStudio. This was all done in the same folder as the build step, because the forecast step needs to have the build-step output files available to it, as noted previously. When complete, M 4 wrote the two output files listed in Section 4.2.3.5.

        Recall from Section 4.3.1.3 that each time M 4 is run in training mode, it gets slightly different models at the end due to stochasticity in the genetic algorithm (e.g., random mutations) and some of the machine learning training processes (e.g., random initial weights and biases in an ANN). Refer back to Section

        4.3.1.3 above for details. In practice, this means that even if you run M 4 in build mode on exactly the same training dataset with exactly the same run control file as in Example 1, the final result (potentially including which input variables and PCA modes to retain) will in general show some slight differences from the Example 1 output files posted to the GitHub repo. Consequently, if you want to take those new models you just created and use them in a forecast run, you’ll have the follow the steps laid out above – that is, in addition to changing “BUILD” to “FORECAST” in the run control file, you’ll also have to set the PCSelection_Frwrd_(modelname) and VariableSelection_Frwrd_(modelname) values in accordance with the outcomes from your new build-phase M 4 run. Conversely, if you just want to run M 4 in forecast mode using the Example 1 training outputs posted to the GitHub repo, without retraining the model, no changes to the run control file would be needed.

      2. Example 2: Gila River at Gila, March 1 forecast of March-through-May volume, genetic algorithm feature selection with up to 4 PCA modes

        This is an example of the intermediate-level use instructions in Section 4.3.2.2. That is, this is the same problem as in Example 1 and described in Section 5.2.1 above, following exactly the same instructions, except that the genetic algorithm was instructed to retain up to a maximum of four PCA modes by setting MaxModes to 4 in MMPE_RunControlFile.txt during the build phase, instead of the default value of 2. Nothing else was different.

        In practice, one would rarely want to set MaxModes to anything over 2 for the western US canonical statistical WSF use-case, but other situations could conceivably arise where more PCA modes are needed. (The situation where MaxModes is decreased from the default of 2 down to 1 is a much more realistic scenario; see Section 4.3.2.2 above.)

      3. Example 3: Owyhee River at Rome, April 1 forecast of April-through-July volume, genetic algorithm feature selection with up to 2 PCA modes

        This is another example of the default, basic-level use instructions in Section 4.3.1 above. It is the identical to the example of Section 5.2.1, but for a different river and so obviously uses different observed flow and predictor data in MMPEInputData_ModelBuildingMode.txt and MMPEInputData_ForecastingMode.txt.

        In this case, the candidate predictor pool happens to be much larger than for the Gila. This example also demonstrates what the results look like when the nonnegativity check algorithm removes some models as it finalizes the multi-model ensemble (in this case, linear regression and linear quantile regression were removed).

      4. Example 4: Deschutes River below Snow Creek, February 1 forecast of April-through- July volume, genetic algorithm switched off

        This is an example of the intermediate-level use instructions in Section 4.3.2.3 above.

        That is, this example demonstrates the use of manual feature selection, in which the genetic algorithm is turned off by setting GeneticAlgorithmFlag to “N” and (in this particular example) ManualPCSelection to 1,2 such that the leading and second PCA modes are used for all six constituent modeling systems. All predictors in the input data files, MMPEInputData_ModelBuildingMode.txt and MMPEInputData_ForecastingMode.txt, are used in all six constituent modeling systems.

        This example also demonstrates how predictors other than SWE and precipitation are included in the input data files. In this particular case, antecedent streamflow at a location on the Deschutes River is included as the right-most predictor variable in the input data matrix (note that it could, of course, be placed anywhere among the predictor variables, not necessarily on the far right-hand side). It also illustrates an early-season forecast, which in general has substantially lower predictive skill than forecasts later in the season. The inclusion of antecedent streamflow and the fact that it’s an early- season forecast don’t have anything to do with any of the settings in MMPE_RunControlFile.txt, however, or vice versa; for instance, exactly the same input data file, containing antecedent streamflow, could be used when applying the default, basic-level use instructions as described in Section 4.3.1 above.

  6. Known issues and bug fixes

Tracking of issues and bug fixes in this document and on the GitHub site begins with the first GitHub distribution of M 4 In March 2023, and which consisted of the following files:

MMPE-Main_MkII.R PCA-Module_v5.R

PCA-Graphics-Module_v3.R

GA-PredictorSelection-Module_v15.R

PCR-Module_v11.R

PCSVM-Module_v6.R

PCRF-Module_v3.R

PCANN-Module_v17.R

PCQR-Module_v7.R PCMCQRNN_v5.R

PCMCQRNN_v5.R

InitializeEnsemble-Module_v1.R

AppendEnsemble-Module_v1.R

FinalizeEnsemble-Module_v1.R

Diagnostics-Module_v6.R

Known issues with this version of M 4 are:

The current GitHub repo (as of May 2023) for M 4 consists of the following files: MMPE-Main_MkII.R

PCA-Module_MkII.R

PCA-Graphics-Module_MkII.R

GA-PredictorSelection-Module_MkII.R

PCR-Module_MkII.R

PCSVM-Module_MkII.R

PCRF-Module_MkII.R

PCANN-Module_MkII.R

PCQR-Module_MkII.R

PCMCQRNN_MkII.R

InitializeEnsemble-Module_MkII.R

AppendEnsemble-Module_MkII.R

FinalizeEnsemble-Module_MkII.R

Diagnostics-Module_MkII.R

Two functionality changes were made relative to the previous version: (1) during training runs, in rare severe cases of a misbehaving Box-Cox transform space-based prediction interval leading to Inf values that cause subsequent graphing steps to crash, M 4 automatically overwrites the Box-Cox transform space-based prediction intervals with conventional stationary Gaussian prediction intervals, and lets the user know; and (2) the forecast run produces forecasts and prediction intervals for all 10 models in M 4 . These changes occur in the main program and cause M 4 to generate an additional training-run output file, BoxCoxPredictionIntervalsSwitches.csv, as well as changes to the forecast-run output files, of which there are now two, ForwardRunMultiModelEnsembleMeanPrediction.csv and ForwardRunAllModelsPredictions.csv. See Sections 4.2.2 and 4.2.3 for further details on technical background, and Sections 4.2.3.2 and 4.2.3.5 for further details on the output file changes.

Additionally, the files were renamed for better consistency with GitHub standard practices for version control, and in some sense semantic versioning more generally, mitigating version lock issues for example. However, the custom function files (i.e, all M 4 .R files other than the main program) are otherwise substantially unchanged.

Other issues or caveats are as noted above for the original distribution.