Regression-based Daugava River Flood Forecasting and Monitoring

Abstract The paper discusses the application of linear and symbolic regression to forecast and monitor river floods. Main tasks of the research are to find an analytical model of river flow and to forecast it. The challenges are a small set of flow measurements and a small number of input factors. Genetic programming is used in the task of symbolic regression. To train the model, historical data of the Daugava River monitoring station near Daugavpils city are used. Several regression scenarios are discussed and compared. Models obtained by the methods discussed in the research show good results and applicability in predicting the river flow and forecasting of the floods.


I. INTRODUCTION
Early forecasting of river floods and prediction of areas to be flooded is an actual problem in territories located on banks of big rivers with regular or irregular flood behaviour.Solution to this problem allows preventing damages and possible losses in advance on inhabited or agricultural territories in risk areas.Essential part of forecasting is monitoring, which allows collecting data of river behaviour parameters in long time periods.In turn, the application of mathematical methods allows finding relations and patterns in this behaviour.Moreover, the monitoring of river physical parameters allows forecasting the river behaviour in the near future and correspondingly gives predictions of the flood.
In research [1], the modelling for the evaluation of aftermath of spring floods of the Daugava River is discussed with the estimation of the flooded areas.For this estimation a heightmap model of the investigated area is applied together with data from local river monitoring station.The main river monitoring data used for the flood estimation are river flow or river discharge, i.e., the volume of water flowing through the current cross-section of the river in a defined time interval.The problem is that the local station performs the evaluation of river flow rarely and irregularly.Nonetheless, a river water level is monitored regularly, which can be used for the estimation of unknown data.
In this study, the development of an analytical model of river flow depending on the current or recent river level is proposed for the determination of river flow discharge values, which are used in [1].Application of linear and symbolic regression is applied to find such a model.
The problem statement is defined in Section II.Section III discusses regression methods proposed to find the flow analytical model.Section IV gives an overview on the problem input data.The application of linear and symbolic regression methods is discussed in Sections V and VI, correspondingly.Conclusions are summarised and given in Section VII.

II. PROBLEM STATEMENT
The goal of the research is to find how to calculate a river flow, which is based on the data of number of samples of river water level at the same river monitoring station.Data of the river level are dynamic.
The available input data of the stated problem are given in two tables: 1.A table of river water level values for each hour.
2. The data of the river discharge measurements.Both data tables for the considered area are taken from an open data source [2].The data describe the condition of the Daugava River at the monitoring station near Daugavpils city.
The data of river water level are given with precision of centimetre for each hour of each day in the analysed time interval (see Table I).The river level is estimated from a referencing level, thus, can have negative values.In turn, the data table of river flow discharge (see Table II) contains dates, when the measurements are sampled, as well as measurement time and the determined water flow in m 3 /s.Flow measurements are highly irregular: between some measurements there are intervals of several months, but other measurements are performed within one week interval.

138
In the research it is proposed that the water discharge in the river is related to the water level.Thus, it is possible to determine the required water flow value if the current and recent river levels are known.Due to high availability of river level measurements, it is proposed to apply a number of recent level measurements, which are separated by a specific time interval to calculate the flow.
An analytical mathematical model in the closed form of algebraic expression, which will relate the water level to the forecasted water flow with a reliable precision, has to be found in the research.
The following solution steps and subtasks are planned and described in this research: 1. to analyse the input data to reveal the patterns and data incompleteness; 2. to process and prepare data for the following analysis methods; 3. to perform data statistical analysis; 4. to solve the regression task with the application of least squares method and to analyse the results; 5. to solve the symbolic regression task with the application of genetic programming; 6. to compare results and to make conclusions.

A. Linear Regression
One of the most common and widely used approaches to find the relationships between one dependent variable and a number of explanatory variables is a linear regression.Linear regression implies that a dependent variable can be expressed in a form of linear equation from explanatory variables.The task is to find such coefficients of linear equation, which will fit the data with the smallest error [3].
The commonly used method to fit the linear regression data is the least squares method.The least squares method is an exact mathematical method and its goal is minimisation of the sum of squared residuals, where residual is the difference between the observed value and the value provided by the regression model [4].
In the current research, a linear regression is performed by statistical tools embedded in the Microsoft Excel spreadsheet application, which fits the coefficients for linear model.

B. Symbolic Regression
Symbolic regression or function identification is an approach to find mathematical expressions in a symbolic form, which will fit regression data in the best way and predict a dependent variable from explanatory variables with the smallest error.In the symbolic regression both the symbolic form of a model and coefficients for model variables are found.The symbolic regression differs from a traditional linear or polynomial regression, where only the best coefficients for linear or polynomial models should be found [5].
The symbolic regression approach is closely related to the genetic programming, which is the natural choice to find symbolic expressions that fit the data.

C. Genetic Programming
Genetic programming (GP) is an evolutionary algorithmic approach to find computer programs that perform the defined task in the best way [5].Genetic programming is derived from a genetic algorithm: it works with a population of solution candidates (i.e., individuals) and performs evolution via iterative execution of selection, crossover and mutation operators.The main distinguishing feature of GP is that the individuals are represented in form of functional trees and the fitness function determines how well the solution candidate's program performs a given task.
In a symbolic regression, the mathematical expression that should be identified is interpreted as a computer program, whose input data are explanatory variables and output is a dependent variable.The following evaluators can be used as fitness function in a symbolic regression: mean squared error, mean average error, Pearson R squared (R 2 ) coefficient of determination [6].
In the current research, the implementation of genetic programming based symbolic regression in HeuristicLab optimisation framework [7], [8] is applied.

A. Input Data Preparation
For the identification of the regression model, the following data pre-processing tasks are performed.
The measurements from Table II are taken as values of the dependent variable in the training dataset.As the number of measured samples is relatively small, all data in the corresponding table will be used.Explanatory variables are derived from Table I, but the table is transformed in the following way.
An analysis of dataset from Table I shows that an hourly water level has only small changes between neighbour samples; moreover, the water level often does not change each hour.Thus, for the regression task only a small part of input data will be selected.
It is assumed that the following data are related to each sample of water flow discharge measures: 1. current water level in the river (measurement taken at time, when the corresponding water flow is measured); 2. water level of the river several hours before the flow measurement; 3. water level of the river several days before the flow measurement; 4. water level one week before the flow measurement is carried out.In the transformation process of the input data, the following table with prepared data is obtained (see Table III).

The table has the following attributes:
 flow -the river water discharge in m 3 /s;  h0 -the water level at water flow measuring time;  h3 -the water level 3 hours before the water flow measurement (h6, h12, h18 are levels at 6, 12 and 18 hours, respectively, before the flow is measured);  d1 -the water level measured 24 hours before the flow measurement (d2 and d3 are river levels 2 and 3 days before the flow measurement, respectively);  d7 -the water level measured a week before the current flow is measured.The regression experiments are performed with the dataset described in Table III  The correlation analysis shows that the lowest correlation is between a water flow value and a water level measured a week ago.The highest correlation with a water flow value has water level values, which are collected during a day before the flow is measured.For the analysed dataset the highest correlation has a water level measured 12 hours before water discharge is measured.Moreover, explanatory values in the training dataset have a high correlation between each other; thus, most of them can be omitted.

V. LINEAR REGRESSION
Several experiments to obtain a linear regression model of the water flow value were performed within a regression toolbox of Microsoft Excel application.
At first, the model with only one explanatory variable h12, which has shown the highest correlation with the dependent variable in the correlation analysis, is fitted with the least squares method.The obtained model has the following mathematical expression: The model with one coefficient has a coefficient of determination R 2 ≈ 0.9011, which shows its high reliability.
For the linear model with all dataset attributes listed above, except one week old measurement of water level, the following linear approximation has been obtained: (2) Such a model fits the data with coefficient R 2 ≈ 0.908; thus, it does not increase the accuracy of the regression model, but the model becomes large and uses a big number of attributes.
For the validation of the obtained regression models, the dependency (1) of the river flow discharge from the river level is visualised in a chart and shown in Fig. 1.As it can be seen, the linear regression poorly fits the empirical data.The flow data have a trend, which looks close to the polynomial trend (see Fig. 1).The linear model is bad for high water levels; therefore, it can be concluded that it is not feasible in flood forecasting, where river levels are high.

VI. SYMBOLIC REGRESSION WITH GENETIC PROGRAMMING
In search for more precise models, the symbolic regression experiments with the application of genetic programming are performed.In the following experiments, HeuristicLab optimisation framework is applied [7].The following parameters are defined for the algorithm of genetic programming: a population size equal to 200 individuals; a subtree swapping crossover and all GP mutation operators implemented in HeuristicLab with a mutation rate of 5% [9].The proportional selection operator is used in the algorithm with one elite individual.Fitness function is evaluated by Pearson R 2 coefficient.Available tree nodes are: real value constants in a range [-100, 100], explanatory variables, arithmetic functions (+, -, *, /), trigonometric functions (sin, cos, tg), exponential, logarithm and power functions.Maximal  Data visualised in the scatter plot (Fig. 4) shows that the model is accurate for the majority of the records.High mismatch between the estimated and target values for several records is determined for low water data.It should be noted that all models obtained after the termination condition of genetic programming include only simple mathematical operators, such as multiplication, addition and subtraction, but the solutions with trigonometric, exponential and logarithmic functions have bad fitness.Thus, the flow discharge should be described as a polynomial model.
Results of the validation experiments for the regression model (4) are shown in Fig. 5.It can be seen that the model is well fitted, and in the dataset there are only a few records that are very different from the model.These outliers probably are caused by some other river physical parameters, which were not included in the input data of the problem statement.
Nevertheless, the model (4) shows good results for records with a high water level; thus, it is applicable for flood forecasting.Fig. 5. Dependence of a river flow on the river level h12 for the regression model ( 4) and empirical data As a majority of the explanatory variables are chosen as the measurements taken in the near past, a symbolic regression can be applied to the flow forecasting in the near future.In this case a regression model is obtained in GP that is applied to a dataset with the excluded current river level measurements (e.g., h0, h3).In the series of 20 GP experiments with the above-mentioned algorithm parameters and dataset with excluded h0, the following regression model was obtained: The model ( 5) can be applied to situations, when the current water level h0 in the river is not known, but it is possible to operate with measures that are made at least 3 hours ago.The model (5) has the Pearson's R 2 ≈ 0.957 for the training set and R 2 ≈ 0.981 for the test set.
It should be noted that the obtained models, when compared between different runs of GP, have different algebraic forms and values of coefficients, but at the same time they describe the training dataset in the same way and with a very close error.It can be concluded that the search algorithm performs well and converges to similar models that are just expressed in different forms.

VII. CONCLUSION
The main result of the research is that river flow discharge can be estimated through water level recent measurements taken at a particular monitoring station.To obtain the analytical model of the flow discharge, the regression model has to be fit with an application of genetic programming.The obtained river flow regression models used in the real life validation of the river flood prediction [1] have shown good results and the proposed methods are applicable for the solution of similar tasks.
The linear model obtained in the Microsoft Excel tool can be used as the simple equation for the flow calculation at a medium river level, but the model is not feasible in flood situations, when a water level is high.For a higher accuracy of output data, the model obtained in genetic programming has to be applied.
However, the best models obtained in a symbolic regression also have small errors and do not fit perfectly several records in the dataset.This can be explained by a small number of input factors, which include only values from river level measurements.Thus, more parameters obtained at a river monitoring station, which affect the water flow, should be included in the dataset to search for more precise flow discharge models in the future.

Fig. 1 .
Fig. 1.Dependence of a river flow on the river level h12 for the regression model (1) and empirical data The linear regression model (2) applied to the training set also shows the chart, which is visually close to Fig. 1.

3 )Fig. 2 .
Fig. 2. Regression model (3) represented in form of a tree in an optimisation framework The model (3) can be easily transformed in a more readable form, such as: flow ≈ 216.678 -0.202 • (0.645•h0 -0.666•h12 + 0.361•h18 --0.066•d3) • (-0.361•h18 + 0.290•h12 -24.078)(4)As it can be seen, the main factors, which affect the water flow, are the same attributes h12 and h18.The model fits data with coefficient R 2 ≈ 0.963 for the training set and R 2 ≈ 0.953 for the test set.The model expresses the river flow in the polynomial form and has higher accuracy than a linear model.Line chart of the model is shown in Fig.3and a scatter plot is shown in Fig.4, respectively.Both charts are obtained in the output of the experiment in the HeuristicLab framework.In the line chart, a blue line corresponds to the empirical values, red and yellow lines -to the model response.The division of the data in training and test datasets can be observed.In the

TABLE I WATER
LEVEL DATA

TABLE III THE
DATASET OF THE REGRESSION TASK (FRAGMENT) . A full table contains 93 records with the Daugava water flow discharge measures from January 1, 2008 to February 28, 2013.B.Input Data Statistical AnalysisA correlation analysis is performed on the data of TableIIIwith a statistical tool of Microsoft Excel application.Results of this analysis are shown in TableIV:

TABLE IV CORRELATION
BETWEEN INPUT DATA ATTRIBUTES