The methodological workflow developed in this study comprised the following steps (Fig. 10): preparation of dependent and independent variables, running the standalone SVR model, development of the two hybridized models (Wavelet-SVR-BAT and Wavelet-SVR-GWO), model validation, and comparison of model performance.

### Urban flood inventory

Historical flood inundation events in urban environments provide vital information for modeling. In this study, a total of 118 flood inundation locations were recorded by field investigations and available reports in the Amol Municipality (see Fig. 1). A ratio of 70:30 was used to randomly split the flood inventory dataset into two groups, for training (n = 83) and validation (n = 35) of the models.

### Factors influencing urban flood inundation

There are no universal guidelines for selecting factors that affect urban flood susceptibility, and previous researchers have considered different factors, depending on the model used and data availability. In this study, the selection of the flood conditioning factors was based on previous work by Darabi et al*.*^{11} and Falah et al.^{23} and data availability in the study region. The factors selected to map urban flood susceptibility in this study were elevation, slope percent, distance from river, distance from channel, curve number (CN), and precipitation (Fig. 11). River network was derived from an elevation layer map with 5-m spatial resolution provided by the Amol city authority. The elevation of Amol city ranges between 59 and 137 m (Fig. 11a). Slope as a topographical factor influences water flow characteristics such as velocity and discharge. It also impacts erosion and sedimentation processes. Lower slopes are generally seen in lowland areas of a watershed, which makes these susceptible to flooding^{40}. Hence, slope could be an important factor in flood susceptibility investigations. Slope percent in Amol city varies from 0 to 25 (Fig. 11b). Rivers and channels are main paths of water flow in cities, where rivers are a natural feature, and channels are human-made. Major floods usually occur from rivers, but the inappropriate design of channels network can cause considerable flooding. Therefore, this study considered distance from rivers and channels and calculated these layers by Euclidean distance function. Distance from river in the Amol city ranges between 0 and 2,322 m (Fig. 11c), and distance from channel varies between 0 and 1,499 m (Fig. 11d). Curve number is an important variable that influences the permeability of soil or ground surface. This factor represents the impact of land use and soil permeability features. To create the CN layer, an ArcCN-runoff function was implemented. There is a wide range of CN in Amol city (40–100), which is high and shows a significant variation (Fig. 11e). Precipitation is the main source of water flow, especially for flooding. To take its impact into account, we acquired the precipitation map from the Mazandaran’s Forest, Range, and Watershed Management Organization. According to their report, 20 rain gauging stations were used for generating the precipitation map, based on the Kriging method due mainly to lower RMSE value compared with other interpolation methods such as inverse distance weighting (IDW). However, precipitation in the study area does not change dramatically, with a small spatial variation (672 to 684 mm year^{−1}) in the study area (Fig. 11f). Further, the variance inflation factor (VIF) was calculated to investigate the multicollinearity between the factors and avoid bias in the results. A VIF value higher than 5 indicates a strong correlation between the factors and, accordingly, critical multicollinearity. Tolerance index, the reciprocal of VIF, was also used, with values lower than 0.2, indicating critical multicollinearity.

### Application of models

#### Support Vector Regression Algorithm

To solve regression issues and perform short-term forecasting, the support vector regression (SVR) algorithm, which is a supervised learning technique, can be applied^{41}. The SVR algorithm, which is based on statistical learning theory, was first proposed by Vapnik^{42}. This method can be employed to look for relationships between input and output data, based on structural risk minimization. In contrast, Neural Network Algorithms (NNA) and conventional statistical methods are based on empirical risk minimization. Thus the SVR method has superiority in reducing the generalization error as opposed to the learning error^{43}. The purpose of SVR is to produce a function which explains the relationship between input and output data:

$$f(x) = w^{T} psi (x) + b$$

(1)

where (x in R^{n}) denotes input data vector, (w in R^{n}) represents weight vector, (b in R) is a deviation, and (psi (x)) is a non-linear mapping function that transforms input data into high-dimensional feature space. Parameters (w) and (b) are based on the principle of structural risk minimization and are calculated as:

$$begin{gathered} Minimize{:}quad left[ {frac{1}{2}left| w right|^{2} + Csumlimits_{i = 1}^{l} {xi_{i} + xi_{i}^{*} } } right] hfill \ S.t.left{ begin{gathered} y_{i} – w^{t} x_{i} – b le varepsilon + xi_{i} hfill \ – y_{i} + w^{T} x_{i} + b le varepsilon + xi_{i}^{*} hfill \ xi_{i} ,xi_{i}^{*} ge 0 hfill \ end{gathered} right. hfill \ end{gathered}$$

(2)

where (y_{i} in R^{n}) shows target data vector, (C succ 0) is a penalty factor determining the tradeoff between training error and model complexity, (xi_{i} ,xi_{i}^{*}) represents slack variables which adjust the upper and lower constraints on the function (f(x)), and (varepsilon) denotes the insensitive loss function, which represents the quality of approximation^{43}. Equation 2 can be solved utilizing Lagrangian formulation and its final solution:

$$f(x) = sumlimits_{i = 1}^{n} {(alpha_{i} – alpha_{i}^{*} )k(x} ,x_{i} ) + , b$$

(3)

where (alpha_{i} ,alpha_{i}^{*}) are Lagrangian multipliers and (k(x_{i} ,;x_{j} ) = langle psi (x_{i} ),;psi (x_{j} )rangle) is the kernel function. Many different types of kernel function can be included (e.g., polynomial, Gaussian radial basis, exponential radial basis).

It should be noted that the SVR technique is highly dependent on model and forecast error to define its parameters, including penalty coefficient (*C*), permitted error range (ɛ), and kernel function parameters. For instance, the training error will be quite high if (C) is at a very low level. On the contrary, if the penalty coefficient is very high, the learning accuracy will improve, but the general adoptions of forecasting models will be low in comparison with reality. When the number of supervised vectors is reduced on condition that (varepsilon) is high, the forecasting model is proportionately simple, but the accuracy will decline. On the other hand, if (varepsilon) is rather low, the complexity of the forecasting model increases, adoption decreases, and the accuracy of regression will be enhanced. To find the optimized value of these parameters, in this study, the Bat and Grey Wolf evolutionary optimization algorithms were applied^{44}. A detailed description of the SVR algorithm can be found in Smola and Schölkopf^{45}.

#### Translation invariant wavelet kernel

Kernel functions have been used in many different pattern analysis and ML techniques. These functions assist the SVR method in processing high-dimensional and infinite data, allowing linear separability by mapping input space data to higher dimensions. In addition, kernel functions keep the computational complexity of SVR reasonable in feature space^{46}. Thus the number of support vectors and their weights, but also the type of kernel functions, are important and affect the results^{47,48}. A wavelet function can be employed as the kernel function^{49}. Generally, a wavelet function is defined as:

$$psi_{a,b} (x) = frac{1}{sqrt a }psi left( {frac{x – b}{a}} right)quad (a succ 0,;b in R)$$

(4)

where (a) and (b) represent dilation (scaling) and translation (shift) parameters, respectively, (psi (x)) is the wavelet base or mother wavelet, and indexed families of wavelets are obtained by changing *a* and *b*^{39,46}. If *a* and *b* are chosen on the basis of a power of 2 (*a* = 2^{j} and *b* = *k* × 2^{j}), the discrete wavelet transform (DWT) is as follows^{50}:

$$psi_{j,k} (x) = frac{1}{{2^{frac{j}{2}} }}psi left( {frac{x}{{2^{j} }} – k} right)$$

(5)

where (k) is a translation parameter and (j) is an integer representing the resolution level, i.e., the dilation parameter. Based on the above, kernel function can be computed by inner products of wavelet function as:

$$k(X,X^{prime}) = langle psi (X),psi (X^{prime})rangle = prodlimits_{i = 1}^{N} {left( {psi left( {frac{{x_{i} – b_{i} }}{a}} right) cdot psi left( {frac{{x^{prime}_{i} – b_{i}^{prime } }}{a}} right)} right)} quad a,;b_{i} ,;b_{i}^{prime } ,;x_{i} ,;x^{prime}_{i} in R$$

(6)

where (X,X^{prime} in R^{N}) are N-dimensional vectors and the translation-invariant wavelet kernel can be expressed as^{51}:

$$k(X,X^{prime}) = prodlimits_{i = 1}^{N} {psi left( {frac{{x_{i} – x^{prime}_{i} }}{a}} right)}$$

(7)

It should be pointed out that there are various wavelet functions, such as Haar, Splines, Daubechies, Marr, and Morlet. Gaussian function (Eq. 8) was used in this study due to its excellent reputation in terms of performance and possessing many useful features, as well as its strong learning capability^{52,53,54}. Thus Eq. (7) was re-formulated as:

$$psi_{Gaussian} (x) = exp left( {frac{{ – x^{2} }}{2}} right)$$

(8)

$$k(X,;X^{prime}) = prodlimits_{i = 1}^{N} {exp left( { – frac{{left| {x_{i} – x^{prime}_{i} } right|}}{{2a^{2} }}^{2} } right)}$$

(9)

Finally, the general form of the wavelet SVR (WSVR) regression function can be written as:

$$f(x) = sumlimits_{i = 1}^{n} {(alpha_{i} – alpha_{i}^{*} )} prodlimits_{j = 1}^{N} {exp left( {frac{{ – left| {x_{j} – x^{prime}_{j} } right|}}{{2a^{2} }}^{2} } right)} + b$$

(10)

It should be noted that since WSVR has multidimensional analysis properties and high flexibility, and is a generalization of the SVR algorithm, it can be employed to solve non-linear problems and approximation and classification issues^{46}. Studies have shown that the approximation effect of the wavelet kernel is far better than that of the Gaussian kernel^{52}. Additionally, due to the redundancy and correlative features of the Gaussian kernel, the training speed of the wavelet kernel SVM is rather fast in comparison with that of the Gaussian kernel SVM^{49}. More detailed explanations of wavelet SVR can be found in Zhang and Han^{49}, and Su et al*.*^{55}.

#### Bat Algorithm (BA)

In 2010, Xing-She Yang established the Bat metaheuristic algorithm, which derives from the echolocation characteristic of bats^{35}. At night, bats can pinpoint their path and produce detailed images of their prey by comparing the emitting pulse with its echoes. In the Bat algorithm, the position of each bat indicates the potential solution and the quality of the solution is determined according to the best position of a bat to its prey. The approach was developed based on the following three rules^{56}:

- 1.
All bats apply echo sounds to recognize the distance and distinguish the dissimilarity between food/ prey and obstacles.

- 2.
During the hunting period, bats fly unsystematically with certain velocity, at position (X_{i}) with fixed frequency, (f_{min }), and varying wavelength, (lambda), and loudness, (A). Bats can also automatically adjust wavelength and rate of pulse emission, (r in [0,1]), with regard to their distance to target.

- 3.
Since loudness can vary in different directions, it can be assumed that the loudness changes from a maximum (positive) to a minimum constant value.

Like other metaheuristic algorithms, there is a well-suited tradeoff between the two main features, intensification (exploitation) and diversification (exploration), which guarantee that the Bat algorithm (BA) can assist in finding overall solutions^{57}. According to the BA-SVR method, bats move within the parameter space and try to detect some optimized parameters. In other words, defining the optimal value for the SVR parameters is the main objective of implementing this approach^{37}. For information, see Ali^{58}, Sambariya et al*.*^{59}, Yang^{35}, and Yang and Gandomi^{60}.

#### Grey wolf optimization (GWO) algorithm

The Grey Wolf Optimization (GWO) algorithm can be employed as another method to set optimum values of SVR parameters. The GWO algorithm was introduced by Mirjalili et al*.*^{33} as a new metaheuristic method. This technique is a bio-inspired global optimization algorithm and, like other metaheuristic methods, belongs to swarm intelligence approaches. The GWO algorithm is based on the leadership hierarchy and the social behavior of grey wolves at the time of hunting. Wolves are social animals that live in packs, and they have a hierarchy in their group. The leader of each pack, the alpha wolf ((alpha)), dictates decisions about hunting, sleeping, and walking time, and all the other group members must follow its orders. However, despite the dominant manner of alpha wolves, democratic behavior can also be seen in the group. In terms of hierarchy, the other wolves fall into three levels, called beta ((beta)), delta ((delta)), and omega ((omega)). The beta wolves at the second level assist the alpha in making decision. They are the best candidates for alpha replacement in due course. The delta wolves are in charge of defending the boundaries, warning the pack about any danger or threats to its territory, and ensuring the safety of the group. At the time of hunting or sourcing food, deltas help alpha and betas. The omega wolves have the lowest rank and rights in the group. They are not allowed to eat until all other wolves have finished eating, do not play any role in decision making, and in fact have a victim role^{33}. However, in reality, theses wolves are a crucial part of the group, and their elimination can lead to fighting and serious social problems for the overall group^{61}.

In addition to the social hierarchy of grey wolves, another remarkable characteristic is their group hunting, which can be summarized in three phases: (i) Tracking, chasing, and approaching prey, (ii) running after prey, surrounding, and harassing the prey up to the moment it stops running, and (iii) attacking and killing^{62,63}.

In the GWO algorithm, mathematical modeling of wolves’ dominant social hierarchy behavior is performed by generating a random set of solutions, of which the solution with the best fit is considered the alpha ((alpha)). The next best solutions are called beta ((beta)) and delta ((delta)), respectively, and remaining candidate solutions are called omega. The hunting process (optimization) is led by (alpha), (beta) , and (delta) wolves, and (omega) wolves have to comply with these three groups. The following four stages are performed: encircling, trapping and surrounding the prey, detecting prey location, attacking prey, and killing the prey (exploitation)^{5}.

It is worth mentioning that GWO, as an optimization algorithm, has better search ability and higher accuracy than Genetic Algorithm (GA) and Particle Swarm Optimization (PSO)^{33}. The GWO algorithm can be employed to find solutions to non-convex optimization problems. Its main advantages are its simplicity, ability to solve real-world optimization problems, and fewer control parameters^{13}. Supplementary information can be found in Sulaiman et al*.*^{64}, Niu et al*.*^{65}, Jayakumar et al*.*^{66}, Saxena et al*.*^{34}, and Luo^{51}.

### Accuracy assessment

In the present study, the prediction performance of the models was assessed by analyzing the agreement between observed data (flood inventory) and model results in terms of both presences (i.e., flooded locations) and absence (i.e., non-flooded locations)^{11}. Three different evaluation approaches were used for assessing the accuracy of models in both the training (goodness-of-fit) and the validation (predictive performance) steps. These were cutoff-independent metrics (Sensitivity, Specificity, Accuracy, Matthews Correlation Coefficient, F-score, Misclassification Rate), cutoff-independent metrics (receiver operating characteristic (ROC) curve), and root mean square error (RMSE).

#### Cutoff-dependent metrics

All cutoff-dependent metrics were calculated based on a confusion matrix (also known as the contingency table). The components of the contingency table are true negative (TN), true positive (TP), false negative (FN), and false positive (FP) (Table 5), where FP and FN are the numbers of pixels erroneously classified (also known as error types I and II) and TN and TP are the numbers of pixels correctly classified. Note that a probability holdout value of 0.5 was chosen since the presence and absence locations were equally balanced, which is in accordance with Frattini et al*.*^{67} and Camilo et al*.*^{68}.

##### True positive rate

True positive rate (TPR) (also termed sensitivity) is one of the most common evaluation metrics and can be calculated as:

$$TPR = frac{TP}{{TP + FN}}$$

(11)

##### False positive rate

False positive rate (FPR) (also known as 1 − specificity) can be calculated as:

$$FPR = frac{FP}{{FP + TN}}$$

(12)

However, it should be noted that TPR and FPR are insufficient performance metrics, because they ignore false positives (here the number of pixels erroneously identified as flooded) and false negatives (here the number of pixels erroneously identified as non-flooded). In fact, they are useful only when used together.

##### Accuracy

Accuracy (A, also known as efficiency) is another common metric for evaluating model accuracy. Accuracy determines the percentage of actual flooded points that are correctly classified by the model as:

$$A = frac{TP + TN}{{TP + TN + FP + FN}}$$

(13)

##### Matthews correlation coefficient

The Matthews Correlation Coefficient (MCC) metric assesses the performance of models based on the correlation rate between observed and predicted data^{69}. MCC ranges from − 1 to 1, where − 1 indicates considerable disagreement between observed and predicted data and 1 indicates perfect agreement. MCC is calculated as:

$$MCC = frac{TP times TN – FP times FN}{{left[ {left( {TP + FP} right) times left( {FN + TN} right) times left( {FP + TN} right) times left( {TP + FN} right)} right]^{{left( {1/2} right)}} }}$$

(14)

##### F-score

The F-score (also called the F1 score or F measure) is calculated as:

$$Ftext{-}score = frac{2TP}{{2TP + FP + FN}}$$

(15)

It can also be obtained based on the TPR and another evaluation metric, Positive Predictive Value (PPV), as:

$$F{ – }score = 2frac{PPV times TPR}{{PPV + TPR}}$$

(16)

where PPV is *TP/*(*TP* + *FP*).

##### Misclassification rate, MR

Misclassification rate considers both the false positive and false negative components and therefore reflects an overall error rate. MR can be computed as:

$$MR = frac{FP + FN}{{FP + FN + TP + TN}}$$

(17)

#### Cutoff-independent metric

The area under the ROC curve (AUC) is the most important evaluation metric in natural hazard assessment^{40}. The ROC curve simply plots the TPR (i.e., sensitivity) on the Y-axis against the FPR (i.e., 1 − specificity) on the X-axis. It is considered the real measure of model evaluation because it simultaneously includes all components of the confusion matrix and equitably estimates the overall quality of a model^{67}. AUC is bounded by [0, 1]: the larger the AUC value, the better the performance of the model over the whole range of possible cutoffs. Based on the analytical expression for the ROC curve, denoted *f*, AUC can be calculated as:

$$AUC = int_{0}^{1} {fleft( {{text{FPR}}} right) d {text{FPR}} = 1 – int_{0}^{1} {f^{ – 1} left( {{text{TPR}}} right) d{text{ TPR}}} }$$

(18)

#### Root mean square error

Root Mean Square Error (RMSE) is a frequently used measure of the agreement between observed and predicted values. In the present study, the RMSE was used to evaluate all models. It can be calculated as:

$$RMSE = left[ {frac{1}{N}mathop sum limits_{i = 1}^{N} left( {S_{i} – O_{i} } right)} right]^{1/2}$$

(19)

where *S*_{i} and *O*_{i} are observed and predicted values, respectively.

### Contribution of conditioning factors

The contribution of the flood conditioning factors (i.e., predictor variables) to the modeling process (relative importance) was investigated using a map-removal sensitivity analysis. The relative decrease (RD) in AUC values, which reflects the dependency of the model output on the conditioning factors, was calculated. RD can be calculated using the following equation:

$$RD = frac{{AUC_{all} – AUC_{i} }}{{AUC_{all} }} times 100$$

(20)

where AUC_{all} and AUC_{i} are the AUC values obtained from the flood susceptibility prediction using all conditioning factors and the prediction when the *i*th conditioning factor was excluded, respectively.