### Dataset

For PGA data, we referred to records of the national strong motion network in Japan, deployed and operated by NIED, K-NET and KiK-net^{17,18}. The ground-motion records have been publicly available on the associated website since May 1996 for K-NET and October 1997 for KiK-net. For earthquake source information (event location and moment magnitude), we referred to the F-net moment tensor solution catalog^{19,20}, which has also been publicly available on the associated website since January 1997. For event depth, we used the centroid depth of F-net. In this study, an earthquake is regarded as a point source, and we ignored the source finiteness of earthquakes to simplify the prediction problem. For *D*_{1400} and *Vs30* information at each site, we referred to the site-below underground information from a deep subsurface structure model of Japan^{22} and the *Vs30* map of Japan suggested by Matsuoka and Wakamatsu^{23}, both of which are publicly available on the website of J-SHIS^{21,24}.

To construct the dataset, we first collected available ground-motion data observed by K-NET and KiK-net and event data from the F-net moment tensor solution catalog and unified hypocenter catalog provided by the Japan Meteorological Agency (JMA). Then, we integrated these data and added site information (*D*_{1400} and *Vs30*) for each station. Finally, we retrieved ground-motion records for events satisfying the following conditions: (1) 4.5 ≤ *M*_{w} ≤ 7.5, (2) epicentral distance shorter than 200 km, (3) event depth shallower than 200 km, and (4) ground-motion records observed from at least five stations. At this step, we eliminated ground-motion records that did not include the S-wave part by checking the theoretical arrival time of the S-wave. The upper limit of *M*_{w} was set because the effect of the source finiteness in huge earthquakes (*M* 8 or 9) is expected to be large and the assumption of point source does not hold in very large earthquakes. The upper limit of the event depth was set to exclude deep earthquakes, which cause anomalous distributions of ground-motion intensity^{38,39}, from the dataset. The upper limit of epicentral distance was the same as in Morikawa and Fujiwara^{8}. We separated the dataset into training data with 186,310 samples (2082 events) consisting of records from 1997 to 2015 and test data with 22,323 samples (208 events) consisting of records from 2016 to 2017. For data preprocessing, we took the common logarithm of the epicentral distance, event depth, and PGA and then standardized the five explanatory variables and the target variable.

### ERT predictor

In constructing the ERT predictor, we used the Extra Trees Regressor in the scikit-learn Python programming package^{40}. The parameter setting of Extra Trees Regressor is the same as that used in the hybrid predictor, which is mentioned later.

### ERT predictor with weighted training data

For data weighting, the training data were divided into four groups; G1: 1 cm/s/s ≤ PGA < 10 cm/s/s, G2: 10 cm/s/s ≤ PGA < 100 cm/s/s, G3: 100 cm/s/s ≤ PGA < 1,000 cm/s/s, and G4: PGA ≥ 1,000 cm/s/s. G1–G4 had 113,806, 60,698, 3,891, and 22 records, respectively, in the training data (13,581, 7,243, 518, and 3 records, respectively, in the test data). We excluded records where the PGA was below 1 cm/s/s. From the data groups, a weighted dataset was prepared where the records of each group were replicated depending on their group weights. For example, a group weight of 2 means duplicating the group records. In this data weighting, a data group with a small volume of data was weighted heavily. Then, the weighted dataset was learned by the ML algorithm. This cycle was repeated using different weight combinations on the dataset and checking the ML results. Figure 4 provides an example with weighted training data where the weights of G1, G2, G3, and G4 were one, one, four, and sixteen, respectively.

### Hybrid predictor of ERT and GMPE

We developed a hybrid predictor where the predicted PGA is represented by adding the predicted value of an ML predictor to the predicted value of a conventional GMPE. For the GMPE, we used base model 1 for crustal earthquakes in Morikawa and Fujiwara^{8}. Their variables are the hypocentral distance (left( {sqrt {D^{2} + H^{2} } } right)) and *M*_{w}. The ML predictor with the variables *D*, *H*, *M*_{w}, *D*_{1400}, and *Vs30* learned the residual between the observation and GMPE prediction as the training data. For the ML algorithm, we used the Extra Trees Regressor in the scikit-learn Python programming package^{40}. Table 2 indicates the corresponding parameter setting. The number of trees was set to 1,000 because of the performance limit of the machine server we used. The maximum depth of the tree, the minimum number of samples required to be at a leaf node, and the number of features to consider when looking for the best split were determined based on the balance of the data fit and variance in the cross-validation test. In the cross-validation test, because seismic records are strongly correlated in time series and the use of random data split may cause results to be misunderstood, we sequentially split the training data into 10 groups based on the time series of events and conducted K-fold cross-validation test using 10 data groups.

### GMPE

Following Morikawa and Fujiwara^{8}, we reconstructed the GMPE considering site amplification due to deep sedimentary layers and shallow soft soils:

$$log PGA = G_{o} + G_{d} + G_{s}$$

(1)

$$G_{o} = a cdot left[ {{min}left( {M_{w} ,M_{0} } right) – M^{prime } } right]^{2} + b cdot X + c – log left[ {X + d cdot 10^{{e cdot {min}left( {M_{w} ,M_{0} } right)}} } right]$$

(2)

$$G_{d} = p_{d} cdot log left[ {{max}left( {D_{1400 , min} ,D_{1400} } right)/D_{0} } right]$$

(3)

$$G_{s} = p_{s} cdot log left[ {{min}left( {V_{s ,max} ,V_{s} 30} right)/V_{0} } right]$$

(4)

where (G_{o}) is the base GMPE model with (M_{w}) and hypocentral distance (X left( { = sqrt {D^{2} + H^{2} } } right)), (G_{d}) is a correction term for site amplification by deep sedimentary layers with *D*_{1400}, (G_{s}) is a correction term for site amplification by shallow soft soils with *Vs30*, and (left( {a,b,c,d,e,M_{0} ,M^{prime } ,p_{d} ,D_{1400 ,min} , D_{0} ,p_{s} ,V_{s ,max} ,V_{0} } right)) are parameters to be determined. In these parameters, (left( {d,e,M_{0} ,M^{prime } , D_{0} ,V_{0} } right)) is same as in Morikawa and Fujiwara^{8}. The rest of the parameters are determined by the following scheme. First, (left( {a, b, c} right)) in Eq. (2) is determined by the least-squares method from the training data with a distance-based weighting scheme8. Then, (left( {p_{d} ,D_{1400 ,min} } right)) in Eq. (3) is determined by the least-squares method and grid search of the residuals between the observations and (G_{o}) in the training data. Finally, (left( {p_{s} ,V_{s ,max} } right)) in Eq. (4) is determined by the least-squares method and grid search of the residuals between the observations and (G_{o} + G_{d}) in the training data.

### Measurement of predictor performance

To measure the predictor performance, we used the coefficient of determination (R^{2}):

$$R^{2} = 1 – frac{{mathop sum nolimits_{i = 1}^{n} mathop sum nolimits_{j = 1}^{{m_{i} }} left( {o_{ij} – p_{ij} } right)^{2} }}{{mathop sum nolimits_{i = 1}^{n} mathop sum nolimits_{j = 1}^{{m_{i} }} left( {o_{ij} – overline{o}} right)^{2} }}$$

(5)

where *n* is the number of earthquakes, (m_{i}) is the number of recordings for the *i*th earthquake, (o_{ij}) is the observed value of (log_{10} PGA) for the *i*th earthquake at the *j*th site, (p_{ij}) is the predicted value, and (overline{o}) is the mean of the observed data. We also calculated the total standard deviation ({upsigma }), the between-event standard deviation (tau), and the within-event standard deviation (phi). The total error of the GMPE is decomposed into the between-event and within-event errors, which are zero-mean, independent, normally distributed random variables with standard deviations (tau) and (phi), respectively^{41,42}. The between- and within-events residuals are assumed uncorrelated, so σ can be written as:

$${upsigma } = sqrt {tau^{2} + phi^{2} } .$$

(6)

To estimate values of the standard deviations, first, we calculated the residuals (r_{ij}):

$$r_{ij} = o_{ij} – p_{ij} .$$

(7)

From (r_{ij}) for all data, the total standard deviation ({upsigma }) was obtained. The between-event error for each earthquake can be described as follows^{42,43}:

$$eta_{i} = frac{{tau^{2} mathop sum nolimits_{j = 1}^{{m_{i} }} r_{ij} }}{{m_{i} tau^{2} + phi^{2} }}$$

(8)

where (eta_{i}) is the between-event error for the *i*th earthquake. This equation implies that if there are a large number of recordings from an earthquake, the between-event error can be approximated by the mean residuals for that event:

$$eta_{i} approx frac{{mathop sum nolimits_{j = 1}^{{m_{i} }} r_{ij} }}{{m_{i} }}.$$

(9)

Using Eq. (9), we obtained (eta_{i}) for earthquakes for which (m_{i}) is larger than 100. Then, from (eta_{i}) for the selected earthquakes, we estimated the between-event standard deviation (tau) and obtained the within-event standard deviation (phi) using Eq. (6).