### Preprocessing

Trials with incorrect (12.83% of total trials), missing (> 3 standard deviations above the mean; 1.90% of total trials), or anticipatory (< 3 standard deviations under the mean; 0% of total trials) responses were excluded from the data.

### Behavioural results

To test our behavioral assumptions, we used the generalized linear mixed models (GLMMs) statistical method. An advantage of GLMMs compared to more classical analyses of variance is the possibility to add random effects in our statistical models, like inter-indivual performances^{30,31}. Moreover, GLMMs handle non-Gaussian data by allowing the specification of various distributions, such as a binomial distribution for the accuracy analysis, modeling the total variance of our data including the inter-trial variance^{30}.

Our dependent variables were the reaction time and accuracy to decide about the musical nuance played by the violinist. To investigate the contribution of each variable and their interaction, we compared different models using the Fisher’s F test for the reaction time, and the chi-square difference test specifying binomial models for the accuracy. Our fixed effects were the congruency (congruent vs. incongruent) and instrumental practice (trained vs. untrained). To focus participants’ attention on the violinists’ gestures and reduce potential effects due to their appearance, violinists’ performance was only represented by 13 point-lights attached to their head and major articulation points (neck, shoulders, elbows and wrists), as well as to their violin and bow (see Fig. 2; for a sample video of the stimuli see Video A and Video B of the supplementary materials section). The latter control was increased by adding another fixed variable: The video display (normal vs. scrambled). In the scrambled condition, videos displayed scrambled versions of violinists’ performance, created by randomizing the starting positions of the point-lights. In the normal condition, videos were presented as captured. Our random effects were the inter-individual performance and violinists’ identity.

#### Reaction time

The only significant effect in the reaction time analysis was the video display main effect, *F*(1, 3,800) = 15.01, *p* < 0.001, (R_{m}^{2}) < 0.01, (R_{c}^{2}) = 0.33 (Note: (R_{m}^{2}) representss the variance explained by the fixed factors, while (R_{c}^{2}) is the variance explained by both fixed and random effects^{32}). Participants were faster to judge the nuance expressed by the violinist when the video was normally displayed compared to the scrambled condition. The congruency, *F*(1, 3,800) = 0.76, *p* = 0.38, and instrumental practice, *F*(1, 18) = 0.14, *p* = 0.71, main effects, as well as the interaction between them, *F*(1, 3,800) = 0.36, *p* = 0.55, were not significant for the reaction time. Moreover, neither the interaction between the video display and congruency, *F*(1, 3,800) = 0.18, *p* = 0.67, the video display and instrumental practice, *F*(1, 3,800) = 1.64, *p* = 0.20, nor the three-way interaction between the video display, congruency and instrumental practice, *F*(1, 3,800) = 0.17, *p* = 0.68, were significant.

#### Accuracy

Although, neither the congruency, χ^{2}(1, *N* = 20) = 1.10, *p* = 0.29, nor the instrumental practice, χ^{2}(1, *N* = 20) = 0.84, *p* = 0.36, factors showed a significant main effect, the two-way interaction between these variables was significant, χ^{2}(1, *N* = 20) = 4.15, *p* < 0.05, (R_{m}^{2}) < 0.01, (R_{c}^{2}) = 0.19 (Fig. 3A). As shown by simple effects, untrained participants were more accurate in the congruent condition than in the incongruent condition, χ^{2}(1, *N* = 20) = 4.58, *p* < 0.05. The same effect was not significant for trained participants, χ^{2}(1, *N* = 20) = 0.70, *p* = 0.40. Interestingly, the video display main effect was significant, χ^{2}(1, *N* = 20) = 62.66, *p* < 0.001, (R_{m}^{2}) = 0.03, (R_{c}^{2}) = 0.22, as well as its interaction with the instrumental practice, χ^{2}(1, *N* = 20) = 4.30, *p* < 0.05, (R_{m}^{2}) = 0.05, (R_{c}^{2}) = 0.23 (Fig. 3B). Concerning the main effect, participants were more accurate to judge the nuance expressed by the violinist when preceded by a normal than scrambled video. For the interaction, both trained and untrained participants were more accurate in the normal than scrambled condition, χ^{2}(1, *N* = 20) = 42.61, *p* < 0.001 and χ^{2}(1, *N* = 20) = 21.17, *p* < 0.001 respectively, but the normal-scrambled difference was significantly more important for trained than untrained participants, χ^{2}(1, *N* = 20) = 4.48, *p* < 0.05. Neither the interaction between the video display and congruency, χ^{2}(1, *N* = 20) = 0.50, *p* = 0.48, nor the three-way interaction between the video display, congruency and instrumental practice, χ^{2}(1, *N* = 20) = 0.73, *p* = 0.39, were significant.

To ensure that the differences observed between trained and untrained participants were not due to differences in strategy (or response biases) between groups, we calculated the response bias “c” index. The c index is a statistic used in signal detection theory^{33}, representing response biases, by taking into account both “hits” (correctly detecting the nuance when the video and word are congruent) and “false alarms” (respond according to the word when it is incongruent with the video) in the calculation of task performance. The c index was calculated for the normal videos in combination with the scrambled ones. Comparison of the c parameter for trained (*M* = − 0.10; *SD* = 0.13) and untrained (*M* = − 0.05; *SD* = 0.18) participants, using a Student independent sample t-test, revealed no significant difference between the groups, *t*(21) = 0.66, *p* = 0.52, *d* = 0.32.

### ERP results

We focused our ERP analyses on the 200 ms before and 1,000 ms after the presentation of the word *forte* and *piano*. Average ERPs were computed in response to the congruency and instrumental practice conditions and the mean amplitude in a 3-electrode site array, divided in 4 zones (Fig. 4A): frontal (F1, Fz, F2), central (C1, Cz, C2), parietal (P1, Pz, P2) and occipital (O1, Oz, O2). We chose to gather these electrodes based on previous studies investigating the N400 component, showing that its effect was maximal at centro-parietal electrode sites^{34,35}.

A mixed ANOVA was conducted on the mean amplitude of the ERPs over 350 to 400 ms after stimulus onset for the N400, with the factor congruency (congruent vs. incongruent, within-participants) × instrumental practice (trained vs. untrained, between-participants) × electrode site (frontal vs. central vs. parietal vs. occipital, within-participants) × video display (normal vs. scrambled, within-participants). 210 to 260 ms after stimulus onset, a deflection possibly reflecting the P200 component seemed to differentiate our trained participants to our untrained participants. To statistically test this component, we conducted another mixed ANOVA with the same factors on this time window. Moreover, in order to test the relation between the P200 and N400 event related potentials we performed a Pearson correlation analysis on the P200 at occipital site and the N400 at central site independently of the congruency and video display conditions. The dependent variable was the mean amplitude.

### N400

A summary of N400 results can be found in Table 1. We found a significant main effect of the electrode site. A contrast analysis showed that N400 mean amplitude was significantly more negative in the central electrode site than in other electrode sites, *t*(18) = − 4.56, *p* < 0.001. N400 mean amplitude was significantly more positive in the occipital electrode site than in other electrode sites, *t*(18) = 2.15, *p* < 0.05. N400 mean amplitude was not significantly different in frontal and parietal electrode sites, *t*(18) = 0.03, *p* = 0.98. Neither the main effects of congruency nor instrumental practice were significant. However, the video display showed a significant main effect in which N400 was more negative when the word was preceded by a normal than scrambled video.

Concerning interactions, two significant effects were obtained for N400: An interaction between the congruency and electrode site, and between the instrumental practice and electrode site. Concerning the congruency-electrode site interaction, simple effects showed that in the central electrode site, the mean amplitude of N400 was more negative for the incongruent condition, *t*(18) = 2.35, *p* < 0.05 (Fig. 4B). In the occipital electrode site, the mean amplitude of N400 was more negative for the congruent condition, *t*(18) = − 2.19, *p* < 0.05 (Fig. 4C). No significant differences between the two congruency conditions for frontal, *t*(18) = 1.74, *p* = 0.10, and parietal, *t*(18) = 0.02, *p* = 1, electrode sites were found (Figure A of the supplementary materials section).

For the instrumental practice-electrode site interaction, N400 mean amplitude was more negative in the central electrode site for untrained than trained participants, *t*(18) = − 2.43, *p* < 0.05 (Fig. 4C). No significant differences between trained and untrained participants were found for the frontal, *t*(18) = − 1.30, *p* = 0.21, parietal, *t*(18) = 0.10, *p* = 0.92, and occipital, *t*(18) = 1.88, *p* = 0.08, electrode sites.

### P200

A summary of P200 results can be found in Table 2. Only one significant main effect was significant for P200: The electrode site. A contrast analysis showed that the early component mean amplitude was significantly more negative in the occipital electrode site than the other electrode sites, *t*(18) = − 4.97, *p* < 0.001.

The only significant interaction observed for P200 was the one between the electrode site and instrumental practice. Simple effects showed that the early component mean amplitude was significantly more negative in the occipital electrode site for trained than untrained participants, *t*(18) = 2.14, *p* < 0.05 (Fig. 4B). There was no significant differences between the instrumental practice groups for frontal, *t*(18) = − 1.63, *p* = 0.12, central, *t*(18) = − 1.80, *p* = 0.09, and parietal, *t*(18) = 0.24, *p* = 0.82, electrode sites (Figure A of the supplementary materials section).

### P200 vs. N400 correlation

The Pearson correlation analysis comparing the relation between the amplitude of the P200 at occipital site and the amplitude of the N400 at central site revealed a significant effect, *r*(18) = -0.66, *p* < 0.01 (Figure B of the supplementary materials section).