A counterbalanced within-subject design was used, and each subject was randomly assigned to a set of tasks and conditions. The control condition was never administered first, but otherwise the order of conditions was individualized and pseudorandomized. No criteria for data inclusion/exclusion were set, and there was no definition and policy for outliers. All data was included in the study and outliers were not detected. Each test trial, consisting of up to 10 attempts, was performed once. All testing was carried out by one experimenter who was known to the subjects.
To set up a competition between two memories, we introduced a test task that partially overlapped with each of two training tasks. The test task overlapped functionally with one training task (henceforth: the functionally overlapping task, the FOT), as they required the same tool and a similar motor pattern for solution (for details see “Apparatus”). The test task overlapped perceptually with the other training task (henceforth: the perceptually overlapping task, the POT). Each task involved opening a puzzle box with appropriate tools and retrieving food rewards from behind a transparent door. Whereas the FOT and the test could be opened with the same tool (henceforth: the right tool), the POT could also be opened with another tool, which, however, did not allow for solving the FOT and the test (henceforth: the wrong tool). The right and the wrong tool were sticks identical in material and dimensions, but only the right tool had two functional tips that allowed for opening the test task. Simply put, only the right tool was relevant for the solution of the problematic situation embodied by the test. The right and wrong tools were accompanied by a third tool that served as a distractor. This useless tool was a thin twig or a string with an appropriate length but not rigid enough to be useful42,43. The wrong tool was not available in the training on the FOT two reasons: (1) because of the subjects’ tendency toward tool destruction (a larger number of available tools gave an opportunity to destroy a larger number of tools and trade more tool pieces for food); (2) to maximize the perceptual overlap between the training on the POT and the test, as these problems were always accompanied by all three tools: the right, the wrong and the useless one.
Solving the tasks required using the right tool on those components of the puzzle box that were relevant for the solution; that is, that were necessary to interact with in order to open the box. Other components were defined as irrelevant for the solution. To measure the subject’s behaviour in the test, we quantified interactions between the right/wrong tool and the relevant/irrelevant components. In particular, we were interested in the interactions between the right tool and the relevant components (henceforth: the relevant interactions) because such interactions would indicate attending to the relevant aspects of the solution and the problem, respectively. Such attending, in turn, would indicate whether and to what extent the apes used the overlapping past situations to solve the problem at hand. As attending to the irrelevant aspects of the solution and the problem would indicate that the apes did not benefit from the past situations, we were also interested in the interactions between the wrong tool and the irrelevant components (henceforth: the irrelevant interactions).
To investigate whether any training was prerequisite for solving the test, we introduced (1) a control condition, in which the subject did not receive any trainings. Further, to investigate whether the apes would benefit from a relevant memory in the test, we introduced (2) a no-conflict condition, in which the subject had only the training on the FOT and no conflicting trainings ensued. To investigate whether the subjects would suffer from retrieval competition, we introduced (3) a conflict condition, in which the subject had two trainings, on the FOT and the POT, whose memories would potentially compete for retrieval in the test (see Fig. 1).
All conditions (control, no-conflict, and conflict) began with a baseline trial to assess whether subjects might spontaneously open the test apparatus, prior to training. If their baseline attempts were unsuccessful, subjects advanced to the FOT (in no-conflict and conflict conditions) and POT (only in conflict condition) training apparatuses. Baseline, FOT, and POT trials were conducted in succession, within the same session. The test session was conducted 24 h after the training session (Fig. 1). The baselines included a single trial, contrary to the tests that included multiple trials. For all successful subjects, the tests were slightly longer or much shorter than the baselines (see Table S1).
Each subject completed three sets of puzzle boxes, one in each condition (control, no-conflict, conflict). Each set of puzzle boxes was accompanied by a unique set of tools and had a unique configuration of the relevant/irrelevant components (see Fig. 2 and Fig. S1 online).
Six great apes (1 male orangutan, Pongo abelii, 1 male and 4 female chimpanzees, Pan troglodytes) participated. They lived with conspecifics at Lund University Primate Research Station Furuvik (Sweden), and had previous experimental experience. Ages varied between 9 and 39 years (see Table 1). They were never food deprived and participated in the tasks voluntarily. Data from all subjects were included in the analyses. The research was approved by the Regional Ethical Review Board at Uppsala District Court (Sweden), permit no. C110/15 and was performed in accordance with relevant guidelines and regulations.
Power estimation for binomial outcomes was carried out before testing, based on accuracy mean, inter-individual variation, number of animals, number of trials, number of simulations, significance level and minimum required power (see Supplementary Information 2). However, sample size was predetermined: there were six great apes that could have been tested in this scheme. Although two other apes were housed at the zoo at the time of testing, one, a female orangutan, was tending to a newborn, and the other, a male chimpanzee, was not trained in bartering.
Although the apes had some varying previous testing experience, none of them had participated in a setup that required tool use from behind the cage bars. Further, the apes used sticks as tools on a daily basis within the enclosures, and only two of them (Selma and Naong) were observed to modify the tips.
Three analogical sets of tasks were devised, each with three puzzle boxes and three tools (see Fig. 2, Fig. S1). All boxes were made from wood with a plexiglass door making the food reward (a grape or a marshmallow) visible. To open a puzzle box, the subject first had to choose a tool and thereafter perform actions on the relevant elements of the box with a tip of the tool (for an example see Fig. S2). The solution always required three actions. The first action involved either of the three: (1) inserting the tip into a gap, (2) hooking the tip behind a surface, (3) casting the tip onto a protruding hook. The second action required stabilizing the hand in a fixed position, and the third action involved either pulling the tool or pushing it to the side/upwards.
The FOT and the test always required the use of the right tool, and the same first and third action, but a different second action (a different position of hand). The POT could have been solved both with the right and the wrong tool, and always required different first, second and third actions than the FOT and the test task. All tools were made of soft wood to avoid injury or damage. To prevent flipping, the boxes were fastened onto a sliding table attached to cage bars which could be moved back and forth by the experimenter. The apes could choose and use tools exclusively from behind the grid patterned bars (4.5 × 4.5 cm) that allowed extending only single digits toward the apparatus.
At least two aspects varied between the sets: (1) the degree of perceptual overlap, and (2) difficulty of required motor patterns. The perceptual overlap between the POT and the test was maximized through identical shapes and dimensions (height, width, length) and a similar distribution of wood and Plexiglas on the front side of the puzzle boxes. The degree of the perceptual overlap varied between the sets, but, as there was only an effect of condition, and not task set on the test score, the degree of the overlap most likely had little effect on the subjects’ performance. However, in the training on the FOT, the holeset required the largest number of demonstrations on the experimenter’s part and the largest number of interactions on the subject’s part before mastering the task (see Table S2).
Baselines and tests
The subjects always had unlimited access to all tools available in a given trial, as the tools always lied beside the apparatus within the subjects’ reach. As all apes were proficient at bartering, each tool was retrieved by the end of each trial. Due to tool material choice, at the beginning of trials all subjects but two destroyed the tools, either by biting into the top of the tool or splitting it into smaller wooden pieces, which they subsequently attempted to trade for food items. To avoid impairment of their bartering skills and reinforcement of tool destruction, the experimenter always moved the apparatus away from the subject when it inserted the tool into mouth. Only if the subject stopped this behaviour would the apparatus be moved back. In the test, if the subject destroyed a tool in any way twice in a row, the trial was terminated and qualified as failed.
At baseline, the subjects had a limited time for interaction with the test task, and their first response from taking the tool(s), using them on the apparatus, or destroying or returning them, was recorded. Only a single response was recorded to avoid a prolonged negative (that is not ending with food item’s release) exposure to the task. If the subject carried out a correct action with the functional tool and released the food item at this point, it was excluded from further testing on a given apparatus. This was the case with two apes (Selma and Santino) in the control condition.
In the test, a number of up to 10 attempts, defined as (1) laying the tools out on the apparatus’ tray and moving the apparatus toward the subjects, (2) holding the tray, (3) removing the tray and retrieving the tools, was set as a maximum allowed. The apes could not swap tools within a single attempt; if they did, this was counted as another attempt. For details on individual performance, see Table S1. Otherwise, the apes had unlimited time for interaction with the task, unless they (1) destroyed the tool twice in a row, (2) left the apparatus, or (3) expressed behavioural signs of frustration (e.g., spitting, crossing arms on the chest). These behaviours were mostly evinced during test trials in the control condition, and always led to trial termination (for details see Table S3 online).
Although pre-defined duration of baselines and tests could have been specified, this approach would do less justice to the subjects’ performance than an exploratory approach. Therefore, only the first attempt (defined as above) was recorded at baseline, and the subjects could have interacted with all available tools and the apparatus unless they evinced the above-mentioned behaviours in the test (see Table S3 online). By doing so, unwanted behaviours (tool destruction, leaving the apparatus) were not reinforced, and frustration of the subjects as well as a lack of cooperation with the experimenter in future encounters were minimized.
During the trainings on the FOT and the POT, the subjects learned how to use the right tool to open the given puzzle box. The training always started with a demonstration by the experimenter, in which she used the right tool to release the food item from the apparatus. The subject always received the released food item and, immediately afterwards, could interact with the apparatus. If the subject did not succeed despite repeated interactions, the experimenter traded the tool for a food item and demonstrated the correct solution again. This sequence was repeated until the subject was able to execute a correct response without the experimenter’s help. If the subject was not able to execute a correct response twice in a row, the experimenter demonstrated the solution again, and this procedure was repeated until the subject was able to release the food item five times in a row. In all cases, once the subject succeeded twice in a row on its own, it did not need further demonstrations to reach the learning criterion. With some subjects, the experimenter was also able to guide the use of the tool by holding the tip of the tool and executing the first action of the motor pattern (hooking, casting, or inserting), which was then completed within the second (adjusting hand position) and the third (pulling or pushing) actions on the subject’s part. Training times were not limited; that is, the subjects could exchange the tools and attempt at solving the FOT and the POT as many times as needed until reaching the learning criterion (see Table S2). However, once the subject reached the learning criterion, the training was terminated.
All trials were video-recorded. For each video, all interactions with the apparatus executed by the subjects were coded frame-by-frame in ELAN 4.9.3. An interaction was defined as a time interval between an onset and an offset of physical contact between a tool held by the subject and an element of the puzzle box used in a given trial. As each interaction involved a certain tool and a certain element of the box, two aspects of the interaction were always determined: (1) the tool used, right (F) or wrong (NF), (2) the component of the apparatus touched, relevant (rel) or irrelevant (irrel).
Two raters coded the videos: one rater coded 100%, and the second rater coded 17.4% of the videos. The second rater downloaded the written instructions and the videos from an online resource without face-to-face contact with the first rater to ensure his independence. Time-unit kappa44 was subsequently computed to estimate inter-observer agreement, understood as the accuracy of the overlap between the interval patterns generated by the raters for the same recording. Each of the recordings was divided into consecutive one-second intervals, and for each interval a 0–1 response was determined. Occurrence of coding on the rater’s part was counted as 1, and its lack was counted as 0. The 0–1 responses for each interval were subsequently assembled into a rater-specific pattern, and finally, an inter-rater kappa coefficient was calculated between these two patterns. The agreement was high and equaled 0.995. The analysis was conducted in R (v.3.5.1, the R Foundation for Statistical Computing: https://www.R-project.org). Significance level was set at 0.05.
Coding was terminated either with the offset of the last recorded interaction or with the offset of the first interaction that led to food item’s release. For each recording, several variables were computed from the coded intervals. To obtain these variables, certain interaction times were divided either by the overall time between the onset of the first interaction and the offset of the last interaction or by a half, or a fourth of this overall time. For a full list of variables see Table S4.
A Generalized Linear Mixed Model with Bernoulli distribution was fit to determine the fixed effect of condition on the score (pass/fail; brms package45,46, controlling for the random effect of subject ID (control: n = 4, no-conflict: n = 6, conflict: n = 6). A series of Generalized Linear Mixed Models with Dirichlet distribution was used to estimate the fixed effect of condition on proportions of interaction time for specific tools and components of the apparatus in the test (brms package). The Dirichlet analyses were carried out only for those subjects that succeeded in the test. Note that, in the results, the effect sizes’ range equaled [− 200, 200] because two differences in %, each between [− 100, 100], were compared in the analysis.
A Generalized Linear Mixed Model with Bernoulli distribution was fit to determine the fixed effect of task on the score (pass/fail; brms package), controlling for the random effect of subject ID. Times spent on particular trainings (FOT only, or FOT and POT), from the beginning of the first to the end of the last interaction, were used to determine whether such time influenced the success in the test. The duration of the test equaled the overall time of all interactions executed in the test. Generalized Linear Mixed Models with Gamma distribution were used to estimate the effect of times spent on particular trainings and the effect of condition and task on the duration of the test. Further, Generalized Linear Mixed Models with poisson distribution were used to estimate the effect of times spent on particular trainings on the duration of the test. All data was plotted using ggplot247, soundgen48 and reshape49 packages. For details see the R script in the Supplementary Information 2.