# Context-aware experience sampling reveals the scale of variation in affective experience

#### ByAUTHOR

Jul 27, 2020

All experimental protocols described below were approved by the Northeastern University Institutional Review Board. These methods were carried out in accordance with the relevant guidelines and regulations for research with human subjects.

### Participants

Sixty-seven participants ranging in age from 18–36 years (55% female; 38.8% White, 3.0% Black, 29.8% Asian, 28.4% other; Mage = 22.8 years, SDage = 4.4 years) were recruited from the greater Boston area through posted advertisements, as well as Northeastern University classrooms and online portals. Eligible participants were non-smoking, fluent English-speakers, and excluded if they had a history of cardiovascular illness or stroke, chronic medical conditions, mental illness, asthma, skin allergies, or very sensitive skin. Eligible participants also confirmed they were not taking any medications known to influence physiological arousal including medications for ADHD, insomnia, anxiety, hypertension, rheumatoid arthritis, epilepsy/seizures, cold/flu, or fever/allergies. Informed consent was obtained from all participants before beginning the study. Participants received $490 as compensation for completing all parts of the study, plus up to an additional$55 in compliance and task incentives, as detailed on page 4 of the supplementary information.

Of the 67 recruited participants, six withdrew and an additional nine were dismissed due to poor compliance. A total of 52 participants completed the full protocol (56% female; 38.5% White, 1.9% Black, 38.5% Asian, 21.2% other; Mage = 22.5 years, SDage = 4.4 years). Six of these participants were excluded from the main analyses due to a lack of sufficient usable physiological data, bringing the final sample size to 46 (50% female; 43.5% White, 0% Black, 41.3% Asian, 15.2% other; Mage = 22.4 years, SDage = 4.4 years).

### Procedure Overview

Each participant completed approximately 14 days (M = 14.4, SD = 0.6) of context-aware experience sampling distributed across a three- to four-week period (M = 24.9 days, SD = 5.5 days). On each day of experience sampling, participants came into the lab and were instrumented for peripheral physiological recordings. Participants could not begin the experience sampling portion of the study without functioning ambulatory equipment, and so did not complete the daily protocol if they did not attend the morning session to be instrumented for the recordings. In the event a participant was unable to make a scheduled session or the equipment was not functioning properly during instrumentation, they were rescheduled to complete the day of sampling on another day. Participants generally started each experience sampling day between 8 and 9 am, although this varied between 7:30am and 2:30 pm according to participants’ schedules. Participants were provided an Android smartphone with a custom application, MESA (MindWare Technologies LTD, Westerville, OH), which was used to determine when a heart rate response of sufficient duration and amplitude occurred which then triggered an experience sampling prompt. Participants were instructed to continue physiological recordings for eight hours each day, after which they were able to remove and recharge all equipment. Upon completing experience sampling each day, participants automatically received an end-of-day survey via SurveyMonkey (San Mateo, CA), which they used to provide additional details about the prompts they completed throughout the day. Before and after the two-week experience sampling protocol, participants also completed two in-lab sessions. In each session, participants completed tasks and questionnaires that are not reported here (see page 5 of supplementary information and supplementary Table S1 for an overview).

#### Physiological measurement

All ambulatory peripheral physiological measures were recorded at 500 Hz on a mobile impedance cardiograph from MindWare Technologies LTD (Model # 50-2,303-02, Westerville, OH), which participants wore clipped onto their clothing on the hip.

ECG and ICG were obtained using pre-gelled ConMed (Westborough, MA) Cleartrace Ag/AgCl sensors, connected via wires to the cardiograph. Sensor sites were cleaned with alcohol and abraded lightly with gauze. ECG was obtained using a modified lead II configuration, with recording electrodes placed on the distal right collarbone and an inferior left rib, respectively, and a reference electrode placed on an inferior right rib. The ECG signal was acquired using a low cutoff of 0.5 Hz and a high cutoff of 45 Hz.

ICG was obtained using a four-spot electrode configuration90. Two inner recording electrodes were placed on the front of the torso: one at the base of the neck at the top of the sternum, and a second at the bottom of the sternum over the xiphisternal junction. Two outer source electrodes were placed on the back along the midline approximately 4 cm above and below the inner recording electrodes, respectively. The source electrodes passed a 4 mA, 100 kHz alternating current across the thorax. The distance between the inner recording electrodes (cm) was measured for each day of experience sampling. Basal impedance (Z0) was acquired using a low cutoff of 10 Hz. The first derivative, dZ/dt, was acquired using a low cutoff of 0.5 Hz and a high cutoff of 45 Hz.

Electrodermal activity (EDA) was also recorded over an unobtrusive neck placement, but data were not used in the present analysis due to difficulties in detecting and removing artifacts, as well as generally low mean skin conductance levels across participants.

The mobile impedance cardiograph collected continuous three-axis accelerometry data used to assess movement. Additionally, participants wore two inertial measurement units (IMUs) purchased from LP-Research (Minato-ky, Tokyo, Japan) to derive measures of posture and changes in posture. One IMU was placed medially on the sternum beneath the top inner impedance recording electrode and affixed to the skin using a double-sided adhesive patch. The other IMU was placed on the front of the thigh using either a cloth holder attached to the clothing, or a second adhesive patch affixed to the skin. Participants did not remove sensors until the end of each experience sampling day, unless instructed by the experimenters (e.g., due to synchronizing issues).

#### Context-aware experience sampling

Peripheral physiological data and accelerometric data were recorded continuously throughout the 8-h sampling period and communicated via Bluetooth to a Motorola Moto G4 smartphone. The smartphone application, MESA, processed the continuous ECG and accelerometer data in real time, and initiated an experience sampling prompt anytime a substantial, sustained change in heart period was detected in the absence of movement or posture change, with an imposed minimum interval of five minutes between prompts. A substantial, sustained change in heart period was operationalized on the first day of sampling as occurring when the interbeat interval (IBI) changed by more than ± 167 ms over an eight-second period (at a typical resting heart rate of 60 bpm or IBI of 1,000 ms, this is equivalent to a decrease of about 9 bpm or an increase of about 12 bpm). On subsequent days, this IBI parameter was manually adjusted up or down to ensure each participant received approximately 20 prompts per day. The average adjustment was ± 26.17 ms (SD = 25.91 ms), with the final parameter setting ranging between ± 65 ms and ± 266 ms across participants. For background on threshold selection and more information on threshold adjustment, see page 4 of the supplementary information.

Movement was determined from the continuous accelerometer data from the mobile impedance cardiograph. Minimal movement was operationalized as any time none of the three accelerometry channels (alone or in aggregate) exceeded a threshold of 10 cm/s2 within the preceding 30 s. Posture (standing, sitting, reclining) was determined by comparing the relative orientation of the two IMUs on participants’ torso and thigh using their continuous accelerometry data. Absence of posture change was operationalized as any time when the relative orientation of the two IMUs did not change within the preceding 30 s.

On average, participants received 21.70 (SD = 6.90) prompts per day. We observed that prompts were relatively evenly distributed throughout the day within experience sampling days. On average, participants completed 34% of prompts in the morning (before 12 pm), 52% of prompts in the afternoon (12–5 pm), and 14% of prompts in the evening (after 5 pm). This makes sense given the typical start times for participants in our sample. Participants also received on average two ‘random’ prompts per experience sampling day, which occurred in the absence of movement or posture change, but which were not contingent on a change in IBI. Random prompts were spread throughout the experience sampling day, such that they could only receive one in the first four hours and one in the second four hours of experience sampling. Participants were informed that some experience sampling prompts would be generated randomly while others would be generated based on changes in their cardiac activity. By conveying this information, we were able to instruct participants to avoid responding to prompts following specific physiological events (e.g., coughing, sneezing) and minimize the extent to which they paid special attention to their cardiac activity.

To remain in the study, participants were required to complete at least three prompts each day. In addition, as detailed on page 4 of the supplementary information, for the purposes of incentivizing participation and limiting attrition, the experience sampling protocol was broken into three pay periods (days 1–5, 6–10, and 11–14). Participants were required to respond to an average of at least six prompts per day during each pay period to remain in the study, and received a bonus payment for each pay period where they completed an average of eight prompts per day. The average number of completed prompts was 8.80 per day (SD = 1.22), which is in line with previous experience sampling studies that have asked participants to complete 10 prompts per day91,92.

At each sampling event, participants were prompted to respond to a series of questions presented in the MESA application. Participants first provided a brief free-text description of what was happening at the time they received the prompt. Next, participants rated their current valence and arousal, each on a 100-point continuous slider scale ranging from -50 (very unpleasant or very deactivated) to + 50 (very pleasant or very activated). Participants then self-generated words to label their current affective experience. Specifically, participants were asked to “list any emotion(s) you were feeling when you received the prompt”. Participants were able to provide as many words as they felt necessary to describe their affective experience but were required to input at least one word. For each self-generated word, participants were asked to provide an intensity rating on a five-point scale: “not at all” (1), “a little” (2), “moderately” (3), “a lot” (4), “very much” (5). We asked participants to self-generate emotion words as this allowed us to capture naturally-occurring variation in how participants categorized their affective experience, as opposed to having them select words from an experimenter-stipulated list, which would necessarily constrain variation. Participants also responded to additional questions that are not included in the present report (see page 4 of the supplementary information for details).

At the end of each experience sampling day, participants received a modified day reconstruction survey93, in which they were presented with their self-generated brief description and social context from each prompt they completed during the day. For each prompt, participants were asked to provide additional details about the event, its social context, and accompanying affective features. Participants were requested to select three sampling events for which they provided a longer, more detailed description (> 200 words). Full details of the end-of-day survey are provided on pages 5–6 of the supplementary information. Data from these surveys are not reported here.

#### Physiological signal processing and feature extraction

Peripheral physiological signals were processed using an in-house pipeline coded in Python to accommodate the volume of physiological data collected (400 + hours), as well as the variability in signal morphologies and artifacts produced during long-term ambulatory monitoring. Artifacts were identified using a series of quality checks, detailed for each signal below. Data affected by artifacts were excluded from analysis (i.e., artifacts were not corrected). This resulted in an average event-related data loss of 8.18% per day (SD = 3.35%) across ECG (5.79% average data loss; SD = 3.22%) and ICG (2.56% average daily data loss; SD = 2.03%) signals. The research team, including two trained psychophysiologists (J.B.W., K.S.Q.), met regularly throughout development of the processing pipeline to visually examine sample output and provide feedback on quality checks.

#### Electrocardiogram (ECG)

The ECG signal was processed following previous work94. First, the raw signal was passed through an elliptic bandpass filter to remove baseline and high frequency noise without affecting the waveform. Initial quality checks were then performed for each beat, checking for overall waveform shape, and acceptable minimum, maximum, and minimum-to-maximum values94. The default parameters, originally selected based on data collected in laboratory settings, were found to be too aggressive for this ambulatory data set, so we relaxed these thresholds based on manual review of the results by experts (J.B.W., K.S.Q.) for a random subset of the data. The parameters we implemented are reported in supplementary Table S2, along with the original recommendations94.

R-peak detection for ECG was performed using established methods95 and implemented using the BioSPPy package in Python96. Interbeat interval (IBI) was then derived as the average R-R interval. Additional quality checks (supplementary Table S2) were performed on each IBI series to ensure that values were within acceptable ranges (300–2000 ms), and that expected beat-to-beat differences were consistent with normal beats and unlikely to be artifacts (following established benchmarks97). ECG data failing any quality check were excluded from analysis.

Respiratory sinus arrhythmia (RSA) was derived from the IBI series. RSA reflects high-frequency variability in IBI which occurs at the respiratory frequency and is often used as an estimate of parasympathetic influence on the heart57. RSA calculations were coded to mimic the processing steps of standard heart rate variability (HRV) analysis software (MindWare Technologies LTD, Westerville, Ohio), including: cubic interpolation of beat-to-beat IBI, detrending to minimize non-stationarity, tapering using a Hamming window, and lastly, fast Fourier transformation (FFT). RSA was calculated as the log of the area under the power spectrogram that lies between 0.12 and 0.4 Hz.

#### Impedance cardiogram (ICG)

ICG feature detection was performed using an in-house implementation based on methods described in previous work94,98, and adjusted for ambulatory data using thresholds determined by comparing performance of the original algorithms with hand-scored impedance cardiographic data on a subset of participants. The first derivative of the basal impedance (Z0) signal, dZ/dt, was used as the basis for all analyses. The signal was segmented into time windows corresponding to 250 ms before the ECG R-peak to 500 ms after; eight such segments (i.e., 8 beats) were averaged together to form overlapping ensembles94. B points were detected in each ensemble by taking the first and second derivatives of the dZ/dt signal and comparing them with thresholds based on signal frequency (supplementary Table S2)98. Following standard procedures98, forward and reverse autoregressive modeling was then used to perform outlier detection and correction, such that all B points in a series were fit with the autoregressive model. X points were detected by examining the second derivative of the dZ/dt signal within each ensemble94. Segments of the ICG signal from which we could not detect B or X points and ICG data that corresponded to periods of unusable ECG data were excluded from analysis.

Two features we used were systolic time intervals, namely, pre-ejection period (PEP) and left ventricular ejection time (LVET). PEP represents the time interval in ms between the electrical event that signals the start of ventricular contraction to the opening of the aortic valve. PEP is often used as an inverse estimate of sympathetic control of the heart53,55,58. Here, PEP was calculated as the time in ms between the ECG R peak and the ICG B point (also referred to as PEPR99). LVET is the time interval in ms between the opening and closing of the aortic valve55, and was calculated as the time in ms between the ICG B point and X point. Quality checks (supplementary Table S2) were performed and we retained only values that occurred within an acceptable range (30–200 ms for PEP, 100–500 ms for LVET), and that did not result in changes in the gradient greater than 30 ms from one ensemble to the next.

Two additional features we used were based on volumetric measures of cardiac function, namely, stroke volume (SV) and cardiac output (CO). SV represents the volume of blood ejected by the heart with each heartbeat (in mL), and here we calculated SV using Kubicek’s equation100. CO is the volume of blood circulated by the heart in L/min55.

### Analysis

#### Within-person clustering analysis

We submitted data from each individual to a separate clustering analysis using Dirichlet Process-Gaussian Mixture Modeling (DP-GMM)49, 50. DP-GMM is a specialized variant of Gaussian Mixture Modeling (GMM)49. Data points (i.e., six-dimensional vectors of change scores for IBI, RSA, PEP, LVET, SV, and CO) were standardized prior to clustering by subtracting the mean and dividing by the standard deviation of each feature. Specific parameter values are reported in supplementary Table S3.

The first step in implementing standard GMM is to initialize a set of parameters for a predefined number of clusters represented in the data. Parameters (i.e., values) include each cluster’s location (i.e., mean), shape (i.e., covariance), and relative size (i.e., the mixture proportion or the prior probability of a point belonging to that cluster relative to others). Initially, these cluster parameters are chosen at random, with means and covariances selected from a Gaussian distribution that is proportional to the data, and with prior probabilities of equal value. GMM uses these initialized values to calculate the posterior probabilities of each point belonging to each cluster. Under mixture models, each point is given a posterior probability of belonging to every cluster; however, if patterns in the data are distinctive, posterior probabilities during later iterations will often be close to 1 for a single cluster and negligible for all others. Based on the posterior probabilities, the means and covariance matrices are updated iteratively using optimization techniques such as Expectation Maximization49 or Variational Inference49 until reaching convergence (i.e., until the values stop changing).

In the present analyses, we sought to avoid tuning and selecting the number of clusters for each participant. DP-GMM allowed us to accomplish this goal as, in contrast to standard GMM, DP-GMM discovers the number of clusters from input data through the use of a Dirichlet Process Prior101. The Dirichlet Process Prior enables a dataset to include an unknown (i.e., infinite) number of clusters, where the mixture proportion of each cluster is randomly assigned a value from a continuous probability distribution such that the mixture proportions must sum to 1. This random assignment process is repeated ad infinitum to create a vector of mixing proportions which then serves as the prior probabilities of a standard GMM. To place reasonable bounds on this process, the number of clusters made discoverable through Variational Inference was limited to the number of data points in the sample. Because DP-GMM iteratively optimizes parameters (means, covariances, and mixture proportions), clusters which initially acquire more points become more likely to continue acquiring points in the future. Consequently, as the model runs, some clusters grow while others shrink and are ultimately discarded.

As a fully Bayesian method, DP-GMM is more likely to avoid overfitting because it is self-regularized by the priors101. There are, however, model parameters that can constrain the number of clusters discovered. As detailed on page 28 of the supplementary information, we ran additional analyses to validate the values selected for these parameters. We also conducted parallel analyses on the example participants using a standard GMM approach, in which Bayesian information criterion (BIC) values were used to select the optimal number of clusters per participant. These analyses replicated our main findings with regard to the heterogeneity of clusters discovered and their relationship with mental features of affective experience (see pages 29–30 of the supplementary information for details).

#### Effect size estimates

We derived effect size estimates (similar to Cohen’s d56) for the mean change scores for IBI, RSA, PEP, LVET, SV, and CO for each cluster for each participant using the following formula:

$$d= frac{m- mu }{sigma }$$

where m is the weighted cluster mean for a given physiological feature, µ is the assumed population mean corresponding to no change (i.e., 0), and σ is the standard deviation for the given physiological feature for that participant across all events included in the analysis. According to published recommendations for interpreting effect sizes54, we interpreted the magnitude of the weighted mean change score for each feature for each cluster as: |d|< 0.2 (negligible change); 0.2 ≤|d|< 0.5 (small change); 0.5 ≤|d|< 0.8 (medium change); |d|≥ 0.8 (large change).

#### Common patterns of change in physiological activity

We classified change score effect sizes as a negligible change (|d|< 0.2), decrease (d ≤ −0.2), or increase (d ≥ 0.2). Before classification, we rounded the effect size estimates to the nearest tenth (i.e., one decimal place) to facilitate comparison with published recommendations. Logical syntax was used to identify duplicate (i.e., repeated) strings of effect size classifications.

We performed further classifications of RSA and PEP change scores to assess overall trends in the change in PNS and SNS function, respectively. Larger change score effect sizes for RSA (i.e., greater variability in interbeat interval within the respiratory frequency band) indicate increased PNS function, such that: |d|< 0.2 (negligible PNS change); d ≥ 0.2 (PNS activation); and d ≤ −0.2 (PNS withdrawal). For PEP, larger change score effect sizes (i.e., longer pre-ejection periods) indicate SNS withdrawal, such that: |d|< 0.2 (negligible SNS change); d ≥ 0.2 (SNS withdrawal); and d ≤ −0.2 (SNS activation).