# Knowledge and social relatedness shape research portfolio diversification

Aug 28, 2020

### Data description

We use the American Physical Society (APS) dataset to reconstruct the activities of 197,682 physicists who published at least one paper in one of the APS outlets in the period ranging from 1977 to 2009 (see the “Data” section for details). All articles in APS journals are classified according to hierarchical codes that map into physics fields and sub-fields (i.e., PACS codes). For our analyses (see the “Modeling and assessment of predictors’ contributions” section), we filter out authors and sub-fields that appear only sporadically in the data. Specifically, we focus on 105,558 authors who published at least two articles, covering a minimum of two sub-fields over a restricted set of 68 PACS which appear in at least four articles.

Figure 1 provides a general description of the data and some insights. Figure 1a shows the popularity, in terms of number of articles, of fields and sub-fields (one- and two-digit level PACS codes, respectively). As expected, PACS popularity is highly heterogeneous and reflects the prominence of condensed matter research in the last decades. Figure 1b shows scientists’ degree of diversification and their relative specialization, as defined in the “Scientists’ specializations” section. The research portfolio of most scholars in our dataset is fairly limited in scope, with a large majority of scientists diversifying in no more than 5 sub-fields. The choice of subjects, however, is not random—as we demonstrate in the next section.

### Diversification is not random

Do scientists, much like firms11,12, shape their research portfolios based on specific strategies and constraints? To address this question quantitatively, we draw a parallel with ecology: as species may co-occur in distinct sites, sub-fields may overlap in research portfolios. Measuring the relatedness of species based on their geographical co-occurrence is analogous to measuring the relatedness of sub-fields based on their overlap in scientists’ ranges of activity. Thus, the PACS-Authors binary bipartite network resembles a presence-absence matrix13. The monopartite projection of this bipartite network (see the “Monopartite projections of bipartite networks” section) on the PACS layer carries a critical piece of information: for each pair of PACS, it tells us how many scientists are active in both sub-fields irrespective of the number of articles, drawing a diversification network.

We can assess this network contrasting it against an appropriate null model. Which sub-fields overlaps are over- or under-represented relative to what we would expect under the assumption that scientists picked research topics at random, but taking into account the popularity of sub-fields? Under a random model, the probability that x scientists are active both in sub-field a and in sub-field b, given that (S_a) and (S_b) scientists are active in these sub-fields, follows a hypergeometric distribution14

begin{aligned} P(X=x) = frac{left( {begin{array}{c}S_a\ xend{array}}right) left( {begin{array}{c}S-S_a\ S_b-xend{array}}right) }{left( {begin{array}{c}S\ S_bend{array}}right) } end{aligned}

(1)

where S is the total number of scientists in the sample.

Figure 2 describes the steps of our procedure. Starting from the bipartite network (panel a), we derive its monopartite projection (panel b) and test whether the resulting structure is non-random, summarizing statistically validated diversification patterns (panel c). Out of 2,278 pairs of PACS, 72% are classified as non-random with a Bonferroni-corrected p-value (<0.05). Of these, 1,151 pairs show a positive association and 486 a negative one. Given the severity of the Bonferroni correction (i.e., power decreases significantly as the number of tests increases) and possible issues related to dependency, we also employ the False Discovery Rate (FDR) Benjamini-Hochberg and Benjamini-Yekutieli corrections (see section S2 and Table S2). These results strongly support a coherent nature of scientists’ diversification choices, but do not provide a direct quantification of the role played by specific features in shaping such coherence. Next, we investigate potential drivers of diversification considering measures of cognitive and social proximity.

### Knowledge and social relatedness predict diversification

The relationships among scientific fields, like those among technologies, can be mapped using network science tools. To chart a knowledge space we need a measure of distance between fields. Several different metrics have been proposed to quantify the relatedness of technologies or scientific domains (see Bowen et al.15 for a review). When we consider the monopartite projection on the PACS layer of the bipartite PACS-Articles network, counting the co-occurrences of all pairs of PACS produces a first approximation of the relatedness of sub-fields. A similar approach was used by Lamperti et al.16 for patent data. However, we need a measure of proximity that: (i) does not depend on the absolute popularity of the fields, and (ii) is symmetric. The most straightforward metric that fulfils both requirements is the cosine similarity (see Fig. 3ac, “Measures of knowledge and social relatedness” section). As expected, the proximity matrix has a clear hierarchical block structure, with blocks largely overlapping with fields. Interestingly, several off block elements show the proximity of sub-fields belonging to different PACS fields.

As science becomes an increasingly “social” enterprise, it is also important to capture the relatedness of scholars, which can be done by analysing co-authorships3. Similar to what we did for knowledge relatedness, we construct a measure of social relatedness starting from the bipartite Authors-Articles network. The monopartite projection on the Authors layer defines the co-authorship network from which we compute our desired metric. In addition, to investigate whether diversification is associated with the exploitation of social relationships, we include information on authors’ specialization as node attributes in the network and we introduce a dummy (SR_{ib}) equal to 1 if scientist i can reach sub-field b through direct social interactions (see Fig. 3d, the “Measures of knowledge and social relatedness” section).

Next, we evaluate the effects of knowledge and social relatedness on diversification with logistic regressions. The binary dependent variable encodes whether a scientist is active in a sub-field, the main explanatory variables are our measures of cognitive and social proximity, and a control is introduced for the core field. In practice, each scientist is assigned to a core sub-field (specialization) and can possibly diversify in one or more target sub-fields different from her own (see the “Scientists’ specializations” section). In this first set of regressions, each scientist appears 67 times, one for every possible target PACS different from her own specialization (see the “Modeling and assessment of predictors’ contributions” section for more details).

Figure 4 provides evidence that both social and knowledge relatedness are associated with scientists’ diversification strategies. Social relatedness matters irrespective of the field, as scientists who can acquire new knowledge through social relationships are more likely to be active in a sub-field different form their own specialization (panel a). Also knowledge relatedness increases the probability of a scientist being active out of her own specialization, and again this is true for all fields (panel b). These results strongly suggest that cognitive and social proximity do contribute to shaping diversification strategies.

### Model extensions and robustness checks

To move further in our investigation of research portfolio diversification, we broaden our analysis in several ways. First, we expand our logistic regression model including a larger set of control variables, such as the number of co-authors or the popularity and citations of the target sub-field (see Table S3 for a complete list). All numerical variables in the expanded model are normalized, and log-transformed to reduce right-skew when necessary (see the “Modeling and assessment of predictors’ contributions” section for more details). Since the effect of knowledge relatedness on the probability of diversification may be modulated by social relatedness, we also include an interaction term in our analysis.

Second, we tackle two potential limitations of our original analysis; that is, defining a single specialization for each scientist (while core specializations may actually be multiple), and not separating sub-field movements within and between fields, i.e., one-digit PACS codes (which may be differently affected by various features). We run additional model fits allowing scientists to have multiple specializations (see the “Scientists’ specializations” section) and separating within and between field diversification. Specifically, we perform the following fits: (i) single specialization with full diversification, (ii) multiple-specialization with full diversification, (iii) single specialization with within field diversification, (iv) multiple specialization with within field diversification, (v) single specialization with between field diversification and (vi) multiple specialization with between field diversification.

Third, we account for the fact that the data employed in our fits are “clustered”, with several observations associated to each scientist and a potential heteroskedasticity across clusters/scientists. We estimate clustering-robust standard errors using the clustered sandwich estimator from the R package sandwich17.

Fits for specifications (i)–(iv), all including the interaction between knowledge and social relatedness and clustering corrected standard errors, are summarized in Table 1, confirming the high significance of the relatedness metrics in shaping research diversification. Figure 5 focuses on the full diversification case. Panels a (single specialization, (i)) and c (multiple specialization, (ii)) show the log-odds difference in the probability of diversification as a function of knowledge and social relatedness, accounting for all controls. Social relatedness positively affects the chances of diversification and the effect is moderated by knowledge relatedness in both specifications, though more markedly in (i) than in (ii). Panels b (for (i)) and d (for (ii)) further illustrate this, showing how the estimated coefficient of social relatedness decreases as knowledge relatedness increases. This result indicates that when diversifying toward “close” sub-field, the role of social relatedness becomes less crucial.

Next, we contrast scientists moving within their specialization field (between two sub-fields, i.e. two-digit PACS codes, belonging to the same field, i.e. one-digit PACS code; e.g. PACS 12 Specific theories and interaction models; particle systematics and PACS 13 Specific reactions and phenomenology, both belonging to PACS 1 High Energy physics) and scientists moving out of their field and towards a completely different subject (i.e. a different one-digit PACS code). These choices may be driven by different factors. Scientists moving within their field may be less dependent on external collaborations, since such a diversification strategy requires a smaller learning effort. Our estimates do highlight differences. Looking at the within field diversification case, single specialization (Table 1, (iii)), we see that knowledge and social relatedness, as well as their interaction, are still significant—but the magnitude of the coefficients is smaller with respect to the full diversification case. When we consider multiple specialization (Table 1, (iv)), coefficients shrink even further and the interaction is no longer significant (see also Figure S2). On the contrary, looking at the between field diversification case, the general trends outlined for the full diversification case are confirmed—including the negative interaction term remaining sizeable and significant for both single and multiple specialization (see Table 1, (v) and (vi), and Figure S3).These results are in line with expectations: while having a co-author in a different sub-field may well be useful, knowledge is not a barrier to entry when scientists move within the same general area of inquiry. This explains why the interaction between social and knowledge relatedness becomes less prominent or non-significant in our estimates.

### Quantifying the relative importance of knowledge and social relatedness

Can we quantify the (relative) role of knowledge and social relatedness in explaining research portfolio diversification? How important are these quantities when evaluated in the presence of several control covariates, and under a range of model specifications? To answer these questions we follow two approaches.

First, we run a LASSO feature selection procedure to gauge the relative importance and role of different predictors by tracking how they are excluded/included in a model as one varies the regularization penalty. Since our predictors include categorical variables (i.e., groups of dummies), as well as naturally grouped variables (e.g., scientists’ individual characteristics, sub-fields’ popularity and competition, etc.) we run a group LASSO algorithm18 with features grouped as shown in Table S3. Moreover, to counteract collinearity and finite sample issues which can render the LASSO unstable19, we split our data forming ten random subsamples of 1,000 scientists each, and repeat the group LASSO fit on each of the subsamples for all the considered model specifications. Figure 6af show the (grouped) coefficient norms as a function of the penalization parameter (lambda). Results clearly demonstrate the crucial role played by social and knowledge relatedness. They also confirm that the role of knowledge relatedness weakens markedly in the case of within-field diversification (panels c and d).

Second, we compute the Relative Contributions to Deviance Explained (RCDEs; see the “Modeling and assessment of predictors’ contributions” section for details). This index captures what percentage of the logistic regression deviance is captured by a predictor. Figure 6g strongly supports a prominent role for social relatedness, with RCDEs around or above 30% across all specifications. The RCDEs of knowledge relatedness are smaller, around 5–10%, and again become negligible in the case of within-field diversification. In summary, our results provide additional evidence that both social and knowledge proximity shape scientists’ diversification strategies, but highlight social interactions as the dominant channel through which knowledge is exchanged and acquired.

### Digging deeper: multidisciplinarity and time

Next, we tackle two additional potential limitations of our original analysis, which might overestimate the probability of diversification for truly multidisciplinary scientists and suffer from reverse causality issues. To investigate diversification into truly unexplored sub-fields, we fitted the model specification (i) (see the “Model extensions and robustness checks” section) considering scientists’ specialization (see the “Scientists’ specializations” section) and limiting their diversification choices to sub-fields in which they have no revealed scientific advantage (see section S5.1). To at least partially address causality in the effects of knowledge and social relatedness on diversification, we included a temporal dimension: we split the original dataset in three time periods, re-computed our measures of relatedness in each, and used them to predict scientists’ diversification introducing time lags (see section S5.2). In both exercises, results confirmed our previous findings: social relatedness shapes scientists’ diversification strategies more than knowledge relatedness.

Finally, and again related to time, our findings may be influenced by underlying trends in the temporal evolution of PACS co-occurrence networks—and thus knowledge proximity. A detailed study of the evolution of relationships among sub-fields, which is of course of interest per se, is beyond the scope of the present article. Nevertheless, to gather at least some approximate sense of its potential impact, we recomputed our measure of knowledge relatedness separately for each of the different decades in the original dataset. Based on results shown in section S4, the physics knowledge space remained rather stable over the time span considered. A valuable alternative approach to take into account the temporal evolution of the physics knowledge space is provided by Chinazzi et al.20