Interplay between $$k$$ k -core and community structure in complex networks


Data

We have considered networks corresponding to systems of different types: from social to technological, from semantic to transportation. Table 1 summarises main properties of such networks. Except for Cookpad networks, all the data sets are publicly available and have been retrieved from the Stanford Large Network data set Collection55 (Facebook 1, Twitter, Emails, and Cond. Matter), the Network Repository56,57,58 (Facebook 2, 3, 4, and 5), the Koblenz Network Collection (KONECT)59 (Comp. Science, and Words), Mark E. J. Newman’s personal network data repository60 (Web-blogs), and the OpenFlights data repository61 (Global airline). In the following text, we provide a brief description of each data set.

  • Facebook and Twitter These networks describe social relationships. Nodes are people. Edges represent their friendship relations.

  • Web-blogs This network is composed of the hyperlinks (edges) between weblogs on US politics (nodes) recorded in 2005.

  • Emails This is a network of email data from a large European research institution. Nodes are people. Edges connect pairs of individuals who have exchanged at least one e-mail.

  • Cond. Matter and Comp. Science The former network is the co-authorship network of the authors of preprint manuscripts submitted to the Condensed Matter Physics arXiv e-print archive from January 1993 to April 2003. The latter network is similarly defined using manuscripts appearing in the DBLP computer science bibliography, using a comprehensive list of research papers in computer science. The submission time of the papers of the DBLP collection is unavailable. A node is an author. An edge represents the existence of at least one manuscript co-authored by two authors.

  • Global airline In this network nodes are airports across the globe. An edge indicates direct commercial flights between two airports.

  • Words This network accounts for the lexical relationships among words extracted from the WordNet data set. Nodes are English words. Edges are relationships (synonymy, antonymy, meronymy, etc.) between pairs of words.

  • Cookpad These networks are extracted from the Cookpad online recipe sharing platform62. Users can post and browse recipes, as well as interact with other users through recipes in multiple ways including liking, sharing, and posting a comment. The platform is present in many countries (e.g., Japan, Indonesia, United Kingdom, and Italy). Here, we consider the data collected from September to November of 2018 in Greece, Spain, and the United Kingdom, separately for each country. In the three networks, nodes are users. An edge between a pair of users exists if one or more of the following types of events takes place: like or follow a user, viewing, bookmarking, commenting, or making a cooksnap of another user’s recipe.

All the networks considered in this work are treated as undirected and unweighted, even when the original data contains more information. Finally, we also consider synthetic networks, generated using the LFR (Lancichinetti–Fortunato–Radicchi) model43 (see Sect. 3 of SM for details).

Network shuffling

Given a network, G, with N nodes and L edges, we generate a randomised counterpart, (G^prime), that has the same nodes and the same number of edges by shuffling the edges of G. We consider three shuffling methods denoted by deg, commA, and commB; each shuffling method preserves different properties of G. The shuffling consists in selecting uniformly at random two edges (ab) and (cd), and replacing them with, e.g., (ac) and (bd), if the swapping of the edges is accepted. An attempt to swap edges is accepted, in which case we call the swapping effective, if and only if it respects the rule of the specific shuffling method and the swapping does not generate self-loops or multiple edges. We continued the shuffling until we carried out 2L effective swaps, such that an edge was swapped four times on average.

In the following text, we provide the details of each shuffling method. Assume that network G partitions into communities such that the set of the communities is ({mathcal {C}} = {C_1, ldots , C_{N_text {c}} }), where (N_text {c}) is the number of communities. Furthermore, let (g(i) in {mathcal {C}}, ; i = 1, dots , N), be the community to which the ith node belongs and (k_i) be the degree of node i. We have:

  • Degree-preserving shuffling (deg) This method preserves degree (k_i) of each node i and is equivalent to the configuration model37.

  • Community-preserving shuffling of type A (commA) On top of the degree of each node, this method preserves the total number of edges within each community and between each pair of communities. In attempts to swap edges, we replace two randomly selected edges (ab) and (cd) by (ac) and (bd) if and only if an end node of edge (ab) and an end node of edge (cd) belong to the same community (i.e., if (g(b) = g(c)) or (g(a) = g(d))).

  • Community-preserving shuffling of type B (commB) Like commA, this method preserves the degree of each node and the number of edges within each community and between each pair of communities. In contrast with commA, the commB method preserves the numbers of edges within and across communities for each node, and not only for each community or pairs of communities. Given two selected edges (ab) and (cd), we replace them with (ac) and (bd) if and only if the two new edges connect the same community pairs as before the swapping (i.e., (g(b) = g(c)) and (g(a) = g(d))).

Comparison of the (k)-core decomposition

To assess the similarity between the (k)-core decomposition of the original network, G, and of its shuffled counterpart, (G^prime), we used four indicators: the average (k)-shell index, (langle k_text {s} rangle), the network’s degeneracy, D, the Jaccard score, J, and the generalised Kendall’s tau, (tau _K). The indicator (langle k_text {s} rangle) explicitly depends on all the nodes in the network, whereas D, J and (tau _K) only depend on the nodes belonging to the innermost (k)-shell(s). We use the latter three indicators because, although a majority of nodes tends to belong to outer (k)-shells, it is a difference in the tails of the (k_text {s}) distributions that often affect functions of networks such as the impact of influencers in contagion processes63. The four indicators are defined as follows.The average of the (k)-shell index, (langle k_text {s} rangle), is equal to

$$begin{aligned} langle k_text {s} rangle = frac{1}{N} sum limits _{i=1}^N k_text {s}(i),, end{aligned}$$

(1)

where (k_text {s}(i)) is the (k)-shell index of node i. The degeneracy, D, of a network G is given by64

$$begin{aligned} D = max _{i in G} { k_text {s}(i) } ,. end{aligned}$$

(2)

Rather than using these raw indicators, to compare across the different data sets, we compute their relative difference between the empirical network and its shuffled counterpart given by (Delta X = left|X_G – X_{G^prime } right|/ X_G), where (X in { langle k_text {s} rangle , D }).

To compute J and (tau _K), we need to define a criterion to select nodes belonging to the innermost (k)-shells. We decided to confine the comparison to the nodes whose (k_text {s}) falls within the top (10%) among the N nodes. The horizontal lines in Fig. 1 indicate the threshold values of (k_text {s}^star) such that (P_{ge }(k_text {s}^star ) = 0.1). In the same manner, we define ({k_text {s}^star }^prime) such that (P_{ge }({k_text {s}^star }^prime ) = 0.1) in network (G^prime). To calculate J and (tau _K), we use the nodes belonging to (k)-shells with (k_text {s}ge k_text {s}^star) in G and the nodes belonging to (k)-shells with (k_text {s}ge {k_text {s}^star }^prime) in (G^prime) without duplication of the nodes. There are several remarks. First, it may hold that (k_text {s}^star ne {k_text {s}^star }^prime). Second, the value of ({k_text {s}^star }^prime) varies from one combination of a run of shuffling and community detection to another. Third, as in the case of the Facebook 2 data set, ({k_text {s}^star }^prime) sometimes does not even exist. In such a case, we set ({k_text {s}^star }^prime = D) and select all the nodes belonging to the innermost (k)-shell although they constitute more than 10% of the nodes in the network. Fourth, additional tests using different threshold percentages, 5% and 20%, instead of 10%, did not qualitatively change the results. Fifth, while the Jaccard score simply compares the nodes belonging to two sets, the generalised Kendall’s tau, (tau _K) compares ranked sets. In our case, the node’s rank is equivalent to the (k_text {s}) value.

Given two sets ({mathcal {A}}) and ({mathcal {B}}), the Jaccard score quantifies their overlap and is given by

$$begin{aligned} J({mathcal {A}},{mathcal {B}}) = dfrac{|{mathcal {A}} cap {mathcal {B}}|}{|{mathcal {A}} cup {mathcal {B}}|},. end{aligned}$$

(3)

The Jaccard score ranges between 0 and 1. A value of 1 indicates the complete overlap between the two sets (i.e., the sets are the same), whereas a value of 0 indicates that the sets are completely different.

The generalised Kendall’s tau, (tau _K), measures the consistency between two rankings by assigning penalties to pairs of elements on which the two rankings disagree65,66. Given two sets ({mathcal {A}}) and ({mathcal {B}}) having (m_A) and (m_B) elements, respectively, consider their associated ranking functions ({mathcal {X}}) and ({mathcal {Y}}). We denote with ((z_1, z_2)) an arbitrary pair of elements of ({mathcal {A}} cup {mathcal {B}}). We assign a penalty (K_{z_1,z_2}({mathcal {X}},{mathcal {Y}}) = 1) to ((z_1, z_2)) if (a) the rankings of the two elements within each set are different (i.e., ({mathcal {X}}(z_1) gtrless {mathcal {X}}(z_2)) and ({mathcal {Y}}(z_1) lessgtr {mathcal {Y}}(z_2))), (b) the element with the higher rank in one set is missing in the other set, i.e., ({mathcal {X}}(z_1) > {mathcal {X}}(z_2)) and (z_1 notin {mathcal {B}}) (or ({mathcal {X}}(z_2) > {mathcal {X}}(z_1)) and (z_2 notin {mathcal {B}})), or (c) both elements belong to one set each, which is not the same set, i.e., (z_1 notin {mathcal {B}}) and (z_2 notin {mathcal {A}}) (and vice-versa). In all the other cases (K_{z_1,z_2}({mathcal {X}},{mathcal {Y}}) = 0), such that we do not penalise the ((z_1, z_2)) pair. Finally, we sum the penalties over all the possible pairs of elements and normalise it, thus obtaining the generalised Kendall’s tau:

$$begin{aligned} tau _K({mathcal {X}},{mathcal {Y}}) = 1 – frac{1}{m_A m_B} sum _{z_1, z_2 in {mathcal {A}} cup {mathcal {B}}} K_{z_1,z_2}({mathcal {X}},{mathcal {Y}}). end{aligned}$$

(4)

Index (tau _K) ranges between 0 and 1. If (tau _K = 1), the two rankings are completely coherent. If (tau _K = 0), the two sets ({mathcal {A}}) and ({mathcal {B}}) have no pair of elements on which rankings ({mathcal {X}}) and ({mathcal {Y}}) are coherent. The above formulation of the Kendall’s tau is the so-called optimistic approach65. This means that we do not penalise the case in which a pairs of elements is present in one set and not in the other set.

Community detection methods

We considered two methods for community detection. The first is the Louvain method (Lvn)39, which is a heuristic greedy multiscale method that approximately maximises the modularity function. Given a network with N nodes distributed among (N_text {c}) communities, the modularity, Q, reads

$$begin{aligned} Q = dfrac{1}{2L} sum _{i,j = 1}^N left[ a_{i,j} – dfrac{k_i k_j}{2L} right] delta bigl (g(i),g(j)bigr ) ,, end{aligned}$$

(5)

where (a_{i,j}) is the element of the network’s adjacency matrix A; g(i) is the community to which the i-th node belongs ((1 le g(i) le N_text {c})), and (delta bigl (g(i),g(j)bigr )) is the Kronecker delta. A large value of Q implies a good partitioning. The Louvain method seeks the partitioning that maximises the modularity. Note that we obtain (Q approx 0) for random assignment of nodes to communities and that we obtain (Q approx 1) when the network is made of perfectly disjoint communities.

The other community detection method that we used is the stochastic block model67. It uses the probabilities ({mathcal {P}} = { p_{C_i,C_j} }) with which there exists an edge (ab) connecting an arbitrarily selected node a in community (C_i) (i.e., (g(a) = C_i)) and an arbitrarily selected node b in community (C_j) (i.e., (g(b) = C_j)). Different instances of probabilities ({mathcal {P}}) allow the description of different mixing patterns. When the diagonal entries of ({mathcal {P}}) predominate, we obtain the most usual community structure, whereas other instances yield other structures such as bipartite or core-periphery structure.

To find the optimal partition, one maximises the likelihood function with respect to ({ p_{C_i,C_j} }) corresponding to the partitioning ({mathcal {C}} = { C_i }), where (i,j in 1,ldots ,N_text {c}). The unnormalised log-likelihood, ({mathfrak {L}}), with which a partition of network G into (N_text {c}) communities, ({mathcal {C}}), is reproduced reads

$$begin{aligned} {mathfrak {L}} bigl ( G , bigl vert , {mathcal {C}} bigr . bigr ) = sum _{i,j = 1}^{N_text {c}} e_{ij} , log left( frac{e_{ij}}{m_i , m_j} right) ,, end{aligned}$$

(6)

where (e_{ij}) is the number of edges connecting community (C_i) and community (C_j), and (m_i) is the number of nodes belonging to (C_i).

The above formulation, however, has one major limitation: it assumes that the degrees of the nodes are distributed according to a Poisson-like function. To account for the degrees’ heterogeneity, Karrer et al. have implemented the so-called degree corrected stochastic block model, in which the expected degree of each node is kept constant via the introduction of additional parameters40. Let (e_i) be the sum of the node’s degree over all nodes in community (C_i). Then, the unnormalised log-likelihood for the degree-corrected stochastic block model reads

$$begin{aligned} {mathfrak {L}}_{text {DC}} bigl ( G , bigl vert , {mathcal {C}} bigr . bigr ) = sum _{i,j = 1}^{N_text {c}} e_{ij} , log left( frac{e_{ij}}{e_i , e_j} right) ,. end{aligned}$$

(7)

Equations (6) and (7) depend on the number of communities (N_text {c}). Because the value of (N_text {c}) is not known a priori, it is inferred through the minimisation of a quantity called the description length. The minimum description length principle describes how much a model compresses the data and allows us to find the optimal number of communities while avoiding overfitting68. In the present work we use the degree-corrected stochastic block model and its implementation available in the Python Graph-tool package69, which we refer to as SBM for brevity.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *