### Demographic dynamics of urban systems

We now derive the population size distribution of cities in the most general way, set by their demographic dynamics. The change of the population *N*_{i} in city *i* during unit time Δ*t* necessarily results from the balance of births, deaths, and migration as

$${N}_{i}(t+Delta t)={N}_{i}(t)+Delta tleft[{upsilon }_{i}{N}_{i}(t)+sum_{j = 1,jne i}^{{N}_{{rm{c}}}}({J}_{ji}-{J}_{ij})right],$$

(2)

where *N*_{c} is the total number of cities, *υ*_{i} = *b*_{i} − *d*_{i} is the city’s vital rate, the difference between its per capita average birth rate, *b*_{i}, and the death rate, *d*_{i}. The current *J*_{ij} is the number of people moving from city *i* to city *j* over the time interval Δ*t*, and vice-versa for *J*_{ji}. For simplicity of notation we will work in units where Δ*t* = 1. Eq. (2) only accounts explicitly for migration between different specific cities, which can cross political borders. Migration to city *i* from non-specific places outside the system, for example from non-urban regions or other nations, can be incorporated into the vital rates *υ*_{i}. Note that this model is very general, and naturally accommodates the inclusion of new cities over time when their sizes become non-zero, and indeed the disappearance of others, if their size vanishes.

To proceed we need to explore the dependence of the currents *J*_{ij} on population size. We start by writing these currents as population rates, *J*_{ij} = *m*_{ij}*N*_{i}(*t*), where *m*_{ij} is the probability that someone in city *i* moves to city *j* over the time period. This allows us to write Eq. (2) in matrix form

$${bf{N}}(t)={bf{A}}(t){bf{N}}(t-1),$$

(3)

where ({bf{N}}(t)={[{N}_{1}(t),ldots ,{N}_{i}(t),ldots ,{N}_{{N}_{{rm{c}}}}(t)]}^{T}) is the vector of populations in each city at time *t* and the matrix **A** projects the population at time *t* − 1 to time *t*. This matrix is known in population dynamics as the environment^{39}, with elements

$${A}_{ij}=left{begin{array}{ll}1+{upsilon }_{i}-{m}_{i}^{{rm{out}}},&{rm{if}},i=j\ {m}_{ji},&{rm{if}},ine jend{array}right.,$$

(4)

where ({m}_{i}^{{rm{out}}}=mathop{sum }nolimits_{k = 1,kne i}^{{N}_{{rm{c}}}}{m}_{ik}) is the total probability that a person migrates out of city *i* per unit time. We also define the population structure vector **x** = [*x*_{i}], with *x*_{i} = *N*_{i}/*N*_{T}, which is the probability of finding a person in city *i* out of the total population *N*_{T}. Note that (mathop{sum }nolimits_{i = 1}^{{N}_{{rm{c}}}}{x}_{i}=1), so that the structure vector has *N*_{c} − 1 degrees of freedom (the magnitude of *x* is fixed).

Equation (3) can now be solved by repeated iteration

$${bf{N}}(t)={bf{A}}(t){bf{A}}(t-1)ldots {bf{A}}(1){bf{N}}(0).$$

(5)

### City size distributions in different environments

A number of important ergodic theorems^{40} in population dynamics tell us about the properties of the solutions under different conditions on the environments **A**(*t*). These results rely on some constraints on the properties of **A**; specifically, that the product of matrices in Eq. (5) is positive for sufficiently long times^{39,41}.

First, the weak ergodic theorem guarantees that for a dynamical sequence of environments, the difference between two different initial population structure vectors decays to zero over time. This means that there is typically an asymptotic city size “distribution”, which is a function of environmental dynamics only, independent of initial conditions. When the environment is stochastic but otherwise time independent, the strong stochastic ergodic theorem states that the structure vector becomes a random variable whose probability distribution converges to a fixed stationary distribution, regardless of initial conditions. This is the sense in which most derivations of Zipf’s law apply^{17,33}. For stochastic environments, only probability distributions of structure vectors, not vectors themselves, can be predicted. Finally, in situations where the environment is time dependent and stochastic, the weak stochastic ergodic theorem states that the difference between the probability distributions for the structure vector resulting from any two initial populations, exposed to independent sample paths, decays to zero. Again, in cases where the environment is explicitly dynamic, besides being stochastic in a stationary sense, we cannot say much about the actual probability distribution of city sizes, only that the importance of initial conditions vanishes for sufficiently long times.

To get more intuition on these results we will now show some explicit solutions. The city size distribution can be calculated exactly when the environments **A** are arbitrarily complicated, but static. This is the result of the strong ergodic theorem, which says that in a constant environment **A**, any initial population vector converges to a fixed stable structure given by the leading eigenvector of the environment^{39}. This is guaranteed to exist if **A** is a strongly connected aperiodic graph, meaning that any city can be reached from any other city in a finite and diverse number of intermediate steps along the non-zero migration flows between them. This is a reasonable assumption to make about an urban system, which may even serve as a definition. The leading eigenvector of **A** is the eigenvector centrality of the urban system. It describes the probabilistic location of a random walker over the graph of migration flows^{42}. This is equal to the stationary solution of numerically integrating a network described by environment **A**. Figure 1 shows two numerical solutions with different initial conditions, but the same environment **A**. The cities in the two runs converge to similar sizes, as they experience a common environmental matrix with all entries set to be the same plus a little noise. The insets show how the population structure converges from both initial conditions to the one defined by the leading eigenvector **e**_{0}. Because the structure vector is the probability of finding a person out of the total population in the urban system in a specific city, we use the Kullback-Leibler (KL) divergence^{43}({D}_{{rm{KL}}}(P| {P}_{{{bf{e}}}_{0}})={sum }_{x}P[x]{mathrm{log}},P[x]/{P}_{{{bf{e}}}_{0}}[x]) as a measure of the ‘distance’ between the initial distribution, *P*, and the final, ({P}_{{{bf{e}}}_{0}}). The KL divergence is a foundational quantity in information theory. It measures the information gain in describing a probabilistic state using one distribution, ({P}_{{{bf{e}}}_{0}}), versus another, *P*, in units of information (bits). Note that (Pto {P}_{{{bf{e}}}_{0}}) and thus ({D}_{{rm{KL}}}(P| {P}_{{{bf{e}}}_{0}})to 0) over time, see insets in Fig. 1.

The convergence rate of the city size distribution to the leading eigenvector is given by the ratio of secondary eigenvalues to the dominant one. Explicitly, we can now write this solution as

$${bf{N}}(t)={{bf{A}}}^{t}{bf{N}}(0)to {bf{N}}(t)=sum_{i = 1}^{{N}_{{rm{c}}}}{({lambda }_{i})}^{t}{c}_{i}{{bf{e}}}_{i}.$$

(6)

where **A****e**_{i} = *λ*_{i}**e**_{i}, so that **e**_{i} are the eigenvectors of the matrix **A** and *λ*_{i} the corresponding eigenvalues, so that *λ*_{0} > *R**e*(*λ*_{1}) > *R**e*(*λ*_{2}) > . . . , where *R**e*(*λ*_{i}) is the real part of the eigenvalue. The projection coefficients *c*_{i} are such that ({bf{N}}(0)=mathop{sum }nolimits_{i = 1}^{{N}_{{rm{c}}}}{c}_{i}{{bf{e}}}_{i}), which can be written as **c** = **E**^{−1}**N**(0), where **E** is a matrix whose *j*th column vector is **e**_{j}. The strong ergodic theorem states that the largest eigenvalue *λ*_{0} is positive and real and that the associated eigenvector **e**_{0} is also positive in terms of all its entries. This means that, over time, the dynamics of the urban system approaches that of the growth of its dominant eigenvector, because the projections on all other eigenvectors decay exponentially in relative terms, as

$${left(frac{{lambda }_{i}}{{lambda }_{0}}right)}^{t}={e}^{-mathrm{ln},frac{{lambda }_{0}}{{lambda }_{i}}t},{mathop{-hskip -4.3pt-hskip -4.3ptlongrightarrow}limits_{tto infty}},0$$

(7)

and exemplified in the two insets in Fig. 1. The longest characteristic time is associated with the difference in magnitude between the two largest eigenvalues ({t}_{*}=1/{mathrm{ln}},frac{{lambda }_{0}}{| {lambda }_{1}| }). Consequently, for *t* ≫ ({t}_{*}) we observe an extreme dimensional reduction from an initial condition characterized in general by *N*_{c} degrees of freedom to a final one with just one, set by the environment’s leading eigenvector. Using the values of migration flows in the US urban system over the last decade, we can compute the value of ({t}_{*}). It is rather long—of the order of several centuries—when measured using census data^{44} and a little shorter using data on tax returns^{45}.

These results allow us to consider very general classes of dynamics and initial conditions. We therefore conclude that in general, given a specific set of demographic conditions, the total population can grow exponentially with a common growth rate across cities. In addition, the relative population distribution at late times does not resemble anything like Zipf’s law. As expected from the statements of the ergodic theorems, the population structure is set by the nature of the intercity flows (and vital rates), or equivalently by the environment **A**.

### Stochastic demographic dynamics and symmetry breaking

The step towards a statistical solution for the city size distribution—and obtaining Zipf’s law in particular—requires the additional consideration of statistical fluctuations, which must arise in the vital and migration rates. Just as in models of statistical physics, the introduction of stochasticity implies not just fluctuating quantities but potentially also the restoration of broken symmetries^{46,47}. In the context of urban systems, restoring broken symmetries means removing preferences for population growth in specific cities over others, associated with imbalances in vital rates and migration flows. To make these expectations explicit we write the migration rates *m*_{ij} as

$${J}_{ij}={m}_{ij}{N}_{i}(t)=left(frac{{s}_{ij}+{delta }_{ij}}{2}right)frac{{N}_{j}(t)}{{N}_{{rm{T}}}(t)}{N}_{i}(t).$$

(8)

Eq. (8) splits the migration flows into two parts: *s*_{ij} describes the symmetric flow between two cities (*s*_{ij} = *s*_{ji}), and *δ*_{ij} the anti-symmetric part (*δ*_{ij} = −*δ*_{ji}). Note that any correlations between *N*_{i} and *N*_{j} are preserved in these two quantities. Making the population size of the destination city explicit will allow to diagonalize Eq. (2) and therefore to find self-consistent solutions for the structure vector. This form of the migration currents also makes contact with the gravity law of geography, which is typically written as

$${J}_{ij}={G}_{g}frac{{N}_{i}{N}_{j}}{{d}_{ij}^{gamma }}={J}_{ji}.$$

(9)

In this form, the gravity law is symmetric, corresponding to Eq. (8) when *δ*_{ij} = 0 and *s*_{ij} = *G*_{g}*N*_{T}(*t*)*f*(*d*_{ij}), where *G*_{g} is the ‘gravitational’ constant, (f({d}_{ij})=1/{d}_{ij}^{gamma }) is a decaying function of distance *d*_{ij} and *γ* is the distance exponent. Only the anti-symmetric part contributes to the dynamics of population size change, since we can write Eq. (2) as

$${N}_{i}(t+1)=left[1+{upsilon }_{i}-{overline{delta }}_{i}right]{N}_{i}(t).$$

(10)

with ({overline{delta }}_{i}=mathop{sum }nolimits_{j = 1}^{{N}_{{rm{c}}}}{delta }_{ij}{x}_{j}). We can now appreciate the importance of the bi-linearity of the migration flows on both the origin and destination population as a means to diagonalize the demographic dynamics. We have achieved this however at the cost of introducing a slow non-linearity in each city’s growth rate, via their dependence on the population structure vector **x**. We can establish the existence of a self-consistent solution **x*** such that for long times

$${upsilon }_{i}-{overline{upsilon }}^{* }=sum_{j = 1}^{{N}_{{rm{c}}}}{delta }_{ij}{x}_{j}^{* },$$

(11)

where ({overline{upsilon }}^{* }=mathop{sum }nolimits_{i}^{{N}_{{rm{c}}}}{upsilon }_{i}{x}_{i}^{* }), since (mathop{sum }nolimits_{i,j = 1}^{{N}_{{rm{c}}}}{delta }_{ij}{x}_{i}{x}_{j}=0), by the conservation of the migration currents. Note that we did not assume any explicit form for the migration flows as a function of distance between places, so that our discussion includes many specific models, including different version of the gravity law.

This equation has a clear geometric meaning: The matrix *δ* = [*δ*_{ij}] is anti-symmetric and real, so that it behaves as an infinitesimal generator of rotations: *R*(**x**) = **x** + *δ***x**. We can write the self-consistent solution in terms of these rotation as

$$R({{bf{x}}}^{* })={{bf{x}}}^{* }+({boldsymbol{upsilon }}-overline{upsilon }),$$

(12)

where (overline{upsilon }) is still a function of **x** and **υ** = [*υ*_{i}]. Note that these rotations are associated with very small angles, and that the structure vector is subject to constraints, such as staying positive. The first two terms of Eq. (12) ask for a general solution for **x** that is invariant under rotations. However, the last term, which introduces differences in relative growth rates for different cities, breaks this symmetry explicitly. This specificity of the relative growth rates leads to particular solutions that do not coincide with Zipf’s law. Fig. 2a shows the example of a numerical solution in a non-linear, non-stochastic environment, defined by Eq. (8). The final population structure in this environment is still well defined, as it reaches a steady state. However, compared to the time independent case (see Fig. 1), the final structure takes much longer to unfold and has a stronger urban hierarchy.

### Symmetry restoration and the emergence of Zipf’s law

The stage is now set to consider what it takes to counter the symmetry breaking resulting from the selective structure of the environment. To do that, we must consider the statistics of the demographic rates. We start by rewriting Eq. (10) in terms of the dynamical equations for the structure vectors **x** as

$${x}_{i}(t+1)=left[1+{upsilon }_{i}-overline{upsilon }-{overline{delta }}_{i}right]{x}_{i}(t)=(1+{epsilon }_{i}){x}_{i}(t).$$

(13)

We now specify the relative growth rate ({epsilon }_{i}={upsilon }_{i}-overline{upsilon }-{overline{delta }}_{i}) for each *x*_{i} as a statistical quantity. It is clear by construction that its average over cities vanishes, ({overline{epsilon }}_{i}=0). The only remaining question is how *ϵ*_{i} varies over time. There is strong empirical evidence that on the time scale of years to decades some cities can maintain larger growth rates than others. For example, presently in the US, most cities of the Southwest and South are growing fast, while other cities show slower growth or relative decay, such as many urban areas in the Midwest and Appalachia^{48}. Moreover, it has been true in recent decades that mid-sized cities in the US have been growing faster than either the largest cities or small micropolitan areas, so that even population aggregated growth rates over certain size ranges differ. Going further back in time one could argue that the same cities of the Midwest were once booming as they became the manufacturing centers of the US, especially between the Civil War and the decades post WWII^{49}. Similar arguments may be made for cities in other urban systems, such as in Europe^{50,51}. Thus, it is possible that on very long time scales, *T*—on the order of a century or longer—the growth rates of cities equalize to very similar numbers and therefore the temporal average (langle {epsilon }_{i}rangle =frac{1}{T}mathop{sum }nolimits_{t}^{T}{epsilon }_{i}(t)to 0). This requires that the temporal average of Eq. (11) remains zero, which means that the structure vector is both rotated and dilated ’randomly’ in ways that over sufficiently long times lead to 〈*ϵ*_{i}〉 = 0, for each city. Numerical solutions of the demographic equations under these conditions show a population structure that closely resembles Zipf’s law for long times, see Fig. 2b.

In the following, we show that Zipf’s law is a probability density for this derived dynamics in the long run. To do this, let us characterize the variance in the temporal growth rate fluctuations as ({sigma }_{i}^{2}=langle {epsilon }_{i}^{2}rangle ={sigma }_{{upsilon }_{i}-upsilon }^{2}+{sigma }_{{overline{delta }}_{i}}^{2}+2{rm{COV}}({upsilon }_{i}-upsilon ,{overline{delta }}_{i}) , > , 0). Over the same long time scales, statistical fluctuations in the growth rate may obey a limit theorem, so that they become approximately Gaussian, with variance, ({sigma }_{i}^{2}). Then, Eq. (13) becomes simpler, and acquires a direct correspondence to familiar stochastic growth processes. It can be written in analogy to geometric Brownian motion without drift as

$$d{x}_{i}={epsilon }_{i}{x}_{i}={x}_{i}{sigma }_{i}dW$$

(14)

where *W* is a standard Brownian motion. We now see how demographic dynamics can be simplified through a chain of assumptions into a form consistent with typical stochastic processes leading to Zipf’s law as proposed by Gabaix^{20,52}.

Equation (14) can be expressed as an equation for the probability density *P* ≡ *P*[*x*_{i}, *t*∣*x*_{i}(0), 0] of observing *x*_{i} at time *t*, having started with an initial condition where *x*_{i}(0) was observed at time zero,

$$frac{d}{dt}P=frac{{d}^{2}}{d{x}_{i}^{2}}{sigma }_{i}^{2}{x}_{i}^{2}P=frac{d{J}_{{x}_{i}}}{d{x}_{i}}.$$

(15)

The last term describes a probability current, ({J}_{{x}_{i}}=frac{d}{d{x}_{i}}{sigma }_{i}^{2}{x}_{i}^{2}P({x}_{i})). This equation is exactly solvable, see Methods for technical details. To find the stationary solutions, we ask that the right hand side of Eq. (15) vanishes. The first of the two solutions corresponds to a vanishing current, ({J}_{{x}_{i}}=frac{d}{d{x}_{i}}{sigma }_{i}^{2}{x}_{i}^{2}P({x}_{i})=0) and is (P=frac{c^{prime} }{{sigma }_{i}^{2}{x}_{i}^{2}}), where (c^{prime} ) is a normalization constant. If ({sigma }_{i}^{2}={sigma }^{2}) independent of *x*, as proposed by Gibrat’s law, we can drop the index so this becomes Zipf’s law, (P(N)={P}_{{rm{z}}}({N}_{i}) sim frac{c}{{N}^{2}}). Note for example that if ({sigma }_{i}^{2} sim {N}^{-alpha }) (*α* may be negative), which violates Gibrat’s law in terms of fluctuations of the growth rate, then *P* ~ 1/*N*^{2−α}. Thus, the city size dependence of growth rate fluctuations will change the exponent in Zipf’s law and may destroy scale-invariance altogether if it is not a power law. The second solution applies for constant current, i. e. *J*_{x} = *J*, which leads to (P=frac{c^{primeprime} }{{sigma }^{2}x}), where *c*″ is a normalization constant. This represents a flow of probability across the urban hierarchy. Both solutions are attractors of the stochastic dynamics that emerge for long times, *t* ≫ 1/*σ*^{2}.

We find that in the limit where the relative difference in vital rates vanishes due to fluctuations, the structure vector becomes rotationally invariant on average and the angular symmetry of **x** is effectively restored. This leads to an effective symmetry of the relative city size distribution on the time scale *t*_{r} = 1/2*σ*^{2}, when growth rate fluctuations vanish. The time *t*_{r} can be estimated using US data to be typically very long, on the range of many centuries or even millennia. Under these conditions all cities share statistically identical dynamics and can be interchanged, resulting in Zipf’s law as the time average of population size distributions, see Fig. 3. In the same limit, the anti-symmetric components of intercity flows will average out and the gravity law will emerge in its conventional form. However, the observations of city sizes at each time are samples of this distribution and may reflect decade-long observable positive and negative preferences for certain cities. Figure 4 shows this effect for the US urban system using decennial census data from 1790 to 1990. For time periods of a few decades, the temporal average does not visibly converge to Zipf’s law: The blue line in the inset of Fig. 4 depicts this effect. However, the cumulative temporal average of the size distributions starts to approximate Zipf’s law after about five decades. The red line (inset) shows the KL divergence between the cumulative temporal average and Zipf’s law, starting from 1790 to subsequent census.