Phylogeography and genomic epidemiology of SARS-CoV-2 in Italy and Europe with newly characterized Italian genomes between February-June 2020

A total of 192 SARS-CoV-2-Italian genomes were newly generated for this study. Travel history was available for 137 (71.3%) patients. All of them reported no international travel in the two weeks preceding the onset of symptoms. One case of contact with a traveller from Bangladesh was reported. Main patients’ information is reported in Table 1.

Table 1 Characteristics of the studied populations.

Analysis of the Italian dataset

Genomic diversity on the basis of the lineage/clade classification

The most prevalent lineages were B.1 (n = 222, 47.7%, including 32 lineages derived from B.1 such as B.1.76, B.1.91, B.1.104, B.1.142, B.1.153, B.1.177, B.1.179, B.1.222, B.1.225, B.1.356, B.1.610) and B.1.1 (n = 141, 30.3%, including 19 lineages derived from B.1.1 such as B.1.1.28, B.1.1.61, B.1.1.161, B.1.1.202, B.1.1.232, B.1.1.331 and B.1.1.372) followed by the lineages B (n = 73, 15.7%) and B.1.1.1 (n = 29, 6.2%). The Nextclade classification showed a high prevalence of the clades 20A (n = 207, 44.5%) and 20B (n = 141, 30.3%), followed by 19A (n = 84, 18.1%), and 20D (n = 29, 6.2%). Only 4 strains were clade 20C (0.9%).

The geographical distribution of the SARS-CoV-2 lineages/clades in Italy (Fig. 1) showed several different epidemiological patterns. Some regions mainly in Northern-Central Italy (Friuli Venezia Giulia, Marche, Emilia Romagna, Lombardy, Lazio) showed a high prevalence of B.1/20A (between 70 and 100%). Other regions, mainly in the Central-Southern Italy (Sardinia, Sicily, Abruzzo, Apulia) had the highest prevalence of B.1.1/20B (from 57% to more than 90%). Other regions showed an equal proportion of both lineages (Basilicata, Liguria, Tuscany, Umbria). Two regions had a unique pattern: Veneto, in which the most prevalent lineage was B/19A (66/97, 68%) and Piedmont, showing 73% (27/37) of B.1.1.1/20D lineage.

Figure 1

Spatial distribution of lineages and clades. (a, b) Map of Italy reporting the lineage distribution (a) and the clade assignment (b) in every region.

A change in the prevalence of the SARS-CoV-2 lineages between February and May was observed. The most frequently detected lineages were B/19A and B.1/20A in February and first half of March, representing 88% of all the genomes obtained in that period of time. Subsequently, starting from the second half of March, B.1.1/20B and other lineages (B.1.1.1/20D) became more prevalent (60.7% between 15 and 31 March, 46.2% in April, 51.6% in May).

Genetic distances and mutation analyses

The overall mean p-distance between all the Italian isolates was 3.9 (SE: 0.4) s/10,000 nts corresponding to a mean of 10.1 (SE: 1.01) substitutions per genome. Genetic distance remained small with a mean of 10.23 (SE: 1.09) substitutions, of which 3.13 (SE:0.59) were synonymous and 6.85 (SE:0.79) non-synonymous. A higher heterogeneity was observed in sequences from Piedmont (20.4, SE: 1.6) and Sicily (18.4, SE: 1.2) compared to other regions. Interestingly, an increasing number of differences over time was recorded, from 5.7 (SE: 0.81) in February to 20.1 (SE: 1.1) in May.

Seventeen amino acid substitutions were present in more than 10% of the Italian isolates but only one of them was in the spike protein (D614G). No mutations were observed in the receptor binding domain (RBD) in the whole Italian sequence dataset. Only eleven B lineage sequences in the whole dataset, all from Veneto (clade 19A), carried T1543I in orf1a. Overall, the B sequences showed a distinct mutations pattern from those of other lineages, including mutations L3606F, G251V in orf1a and orf3a, respectively. The B.1.1.1 lineage presented additional substitutions in comparison with B.1 and B.1.1 lineages such as T1246I in orf1a in all isolates. Table 2 shows the most frequent amino acid substitutions stratified by lineage and clade.

Table 2 Aminoacid substitutions found in more than 10% of sequences stratified according to lineage and clade.

Phylogenetic analysis by ML and Bayesian methods

The phylogenetic analysis by Bayesian method assigning each tip to its lineage showed 4 large highly significant clades corresponding to the main circulating lineages in Italy (B, B.1, B.1.1, and B.1.1.1) (Fig. 2). B1, B.1.1 and B.1.1.1 were nested into each other, while B segregated independently. Chinese sequences tended to segregate at the outgroup of the Italian clades within B and B.1 lineages. The estimation of the tMRCAs of the main clades suggested that B lineage spread to Italy in the last week of January 2020, lineage B.1.1 emerged later, in mid-February and B.1.1.1 was the latest, spreading in early March. ML analysis showed similar tMRCAs but with broader confidential intervals (Table 3).

Figure 2
figure 2

SARS-CoV-2 Bayesian phylogeographic tree of 479 strains. Large red and purple circles indicate highest posterior probability ranging from 1 to 0.9. The branches are coloured based on the most probable lineage of the descendent nodes.

Table 3 Time of the Most Recent Common Ancestor (tMRCA) estimates and confidence intervals (CI) of the mains lineages.

Phylogeography in Italy

The phylogeography of SARS-CoV-2 identified China as the location of the tree root (Fig. 3 and Supplementary Fig. 1). Four main large clusters were identified. The earliest clusters were in Lombardy and Veneto, directly linked to China, while later (around the second half of March) other clusters appeared in Abruzzo and Piedmont. Combining the phylogeography with the SARS-CoV-2 lineages, the reconstruction of the ancestral state showed that lineage B and B.1 spread from China to Veneto and Lombardy, respectively. While lineage B apparently remained confined to Veneto (and it was successfully extinguished), lineage B.1 further spread from Lombardy to other Italian regions (Veneto, Emilia Romagna, Abruzzo, Marche, Apulia, Friuli Venezia Giulia and Lazio). Lineage B.1.1 spread from central Italy (Abruzzo) to other Italian regions (Veneto, Lombardy, Apulia, Sardinia). Finally, the lineage B.1.1.1 emerged later and remained apparently localized in Piedmont without further spread to other regions.

Figure 3
figure 3

Ancestral reconstruction of SARS-CoV-2 lineages B.1 using the Italian dataset. The figure shows the compressed visualization produced by PastML using marginal posterior probability approximation (MPPA) with an F81-like model. Different colours correspond to different Italian geographical regions and lineages. Numbers inside (or next to) the circles indicate the number of strains assigned to the specific node.

Analysis of the international data set

Italian clusters

The phylogenetic analysis by ML of the entire dataset including Italian, European and Chinese genomes showed that the majority of the Italian isolates were dispersed in the entire tree. A total of 80 (out of 465, 17.2%) Italian isolates were included in 22 highly supported clusters (Table 4). Of these, 12 (54.5%) were within the lineage B.1, five (22.7%) were B.1.1/20B, three (13.6%) were B.1.1.1/20D and two (9.1%) were B/19A. All but one B.1 clusters were classified as 20A clade. Cluster #19 was the only exception and included four Italian strains classified as clade 20C (all from Rome), showing a mean tMRCA falling in March 2020. Three clusters (13.6%) were singletons (including only single Italian isolates not linked to other Italian sequences), probably corresponding to sporadic introductions followed by limited circulation, while the remaining 19 clusters encompassed at least two Italian isolates, suggesting a local transmission. Thirteen of these (68.4%) included only Italian strains (suggesting a mainly local circulation of this lineage), while 6 (31.6%) included isolates from other European countries, and one of them (B.1) included also one Chinese genome.

Table 4 Main characteristics of the identified clusters.

The estimate of the clusters tMRCA by ML method confirmed that the first transmission events in Italy dated around the second half of January and early February. Eighteen clusters had a common ancestor dating before the introduction of the containment measures in our country. In particular, B.1/20A clusters predominated (10/14) at earlier time points (before March) while in March other clades (20B, 20C and 20D) prevailed (6/8). Moreover, the mixed and singleton clusters were prevalent at the beginning, while pure Italian clusters were the only clusters observed after the lockdown. The earliest cluster (#1), was lineage B.1/20A, dated back to average 20/01/2020 (CI95% 08/01–24/01/2020) and included only four North Italian strains: one from Lodi, two from Milan (the locations where autochthonous COVID-19 cases were firstly identified in Italy) and one from Piacenza. The first B.1.1 cluster dated back to 10/02/2020 (CI95% 28/01/2020–12/03/2020) and included 3 Italian isolates from Abruzzo. Three B.1.1.1/20D clusters dated back to 02 March (CI95% 22/02/2020–02/03/2020). Only two small Italian clusters supported by significant bootstraps were observed within the ML tree including B/19A isolates. In particular a single pure Italian cluster included 11 genomes from Veneto (province of Padua), characterized by the substitution T1543I in orf1a, not detected in any of the other B/19A genomes in our international dataset.

Phylogeographical analysis in Europe

Combining the ancestral state reconstruction for the location with the lineage (Fig. 4 and Supplementary Fig. 2), the analyses showed that B.1 probably originated in China and spread to several European countries reaching Italy several times, forming a large cluster which included initially 59 (around the first week of March) and finally 198 genomes, and 6 further independent introductions mainly corresponding to a group of genomes characterized only by the substitution D614G but lacking other substitutions, in particular the P314L in the RdRp identifying the clade 20A (lineage B.1, clade 19A).

Figure 4
figure 4

Ancestral reconstruction of SARS-CoV-2 lineages B.1 using the European dataset. The figure shows the compressed visualization produced by PastML using marginal posterior probability approximation (MPPA) with an F81-like model. Different colours correspond to different European countries and lineages. Numbers inside (or next to) the circles indicate the number of strains assigned to the specific node. The joint ancestral scenario (Joint) and maximum a posteriori (MAP) predictions are shown for the uncertain nodes (shown as octagonal icons). CN, China; IT, Italy, EU, Europe.

Starting from Italy, B1/20A spread to other European countries being also reintroduced later to China. A second large Italian cluster, including 138 genomes of lineage B.1.1, emerged from the Italian B.1 cluster. Multiple introductions of B.1.1 were observed from Italy to other European countries. A large cluster (n = 203 genomes) corresponding to B.1.1.1 lineage appeared in Europe in the early March and reached Italy only later (second half of March) (Fig. 4). A total of 7 nodes remained undetermined. A separate analysis conducted distinguishing European countries (rather than considering a single generalized group), generally confirmed this scenario and allowed to reconstruct in greater detail the dispersion of the epidemic in the European countries (Supplementary Fig. 3).

The analysis of lineage B showed that only 2 nodes remained undetermined between Europe and China (Supplementary Fig. 4). The visualization (Fig. 5) suggested several introductions from China to Italy starting with the end of February. A single cluster corresponding to the previously described cluster#5 was observed, while the other strains apparently represent multiple independent introductions forming small groups of no more than 2 sequences. Two sporadic introductions from Europe were also observed. Unlike the ancestral reconstruction for B.1 lineage, this scenario was different as the migratory flows seem to stop in Italy without further spread.

Figure 5
figure 5

Ancestral reconstruction of SARS-CoV-2 lineages B using the European dataset. The figure shows the compressed visualization produced by PastML using marginal posterior probability approximation (MPPA) with an F81-like model. Different colours correspond to different European countries and lineages. Numbers inside (or next to) the circles indicate the number of strains assigned to the specific node. The joint ancestral scenario (Joint) and maximum a posteriori (MAP) predictions are shown for the uncertain nodes (shown as octagonal icons). CN, China; IT, Italy, EU, Europe.

The analysis conducted among European countries (Supplementary Fig. 5), highlighted the same ancestral scenario but did not show any introduction from Europe.

https://www.nature.com/articles/s41598-022-09738-0