
Characteristics of patients in the study
Patients receiving GT who were included in the study were treated in the context of clinical trials or early access programmes approved by ethical committee and competent regulatory authorities. The treatment was administered at the Bone Marrow Transplantation Unit at the San Raffaele Scientific Institute in Milan, Italy. We have complied with all the ethical regulations for retrieving biological materials from patients receiving GT. Parents signed informed consent for research protocols approved by the San Raffaele Scientific Institute’s Ethics Committee (TIGET06 and TIGET09). All patients received autologous HSPC transduced with transgene encoding lentiviral vectors under the same transduction protocol, as previously described2,4,5,6. Patients with MLD received full myeloablative conditioning2,3,22 with busulfan at doses ranging from 10 to 14 mg kg−1. Patients with WAS received a reduced-intensity conditioning regimen consisting of the combined administration of a monoclonal antibody against CD20, busulfan (7.6 to 10.1 mg kg−1) and fludarabine (60 mg m−2)4,5,15,17. In patients with β-Thal, the conditioning consisted in thiotepa (6–8 mg kg) and dose- adjusted treosulfan (14 g m−2) administration6. Patients with MLD were enrolled in a phase 1 or 2 clinical trial (NCT01560182) or treated with a hospital exemption programme or compassionate use programme. A total of 29 patients had been treated with the HSPC gene therapy clinical protocol for arylsulfatase A deficiency). Sixteen of the treated patients (Pt16, Pt32, Pt47, Pt02, Pt20, Pt34, Pt31, Pt03, Pt37, Pt33, Pt08, Pt53, Pt23, Pt40, Pt25, Pt28) were affected by late infantile MLD in a presymptomatic stage and have been identified by molecular and biochemical tests in the presence of at least an affected older sibling, whereas 13 patients (Pt38, Pt01, Pt10, Pt43, Pt42, Pt04, Pt30, Pt51, Pt44, Pt18, Pt50, Pt14, Pt07) were affected by early juvenile MLD in a pre- or early-symptomatic stage. Patients were treated with a myeloablative busulfan conditioning regimen administered before reinfusion of the engineered HSPCs3,22. Fourteen male patients with WAS for whom no human leukocyte antigen–identical sibling donor or suitable matched unrelated donor was available underwent lentiviral GT after a reduced conditioning regimen protocol. Patients Pt52, Pt21, Pt48, Pt13, Pt29, Pt11, Pt39 and Pt17 were enrolled in an open-label, non-randomized, phase 1 or 2 clinical study5 registered with ClinicalTrial.gov (number NCT01515462) and EudraCT (number 2009-017346-32). The other patients with WAS were treated under an early access programme, compassionate use programme or hospital exemption. Among patients with β-Thal, three adults and six children with β0 or severe β+ mutations were enrolled in a phase 1 or 2 trial (NCT02453477) for intrabone administration of GLOBE lentiviral vector–modified HSPCs after myeloablative conditioning with treosulfan–thiotepa6.
Sample collection
PB and BM samples were obtained from patients before and after gene therapy at different times. From each patient, 12 ml of blood was collected at time points of 30 days, 60 days (only for PB sample), 90 days, 6 months during the first year and once every year and half year thereafter. BM cells were isolated by aspirate of 12 ml and specific cell lineages purified as previously described2,4,6. Briefly, PB and BM mononuclear cells from patients receiving GT were isolated using Ficoll-Hypaque gradient separation (Lymphoprep, Fresenius). Mature PB lineages and BM progenitors were purified using positive selection with immunomagnetic beads according to the manufacturer’s specifications (Miltenyi Biotec) and genomic DNA was extracted with the QIAamp DNA Blood Mini or Micro Kit (Qiagen). CD34+ cells were isolated using the CD34+ MicroBead Kit UltraPure, human (lyophilized), MILTENYI BIOTEC CD15+ were isolated from granulocytes cells using CD15 Micro Beads Human, MILTENYI BIOTEC. CD13+ (CD13 Antibody, Antihuman, Biotin and Antibiotin Micro Beads UltraPure), CD3+ (CD3 Micro Beads Human, MILTENYI BIOTEC), CD56+ (CD56 Micro Beads Human, MILTENYI BIOTEC), GLYA+ (Glycophorin A CD235a Micro Beads, human, MILTENYI BIOTEC S.R.L), CD19+ (CD19 Micro Beads Human, MILTENYI BIOTEC) were isolated in sequence from mononuclear cells after CD34+ purification. PB mononuclear cells were isolated by aspirate of 12 ml and specific cell lineages were purified. From granulocytes, CD15+ were isolated using CD15 Micro Beads Human, MILTENYI BIOTEC). From peripheral mononuclear cells CD14+ (CD14 Micro Beads Human, MILTENYI BIOTEC), CD19+ (CD19 Micro Beads Human, MILTENYI BIOTEC), CD3+ (CD3 Micro Beads Human, MILTENYI BIOTEC) and CD56+ (CD56 Micro Beads Human, MILTENYI BIOTEC) were isolated in sequence.
Genomic DNA was extracted using the QIAamp DNA blood mini kit (Qiagen) or QIAamp DNA Micro Kit (Qiagen), based on a cell pellet. The QIAamp DNA Micro Kit was used with fewer than 1 × 106 cells, whereas the QIAamp DNA Mini Kit was used with a dry pellet from 0.5 × 106 to 5 × 106 cells.
Retrieval of vector ISs
The genomic DNA from the patients’ cells was extracted and split in three technical replicates that were subjected to custom PCR amplification to retrieve vector ISs, initially the linear amplification-mediated (LAM)-PCR27 and in more recent samples the sonication linker-mediated (SLiM)-PCR13. Briefly, the SLiM-PCR procedure consists of the following steps: (1) fragmentation by sonication of the DNA, (2) ligation of the fragments to a linker cassette and (3) two consecutive rounds of PCR, to specifically amplify vector to cellular-genome junctions, by using primers annealing to the vector genome end (long terminal repeats) and the linker cassette. The list of primers and sequences used for SLiM-PCR procedure is reported in Supplementary Tables 7–9. Primers contain DNA barcodes, which allow univocal barcoding of all the SLiM-PCR replicates, and sequencing adaptors that allow multiplexed sequencing on Illumina sequencers. The list of samples with the details on the applied PCR procedure and the number of reads retrieved is reported in Supplementary Table 2.
The resulting PCR products were sequenced using Illumina platforms (Mi-seq, Hi-seq, Next-seq and Nova-seq), for a total amount of 194 sequencing pools for more than 17,000 PCR reactions and more than 14.5 × 109 raw reads.
Identification of vector ISs
We used VISPA2 (ref. 66) to identify ISs from PCR samples sequenced with Illumina paired-end reads. Briefly, for each Illumina sequencing library, paired-end reads are filtered for quality standards, barcodes are identified for sample demultiplexing, vector sequences are trimmed from each read and the remaining cellular genomic sequence is mapped on the reference Human Genome (Human Genome_GRCh37/hg19 February 2019). For the quantification of the number of genomes representing each clone, we adopted an estimation method based on the number of distinct fragments for each IS, SonicLength67, such that each IS abundance will be proportional to the initial number of contributing cells, allowing to estimate the clonal abundance in the starting sample. The final list of unique ISs is composed of univocally mapped loci annotated with the nearest RefSeq gene.
We then used a new R package, ISAnalytics28 to integrate the output files of VISPA2 and perform downstream analyses of ISs, from quality controls to the analyses of shared ISs among samples. A detailed report and code of ISAnalytics is available in the GitHub repository (https://github.com/calabrialab/Code_HSPCdynamics). We first removed the same IS present in different independent samples, named collisions, using the same approach previously described2. Briefly, given two patients and the list of shared ISs between the two patients, we assign to one patient the IS if it was observed and retrieved for the first time in that patient or if the relative abundance of that read was at least ten times higher than the other patient. The analysis and removal of collisions in independent samples was realized using ISAnalytics. We then performed data quality analysis of sequencing samples by analysing the number of reads of each PCR sample in each sequencing pool, and we removed the samples containing a number of raw reads highly under-represented (threefold less) than the average number of reads of the other samples in the pool. The quality control analyses included (1) the removal of sequencing samples with a number of raw reads threefold less than the average number of reads per sample in the pool, (2) the removal of contaminations identified as identical ISs retrieved in independent samples as previously described2. Through these filtering approaches, 99 samples (0.7%) were removed because they had a number of raw reads below the acceptance threshold (less than threefold difference on raw reads average per pool), whereas the amount of removed contaminant ISs was less than 1%. For common insertion site analysis, we used Grubbs test for outliers68 (implemented in ISAnalytics). Briefly, for each patient, we computed the targeting frequency of each gene using the number of ISs landing in the gene body ±100 kbp and then normalized by the gene length. After the log2 transformation of the gene distribution frequency, we computed the Grubbs test for outliers to identify genes with a targeting frequency significantly higher than the average observed frequency. The analysis of the cumulative ISs over time allowed us to quantify how many new ISs we observed at each collection or sample compared to all previous samplings. To this aim, we used all ISs observed at each time point. We computed the cumulative number of ISs using ISAnalytics with function ‘cumulative_is’.
Clonal population diversity
An ecological system is maintained stable if the populating species are in equilibrium, suggesting a healthy environment. Population dynamics can be characterized in terms of species diversity, using for example a quantitative measure such as the Shannon diversity index (H index). The H index accounts for the number of distinct species (richness) and their relative abundance with the following formula:
$${H}^{{\prime} }=-\mathop{\sum }\limits_{i=1}^{R}{p}_{i}{\rm{ln}}{p}_{i}$$
where i is a clone (an IS), pi is the clonal abundance and R is the set of clones.
Several clonal studies already approached the analysis of heterogeneity and complexity of the vector-marked cells over time, tissues and differentiated lineages, using IS as surrogate of different species and the IS size as species abundance, and then measuring richness and evenness over time to quantify long-term efficacy (when the H index is maintained at high values) or observe malignant occurrences (when the H index drastically decreases over time). The diversity index has been computed using the R package Vegan and is integrated in ISAnalytics. Z-score normalization has been performed on the H index by patient and time point using the R function ‘scale’.
As different PCR methods (LAM-PCR and SLiM-PCR) were used to retrieve ISs, which have different retrieval efficiencies, we performed a comparative analysis of clonal diversity only on samples in which the same technology within the same clinical programme was used. For MLD and β-Thal, we selected only the samples processed by SLiM-PCR whereas for patients with WAS only LAM-PCR samples were available. We used the Bayesian multivariate linear regression algorithm to model and correct biases induced by different technical confounding factors as well as their interactions. These factors included PCR method, amount of DNA used, dose of CD34+ infused per kg, VCN, sequencing depth, patient gender and age.
Estimate of active repopulating HSPCs
Similar to what was done in our previous studies2,4, we used short-living cells as the readout of HSC output. Specifically, from our data, we selected myeloid-derived cells from PB samples (cell markers CD14+ and CD15+) from each time point and we then filtered ISs from these cell lineages by sequence count sum less than three across time. When estimating HSPC from several samples, we used mark-recapture statistics that leverage on the number of recaptured ISs across the different observations to estimate the size of the source population. Population size estimate has been performed in R using the package Rcapture with Chao1 (ref. 69) model considering a closed population. For the estimate of HSPCs over time, we used triplets of consecutive time points using a sliding window from the first month in vivo to the last follow-up month. For the estimate of the size of HSPCs per patient, we selected used all stable time points (from months 9 or 12 after gene therapy). Then, we corrected the number of estimated HSPCs by the VCN in in vivo cells if VCN > 1. Moreover, when we correlated the HSPC size to the cell dose, we corrected the dose by the engraftment using the in vitro VCN (if VCN < 1) as reported16. HSPC estimate has been implemented in ISAnalytics, here run with the following prototype function ‘HSC_population_size_estimate’.
Clonal composition through IS sharing
To analyse the composition of long-lasting clones by tracking the originating time points (that is, the first observed time point of each specific IS) for each patient, sample and tissue, we processed the matrix of ISs to label each clone with the first observed time point (among the previous ones) such that if the clones were recaptured in a subsequent time point we could backtrack from which month it has been initially found. We then computed the percentage of the recaptured clones on the total observed clones at each time point. For this purpose, we used the function ‘iss_source’ in ISAnalytics.
To quantify the composition of long-lasting clones, we isolated ISs retrieved from 36 months (both PB and BM) and averaged the percentages for each source time point. Source time points were then grouped by haematopoietic phases: 1–6 months as early, 12–18 months as mid and 24–30 months as steady. Once collected data from each patient and calculated the percentage in each lineage, we then performed the non-parametric Kruskal–Wallis statistical test on the means between pairs of groups (early versus mid, early versus steady, mid versus steady) using the R function compare_means and returned P value after false discovery rate (FDR) correction.
Sharing ratio and Z-score
A single HSPC clone can proliferate and differentiate in distinct mature lineages. Using ISs, we can account for how many lineages have been retrieved for every single clone, thus studying the output of HSPCs. Specifically, given two distinct lineages (namely A and B), we can calculate the number of shared ISs or not shared ISs (as exclusive of A or B) on the overall number of clones per set (A or B). This ratio (or relative percentage) is called the ‘sharing ratio’.
The sharing ratio for CD34+ BM cells is calculated as the number of shared IS retrieved in both CD34+ BM cells and each mature cell marker on the total number of CD34+ BM IS. From each patient’s matrix of IS, we applied the filter for intra-marker contamination (as reported in previous works2,4, here using ISAnalytics with the function named ‘purity_filter’ with parameters impurity_threshold = 10, min_value = 3) and we then computed the sharing ratio. The parameter ‘input_threshold’ is the fold difference, set to 10 as already configured in past analyses, whether ‘min_value’ is the minimum number of reads to accept and IS as true.
Profiles of the sharing ratio over time were generated using the log-spline function in R. The Z-score was computed within each patient, time point and tissue by the R function ‘scale’. The statistical test to compare means was the non-parametric Kruskal–Wallis test with FDR correction. For data selection for time course analysis, repeated measurements were obtained as the average of ISs between 24 and 60 months, corresponding to the overlapping time points among the three clinical studies, with a minimum number of three observations per patient.
The sharing ratio for whole populations (in BM and PB separately) is calculated as the number of shared IS retrieved in both whole populations and each mature cell marker on the total number of IS retrieved in the whole population. Once computed the sharing ratio, we then compared these values among the lineages to observe any skewing or imbalance within each clinical trial and performed the non-parametric Kruskal–Wallis test (with FDR correction) on the means to quantify the differences. When we compared the different clinical trials, we needed to apply Z-score statistics and transform the data before data comparison.
HSC lineage commitment
If we considered pairs of sets in the sharing ratio, allowing a clone to belong to different mature lineages, to study lineage commitment, we required a method that accounted for sharing across all lineages together, assigning each IS unambiguously to a specific intersection (Fig. 3a). In our study, we defined the following sets to intersect for each patient and time point: CD34+, erythroid, myeloid, B and T cells. A single IS (with a minimum of three reads) exclusively belongs to either a single lineage or to an intersection of several lineages. For each time point, we classified an IS as ‘multi-lineage’ if it was shared among at least two mature lineages (excluding CD34+ cells). Otherwise, it was labelled as ‘uni-lineage’ if the IS was found exclusively within one mature lineage (or shared with CD34+ cells).
Computationally, we addressed this by generating a binary table that included all possible combinations of intersections for each time point (see Supplementary Methods for more details).
If a clone has been found in several time points, then it is considered ‘recurrent’, otherwise it is a ‘singleton’. The union of recurrent and singletons sum to all patient IS. For the analysis of lineage commitment in a percentage, the sharing ratio was then computed for each time point placing at the numerator the number of IS observed in a specific group (among multi-lineage of one of the uni-lineage) and at the denominator the number of clones captured at that time point (belonging to one of the groups, recurrent or singleton, if the ratio was within the group). For single clone lineage commitment, we tracked the classification dynamics of each clone over time and then we selected recurrent clones in early and late phases (less than 24 months and more than 24 months, respectively). For the analyses of the abundance of committed clones over time, we then joined each IS with the relative abundance (quantified with the number of genomes) and averaged the clonal abundance within each assemblage.
Gene Ontology and feature annotation
Gene Ontology has been realized using R packages for Gene Ontology clusterProfiler, the annotation DB org.Hs.eg.db and msigdbr. Semantic similarity has been done with the R package GOSemSim70. Feature annotations have been realized with the R packages ChIPseeker and database TxDb.Hsapiens.UCSC.hg19.knownGene (up-set plot) and the closest genes have been annotated with RefGene table (UCSC database hg19). The circos plot was generated by the R package ‘circlize’.
Statistical analysis
For pair-wise correlation and PCA analysis we used the R packages PerformanceAnalytics and factoextra, respectively. Kruskal–Wallis test performed in the R package ggpubr and stat_compare_means function. The R library ‘stats’ has been used for all P value corrections. The following libraries and software version have been used for IS analysis and statistics: Prism (v.10); R v.4.0.3 (package dependencies linked to this version include ISAnalytics, scales, dplyr, rstatix for wilcox_test, splines, Rcapture, vegan, psych, cluster, DescTools, fpc, pca, factorextra, ggplot). The following packages have been used for Good–Turing and Bayesian regression: R v.4.2.2 (2022-10-31), plyr_1.8.9, tools_4.2.2, jsonlite_1.8.8, grid_4.2.2, tidyselect_1.2.0; Python v.3.8.15, packaged by conda-forge, sklearn v.0.2, joblib v.1.2.0, numpy v.1.24.1, scipy v.1.10.1 and threadpoolctl v.3.1.0.
Multivariate Bayesian linear regression
Accurate determination of species counts and their respective abundances may face challenges due to technical or patient-specific confounders, potentially hindering the identification of true variability within the data. To overcome this, in our analyses focusing on clonal diversity, the assessment of long-lasting clones and HSC lineage commitment, we included these factors as covariates in a multivariate Bayesian linear regression model. The set of covariates includes positive real values and categorical values: time after gene therapy (positive real value), amount of sequenced DNA (positive real value), next-generation sequencing technology (categorical value), PCR methodology (categorical value), average VCN (positive real value), dose of cells (CD34 pro per kg) (positive real value) and patient gender and age (positive real value). Further details are reported in Supplementary Methods.
To model the nonlinearity among the response and the predictors, a second-order spline transformation on the input data X is performed before fitting the model and categorical predictors are included in the model by means of one-hot encoding. The preprocessing has been carried out through the functions SplineTransformer() and OneHotEncoder(), whereas for the model fitting, the ARDRegression() function was used; all functions are implemented in the scikit-learn Python package. We carried out a Bayesian linear regression to predict the Shannon diversity index (H index), the Sharing ratio (S ratio) and the CD34 output, taking into account confounding covariates to reduce their effect on the observed quantity.
Good–Turing estimator
The Good–Turing model has been used to estimate the number of undetected species in an assemblage. This estimation can be extended to evaluate the number of shared species between two assemblages and it is particularly useful in situations in which rare species, often not detected, make up a significant portion of the total. This method has been used in ecology to correct the bias between the observed and true number of species in a given area. By treating each IS as a species and cell markers as distinct assemblages, we can apply this approach to better estimate the richness of ISs within specific cell markers and the number of shared ISs between CD34 and each marker, accounting for undetected species. Two assemblage models exist and have been used: single assemblage and two assemblages. In the single assemblage model, the true number of species includes both observed and undetected species. The undetected species count is estimated based on the rarest observed species by using singletons and doubletons. This model is extended to two assemblages when estimating shared species between two sets. The true count of shared species accounts for those undetected in one or both assemblages. Estimations are based on observed species frequencies and population sizes, with specific formulas for undetected species in each case. A detailed formulation is reported in Supplementary Methods.
We applied the Good–Turing estimators to account for undetected ISs in lineage analyses. In single assemblage cases, we calculated the adjusted richness of CD34+ cells. For two assemblages, we estimated shared ISs between CD34+ cells and mature cell markers. The adjusted sharing ratio was then used in a multivariate Bayesian linear regression model to analyse lineage commitment and patient heterogeneity. We categorized each IS as multi- or uni-lineage and calculated sharing ratios, correcting them for confounding factors using the Good–Turing method and Bayesian analysis.
IS confidence interval by bootstrap
Each IS at each time point has been classified as either multi- or uni-lineage if shared across distinct cell lineages or a single cell lineage, respectively. To test whether the potential subsampling bias for poorly abundant clones could result in misclassification of an IS as uni-lineage instead of multi-lineage, we further evaluated the robustness of the commitment results by performing a parametric bootstrap with incremental subsampling percentages of the number of reads per IS and replicating this procedure ten times (ten is the number of randomizations). By this approach, we can evaluate the robustness and stability of the assigned label class (multi- or uni-lineage) for each IS during early, less than 24 months, versus late, greater than or equal to 24 months, phases of haematopoietic reconstitution by assessing how many times each single IS maintains the same commitment or multi-lineage state (in percent) generating the confidence interval. The standard error of the confidence interval is then computed across all IS.
In more detail, for each experiment i, let Xi be the observed read counts matrix. In this matrix, the rows represent the distinct ISs and the columns represent the distinct time points and cell markers, with the column-wise sums yielding the total number of observed reads for the given sample. We assumed to collect a subsample of the total number of reads, testing the robustness using 50, 70, 80 and 90% of the observed read counts. For each subsampling percentage p, we generated a matrix X(p)i. To generate several stochastic bootstrap replicates, we then sampled each matrix entry from a negative binomial distribution, using as mean and overdispersion the subsampled number of reads X(p)i. For each replicate, we performed the classification of the ISs as multi- or uni-lineage. We then compared each IS’s classifications with those obtained from the observed matrix (without subsampling, 100%) using a measure of accuracy, defined as accuracy = (number of correct predictions)/(number of total predictions).
Somatic mutations with the myeloid panel
DNA was extracted from PB mononuclear cells of 71 MLD (23 patients) and 27 β-Thal samples (nine patients) using the QIAamp DNA Blood Mini Kit (Qiagen). Twenty nanograms of genomic DNA were subjected to the AmpliSeq for Illumina Myeloid Panel (Illumina) procedure, following the manufacturer’s instructions. Special control DNA samples to assess the performance of next-generation sequencing assays for the detection of somatic mutations (Horizon Discovery HD701 and HD729) were processed simultaneously. PCR products were barcoded with AmpliSeq for Illumina CD Indexes Set A and B and pooled in one library that was paired-end sequenced on 1 Nova-Seq SP500 flow cells.
Sequences were aligned to the human reference genome (hg19/GRChg37) using BWA-MEM. Pileup files were generated using Samtools (options -B -q 1), followed by variant calling with VarScan2 (options –min-coverage 100, –min-var-freq 0.01). To filter out false positives, we removed mutations that: (1) appeared in several samples, (2) were in low-covered amplicons (fewer than 200 reads), (3) were germline (49 < VAF < 51 or VAF > 99), (4) occurred in the last 3 bp of reads or (5) were in poly-T or poly-A regions. Most somatic mutations had a VAF of less than 2%. Detected mutations were annotated using gnomAD, dbSNP and ClinVar databases. Further details are reported in Supplementary Methods.
TPO and EPO assays
To measure TPO in patients with WAS, we used the Quantikine ELISA—Human Thrombopoietin Immunoassay kit on the frozen plasma of patients according to the manufacturer’s instructions.
EPO was dosed from each patient’s serum by chemiluminescence with the Immulite 2000Xpi. Normal values are 3–24 mU ml−1.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.