An expanded registry of candidate cis-regulatory elements

The registry of cCREs

Anchoring cCREs

We anchored cCREs on two classes of elements: rDHSs and representative transcription factor clusters (TF rClusters). To annotate rDHSs, we refined and filtered DNase hypersensitivity site calls across biosamples and then iteratively clustered and selected the strongest DHS as previously described9. To annotate TF rClusters, we downloaded peak BED files for all transcription factor ChIP–seq experiments from the ENCODE portal with a FRiP score >0.003. For each experiment, we used the ‘preferred default’ peak file. To match the cCRE size distribution, we resized peaks to 150–350 bp. Using the same iterative clustering and selection process that we used for rDHSs, we identified representative transcription factor peaks for each cluster. We then selected all TF rClusters representing at least five experiments that did not overlap an rDHS. For each anchor (rDHS or TF rCluster), we calculated the z-scores of the log-transformed signal for each DNase experiment within a biosample to normalize across datasets. cCREs were defined as regions with a DNase maximum z-score >1.64 or overlapping a TF rCluster.

Classifying cCREs

We classified cCREs into eight classes by analysing DNase-seq, H3K4me3, H3K27ac and CTCF ChIP–seq signals, and transcription factor binding across biosamples. For each cCRE, we calculated z-scores of the log-transformed signal for each mark within each biosample. The maximum z-score for each mark was determined across all biosamples. Based on combinations of high or low signals for these marks, genomic distance to annotated TSSs and overlap with TF rClusters, cCREs were categorized into promoter-like, enhancer-like (TSS-proximal or distal), CA-H3K4me3, CA-CTCF, CA-TF, CA or TF classes. Classifications were performed both agnostic of cell type and in specific biosamples, following previously published methods.

Gene annotations

Unless otherwise stated, all analyses used GENCODE 40 basic gene annotations for human and GENCODE M25 basic annotations for mouse. ENCODE RNA-seq data were uniformly processed by the ENCODE DCC using GENCODE 29 comprehensive for human and GENCODE M21 comprehensive for mouse.

Overlap with repeat elements

We downloaded repeat element annotations from the UCSC Genome Browser RepeatMasker track51 (https://www.repeatmasker.org; accessed November 2021; n = 5,633,664 elements) and used BEDTools (v.2.19.1)52, to calculate the fraction of cCREs overlapping repeat elements by major class (for example, LINEs, SINEs and LTRs). For cCRE sets that showed significant enrichment, we further analysed overlaps with specific repeat families within each class (for example, L1 and L2 for LINEs). Statistical significance was assessed using a two-sided Fisher’s exact test with FDR correction. This analysis was performed for multiple cCRE subsets, including multi-mapping cCREs and silencer cCREs.

Mammalian conservation

To assess evolutionary conservation patterns across cCREs, we used the 241-way mammalian multiple genome alignment generated by the Zoonomia Consortium53. As described20, we calculated two metrics for each human cCRE: (1) N1, the number of aligned species with ≥90% coverage of the cCRE’s nucleotide positions, and (2) N2, the number of species with ≤10% coverage. We then stratified cCREs into three conservation groups based on their distribution:

  • Group 1 (G1): highly conserved elements (N1 ≥ 120 and N2 ≤ 25)

  • Group 2 (G2): actively evolving elements (20 ≤ N1 ≤ 50 and N2 ≤ 120)

  • Group 3 (G3): primate-specific elements (N1 ≤ 50 and N2 ≥ 180)

We used these groupings to compare conservation profiles across different cCRE subclasses, including silencers, MAFF/MAFK+ elements, and previously annotated regulatory elements.

cCRE sequence features

Variational autoencoder

We analysed cCREs classified as active in three cell lines—K562, HepG2 and HCT116—focusing on promoters, proximal and distal enhancers, and CA-H3K4me3 elements for each respective cell type. Each cCRE sequence was adjusted to a fixed width of 300 bp, and its sequence was represented using one-hot encoding. A cell type-specific variational autoencoder (VAE) was trained for each cell type, utilizing a ten-dimensional latent space. The encoder architecture included a convolutional layer, max-pooling, flattening and a fully connected layer, while the decoder consisted of dense, reshape, upsampling and convolutional layers. Models were trained for 25 epochs using the Adam optimizer, implemented in Keras (v.3.8.0)/TensorFlow (v.2.17.0). After training, the mean latent representations of each cCRE were projected into two dimensions using UMAP (umap-learn v.0.5.7, 100 nearest neighbours).

GC content

For each cCRE, we calculated its GC content by counting the number of G and C nucleotides that appear in its sequence and dividing by the total cCRE length.

Transcription factor motif overlap

We scanned cCREs for HOCOMOCO v.11 motifs54 using FIMO (v.5.1.1)55 with the default parameters and the –text flag.

Functional characterization data

Whole-genome STARR-seq

For peak-based analyses, we downloaded BED peak files from the ENCODE Portal and intersected them with cCREs using BEDTools. For enrichment analysis, we intersected peaks (ENCFF454ZKK) from experiment ENCSR661FOW with cCREs active in K562, all other cCREs and non-cCRE size-matched genomic regions (generated using BEDtools shuffle). For fragment-based analyses, we downloaded BAM files from the ENCODE portal and processed them with our CAPRA pipeline (detailed in ‘Calculating CAPRA STARR scores’).

MPRA

For each MPRA experiment we downloaded ‘element quantification’ files from the ENCODE portal. If element coordinates were on hg19, we lifted coordinates to GRCh38 using UCSC liftOver (v.350)56. We stratified tested regions as to whether they completely overlapped a cCRE, partially overlapped a cCRE, or did not overlap any cCRE. We considered all tested regions with a log2 fold enrichment >1 as active. We calculated activity enrichment for each of the three groups (full overlap, partial overlap and no overlap) against the baseline activity of all tested regions.

For enrichment analysis, we intersected tested regions (ENCFF677CJZ) from experiment ENCSR653LWA with cCREs active in K562, all other cCREs and the non-cCRE size-matched genomic regions.

CRISPR perturbation screens

For CRISPR perturbation experiments, we intersected coordinates of CRISPR guide RNAs with cCREs and calculated per-cCRE counts for input libraries and output measurements. We used DESeq2 (v.1.46.0) to calculate cCREs with significant depletion or overrepresentation depending on the exact assay.

CRISPRi–Flow-FISH

For PrimeFlow readout experiments generated by the Engreitz lab57, we downloaded element quantification files from the ENCODE portal. We lifted region coordinates to the GRCh38 genome using UCSC liftOver and intersected them with cCREs using BEDTools.

For hybridization chain reaction (HCR)–FlowFISH readout experiments generated by the Sabeti laboratory, we downloaded CASA elements5 and intersected them with cCREs using BEDTools. We used these elements for enrichment analysis, where CASA elements were intersected with cCREs active in K562, all other cCREs, and the non-cCRE size-matched genomic regions.

Calculating CAPRA STARR scores

Pipeline

The first step of the CAPRA pipeline converts BAM files into fragment BED files using BEDTools for both the input DNA and output RNA fragments. Next, each fragment BED file is intersected with cCREs. Fragments that overlap a single cCRE in its entirety are counted in the ‘solo’ quantifications. Fragments that overlap two cCREs in their entirety are counted in the ‘double’ quantifications. Fragments that overlap at least one cCRE in its entirety and partial overlap other cCREs are used for paired sweep analyses. Fragments that partially overlap one or more cCREs are excluded. Quantifications for DNA and RNA fragments (including multiple replicates when available) are concatenated into a matrix for each quantification type—solo and double. Quantification matrices are processed using DESeq2 to identify cCREs with higher than expected RNA counts (denoting enhancer activity) or lower than expected RNA counts (denoting silencer activity).

Cell type-specific activity

Using the solo fragment CAPRA STARR scores, we defined K562+ and HepG2 + STARR distal enhancers as follows:

Transcription factor enrichment

Using the HOCOMOCO motif sites identified by FIMO (see above) we calculated the number of K562+ and HepG2+ STARR distal enhancers that overlap a motif for each transcription factor. We considered all transcription factors with greater than 5 TPM in either K562 or HepG2 (n = 234). We compared motif overlap between the two cell types using a two-sided Fisher’s exact test with FDR correction. We visualized the top five most enriched transcription factor motifs in each cell type that appear in at least 10% of the respective K562+ or HepG2+ STARR distal enhancers.

Silencer cCREs

Defining REST+cCREs

Using BEDTools, we intersected cCREs with the summits of ENCODE REST ChIP–seq peaks and selected all cCREs that overlapped the summits of at least five peaks. We then selected all non-promoter cCREs that overlapped at least one high quality REST motif site from FactorBook58.

Testing REST+ regions with transgenic mouse enhancer assays

To supplement previously tested regions in the VISTA database, we performed transgenic mouse enhancer assay experiments for 26 REST+ regions (Supplementary Table 10e). These regions were conserved between mouse and human and had evidence of REST binding (that is, ChIP–seq peaks) in both species. Nine of these regions overlapped REST+ enhancer/silencer cCREs, 15 overlapped REST+ silencer cCREs and 2 overlapped cCREs that overlapped REST ChIP–seq peaks but were not included in our REST+ silencer sets (Supplementary Table 10e).

The transgenic mouse assays were conducted using the FVB/NCrl strain of Mus musculus (Charles River) as previously described59. Candidate regions were amplified via PCR and inserted into a plasmid containing a minimal Hsp68 promoter and a lacZ reporter gene. The resulting constructs were injected into fertilized mouse eggs, which were then implanted into surrogate mothers. Embryos were collected at embryonic day 11.5 (E11.5) and analysed for β-galactosidase activity. Regions were classified as active enhancers if reproducible staining was observed in the same tissue in at least three embryos. Regions were considered inactive if no consistent staining was detected, provided that at least five embryos with transgene insertions were analysed.

Overlap of REST+ cCREs with VISTA enhancers

We downloaded the coordinates of 1,947 tested regions from the VISTA Enhancer Database (available as of 22 April 2023), which included the aforementioned 26 regions, and lifted the coordinates of these regions to the GRCh38 genome using UCSC liftOver56. We then intersected VISTA regions with REST+ cCREs using BEDTools, calculating the fraction of tested cCREs with positive activity. We used a two-sided Fisher’s exact test to calculate statistical significance for each cCRE class. For REST+ enhancer/silencer cCREs, we calculated enrichment of activity in specific tissues using a two-sided Fisher’s exact test by comparing the activity of regions overlapping REST+ distal enhancer cCREs versus distal enhancer cCREs not bound by REST (REST−).

Defining STARR silencer cCREs

Using STARR scores calculated by CAPRA from the K562 STARR-seq experiment ENCSR661FOW, we annotated all cCREs with a negative STARR score and P < 0.01 as stringent STARR silencers and all cCREs with a negative STARR score and P < 0.05 as robust STARR silencers (Supplementary Table 11a).

Enrichment in chromatin signatures

For each biosample with DNase data (n = 1,325) we calculated the fraction of silencer cCREs with high signal (z-score > 1.64). To determine enrichment, we calculated the log2 fold enrichment of the fraction of silencer cCREs over the fraction of all cCREs with high DNase signal. For a subset of silencer cCREs, REST+ silencer cCREs, we compare the enrichment of chromatin accessibility between embryonic and adult biosamples from the same tissue or organ (Supplementary Table 8g). Statistical significance was calculated using a two-sided Wilcoxon test.

We also intersected silencer cCREs with histone mark peaks from K562 cells using BEDTools intersect. We downloaded histone mark peaks directly from the ENCODE portal. For REST+ cCREs, we compared peak overlap with cCREs active in K562 and those with low chromatin accessibility, respectively. For stringent and robust STARR silencers we compared peak overlap with cCREs with positive STARR scores and neutral STARR scores. Statistical significance was calculated using a two-sided Fisher’s exact test with FDR correction.

To evaluate the chromatin state context of silencers, we calculated the overlap between cCRE sets and ChromHMM60 annotations from the ENCODE Project. We used the consolidated 15-state ChromHMM segmentation for K562 (generated by this study, ENCFF106BGJ on the ENCODE portal). For each chromatin state, we used BEDTools to calculate the fraction of cCREs overlapping any genomic region assigned to that state.

Overlap with previous silencer collections

We downloaded K562 silencers from four previous studies and formatted them into BED files. This included elements from Huan et al.13, Jayavelu et al.14, Pang & Snyder15 and Cai et al.16. When necessary, region coordinates were lifted to the GRCh38 genome using UCSC liftOver. We intersected each collection with cCREs using BEDTools. We also compared silencer calls between each study by calculating the overlap coefficient.

Expression of genes near silencers

Using BEDTools we identified the closest GENCODE annotated protein coding gene for each cCRE as measured by linear distance to the nearest TSS. Using gene quantifications in K562 (ENCFF421TJX, available on the ENCODE portal), we extracted TPM quantifications for each gene and stratified by cCRE class or group.

MAFF/MAFK+ cCREs

To define MAFF/MAFK+ cCREs, we first extracted peak summits for MAFF and MAFK from ChIP–seq data in K562 and HepG2 cells. We classified a cCRE as MAFF/MAFK+ if it overlapped the summits of both MAFF and MAFK in the same given cell type. We then selected common MAFF/MAFK+ cCREs that had low chromatin accessibility in both cell types for further characterization.

Dissecting GWAS loci

VESPA pipeline

As previously described9, the VESPA pipeline takes lead variants reported by GWAS and generates matched control variants based on minor allele frequency and genomic context. For both sets of GWAS and control variants, VESPA then retrieves variants in high linkage disequilibrium (LD, R2 > 0.7) based on study population, creating ‘LD blocks’. VESPA next intersects variants with cCREs. For studies with a sufficient number of LD blocks (at least 25 lead variants), we calculated cell type and tissue enrichment by subsetting cCREs based on activity in specific biosamples and comparing overlap between GWAS and control variants, accounting for LD by only counting each LD block once per intersection. For this study, we used H3K27ac signals across 562 biosamples. In total, we curated variants from 3,751 unique trait–study combinations, including 396 with biosample recommendations. Results are available on SCREEN.

Candidate GWAS genes

We identified candidate GWAS genes using four complementary approaches:

  • Closest gene. Using BEDTools closest52, we identified the closest gene (either protein coding or non-coding) to each RBC cCRE as measured by linear distance to the nearest TSS.

  • Proximal TSS. We included all genes with an annotated TSS within 5 kb of an RBC cCRE.

  • 3D chromatin interactions. We intersected RBC cCREs with the anchors of 3D chromatin loops—RNAPII ChIA-PET, CTCF ChIA-PET and Hi-C—from K562 cells. To link a candidate cCRE to a gene, we required one end of the loop to overlap the cCRE and the other end of the loop to fall within 2 kb of an annotated TSS.

  • CRISPRi–Flow-FISH. We intersected the GRCh38 mapped coordinates with RBC cCREs and selected all cCRE–gene pairs with an FDR < 0.05.

For the cross-biosample, cross-gene expression analysis (Fig. 4d) gene quantification files were downloaded directly from the ENCODE portal (experiment and file IDs in Supplementary Table 17d). For KLF1 and PRDX2 gene expression data was downloaded directly from SCREEN (Supplementary Table 17e).

Cell lines

This study used data from more than 200 human and mouse cell lines. A complete list of all cell lines, including biosample accessions, is provided in Supplementary Table 1a,b. Each experiment is linked to a biosample page on the ENCODE portal (https://encodeproject.org), which includes donor, tissue and derivation information. No new cell lines were generated or cultured for this study. All cell lines were obtained by ENCODE data production labs, which performed standard quality control and authentication procedures prior to data release, including verification of species identity as part of ENCODE production guidelines. Information on the provenance and validation of each biosample is available on the corresponding ENCODE portal biosample pages. Cell lines were not tested for mycoplasma contamination. No commonly misidentified cell lines were used.

Human primary cell and tissue samples

This study analysed publicly available human data generated by the ENCODE consortium. All human datasets were collected by ENCODE production laboratories under protocols approved by their respective Institutional Review Boards and in accordance with informed consent and relevant ethical guidelines.

Mouse primary cell and tissue samples

Each biosample on the ENCODE portal, including mouse biosamples, has annotated biological sex information. No sex-specific analyses were performed as part of this work.

Randomization and blinding

Randomization and blinding were not relevant for this study. Most analyses were based on publicly available ENCODE consortium datasets that were uniformly processed and analysed computationally. For the REST transgenic mouse reporter assays performed specifically for this study, embryos were not randomly assigned to treatment or control groups because all embryos were generated and assayed under identical experimental conditions to evaluate reproducible enhancer or silencer activity of the tested sequences. Thus, no random allocation was necessary. The embryos were also processed and imaged under identical conditions, and reporter expression was scored without knowledge of which candidate elements were tested to minimize potential bias in interpretation.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Leave a Comment