Title: | A Fast and Light Package to Compute Polygenic Risk Scores |
---|---|
Description: | Quickly computes polygenic scores from GWAS summary statistics of either case-control or quantitative traits without parameter tuning. Reales,G., Vigorito, E., Kelemen,M., Wallace,C. (2021) <doi:10.1101/2020.07.24.220392> "RápidoPGS: A rapid polygenic score calculator for summary GWAS data without a test dataset". |
Authors: | Guillermo Reales [aut, cre] |
Maintainer: | Guillermo Reales <[email protected]> |
License: | GPL-3 |
Version: | 2.3.0.9002 |
Built: | 2025-01-29 03:54:15 UTC |
Source: | https://github.com/grealesm/rapidopgs |
create_1000G
downloads and gets 1000 Genomes Phase III panel (hg19) in
PLINK format, and apply quality control for being used to compute PGS using
rapidopgs_multi
.
Given the size of the files, running this function can take long, depending
on broadband speed and server status. We also recommend to ensure that there
is at least 60GB free space available in disk.
create_1000G( directory = "ref-data", remove.related = TRUE, qc.maf = 0.01, qc.hwe = 1e-10, qc.geno = 0, autosomes.only = TRUE )
create_1000G( directory = "ref-data", remove.related = TRUE, qc.maf = 0.01, qc.hwe = 1e-10, qc.geno = 0, autosomes.only = TRUE )
directory |
a string indicating the directory to download the panel |
remove.related |
a logical stating if related individuals should be removed. Default TRUE. |
qc.maf |
a numeric to set the MAF threshold for variants to be removed. DEFAULT 0.01 |
qc.hwe |
a numeric indicating the threshold for Hardy-Weinberg exact test p-value, below which variants will be removed. DEFAULT 1e-10. |
qc.geno |
a numeric to set maximum missing call rates for variants. DEFAULT = 0. |
autosomes.only |
If FALSE, it will include X and Y chromosomes, too. |
bed, fam and bim files for each chromosome in the chosen directory.
Guillermo Reales
## Not run: create_1000G() ## End(Not run)
## Not run: create_1000G() ## End(Not run)
A GRanges object containing the LD block for European ancestry, in hg19 build. This dataset was obtained from Berisa and Pickrell (2016), in bed format, then converted to GRanges. See manuscript for more details.
EUR_ld.blocks19
EUR_ld.blocks19
A GRanges object containing 1703 ranges
chromosome
start and stop positions for the block
genomic strand, irrelevant here
https://bitbucket.org/nygcresearch/ldetect-data/src
A GRanges object containing the LD block for European ancestry, in hg38 build. This dataset was obtained from MacDonald et al. (2022), in bed format, then transformed to GRanges. See manuscript for more details.
EUR_ld.blocks38
EUR_ld.blocks38
A GRanges object containing 1361 ranges
chromosome
start and stop positions for the block
genomic strand, irrelevant here
https://github.com/jmacdon/LDblocks_GRCh38/tree/master/data
gwascat.download
find the right fileFinding a file in an FTP directory
This is an internal function to help gwascat.download
find the right file
find_file_in_ftp(ftp_address, acc, hm)
find_file_in_ftp(ftp_address, acc, hm)
ftp_address |
a string. An FTP address provided by |
acc |
a string containing the accession for the desired study. |
hm |
a logical. Should it look in the harmonised directory? |
a data.table containing the dataset.
Guillermo Reales
gwascat.download
takes a PMID from the user and downloads the associated summary statistics datasets published in GWAS catalogThis function, takes PUBMED ids as an input, searches at the GWAS catalog for harmonised datasets associated to that, interactively asking the user to choose if there are more than one, and fetches the dataset.
gwascat.download(ID, harmonised = TRUE)
gwascat.download(ID, harmonised = TRUE)
ID |
a numeric. A PubMed ID (PMID) reference number from a GWAS paper. |
harmonised |
a logical. Should GWAS catalog harmonised files be pursued? If not available, the function will fall back to non-harmonised |
If multiple files are available for the same study, R will prompt an interactive dialogue to select a specific file, by number.
a character vector containing the url(s) to the dataset(s).
Guillermo Reales
Sums logs without loss of precision This function is verbatim of its namesake in cupcake package (github.com/ollyburren/cupcake/)
logsum(x)
logsum(x)
x |
a vector of logs to sum |
a scalar
Chris Wallace
A data.table containing a subset of Michailidou et al., 2017 breast cancer summary statistic dataset, in hg19 build. This dataset is freely available in GWAS catalog (see link below). We used "chromosome", "base_pair_location columns", removed unnecessary and all-missing columns, and took a random sample of 100,000 SNPs without replacement.
michailidou19
michailidou19
A data.table object containing 100,000 SNPs
SNPID, CHR, BP, REF, ALT, ALT_FREQ, BETA, SE, P
rsids, or SNP ids
chromosome
base position, in hg38
reference, or non-effect allele
alternative, or effect allele
effect allele frequency
beta, log(OR), or effect size
standard error of beta
p-value
A data.table containing a subset of Michailidou et al., 2017 breast cancer summary statistic dataset, in hg38 build. This dataset is freely available in GWAS catalog (see link below). We removed unnecessary and all-missing columns, and rows with missing data at hm_beta and hm_effect_allele_frequency, and took a random sample of 100,000 SNPs without replacement.
michailidou38
michailidou38
A data.table object containing 100,000 SNPs
rsids, or SNP ids
chromosome
base position, in hg38
reference, or non-effect allele
alternative, or effect allele
beta, log(OR), or effect size
effect allele frequency
standard error of beta
p-value
'rapidopgs_multi
computes PGS from a from GWAS summary statistics
using Bayesian sum of single-effect (SuSiE) linear regression using z scores
rapidopgs_multi( data, reference = NULL, LDmatrices = NULL, N = NULL, build = c("hg19", "hg38"), trait = c("cc", "quant"), ncores = 1, alpha.block = 1e-04, alpha.snp = 0.01, sd.prior = NULL, ancestry = "EUR", LDblocks = NULL )
rapidopgs_multi( data, reference = NULL, LDmatrices = NULL, N = NULL, build = c("hg19", "hg38"), trait = c("cc", "quant"), ncores = 1, alpha.block = 1e-04, alpha.snp = 0.01, sd.prior = NULL, ancestry = "EUR", LDblocks = NULL )
data |
a data.table containing GWAS summary statistic dataset with all required information. |
reference |
a string representing the path to the directory containing the reference panel (eg. "../ref-data"). |
LDmatrices |
a string representing the path to the directory containing the pre-computed LD matrices. |
N |
a numeric indicating the number of individuals used to generate input GWAS dataset, or a string indicating the column name containing per-SNP sample size. |
build |
a string indicating the genome build. 'hg19' and 'hg38' are supported. Note that your LD matrices or reference panel should match the build. |
trait |
a string indicating if trait is a case-control ("cc") or quantitative ("quant"). |
ncores |
a numeric specifying the number of cores (CPUs) to be used. If using pre-computed LD matrices, one core is enough for best performance. |
alpha.block |
a numeric threshold for minimum P-value in LD blocks.
Blocks with minimum P above |
alpha.snp |
a numeric threshold for P-value pruning within LD block.
SNPs with P above |
sd.prior |
the prior specifies that BETA at causal SNPs follows a centred normal distribution with standard deviation sd.prior. If NULL (default) it will be automatically estimated (recommended). |
ancestry |
a string indicating the ancestral population (DEFAULT: "EUR", European). If using an alternative population, bear in mind that your LD matrices or reference must be from the same population. You'll also need to provide matching LD.blocks via the LDblocks argument. |
LDblocks |
a string indicating the path to an alternative LD block file in .RData format. Only required for non-European PGS. |
This function will take a GWAS summary statistic dataset as an input,
will assign LD blocks to it, then use user-provided LD matrices or a preset
reference panel in Plink format to compute LD matrices for each block.
Then SuSiE method will be used to compute posterior probabilities of variants to be causal
and generate PGS weights by multiplying those posteriors by effect sizes ().
Unlike
rapidopgs_single
, this approach will assume one or more causal variants.
The GWAS summary statistics file to compute PGS using our method must contain the following minimum columns, with these exact column names:
Chromosome
Base position (in GRCh37/hg19).
Reference, or non-effect allele
Alternative, or effect allele, the one refers to
(or log(OR)), or effect sizes
standard error of
P-value for the association test
In addition, quantitative traits must have the following extra column:
Minor allele frequency.
Also, for quantitative traits, sample size must be supplied, either as a number, or indicating the column name, for per-SNP sample size datasets (see below). Other columns are allowed, and will be ignored.
Reference panel should be divided by chromosome, in Plink format.
Both reference panel and summary statistic dataset should be in GRCh37/hg19.
For 1000 Genomes panel, you can use create_1000G
function to set it up
automatically.
If prefer to use LD matrices, you must indicate the path to the directory where they are stored. They must be in RDS format, named LD_chrZ.rds (where Z is the 1-22 chromosome number). If you don't have LD matrices already, we recommend downloading those gently provided by Prive et al., at https://figshare.com/articles/dataset/European_LD_reference/13034123. These matrices were computed using for 1,054,330 HapMap3 variants based on 362,320 European individuals of the UK biobank.
a data.table containing the sumstats dataset with computed PGS weights.
Guillermo Reales, Chris Wallace
## Not run: ss <- data.table( CHR=c(4,20,14,2,4,6,6,21,13), BP=c(1479959, 13000913, 29107209, 203573414, 57331393, 11003529, 149256398, 25630085, 79166661), REF=c("C","C","C","T","G","C","C","G","T"), ALT=c("A","T","T","A","A","A","T","A","C"), BETA=c(0.012,0.0079,0.0224,0.0033,0.0153,0.058,0.0742,0.001,-0.0131), SE=c(0.0099,0.0066,0.0203,0.0171,0.0063,0.0255,0.043,0.0188,0.0074), P=c(0.2237,0.2316,0.2682,0.8477,0.01473,0.02298,0.08472,0.9573,0.07535)) PGS <- rapidopgs_multi(ss, reference = "ref-data/", N = 20000, build = "hg19", trait="cc", ncores=5) ## End(Not run)
## Not run: ss <- data.table( CHR=c(4,20,14,2,4,6,6,21,13), BP=c(1479959, 13000913, 29107209, 203573414, 57331393, 11003529, 149256398, 25630085, 79166661), REF=c("C","C","C","T","G","C","C","G","T"), ALT=c("A","T","T","A","A","A","T","A","C"), BETA=c(0.012,0.0079,0.0224,0.0033,0.0153,0.058,0.0742,0.001,-0.0131), SE=c(0.0099,0.0066,0.0203,0.0171,0.0063,0.0255,0.043,0.0188,0.0074), P=c(0.2237,0.2316,0.2682,0.8477,0.01473,0.02298,0.08472,0.9573,0.07535)) PGS <- rapidopgs_multi(ss, reference = "ref-data/", N = 20000, build = "hg19", trait="cc", ncores=5) ## End(Not run)
'rapidopgs_single
computes PGS from a from GWAS summary statistics using posteriors from Wakefield's approximate Bayes Factors
rapidopgs_single( data, N = NULL, trait = c("cc", "quant"), build = "hg19", pi_i = 1e-04, sd.prior = if (trait == "quant") { 0.15 } else { 0.2 }, filt_threshold = NULL, recalc = TRUE, reference = NULL )
rapidopgs_single( data, N = NULL, trait = c("cc", "quant"), build = "hg19", pi_i = 1e-04, sd.prior = if (trait == "quant") { 0.15 } else { 0.2 }, filt_threshold = NULL, recalc = TRUE, reference = NULL )
data |
a data.table containing GWAS summary statistic dataset with all required information. |
N |
a scalar representing the sample in the study, or a string indicating the column name containing it. Required for quantitative traits only. |
trait |
a string specifying if the dataset corresponds to a case-control ("cc") or a quantitative trait ("quant") GWAS. If trait = "quant", an ALT_FREQ column is required. |
build |
a string containing the genome build of the dataset, either "hg19" (for hg19/GRCh37) or "hg38" (hg38/GRCh38). DEFAULT "hg19". |
pi_i |
a scalar representing the prior probability (DEFAULT:
|
sd.prior |
the prior specifies that BETA at causal SNPs follows a centred normal distribution with standard deviation sd.prior. Sensible and widely used DEFAULTs are 0.2 for case control traits, and 0.15 * var(trait) for quantitative (selected if trait == "quant"). |
filt_threshold |
a scalar indicating the ppi threshold (if
|
recalc |
a logical indicating if weights should be
recalculated after thresholding. Only relevant if |
reference |
a string indicating the path of the reference file SNPs should be filtered and aligned to, see Details. |
This function will take a GWAS summary statistic dataset as an input,
will assign align it to a reference panel file (if provided), then it will assign
SNPs to LD blocks and compute Wakefield's ppi by LD block, then will use it
to generate PGS weights by multiplying those posteriors by effect sizes ().
Optionally, it will filter SNPs by a custom filter on ppi and then recalculate weights, to improve accuracy.
Alternatively, if filt_threshold is larger than one, RapidoPGS will select the top
filt_threshold
SNPs by absolute weights (note, not ppi but weights).
The GWAS summary statistics file to compute PGS using our method must contain the following minimum columns, with these exact column names:
Chromosome
Base position (in GRCh37/hg19 or GRCh38/hg38). If using hg38, use build = "hg38" in parameters
Reference, or non-effect allele
Alternative, or effect allele, the one refers to
Minor/ALT allele frequency in the tested population, or in a close population from a reference panel. Required for Quantitative traits only
(or log(OR)), or effect sizes
standard error of
If a reference is provided, it should have 5 columns: CHR, BP, SNPID, REF, and ALT. Also, it should be in the same build as the summary statistics. In both files, column order does not matter.
a data.table containing the formatted sumstats dataset with computed PGS weights.
Guillermo Reales, Chris Wallace
sumstats <- data.table(SNPID=c("rs139096444","rs3843766","rs61977545", "rs544733737", "rs2177641", "rs183491817", "rs72995775","rs78598863", "rs1411315"), CHR=c(4,20,14,2,4,6,6,21,13), BP=c(1479959, 13000913, 29107209, 203573414, 57331393, 11003529, 149256398, 25630085, 79166661), REF=c("C","C","C","T","G","C","C","G","T"), ALT=c("A","T","T","A","A","A","T","A","C"), BETA=c(0.012,0.0079,0.0224,0.0033,0.0153,0.058,0.0742,0.001,-0.0131), SE=c(0.0099,0.0066,0.0203,0.0171,0.0063,0.0255,0.043,0.0188,0.0074)) PGS <- rapidopgs_single(sumstats, trait = "cc")
sumstats <- data.table(SNPID=c("rs139096444","rs3843766","rs61977545", "rs544733737", "rs2177641", "rs183491817", "rs72995775","rs78598863", "rs1411315"), CHR=c(4,20,14,2,4,6,6,21,13), BP=c(1479959, 13000913, 29107209, 203573414, 57331393, 11003529, 149256398, 25630085, 79166661), REF=c("C","C","C","T","G","C","C","G","T"), ALT=c("A","T","T","A","A","A","T","A","C"), BETA=c(0.012,0.0079,0.0224,0.0033,0.0153,0.058,0.0742,0.001,-0.0131), SE=c(0.0099,0.0066,0.0203,0.0171,0.0063,0.0255,0.043,0.0188,0.0074)) PGS <- rapidopgs_single(sumstats, trait = "cc")
sd.prior.est
function will take the dataset as an input, a
value obtained from a public repository such as LDhub,
(http://ldsc.broadinstitute.org/ldhub/), sample size and number of variants,
and will provide a sd.prior estimate that can be used to improve prediction
performance of RapidoPGS functions on quantitative traits.
sd.prior.est(data, h2, N, pi_i = 1e-04)
sd.prior.est(data, h2, N, pi_i = 1e-04)
data |
a data.table containing the GWAS summary statistic input dataset. Must contain SNPID and SE columns. |
h2 |
a numeric. Heritability estimate or h^2 (See details). |
N |
a numeric. Sample size of the GWAS input dataset. |
pi_i |
a numeric. Prior that a given variant is causal. DEFAULT = 1e-4. |
Guillermo Reales, Elena Vigorito, Chris Wallace
sumstats <- data.table(SNPID=c("4:1479959","20:13000913","14:29107209","2:203573414", "4:57331393","6:11003529","6:149256398","21:25630085","13:79166661"), REF=c("C","C","C","T","G","C","C","G","T"), ALT=c("A","T","T","A","A","A","T","A","C"), ALT_FREQ=c(0.2611,0.4482,0.0321,0.0538,0.574,0.0174,0.0084,0.0304,0.7528), BETA=c(0.012,0.0079,0.0224,0.0033,0.0153,0.058,0.0742,0.001,-0.0131), SE=c(0.0099,0.0066,0.0203,0.0171,0.0063,0.0255,0.043,0.0188,0.0074), P=c(0.2237,0.2316,0.2682,0.8477,0.01473,0.02298,0.08472,0.9573,0.07535)) sd.prior <- sd.prior.est(sumstats, h2 = 0.2456, N = 45658, pi_i=1e-4)
sumstats <- data.table(SNPID=c("4:1479959","20:13000913","14:29107209","2:203573414", "4:57331393","6:11003529","6:149256398","21:25630085","13:79166661"), REF=c("C","C","C","T","G","C","C","G","T"), ALT=c("A","T","T","A","A","A","T","A","C"), ALT_FREQ=c(0.2611,0.4482,0.0321,0.0538,0.574,0.0174,0.0084,0.0304,0.7528), BETA=c(0.012,0.0079,0.0224,0.0033,0.0153,0.058,0.0742,0.001,-0.0131), SE=c(0.0099,0.0066,0.0203,0.0171,0.0063,0.0255,0.043,0.0188,0.0074), P=c(0.2237,0.2316,0.2682,0.8477,0.01473,0.02298,0.08472,0.9573,0.07535)) sd.prior <- sd.prior.est(sumstats, h2 = 0.2456, N = 45658, pi_i=1e-4)
Estimate trait standard deviation given vectors of variance of coefficients, MAF and sample size
sdY.est(vbeta, maf, n)
sdY.est(vbeta, maf, n)
vbeta |
vector of variance of coefficients |
maf |
vector of MAF (same length as vbeta) |
n |
sample size |
Estimate is based on var(beta-hat) = var(Y) / (n * var(X)) var(X) = 2* maf*(1-maf) so we can estimate var(Y) by regressing n*var(X) against 1/var(beta) This function is verbatim from its namesake in coloc package (github.com/chr1swallace/coloc/), by Chris Wallace
estimated standard deviation of Y
Chris Wallace
wakefield_pp
computes posterior probabilities for a given SNP to be causal for a given SNP under the assumption of a single causal variant.This function was adapted from its namesake in cupcake package (github.com/ollyburren/cupcake/) to no longer require allele frequencies.
wakefield_pp(beta, se, pi_i = 1e-04, sd.prior = 0.2)
wakefield_pp(beta, se, pi_i = 1e-04, sd.prior = 0.2)
beta |
a vector of effect sizes ( |
se |
vector of standard errors of effect sizes ( |
pi_i |
a scalar representing the prior probability (DEFAULT |
sd.prior |
a scalar representing our prior expectation of |
a vector of posterior probabilities.
Olly Burren, Chris Wallace, Guillermo Reales
wakefield_pp_quant
computes posterior probabilities for a given SNP to be causal for a given SNP under the assumption of a single causal variant.
wakefield_pp_quant(beta, se, sdY, sd.prior = 0.15, pi_i = 1e-04)
wakefield_pp_quant(beta, se, sdY, sd.prior = 0.15, pi_i = 1e-04)
beta |
a vector of effect sizes ( |
se |
vector of standard errors of effect sizes ( |
sdY |
a scalar of the standard deviation given vectors of variance of coefficients, MAF and sample size. Can be calculated using |
sd.prior |
a scalar representing our prior expectation of |
pi_i |
a scalar representing the prior probability (DEFAULT |
This function was adapted from wakefield_pp
in cupcake package (github.com/ollyburren/cupcake/)
a vector of posterior probabilities.
Guillermo Reales, Chris Wallace