genefilter 1.86.0
The genefilter package can be used to filter (select) genes from
a microarray dataset according to a variety of different filtering mechanisms.
Here, we will consider the example dataset in the sample.ExpressionSet example
from the Biobase package. This experiment has 26 samples, and
there are 500 genes and 3 covariates. The covariates are named sex, type and
score. The first two have two levels and the last one is continuous.
library("Biobase") 
library("genefilter")
data(sample.ExpressionSet)
varLabels(sample.ExpressionSet)## [1] "sex"   "type"  "score"table(sample.ExpressionSet$sex)## 
## Female   Male 
##     11     15table(sample.ExpressionSet$type)## 
##    Case Control 
##      15      11One dichotomy that can be of interest for subsequent analyses is whether the
filter is specific or non-specific. Here, specific means that we are
filtering with reference to sample metadata, for example, type. For example,
if we want to select genes that are differentially expressed in the two groups
defined by type, that is a specific filter. If on the other hand we want to
select genes that are expressed in more than 5 samples, that is an example of a
non–specific filter.
First, let us see how to perform a non–specific filter. Suppose we want to
select genes that have an expression measure above 200 in at least 5 samples. To
do that we use the function kOverA.
There are three steps that must be performed.
Create function(s) implementing the filtering criteria.
Assemble it (them) into a (combined) filtering function.
Apply the filtering function to the expression matrix.
f1 <- kOverA(5, 200)
ffun <- filterfun(f1)
wh1 <- genefilter(exprs(sample.ExpressionSet), ffun)
sum(wh1)## [1] 159Here f1 is a function that implies our “expression measure above 200 in at
least 5 samples” criterion, the function ffun is the filtering function (which
in this case consists of only one criterion), and we apply it using r Biocpkg("genefilter"). There were 159 genes that satisfied the
criterion and passed the filter. As an example for a specific filter, let us
select genes that are differentially expressed in the groups defined by type.
f2 <- ttest(sample.ExpressionSet$type, p=0.1)
wh2 <- genefilter(exprs(sample.ExpressionSet), filterfun(f2))
sum(wh2)## [1] 88Here, ttest is a function from the genefilter package which
provides a suitable wrapper around t.test from package stats.
Now we see that there are 88 genes that satisfy the selection
criterion. Suppose that we want to combine the two filters. We want those genes
for which at least 5 have an expression measure over 200 and which also are
differentially expressed between the groups defined by type
ffun_combined <- filterfun(f1, f2)
wh3 <- genefilter(exprs(sample.ExpressionSet), ffun_combined)
sum(wh3)## [1] 35Now we see that there are only 35 genes that satisfy both conditions.
The function knnCV defined below performs \(k\)–nearest neighbour
classification using leave–one–out cross–validation. At the same time it
aggregates the genes that were selected. The function returns the predicted
classifications as its returned value. However, there is an additional side
effect. The number of times that each gene was used (provided it was at least
one) are recorded and stored in the environment of the aggregator Agg. These
can subsequently be retrieved and used for other purposes.
knnCV <- function(x, selectfun, cov, Agg, pselect = 0.01, scale=FALSE) {
   nc <- ncol(x)
   outvals <- rep(NA, nc)
   for(i in seq_len(nc)) {
      v1 <- x[,i]
      expr <- x[,-i]
      glist <- selectfun(expr, cov[-i], p=pselect)
      expr <- expr[glist,]
      if( scale ) {
        expr <- scale(expr)
        v1 <- as.vector(scale(v1[glist]))
      }
      else
         v1 <- v1[glist]
      out <- paste("iter ",i, " num genes= ", sum(glist), sep="")
      print(out)
      Aggregate(row.names(expr), Agg)
      if( length(v1) == 1)
         outvals[i] <- knn(expr, v1, cov[-i], k=5)
      else
          outvals[i] <- knn(t(expr), v1, cov[-i], k=5)
    }
    return(outvals)
}gfun <- function(expr, cov, p=0.05) {
   f2 <- ttest(cov, p=p)
   ffun <- filterfun(f2)
   which <- genefilter(expr, ffun)
  }Next we show how to use this function on the dataset geneData
library("class")
##scale the genes
##genescale is a slightly more flexible "scale"
##work on a subset -- for speed only
geneData <- genescale(exprs(sample.ExpressionSet)[1:75,], 1) 
Agg <- new("aggregator") 
testcase <- knnCV(geneData, gfun, sample.ExpressionSet$type, 
       Agg,pselect=0.05)  sort(sapply(aggenv(Agg), c), decreasing=TRUE)##          AFFX-MurIL4_at         AFFX-TrpnX-M_at         AFFX-hum_alu_at 
##                      26                      26                      26 
##        AFFX-YEL018w/_at                31308_at          AFFX-PheX-M_at 
##                      20                      15                      15 
##                31312_at          AFFX-BioC-3_st AFFX-HUMRGE/M10098_M_at 
##                       3                       3                       1 
##          AFFX-DapX-5_at         AFFX-TrpnX-5_at         AFFX-BioDn-5_at 
##                       1                       1                       1 
##          AFFX-PheX-5_at 
##                       1The environment Agg contains, for each gene, the number of times it was selected in the cross-validation.
The version number of R and packages loaded for generating the vignette were:
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] class_7.3-22        genefilter_1.86.0   Biobase_2.64.0     
## [4] BiocGenerics_0.50.0 BiocStyle_2.32.0   
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9              lattice_0.22-6          RSQLite_2.3.6          
##  [4] digest_0.6.35           evaluate_0.23           grid_4.4.0             
##  [7] bookdown_0.39           fastmap_1.1.1           blob_1.2.4             
## [10] jsonlite_1.8.8          Matrix_1.7-0            AnnotationDbi_1.66.0   
## [13] GenomeInfoDb_1.40.0     DBI_1.2.2               survival_3.6-4         
## [16] BiocManager_1.30.22     httr_1.4.7              UCSC.utils_1.0.0       
## [19] XML_3.99-0.16.1         Biostrings_2.72.0       jquerylib_0.1.4        
## [22] cli_3.6.2               rlang_1.1.3             crayon_1.5.2           
## [25] XVector_0.44.0          bit64_4.0.5             splines_4.4.0          
## [28] cachem_1.0.8            yaml_2.3.8              tools_4.4.0            
## [31] memoise_2.0.1           annotate_1.82.0         GenomeInfoDbData_1.2.12
## [34] vctrs_0.6.5             R6_2.5.1                png_0.1-8              
## [37] matrixStats_1.3.0       stats4_4.4.0            lifecycle_1.0.4        
## [40] zlibbioc_1.50.0         KEGGREST_1.44.0         S4Vectors_0.42.0       
## [43] IRanges_2.38.0          bit_4.0.5               bslib_0.7.0            
## [46] xfun_0.43               MatrixGenerics_1.16.0   knitr_1.46             
## [49] xtable_1.8-4            htmltools_0.5.8.1       rmarkdown_2.26         
## [52] compiler_4.4.0