GenomicRanges — ゲノム上の範囲やアノテーション

https://bioconductor.org/packages/GenomicRanges

ゲノム上の座標・範囲を表現するための型として S4 class GenomicRanges を提供する。これは整数の範囲を扱う IRanges を拡張したもの。

Installation

install.packages("BiocManager")
BiocManager::install("GenomicRanges")
library(GenomicRanges)

`IRanges`

両端の値を含む閉区間 [start, end] なのが厄介。 end = start + width - 1 の関係。 width = 0 のときだけ例外的に end < start になり、両端を含まない。

ir1 = IRanges(start = 1:4, width = 3:0)
ir2 = IRanges(start = 1:4, end = 3)
ir3 = IRanges(end = 3, width = 3:0)
stopifnot(identical(ir1, ir2), identical(ir1, ir3))
ir1

IRanges object with 4 ranges and 0 metadata columns:
      start end width
  [1]     1   3     3
  [2]     2   3     2
  [3]     3   3     1
  [4]     4   3     0

start(ir1)

[1] 1 2 3 4

end(ir1)

[1] 3 3 3 3

width(ir1)

[1] 3 2 1 0

`GenomicRanges`

gr = GRanges(
  seqnames = c("chr2", "chr1", "chr1"),    # 染色体の名前
  ranges = IRanges(101:103, width = 100),  # 座標
  strand = c("-", "+", "*"),
  score = 51:53,                           # 任意のelementMetadata列
  GC = 0.1 * 5:7,                          # 任意のelementMetadata列
  seqinfo = NULL,
  seqlengths = c(chr1 = 249250621, chr2 = 243199373)
)
gr

GRanges object with 3 ranges and 2 metadata columns:
      seqnames    ranges strand |     score        GC
         <Rle> <IRanges>  <Rle> | <integer> <numeric>
  [1]     chr2   101-200      - |        51       0.5
  [2]     chr1   102-201      + |        52       0.6
  [3]     chr1   103-202      * |        53       0.7
  -------
  seqinfo: 2 sequences from an unspecified genome

上記のように自分で作る機会はほぼ無くて、 GenomicRanges::makeGRangesFromDataFrame(df), rtracklayer::import("annotation.gff3"), GenomicFeatures::genes(txdb), のような形で読み込むことが多い。

個々の区間の情報へのアクセスはIRangeと同じ start(gr), end(gr), width(gr) に加えて:

seqnames(gr)

factor-Rle of length 3 with 2 runs
  Lengths:    1    2
  Values : chr2 chr1
Levels(2): chr1 chr2

ranges(gr)

IRanges object with 3 ranges and 0 metadata columns:
      start end width
  [1]   101 200   100
  [2]   102 201   100
  [3]   103 202   100

strand(gr)

factor-Rle of length 3 with 3 runs
  Lengths: 1 1 1
  Values : - + *
Levels(3): + - *

mcols(gr)

DataFrame with 3 rows and 2 columns
      score        GC
  <integer> <numeric>
1        51       0.5
2        52       0.6
3        53       0.7

S4Vectors::mcols() で参照・代入するのは区間ごとのメタデータ。

データセット全体のメタデータとして染色体の長さなども扱う:

seqinfo(gr)

Seqinfo object with 2 sequences from an unspecified genome:
  seqnames seqlengths isCircular genome
  chr1      249250621         NA   <NA>
  chr2      243199373         NA   <NA>

seqlevels(gr)

[1] "chr1" "chr2"

seqlengths(gr)

     chr1      chr2 
249250621 243199373

isCircular(gr)

chr1 chr2 
  NA   NA

genome(gr)

chr1 chr2 
  NA   NA

Functions

多くの関数は IRanges と GenomicRanges で同様に動作する。より単純な前者で例を示し、後者固有の話は適宜挟む。

ir = IRanges(c(1, 8, 14, 15, 19, 34, 40), width = c(12, 6, 6, 15, 6, 2, 7))
ir

IRanges object with 7 ranges and 0 metadata columns:
      start end width
  [1]     1  12    12
  [2]     8  13     6
  [3]    14  19     6
  [4]    15  29    15
  [5]    19  24     6
  [6]    34  35     2
  [7]    40  46     7

plot of chunk iranges-orig

Intra-range methods

個々の区間を操作

shift(x, shift = 0L, use.names = TRUE)

ずらす。

narrow(x, start = NA, end = NA, width = NA, use.names = TRUE)

狭める。startには正、endには負の値を与える。

start = 1, end = -1 のとき何もしないというのがわかりにくすぎて怖い。

identical(ir, narrow(ir, 1, -1))

[1] TRUE

threebands(x, start = NA, end = NA, width = NA) は削られる両端部分も含めて $left, $middle, $right のリストで返してくれる亜種。

逆に広げるには flank() を punion() するか、 + 演算子で両側に広げてから片側を narrow() するか？

resize(x, width, fix = "start", use.names = TRUE, ...)

幅を変える。

fix: start, end, center

flank(x, width, start = TRUE, both = FALSE, use.names = TRUE, ...)

start上流もしくはend下流の領域。両方いっぺんには取れない。

both = TRUE は start (or end) を起点に両側という意味であり、範囲の両側ではない。

flank(ir, 1, both = TRUE)

IRanges object with 7 ranges and 0 metadata columns:
      start end width
  [1]     0   1     2
  [2]     7   8     2
  [3]    13  14     2
  [4]    14  15     2
  [5]    18  19     2
  [6]    33  34     2
  [7]    39  40     2

promoters(x, upstream=2000, downstream=200, use.names=TRUE, ...) はstart起点に上下異なる幅で取ってこられる亜種。

reflect(x, bounds, use.names = TRUE)

bounds の裏から見た相対位置にする。

reflect(ir, IRanges(1, 1000))

IRanges object with 7 ranges and 0 metadata columns:
      start  end width
  [1]   989 1000    12
  [2]   988  993     6
  [3]   982  987     6
  [4]   972  986    15
  [5]   977  982     6
  [6]   966  967     2
  [7]   955  961     7

いかにも負のstrandの座標処理に使えそうだがなぜかGenomicRangesには未対応。自分で書くならこんな感じか:

y = gr
bounds = IRanges(start = 1, width = seqlengths(gr)[as.vector(seqnames(gr))])
ranges(y)[strand(gr) == "-"] = reflect(ranges(gr), bounds)[strand(gr) == "-"]
y

GRanges object with 3 ranges and 2 metadata columns:
      seqnames              ranges strand |     score        GC
         <Rle>           <IRanges>  <Rle> | <integer> <numeric>
  [1]     chr2 243199174-243199273      - |        51       0.5
  [2]     chr1             102-201      + |        52       0.6
  [3]     chr1             103-202      * |        53       0.7
  -------
  seqinfo: 2 sequences from an unspecified genome

reverse(x) は reflect(x, range(x)) のショートカット。区間を逆順に並べる rev() とは違う。

restrict(x, start = NA, end = NA, keep.all.ranges = FALSE, use.names = TRUE)

start から end までの範囲のみ残して外を捨てる。境界含む。

end = 14 で15から始まる区間が取れてきちゃうのはバグじゃない？

restrict(ir, 10, 14)

IRanges object with 4 ranges and 0 metadata columns:
      start end width
  [1]    10  12     3
  [2]    10  13     4
  [3]    14  14     1
  [4]    15  14     0

足し算・引き算は両側に伸縮:

IRanges(101:200) + 100

IRanges object with 1 range and 0 metadata columns:
      start end width
  [1]     1 300   300

掛け算はズームイン・ズームアウト:

IRanges(101:200) * 2

IRanges object with 1 range and 0 metadata columns:
      start end width
  [1]   126 175    50

IRanges(101:200) * 0.5

IRanges object with 1 range and 0 metadata columns:
      start end width
  [1]    51 250   200

Inter-range methods

区間の集合を操作

range(x, ..., with.revmap = FALSE, na.rm = FALSE)

端から端まで1つの区間として返す。

range(ir)

IRanges object with 1 range and 0 metadata columns:
      start end width
  [1]     1  46    46

reduce(x, drop.empty.ranges = FALSE, min.gapwidth = 1L, with.revmap = FALSE)

重なっている区間をつなげて平らにする。

with.revmap = TRUE とすると入力した区間がどこに含まれるかをmcolsに保持する。

reduce(ir, with.revmap = TRUE)

IRanges object with 3 ranges and 1 metadata column:
      start end width |    revmap
  [1]     1  29    29 | 1,2,3,...
  [2]    34  35     2 |         6
  [3]    40  46     7 |         7

plot of chunk iranges-reduce

min.gapwidth を大きくすると離れた区間もつなげられる。例えば reduce(ir, min.gapwidth = 10L) で全部つながって range(ir) と同じになる。

gaps(x, start=NA, end=NA, ...)

IRanges::setdiff(IRanges(start, end), x) と同等。 start/end 省略時は range(x) からの差分。

gaps(ir)

IRanges object with 2 ranges and 0 metadata columns:
      start end width
  [1]    30  33     4
  [2]    36  39     4

GRangesに対しては染色体全体からの差分がstrandごとに計算される:

gaps(gr)

GRanges object with 9 ranges and 0 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]     chr1         1-101      +
  [2]     chr1 202-249250621      +
  [3]     chr1   1-249250621      -
  [4]     chr1         1-102      *
  [5]     chr1 203-249250621      *
  [6]     chr2   1-243199373      +
  [7]     chr2         1-100      -
  [8]     chr2 201-243199373      -
  [9]     chr2   1-243199373      *
  -------
  seqinfo: 2 sequences from an unspecified genome

disjoin(x, ...)

全ての始点・終点を使い、重なり無しで最多の小区間にして返す。

disjoin(ir)

IRanges object with 10 ranges and 0 metadata columns:
       start end width
   [1]     1   7     7
   [2]     8  12     5
   [3]    13  13     1
   [4]    14  14     1
   [5]    15  18     4
   [6]    19  19     1
   [7]    20  24     5
   [8]    25  29     5
   [9]    34  35     2
  [10]    40  46     7

plot of chunk iranges-disjoin

disjointBins(x, ...)

重ならないように描画するときのy座標に使える。

disjointBins(ir)

[1] 1 2 1 2 3 1 1

union(x, y), intersect(x, y), setdiff(x, y)

結果は reduce() 済みの区間。

element-wise の punion(), pintersect(), psetdiff(), pgap() もある。

GenomicRanges::subtract(x, y, minoverlap = 1L, ...)

setdiff() と似ているが元の区間ごとの GRangesList にして返す。

Overlaps

type

any: ちょっとでも重なっているか距離が maxgap 以下ならOK。

start/end/equal: 開始/終了/両方の距離が maxgap 以下ならOK。

within: queryがすっぽり含まれていればOK。 maxgap の意図・挙動は不明。

findOverlaps(query, subject, maxgap = -1L, minoverlap = 0L, type, select, ...)

select: subject側に複数マッチした場合に何を返すか。デフォルトの “all” ならHits型オブジェクト、 “first” や “last” なら整数vectorでインデックスを返す。 “arbitrary” の挙動は謎だが乱数を振ったりするわけではなさそう。

findOverlaps(ir, ir)

Hits object with 15 hits and 0 metadata columns:
       queryHits subjectHits
       <integer>   <integer>
   [1]         1           1
   [2]         1           2
   [3]         2           1
   [4]         2           2
   [5]         3           3
   ...       ...         ...
  [11]         5           3
  [12]         5           4
  [13]         5           5
  [14]         6           6
  [15]         7           7
  -------
  queryLength: 7 / subjectLength: 7

findOverlaps(ir, ir, select = "first")

[1] 1 1 3 3 3 6 7

mergeByOverlaps(query, subject, ...) と findOverlapPairs(query, subject, ...) はHits型とは違う謎の形式で返す亜種。

countOverlaps(query, subject, maxgap = -1L, minoverlap = 0L, type, ...)

いくつsubjectと重なるか、queryと同じ長さの整数vectorを返す。

countOverlaps(ir, ir)

[1] 2 2 3 3 3 1 1

overlapsAny(query, subject, maxgap = -1L, minoverlap = 0L, type, ...)

ひとつでもsubjectと重なるものがあるか、queryと同じ長さの論理vectorを返す。

overlapsAny(ir, ir[4])

[1] FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE

%over%, %within% %outside% という演算子で短く書けるけど可読性低下とconflictが怖いので使わない。

subsetByOverlaps(x, ranges, maxgap = -1L, minoverlap = 0L, type, invert = FALSE, ...)

x[overlapsAny(x, ranges)] のショートカット。

overlapsRanges(query, subject, hits = NULL, ...)

subsetByOverlapsした上、重なっている部分のみ切り詰めて返す。

poverlaps(query, subject, maxgap = 0L, minoverlap = 1L, type, ...)

min() に対する pmin() のように、element-wiseに比較する。

poverlaps(ir, rev(ir))

[1] FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE

coverage(x, shift=0L, width=NULL, weight=1L, method=c("auto", "sort", "hash", "naive"))

重なりの数をRle型で。

coverage(ir, shift = 200, width = 800)

integer-Rle of length 800 with 13 runs
  Lengths: 200   7   5   2   4   1   5   5   4   2   4   7 554
  Values :   0   1   2   1   2   3   2   1   0   1   0   1   0

cvg(x, from = NA, to = NA, weight = 1L, varname = "cvg", collapse = FALSE, ...) も似たような関数だがドキュメント無し。

slice(x, ...) を使うと「カバレッジがこれ以上の区間」みたいなのを簡単に取れる:

coverage(ir) |> IRanges::slice(lower = 2, rangesOnly = TRUE)

IRanges object with 2 ranges and 0 metadata columns:
      start end width
  [1]     8  12     5
  [2]    15  24    10

Neighboring

nearest(x, subject, select = c("arbitrary", "all"))

最も近いもの、あるいはoverlapしているもの。

nearest(ir, ir)

[1] 1 1 2 3 3 6 7

precede(x, subject, select = c("first", "all"))

subjectのうちどれの上流にあるか。重なるものは除外。

precede(ir, ir)

[1]  3  3  6  6  6  7 NA

follow(x, subject, select = c("last", "all"))

subjectのうちどれの下流にあるか。重なるものは除外。

follow(ir, ir)

[1] NA NA  2  2  2  4  6

distanceToNearest(x, subject, select = c("arbitrary", "all"))

最も近いsubjectへの距離。overlapしている場合はゼロ。

distanceToNearest(ir, ir[1:3])

Hits object with 7 hits and 1 metadata column:
      queryHits subjectHits |  distance
      <integer>   <integer> | <integer>
  [1]         1           1 |         0
  [2]         2           1 |         0
  [3]         3           2 |         0
  [4]         4           3 |         0
  [5]         5           3 |         0
  [6]         6           3 |        14
  [7]         7           3 |        20
  -------
  queryLength: 7 / subjectLength: 3

distance(x, y) はelement-wiseに計算。

distance(ir, rev(ir))

[1] 27 20  0  0  0 20 27

`ignore.strand`

GenomicRangesの操作はデフォルトでseqnamesとstrandごとに分けて行われる。 strandを無視するには ignore.strand = TRUE を渡す。

reduce(gr)

GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1   102-201      +
  [2]     chr1   103-202      *
  [3]     chr2   101-200      -
  -------
  seqinfo: 2 sequences from an unspecified genome

reduce(gr, ignore.strand = TRUE)

GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1   102-202      *
  [2]     chr2   101-200      *
  -------
  seqinfo: 2 sequences from an unspecified genome