Database Formats and Multi-Contig Support

library(misha)

Overview

Starting with misha 5.3.0, databases can be stored in two formats:

The indexed format provides better performance and scalability, especially for genomes with many contigs (>50 chromosomes).

Key Features

Database Formats

Per-Chromosome Format (Legacy)

The per-chromosome format uses separate files:

Sequence data: - seq/chr1.seq, seq/chr2.seq, … - One file per chromosome

Track data: - tracks/mytrack.track/chr1.track, chr2.track, … - One file per chromosome

When to use: - Compatibility with older misha versions (<5.3.0) - Small genomes (<25 chromosomes) where performance difference is negligible

Creating Databases

New Databases (Indexed Format)

By default, new databases use the indexed format:

# Create database from FASTA file
gdb.create("mydb", "/path/to/genome.fa")

# Or download pre-built genome
gdb.create_genome("hg38", path = "/path/to/install")

Force Legacy Format

To create a database in legacy format (for compatibility with older misha):

# Set option before creating database
options(gmulticontig.indexed_format = FALSE)
gdb.create("mydb", "/path/to/genome.fa")

Checking Database Format

Use gdb.info() to check your database format:

gsetroot("/path/to/mydb")
info <- gdb.info()
print(info$format) # "indexed" or "per-chromosome"

Example output:

info <- gdb.info()
# $path
# [1] "/path/to/mydb"
#
# $is_db
# [1] TRUE
#
# $format
# [1] "indexed"
#
# $num_chromosomes
# [1] 24
#
# $genome_size
# [1] 3095693983

Converting Databases

Convert Entire Database

Convert all tracks and sequences to indexed format:

gsetroot("/path/to/mydb")
gdb.convert_to_indexed()

This will: 1. Convert sequence files (chr*.seqgenome.seq + genome.idx) 2. Convert all tracks to indexed format 3. Validate conversions 4. Remove old files after successful conversion

Convert Individual Tracks

Convert specific tracks while keeping others in legacy format:

gtrack.convert_to_indexed("mytrack")

Note that 2D tracks cannot be converted to indexed format yet.

Convert Intervals

Convert interval sets to indexed format:

# 1D intervals
gintervals.convert_to_indexed("myintervals")

# 2D intervals
gintervals.2d.convert_to_indexed("my2dintervals")

Migration Guide

When to Migrate

High priority (significant benefits): - Genomes with many contigs (>50 chromosomes) - Large-scale analyses (10M+ bp regions frequently) - 2D track workflows - File descriptor limit issues

Medium priority (moderate benefits): - Repeated extraction workflows - Regular analyses on medium-sized regions (1-10M bp)

Low priority (minimal benefits): - Small genomes (<25 chromosomes) - One-off analyses - Simple queries on small regions

Migration Workflow

Step 1: Backup (optional but recommended)

# Create backup of important database
system("cp -r /path/to/mydb /path/to/mydb.backup")

Step 2: Check current format

gsetroot("/path/to/mydb")
info <- gdb.info()
print(paste("Current format:", info$format))

Step 3: Convert

gdb.convert_to_indexed()

Step 4: Verify

# Check format changed
info <- gdb.info()
print(paste("New format:", info$format))

# Test a few operations
result <- gextract("mytrack", gintervals(1, 0, 1000))
print(head(result))

Step 5: Remove backup (after validation)

# After thorough testing
system("rm -rf /path/to/mydb.backup")

Copying Tracks Between Databases

You can freely copy tracks between databases with different formats.

Method 1: Export and Import

# Export from source database
gsetroot("/path/to/source_db")
gextract("mytrack", gintervals.all(),
    iterator = "mytrack",
    file = "/tmp/mytrack.txt"
)

# Import to target database (format auto-detected)
gsetroot("/path/to/target_db")
gtrack.import("mytrack", "Copied track", "/tmp/mytrack.txt", binsize = 0)
# Automatically converted to target database format!

Method 2: Batch Copy

# Copy multiple tracks
tracks <- c("track1", "track2", "track3")

for (track in tracks) {
    # Export
    gsetroot("/path/to/source_db")
    file_path <- sprintf("/tmp/%s.txt", track)
    gextract(track, gintervals.all(), iterator = track, file = file_path)

    # Import
    gsetroot("/path/to/target_db")
    info <- gtrack.info(track) # Get description
    gtrack.import(track, info$description, file_path, binsize = 0)
    unlink(file_path)
}

Performance Comparison

Based on comprehensive benchmarks comparing indexed vs legacy formats:

Operations Faster with Indexed Format

Operations Similar Performance

Summary

Backward Compatibility

Fully Compatible

Example: Mixed Environment

# Work with both formats in same session
gsetroot("/path/to/legacy_db")
data1 <- gextract("track1", gintervals(1, 0, 1000))

gsetroot("/path/to/indexed_db")
data2 <- gextract("track2", gintervals(1, 0, 1000))

Troubleshooting

“File descriptor limit reached”

This occurs with many-contig genomes in legacy format:

Solution: Convert to indexed format

gdb.convert_to_indexed()

“Track not found after copying files”

After manually copying track directories:

Solution: Reload database

gdb.reload()

“Conversion fails with disk space error”

Conversion needs 2x track size temporarily:

Solution: Free disk space or convert tracks individually

# Convert one track at a time
gtrack.convert_to_indexed("track1")
gtrack.convert_to_indexed("track2")
# etc.

Best Practices

For New Projects

  1. Use indexed format (default) for new databases
  2. Use gdb.create_genome() for standard genomes
  3. Use gdb.create() with multi-FASTA for custom genomes

For Existing Projects

  1. Check format with gdb.info()
  2. Migrate if beneficial (many contigs, large analyses)
  3. No rush - legacy format remains fully supported

Summary