Overview

Starting with misha 5.3.0, databases can be stored in two formats:

Indexed format (default, recommended): Single unified files for sequences and tracks
Per-chromosome format (legacy): Separate files for each chromosome

The indexed format provides better performance and scalability, especially for genomes with many contigs (>50 chromosomes).

Key Features

Automatic format detection - misha automatically detects which format your database uses
Fully backward compatible - existing databases continue to work without modification
Transparent to users - same API for both formats
Migration tools - convert databases when convenient
Performance benefits - 4-14% faster for large-scale analyses

Database Formats

Indexed Format (Recommended)

The indexed format uses unified files:

Sequence data: - seq/genome.seq - All chromosome sequences concatenated - seq/genome.idx - Index mapping chromosome names to positions

Track data: - tracks/mytrack.track/track.dat - All chromosome data concatenated - tracks/mytrack.track/track.idx - Index with offset/length per chromosome

Advantages: - Fewer file descriptors (important for genomes with 100+ contigs) - Better performance for large workloads (14% faster) - Smaller disk footprint - Faster track creation and conversion

Per-Chromosome Format (Legacy)

The per-chromosome format uses separate files:

Sequence data: - seq/chr1.seq, seq/chr2.seq, … - One file per chromosome

Track data: - tracks/mytrack.track/chr1.track, chr2.track, … - One file per chromosome

When to use: - Compatibility with older misha versions (<5.3.0) - Small genomes (<25 chromosomes) where performance difference is negligible

Creating Databases

New Databases (Indexed Format)

By default, new databases use the indexed format:

# Create database from FASTA file
gdb.create("mydb", "/path/to/genome.fa")

# Or download pre-built genome
gdb.create_genome("hg38", path = "/path/to/install")

Force Legacy Format

To create a database in legacy format (for compatibility with older misha):

# Set option before creating database
options(gmulticontig.indexed_format = FALSE)
gdb.create("mydb", "/path/to/genome.fa")

Checking Database Format

Use gdb.info() to check your database format:

gsetroot("/path/to/mydb")
info <- gdb.info()
print(info$format) # "indexed" or "per-chromosome"

Example output:

info <- gdb.info()
# $path
# [1] "/path/to/mydb"
#
# $is_db
# [1] TRUE
#
# $format
# [1] "indexed"
#
# $num_chromosomes
# [1] 24
#
# $genome_size
# [1] 3095693983

Converting Databases

Convert Entire Database

Convert all tracks and sequences to indexed format:

gsetroot("/path/to/mydb")
gdb.convert_to_indexed()

This will: 1. Convert sequence files (chr*.seq → genome.seq + genome.idx) 2. Convert all tracks to indexed format 3. Validate conversions 4. Remove old files after successful conversion

Convert Individual Tracks

Convert specific tracks while keeping others in legacy format:

gtrack.convert_to_indexed("mytrack")

Note that 2D tracks cannot be converted to indexed format yet.

Convert Intervals

Convert interval sets to indexed format:

# 1D intervals
gintervals.convert_to_indexed("myintervals")

# 2D intervals
gintervals.2d.convert_to_indexed("my2dintervals")

Migration Guide

When to Migrate

High priority (significant benefits): - Genomes with many contigs (>50 chromosomes) - Large-scale analyses (10M+ bp regions frequently) - 2D track workflows - File descriptor limit issues

Medium priority (moderate benefits): - Repeated extraction workflows - Regular analyses on medium-sized regions (1-10M bp)

Low priority (minimal benefits): - Small genomes (<25 chromosomes) - One-off analyses - Simple queries on small regions

Migration Workflow

Step 1: Backup (optional but recommended)

# Create backup of important database
system("cp -r /path/to/mydb /path/to/mydb.backup")

Step 2: Check current format

gsetroot("/path/to/mydb")
info <- gdb.info()
print(paste("Current format:", info$format))

Step 3: Convert

gdb.convert_to_indexed()

Step 4: Verify

# Check format changed
info <- gdb.info()
print(paste("New format:", info$format))

# Test a few operations
result <- gextract("mytrack", gintervals(1, 0, 1000))
print(head(result))

Step 5: Remove backup (after validation)

# After thorough testing
system("rm -rf /path/to/mydb.backup")

Copying Tracks Between Databases

You can freely copy tracks between databases with different formats.

Method 1: Export and Import

# Export from source database
gsetroot("/path/to/source_db")
gextract("mytrack", gintervals.all(),
    iterator = "mytrack",
    file = "/tmp/mytrack.txt"
)

# Import to target database (format auto-detected)
gsetroot("/path/to/target_db")
gtrack.import("mytrack", "Copied track", "/tmp/mytrack.txt", binsize = 0)
# Automatically converted to target database format!

Method 2: Batch Copy

# Copy multiple tracks
tracks <- c("track1", "track2", "track3")

for (track in tracks) {
    # Export
    gsetroot("/path/to/source_db")
    file_path <- sprintf("/tmp/%s.txt", track)
    gextract(track, gintervals.all(), iterator = track, file = file_path)

    # Import
    gsetroot("/path/to/target_db")
    info <- gtrack.info(track) # Get description
    gtrack.import(track, info$description, file_path, binsize = 0)
    unlink(file_path)
}

Performance Comparison

Based on comprehensive benchmarks comparing indexed vs legacy formats:

Operations Faster with Indexed Format

Very large workloads (10M+ bp): 14% faster
2D track operations: 2% faster (was 36% slower in legacy)
Repeated extractions: 14% faster
Real workflows: 8% faster on average

Operations Similar Performance

Single chromosome extraction: Within 5%
Multi-chromosome (10-22 chr): Within 1%
PWM operations: Within 3%
Small workloads (<1M bp): Within 10%

Summary

64% of operations are faster with indexed format
93% within 5% of legacy format (statistically equal)
Average: 4% faster across all benchmarks
No regressions for common use cases

Backward Compatibility

Fully Compatible

✅ Existing databases work without modification
✅ Existing scripts work without changes
✅ Same API for both formats
✅ Automatic format detection
✅ Can mix formats in same analysis

Example: Mixed Environment

# Work with both formats in same session
gsetroot("/path/to/legacy_db")
data1 <- gextract("track1", gintervals(1, 0, 1000))

gsetroot("/path/to/indexed_db")
data2 <- gextract("track2", gintervals(1, 0, 1000))

Troubleshooting

“File descriptor limit reached”

This occurs with many-contig genomes in legacy format:

Solution: Convert to indexed format

gdb.convert_to_indexed()

“Track not found after copying files”

After manually copying track directories:

Solution: Reload database

gdb.reload()

“Conversion fails with disk space error”

Conversion needs 2x track size temporarily:

Solution: Free disk space or convert tracks individually

# Convert one track at a time
gtrack.convert_to_indexed("track1")
gtrack.convert_to_indexed("track2")
# etc.

Database Formats and Multi-Contig Support

Overview

Key Features

Database Formats

Indexed Format (Recommended)

Per-Chromosome Format (Legacy)

Creating Databases

New Databases (Indexed Format)

Force Legacy Format

Checking Database Format

Converting Databases

Convert Entire Database

Convert Individual Tracks

Convert Intervals

Migration Guide

When to Migrate

Migration Workflow

Copying Tracks Between Databases

Method 1: Export and Import

Method 2: Batch Copy

Performance Comparison

Operations Faster with Indexed Format

Operations Similar Performance

Summary

Backward Compatibility

Fully Compatible

Example: Mixed Environment

Troubleshooting

“File descriptor limit reached”

“Track not found after copying files”

“Conversion fails with disk space error”

Best Practices

For New Projects

For Existing Projects

Summary