Starting with misha 5.3.0, databases can be stored in two formats:
The indexed format provides better performance and scalability, especially for genomes with many contigs (>50 chromosomes).
The indexed format uses unified files:
Sequence data: - seq/genome.seq - All
chromosome sequences concatenated - seq/genome.idx - Index
mapping chromosome names to positions
Track data: -
tracks/mytrack.track/track.dat - All chromosome data
concatenated - tracks/mytrack.track/track.idx - Index with
offset/length per chromosome
Advantages: - Fewer file descriptors (important for genomes with 100+ contigs) - Better performance for large workloads (14% faster) - Smaller disk footprint - Faster track creation and conversion
The per-chromosome format uses separate files:
Sequence data: - seq/chr1.seq,
seq/chr2.seq, … - One file per chromosome
Track data: -
tracks/mytrack.track/chr1.track, chr2.track, …
- One file per chromosome
When to use: - Compatibility with older misha versions (<5.3.0) - Small genomes (<25 chromosomes) where performance difference is negligible
By default, new databases use the indexed format:
Use gdb.info() to check your database format:
Example output:
Convert all tracks and sequences to indexed format:
This will: 1. Convert sequence files (chr*.seq →
genome.seq + genome.idx) 2. Convert all tracks to indexed
format 3. Validate conversions 4. Remove old files after successful
conversion
Convert specific tracks while keeping others in legacy format:
Note that 2D tracks cannot be converted to indexed format yet.
High priority (significant benefits): - Genomes with many contigs (>50 chromosomes) - Large-scale analyses (10M+ bp regions frequently) - 2D track workflows - File descriptor limit issues
Medium priority (moderate benefits): - Repeated extraction workflows - Regular analyses on medium-sized regions (1-10M bp)
Low priority (minimal benefits): - Small genomes (<25 chromosomes) - One-off analyses - Simple queries on small regions
Step 1: Backup (optional but recommended)
Step 2: Check current format
Step 3: Convert
Step 4: Verify
# Check format changed
info <- gdb.info()
print(paste("New format:", info$format))
# Test a few operations
result <- gextract("mytrack", gintervals(1, 0, 1000))
print(head(result))Step 5: Remove backup (after validation)
You can freely copy tracks between databases with different formats.
# Export from source database
gsetroot("/path/to/source_db")
gextract("mytrack", gintervals.all(),
iterator = "mytrack",
file = "/tmp/mytrack.txt"
)
# Import to target database (format auto-detected)
gsetroot("/path/to/target_db")
gtrack.import("mytrack", "Copied track", "/tmp/mytrack.txt", binsize = 0)
# Automatically converted to target database format!# Copy multiple tracks
tracks <- c("track1", "track2", "track3")
for (track in tracks) {
# Export
gsetroot("/path/to/source_db")
file_path <- sprintf("/tmp/%s.txt", track)
gextract(track, gintervals.all(), iterator = track, file = file_path)
# Import
gsetroot("/path/to/target_db")
info <- gtrack.info(track) # Get description
gtrack.import(track, info$description, file_path, binsize = 0)
unlink(file_path)
}Based on comprehensive benchmarks comparing indexed vs legacy formats:
This occurs with many-contig genomes in legacy format:
Solution: Convert to indexed format
After manually copying track directories:
Solution: Reload database
gdb.create_genome() for standard genomesgdb.create() with multi-FASTA for custom
genomesgdb.info()gdb.convert_to_indexed()