Title: | Gene-Cluster Discovery, Annotation and Visualization |
Version: | 1.1 |
Description: | Performs end-to-end analysis of gene clusters—such as photosynthesis, carbon/nitrogen/sulfur cycling, carotenoid, antibiotic, or viral marker genes (e.g., capsid, polymerase, integrase)—from genomes and metagenomes. It parses Basic Local Alignment Search Tool (BLAST) results in tab-delimited format produced by tools like NCBI BLAST+ and Diamond BLASTp, filters Open Reading Frames (ORFs) by length, detects contiguous clusters of reference genes, optionally extracts genomic coordinates, merges functional annotations, and generates publication-ready arrow plots. The package works seamlessly with or without the coding sequences input and skips plotting when no functional groups are found. For more details see Li et al. (2023) <doi:10.1038/s41467-023-42193-7>. |
URL: | https://github.com/LiuyangLee/gclink |
BugReports: | https://github.com/LiuyangLee/gclink/issues |
License: | GPL-3 |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
Imports: | dplyr (≥ 1.1.4), gggenes (≥ 0.5.1), ggplot2 (≥ 3.5.2), utils |
Depends: | R (≥ 3.5) |
LazyData: | false |
Config/check/use_internal_tz: | TRUE |
NeedsCompilation: | no |
Packaged: | 2025-08-28 14:44:20 UTC; Administrator |
Author: | Liuyang Li |
Maintainer: | Liuyang Li <cyanobacteria@yeah.net> |
Repository: | CRAN |
Date/Publication: | 2025-09-02 21:10:08 UTC |
KEGG Orthology (KO) Group Classification
Description
A dataset mapping KEGG Orthology (KO) identifiers to their functional groups and gene symbols.
Usage
KO_group
Format
A data frame with multiple rows and 3 variables:
- gene
Character. KEGG Orthology identifier (e.g., "K04035").
- gene_group
Character. Functional group (e.g., "acsF").
- gene_label
Character. Gene symbol (e.g., "acsF").
Photosynthesis Gene Classification Groups
Description
A dataset mapping photosynthesis-related genes to their functional groups and subunits.
Usage
PGC_group
Format
A data frame with multiple rows and 3 variables:
- gene
Character. Gene identifier (e.g., "bchB").
- gene_group
Character. Functional group (e.g., "bch").
- gene_label
Character. Subunit designation (e.g., "B").
BLASTP Results for test Proteins
Description
A dataset containing BLASTP alignment results for proteins from test genomes of bacteria, including alignment metrics.
Usage
blastp_df
Format
A data frame with multiple rows and 12 variables:
- qaccver
Character. Protein query identifier.
- saccver
Character. Subject protein identifier from reference databases.
- pident
Numeric. Percentage identity between query and subject sequences.
- length
Numeric. Length of the aligned sequence region.
- mismatch
Numeric. Number of mismatches in the alignment.
- gapopen
Numeric. Number of gap openings in the alignment.
- qstart
Numeric. Start position in query sequence.
- qend
Numeric. End position in query sequence.
- sstart
Numeric. Start position in subject sequence.
- send
Numeric. End position in subject sequence.
- evalue
Numeric. Expect value for the alignment.
- bitscore
Numeric. Bit score for the alignment.
EggNOG Functional Annotation Results
Description
A dataset containing EggNOG functional annotation results, including taxonomic classification, functional categories, and domain information.
Usage
eggnog_df
Format
A data frame with multiple rows and 21 variables:
- #query
Character. Protein query identifier (e.g., "Houyiibacteriaceae–LLY-WYZ-15_3—k141_356661_13").
- seed_ortholog
Character. Subject protein identifier from EggNOG database (e.g., "1168065.DOK_14784").
- evalue
Numeric. Expect value for the annotation.
- score
Numeric. Annotation bitscore.
- eggNOG_OGs
Character.
- max_annot_lvl
Character.
- COG_category
Character.
- Description
Character. Functional description (e.g., "Bacterial regulatory proteins, tetR family").
- Preferred_name
Character. Gene name if available (e.g., "cobB").
- GOs
Character.
- EC
Character. Enzyme Commission number if applicable (e.g., "2.7.7.49").
- KEGG_ko
Character. KEGG Orthology identifier if available (e.g., "ko:K00986").
- KEGG_Pathway
Character.
- KEGG_Module
Character.
- KEGG_Reaction
Character.
- KEGG_rclass
Character.
- BRITE
Character.
- KEGG_TC
Character.
- CAZy
Character.
- BiGG_Reaction
Character.
- PFAMs
Character.
Complete Gene Clusters by Adding Missing ORFs
Description
Expands gene cluster tables to include all ORFs (annotated and hypothetical) within contigs, normalizing cluster representations for downstream analysis and plotting. Ensures consistent ORF spacing/length across clusters by inserting missing rows.
Usage
gc_add(
Data = sbgc,
Annotation = bin_genes,
orf_before_first = 0,
orf_after_last = 0,
orf_range = "All"
)
Arguments
Data |
A
|
Annotation |
A |
orf_before_first |
Integer. Hypothetical ORFs to insert before the first annotated ORF
in each cluster (bounded by contig start). Default: |
orf_after_last |
Integer. Hypothetical ORFs to append after the last annotated ORF
(bounded by contig end). Default: |
orf_range |
Character. Controls ORF inclusion and annotation merging:
|
Details
-
Hypothetical ORFs are inserted as rows with
gene = NA
. Output is always sorted by
gene_cluster
andorf_position
.Progress messages are printed to console with timestamps.
Contig bounds are respected—insertions never exceed actual ORF positions in
Annotation
.
Value
A data.frame
with one row per ORF (real/hypothetical), sorted by gene_cluster
and
orf_position
. Added columns:
GC_orf_position
Relative position within cluster (1-indexed).
GC_present_length
Count of annotated ORFs in the cluster.
GC_absent_length
Count of inserted hypothetical ORFs.
GC_length
Total ORFs (
GC_present_length + GC_absent_length
).
Identify and Extract Gene Clusters from Scaled BLAST Data
Description
This function screens contigs for regions that contain a
pre-defined set of “reference” genes (e.g., photosynthetic genes, viral genes)
arranged in a continuous block. Contigs are
first coarsely filtered by the minimum number of reference genes
they carry, then finely scanned for clusters that satisfy user-
defined density and contiguity criteria. Each detected cluster
is returned with a unique gene_cluster
identifier.
Usage
gc_cal(
Data = bin_genes,
in_gene_list = photosynthesis_gene_list,
AllGeneNum = 30,
MinConSeq = 15
)
Arguments
Data |
A data frame produced by |
in_gene_list |
A character vector of “reference” gene symbols (e.g.,
|
AllGeneNum |
Integer. Maximum total ORF count (annotated plus hypothetical) that the algorithm is allowed to span when defining a cluster (default: 30). |
MinConSeq |
Integer. Minimum number of reference genes that must be
present and consecutive within the candidate cluster
(default: 15). Must satisfy |
Details
-
Coarse filter: Contigs with fewer than
MinConSeq
reference genes are discarded. -
Fine scan: For each remaining contig, the algorithm slides a window that can encompass up to
AllGeneNum
consecutive ORFs and retains windows that contain at leastMinConSeq
reference genes in uninterrupted order. -
Cluster labelling: Each valid cluster receives a unique ID (
genome_contig---1
,genome_contig---2
, …).
Value
A data frame identical in structure to Data
but filtered to
contain only those rows that belong to valid clusters. An extra
column gene_cluster
(format: genome_contig---N
) is added
to uniquely label every cluster.
Identify Breakpoints of Gene Clusters within a Contig
Description
Internal helper used by gc_cal
.
Given the ordered positions of reference genes on a contig,
this function returns the genomic coordinates that mark the
boundary of each candidate cluster.
A boundary is declared whenever the gap between two successive
reference genes exceeds the maximum spacing allowed by the
cluster definition (AllGeneNum - MinConSeq
).
Usage
gc_cluster(Data = orf_position.tmp, AllGeneNum = 30, MinConSeq = 15)
Arguments
Data |
A numeric vector (ascending order) of ORF positions that carry
one of the reference genes of interest. Usually the vector
|
AllGeneNum |
Integer. Maximum genomic span (in ORF count) that the
algorithm is allowed to cover when defining a single
cluster. Passed unchanged from |
MinConSeq |
Integer. Minimum number of consecutive reference genes
required to form a cluster. Passed unchanged from
|
Details
If the gap between two consecutive reference genes is larger than
AllGeneNum - MinConSeq
, the left-hand gene is recorded as the last member of the preceding cluster.The final reference gene is always appended to the vector as the last boundary, ensuring the rightmost cluster is not lost.
Value
A numeric vector containing every position that marks the end
of a putative gene cluster. These values are subsequently used as
breakpoints to slice the contig into candidate regions in the
downstream functions gc_position()
and gc_range()
.
Plot Scaled Gene Clusters with Arrows
Description
Generates a compact, publication-ready arrow plot of gene
clusters previously processed with gc_scale
.
Each cluster is displayed on its own horizontal track, with
gene directionality, labels, and user-defined colour schemes.
Usage
gc_plot(
data = GC_meta,
color_theme = c("#3BAA51", "#6495ED", "#DD2421", "#EF9320", "#F8EB00", "#FF0683",
"#956548", "grey")
)
Arguments
data |
A data frame produced by |
color_theme |
Character vector of colours, in the same order as the
factor levels of |
Value
A ggplot
object that can be further customised or printed.
The plot uses gggenes::geom_gene_arrow()
and
gggenes::geom_gene_label()
with the following elements:
One facet per
Pgenome
(cluster).Genes coloured by
gene_group
.Arrow direction encoded by
Pdirection
.Optional gene labels inside arrows.
Extract ORF Positions for One Specific Gene Cluster
Description
Internal helper used by gc_cal
.
Given the ordered positions of all reference genes on a contig
(Data
) and the vector of cluster breakpoints produced by
gc_cluster
(Cluster
), this function slices
Data
to return the positions that belong to the
EachCluster
-th cluster.
Usage
gc_position(
Data = orf_position.tmp,
Cluster = cluster.tmp,
EachCluster = eachcluster
)
Arguments
Data |
A numeric vector (ascending order) of ORF positions that carry
one of the reference genes of interest. Usually the vector
|
Cluster |
Numeric vector returned by |
EachCluster |
Integer index specifying which cluster to extract (1 = first cluster, 2 = second cluster, …). |
Value
A numeric vector containing the ORF positions that constitute the requested gene cluster.
Determine ORF Range for a Candidate Gene Cluster
Description
Internal helper used by gc_cal
.
After gc_position
has isolated the ORF positions
belonging to a single cluster, this function validates and
trims that range so that the final span (distance between the
first and last retained ORF) does not exceed AllGeneNum
.
The goal is to retain the largest contiguous block that still
satisfies the user-defined size limit.
Usage
gc_range(Norf_position = Norf_position, AllGeneNum = 20, MinConSeq = 10)
Arguments
Norf_position |
Numeric vector of ORF positions (ascending) that belong
to the current candidate cluster (output from
|
AllGeneNum |
Integer. Maximum allowed genomic span (in ORF count) for the final cluster. |
MinConSeq |
Integer. Minimum number of consecutive reference genes required for the cluster. |
Details
For every reference gene in
Norf_position
, the function evaluates whether a window of at leastMinConSeq
consecutive reference genes centred on that gene can fit withinAllGeneNum
consecutive ORFs.Genes that pass the test are collected in
retain.site
.The minimal and maximal positions in
retain.site
are then used to slice the full ORF range, guaranteeing that the final cluster length <=AllGeneNum
.
Value
A numeric vector containing the final ORF positions that define the validated gene cluster. If no valid block can be produced, the vector will be empty.
Scale Gene-Cluster Coordinates for Visualization
Description
Prepares a gene‐cluster annotation table for downstream plotting by
converting absolute genomic coordinates into relative positions,
ensuring that every cluster starts at 0 and is oriented consistently.
Hypothetical ORFs (originally labeled with NA
) and missing
labels are replaced with placeholders, and factor levels are set
as requested.
Usage
gc_scale(GC_meta = GC_meta, levels_gene_group = levels_gene_group)
Arguments
GC_meta |
A data frame containing gene-cluster information. Must include
the columns |
levels_gene_group |
Character vector specifying the desired factor levels
for |
Details
Absolute
start
/end
values are not modified; scaled values are stored in new columns (Pstart
,Pend
).-
Pgenome
can be swapped for any unique identifier (e.g.,Genome
) downstream if each genome contains only one cluster.
Value
The input data frame with the following new or overwritten columns:
gene_label
Empty string (
""
) if originallyNA
.gene_group
Set to
"hypothetical ORF"
if originallyNA
, then coerced to a factor usinglevels_gene_group
.Pgenome
Factor version of
gene_cluster
; levels follow the order of appearance in the data.Pstart
,Pend
Relative start and end coordinates (numeric) within each cluster, scaled so that the left-most gene starts at 0.
Pdirection
Logical vector:
TRUE
for forward,FALSE
for reverse.
Gene-Cluster Discovery, Annotation and Visualization
Description
gclink()
performs all steps from raw blastp output to a publication-ready
gene-cluster plot in one call. It dynamically adapts its workflow:
If
in_seq_data
is provided: Extracts coordinates/sequences, merges data, and generates plots.If
in_seq_data
isNULL
: Skips sequence-dependent steps, returning only the cluster table.If
in_GC_group
isNULL
: Omits functional-group merging and plotting.Supports custom gene lists (e.g., photosynthesis, viral genes) via
in_gene_list
for universal cluster detection in bacteria, archaea, or phages.
Gene clusters are identified via a density threshold (AllGeneNum
and MinConSeq
) due to
incomplete gene annotation coverage.
Usage
gclink(
in_blastp_df = blastp_df,
in_seq_data = seq_data,
in_gene_list = photosynthesis_gene_list,
in_GC_group = PGC_group,
AllGeneNum = 50,
MinConSeq = 25,
apply_length_filter = FALSE,
down_IQR = 10,
up_IQR = 10,
apply_evalue_filter = FALSE,
min_evalue = 1,
apply_score_filter = FALSE,
min_score = 10,
orf_before_first = 0,
orf_after_last = 0,
orf_range = "All",
levels_gene_group = c("bch", "puh", "puf", "crt", "acsF", "assembly", "regulator",
"hypothetical ORF"),
color_theme = c("#3BAA51", "#6495ED", "#DD2421", "#EF9320", "#F8EB00", "#FF0683",
"#956548", "grey"),
genome_subset = NULL
)
Arguments
in_blastp_df |
A
eggNOG (evolutionary gene genealogy Nonsupervised Orthologous Groups) format results
are supported by renaming annotation columns (e.g., |
in_seq_data |
Example:
seq_data <- Biostrings::readBStringSet("Prodigal.fasta",format="fasta", nrec=-1L, skip=0L, seek.first.rec=FALSE, use.names=TRUE) %>% data.frame(Sequence = .) %>% tibble::rownames_to_column("SeqName")
|
in_gene_list |
Character vector of reference genes for cluster detection.
Default: |
in_GC_group |
|
AllGeneNum |
Integer; max ORFs per cluster. Default: |
MinConSeq |
Integer; min consecutive reference genes per cluster. Default: |
apply_length_filter |
Logical; whether to apply length filtering based on IQR.
If |
down_IQR |
Numeric; lower-bound scale factor for IQR length
filtering (see |
up_IQR |
Numeric; upper-bound scale factor for IQR length
filtering (see |
apply_evalue_filter |
Logical; whether to filter BLAST hits by e-value cutoff.
Requires column |
min_evalue |
Numeric; maximum e-value threshold for retaining BLAST hits.
Hits with e-value > cutoff will be removed.
Typical values: 1e-3 (loose) to 1e-10 (strict).
Default: |
apply_score_filter |
Logical; whether to filter BLAST hits by bitscore cutoff.
Requires column |
min_score |
Numeric; minimum bitscore threshold for retaining BLAST hits.
Hits with bitscore < cutoff will be removed.
For short proteins (~100 aa), 30-40 is typical;
for long proteins (>300 aa), 50-100 is recommended.
Default: |
orf_before_first |
Integer; number of hypothetical ORFs to insert
before the first gene of each detected cluster.
Useful when the upstream annotation is incomplete or
low-confidence; insertion is bounded by the first ORF
of the contig. Default: |
orf_after_last |
Integer; number of hypothetical ORFs to append
after the last gene of each detected cluster.
Helpful when the downstream annotation is incomplete
or low-confidence; insertion is bounded by the last
ORF of the contig. Default: |
orf_range |
Character. ORF inclusion mode:
|
levels_gene_group |
Character vector; factor levels for gene-group
legends (must include "hypothetical ORF" in case
some genes remain unclassified).
Ignored if |
color_theme |
Character vector; colours for |
genome_subset |
Character vector or |
Value
A named list with:
GC_meta
Annotated cluster table (
data.frame
).GC_seq
FASTA sequences (if
in_seq_data
provided).GC_plot
ggplot object (if
in_seq_data
andin_GC_group
provided).
Examples
# Case 1: Using blastp result with Full pipeline (Find Cluster + Extract FASTA + Plot Cluster)
data(blastp_df)
data(seq_data)
data(photosynthesis_gene_list)
data(PGC_group)
gc_list <- gclink(in_blastp_df = blastp_df,
in_seq_data = seq_data,
in_gene_list = photosynthesis_gene_list,
in_GC_group = PGC_group,
AllGeneNum = 50,
MinConSeq = 25,
apply_length_filter = TRUE,
down_IQR = 10,
up_IQR = 10,
orf_before_first = 0,
orf_after_last = 0,
levels_gene_group = c('bch','puh','puf','crt','acsF','assembly','regulator',
'hypothetical ORF'),
color_theme = c('#3BAA51','#6495ED','#DD2421','#EF9320','#F8EB00',
'#FF0683','#956548','grey'),
genome_subset = NULL)
gc_meta = gc_list[["GC_meta"]]
gc_seq = gc_list[["GC_seq"]]
gc_plot = gc_list[["GC_plot"]]
head(gc_meta)
head(gc_seq)
print(gc_plot)
# Case 2: Using eggNOG result with Full pipeline (Find Cluster + Extract FASTA + Plot Cluster)
data(eggnog_df)
data(seq_data)
data(KO_group)
KOs = c("K02291","K09844","K20611","K13789",
"K09846","K08926","K08927","K08928",
"K08929","K13991","K04035","K04039",
"K11337","K03404","K11336","K04040",
"K03403","K03405","K04037","K03428",
"K04038","K06049","K10960","K11333",
"K11334","K11335","K08226","K08226",
"K09773")
rename_KOs = paste0("ko:", KOs)
eggnog_df$qaccver = eggnog_df$`#query`
eggnog_df$saccver = eggnog_df$KEGG_ko
eggnog_df$evalue = eggnog_df$evalue
eggnog_df$bitscore = eggnog_df$score
eggnog_df$gene = eggnog_df$KEGG_ko
gc_list_2 = gclink(in_blastp_df = eggnog_df,
in_seq_data = seq_data,
in_gene_list = rename_KOs,
in_GC_group = KO_group,
AllGeneNum = 50,
MinConSeq = 25,
apply_evalue_filter = FALSE,
min_evalue = 1,
apply_score_filter = TRUE,
min_score = 10,
orf_before_first = 1,
orf_after_last = 1,
levels_gene_group = c('bch','puh','puf','crt',
'acsF','assembly','hypothetical ORF'),
color_theme = c('#3BAA51','#6495ED','#DD2421','#EF9320',
'#F8EB00','#FF0683','grey'))
gc_meta_2 = gc_list_2[["GC_meta"]]
gc_seq_2 = gc_list_2[["GC_seq"]]
gc_plot_2 = gc_list_2[["GC_plot"]]
head(gc_meta_2)
head(gc_seq_2)
print(gc_plot_2)
# SOLUTION FOR "Viewport has zero dimension(s)" ERROR in print(gc_plot)
# --------------------------------------------------
# When you see this error, try these 3 solutions:
# 1. RESIZE RSTUDIO PLOT PANEL (Quickest fix)
# Simply click and drag the right border of the plot panel wider
# 2. LAUNCH IN NEW WINDOW
# dev.new()
# print(gc_plot)
# 3. SAVE DIRECTLY TO FILE
# ggplot2::ggsave('gc_plot.pdf', gc_plot, width=20,height=3)
Remove Length Outliers from BLAST Results
Description
Filters BLAST hits by removing ORFs whose gene (protein) length
is an outlier within the corresponding gene group, as defined
by the inter-quartile range (IQR). Hits whose length falls
outside the interval
[Q1 - down_IQR * IQR, Q3 + up_IQR * IQR]
are discarded.
Usage
length_filter(Data = bin_genes, down_IQR = 1.5, up_IQR = 1.5)
Arguments
Data |
A data frame containing BLAST results. Must include the columns
|
down_IQR |
Numeric multiplier applied to the IQR for the lower bound (default: 1.5). |
up_IQR |
Numeric multiplier applied to the IQR for the upper bound (default: 1.5). |
Details
Filtering is performed within each gene group; outliers are determined independently for every gene symbol.
Progress messages report the number of rows before and after filtering.
Missing values in
length
are ignored when computing quantiles.
Value
The input data frame with outlier rows removed. The returned object is ungrouped regardless of the input grouping.
Extract ORF and Genome Information from BLAST or BLASTP Results
Description
This function parses BLASTP result tables to extract structured genome, contig, ORF, and gene information from the query and subject identifiers. It is designed for downstream analyses requiring explicit separation of genome, contig, and ORF identifiers from concatenated BLAST headers.
Usage
orf_extract(bin_genes = blastp_df)
Arguments
bin_genes |
A data frame containing BLASTP results with at least 2 standard columns:
|
Value
The original data frame with six additional columns:
genome
Genome identifier extracted from
qaccver
.contig
Contig identifier extracted from
qaccver
.orf
Full ORF identifier extracted from
qaccver
.genome_contig
Concatenated genome and contig IDs (genome—contig).
gene
Gene symbol extracted from
saccver
.orf_position
Numeric ORF position extracted from the ORF identifier.
Parse ORF Coordinates from Prodigal FASTA Headers
Description
Extracts ORF identifiers, start/end positions and strand orientation directly from the FASTA headers produced by Prodigal. The resulting table is ready for downstream gene-cluster analyses.
Usage
orf_locate(in_seq_data = seq_data)
Arguments
in_seq_data |
A data frame with two columns:
Example:
seq_data <- Biostrings::readBStringSet("Prodigal.fasta",format="fasta", nrec=-1L, skip=0L, seek.first.rec=FALSE, use.names=TRUE) %>% data.frame(Sequence = .) %>% tibble::rownames_to_column("SeqName") |
Value
A data frame
Photosynthesis Gene List
Description
A comprehensive list of genes involved in photosynthetic processes.
Usage
photosynthesis_gene_list
Format
A character vector containing gene identifiers:
Each element represents a photosynthesis-related gene symbol (e.g., "acsF", "bchB").
Genomic Sequence Data with Annotations
Description
A dataset containing DNA sequences from test bacteria with detailed annotation metadata. The first column combines multiple annotation elements separated by semicolons.
Usage
seq_data
Format
A data frame with multiple rows and 2 variables:
- SeqName
Character. Combined annotation fields separated by semicolons, containing:
-
ID
: Sequence identifier (e.g., "1_7") -
partial
: Completion status ("00" for complete, "01" for partial) -
start_type
: Translation initiation codon (e.g., "GTG", "ATG") -
rbs_motif
: Ribosome binding site motif (e.g., "GGAG/GAGG") -
rbs_spacer
: RBS spacer length (e.g., "5-10bp") -
gc_cont
: GC content (e.g., "0.673")
-
- Sequence
Character. DNA sequence (when available) in FASTA format