subtools reads and manipulates video subtitle files from
a variety of formats (SubRip .srt, WebVTT
.vtt, SubStation Alpha .ass/.ssa,
SubViewer .sub, MicroDVD .sub) and exposes
them as tidy tibbles ready for text analysis.
This vignette walks through:
tidytextread_subtitles() is the main entry point. It
auto-detects the file format from the extension and returns a
subtitles object — a tibble with four core
columns: ID, Timecode_in,
Timecode_out, and Text_content.
f_srt <- system.file("extdata", "ex_subrip.srt", package = "subtools")
subs <- read_subtitles(file = f_srt)
subs
#> # A tibble: 6 × 4
#> ID Timecode_in Timecode_out Text_content
#> <chr> <time> <time> <chr>
#> 1 1 00'22.5" 00'24.1" Lorem ipsum dolor sit amet, consectetur adipis…
#> 2 2 00'25.5" 00'27.1" Donec eu nisl commodo, elementum dui ut, gravi…
#> 3 3 00'28.7" 00'29.1" Nulla aliquam,
#> 4 4 00'29.9" 00'31.1" nibh cursus interdum volutpat,
#> 5 5 00'31.9" 00'33.1" dolor lacus hendrerit tellus, vel faucibus jus…
#> 6 6 00'33.9" 00'34.8" Suspendisse potenti.The same call works for every supported format. Use
format = "auto" (default) or supply the format
explicitly.
f_vtt <- system.file("extdata", "ex_webvtt.vtt", package = "subtools")
read_subtitles(file = f_vtt, format = "webvtt")
#> # A tibble: 3 × 4
#> ID Timecode_in Timecode_out Text_content
#> <chr> <time> <time> <chr>
#> 1 1 00'01" 00'04" Never drink liquid nitrogen.
#> 2 2 00'05" 00'09" — It will perforate your stomach. — …
#> 3 A dangerous cue 00'11" 00'14" Dès Noël où un zéphyr haï me vêt de …f_ass <- system.file("extdata", "ex_substation.ass", package = "subtools")
read_subtitles(file = f_ass, format = "substation")
#> # A tibble: 6 × 4
#> ID Timecode_in Timecode_out Text_content
#> <chr> <time> <time> <chr>
#> 1 1 00'22.5" 00'24.1" Lorem ipsum dolor sit amet, consectetur adipis…
#> 2 2 00'25.5" 00'27.1" Donec eu nisl commodo, elementum dui ut, gravi…
#> 3 3 00'28.7" 00'29.1" Nulla aliquam,
#> 4 4 00'29.9" 00'31.1" nibh cursus interdum volutpat,
#> 5 5 00'31.9" 00'33.1" dolor lacus hendrerit tellus, vel faucibus jus…
#> 6 6 00'33.9" 00'34.8" Suspendisse potenti.Any descriptive information — season, episode, source, language — can
be attached as a one-row tibble via the metadata argument.
The values are repeated for every subtitle line, keeping the tidy
structure intact.
subs_meta <- read_subtitles(
file = f_srt,
metadata = tibble::tibble(Season = 1L, Episode = 3L, Language = "en")
)
subs_meta
#> # A tibble: 6 × 7
#> ID Timecode_in Timecode_out Text_content Season Episode Language
#> <chr> <time> <time> <chr> <int> <int> <chr>
#> 1 1 00'22.5" 00'24.1" Lorem ipsum dolor sit … 1 3 en
#> 2 2 00'25.5" 00'27.1" Donec eu nisl commodo,… 1 3 en
#> 3 3 00'28.7" 00'29.1" Nulla aliquam, 1 3 en
#> 4 4 00'29.9" 00'31.1" nibh cursus interdum v… 1 3 en
#> 5 5 00'31.9" 00'33.1" dolor lacus hendrerit … 1 3 en
#> 6 6 00'33.9" 00'34.8" Suspendisse potenti. 1 3 enMetadata columns travel with the object through all
subtools operations.
as_subtitle() parses an in-memory character vector,
which is useful when the subtitle text is already loaded or generated
programmatically.
raw <- c(
"1",
"00:00:01,000 --> 00:00:03,500",
"Hello, world.",
"",
"2",
"00:00:04,000 --> 00:00:06,000",
"This is subtools."
)
as_subtitle(x = raw, format = "srt")
#> # A tibble: 2 × 4
#> ID Timecode_in Timecode_out Text_content
#> <chr> <time> <time> <chr>
#> 1 1 00'01" 00'03.5" Hello, world.
#> 2 2 00'04" 00'06.0" This is subtools.get_subtitles_info() prints a compact summary: line
count, overall duration, and attached metadata fields.
get_raw_text() collapses all subtitle lines into a
single character string, useful when passing the whole transcript to
external Natural Language Processing tools.
transcript <- get_raw_text(x = s)
transcript
#> [1] "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eu nisl commodo, elementum dui ut, gravida orci. Nulla aliquam, nibh cursus interdum volutpat, dolor lacus hendrerit tellus, vel faucibus justo nisi quis felis. Suspendisse potenti."
# One line per subtitle, separated by newlines
cat(get_raw_text(x = s, collapse = "\n"))
#> Lorem ipsum dolor sit amet, consectetur adipiscing elit.
#> Donec eu nisl commodo, elementum dui ut, gravida orci.
#> Nulla aliquam,
#> nibh cursus interdum volutpat,
#> dolor lacus hendrerit tellus, vel faucibus justo nisi quis felis.
#> Suspendisse potenti.Because a subtitles object is a tibble, all
dplyr verbs work directly:
library(dplyr)
# Lines spoken after the first 30 seconds
s |>
filter(Timecode_in > hms::as_hms("00:00:30"))
#> # A tibble: 2 × 4
#> ID Timecode_in Timecode_out Text_content
#> <chr> <time> <time> <chr>
#> 1 5 00'31.9" 00'33.1" dolor lacus hendrerit tellus, vel faucibus jus…
#> 2 6 00'33.9" 00'34.8" Suspendisse potenti.
# Duration of each subtitle cue (in seconds)
s |>
mutate(duration_s = as.numeric(Timecode_out - Timecode_in)) |>
select(ID, Text_content, duration_s)
#> # A tibble: 6 × 3
#> ID Text_content duration_s
#> <chr> <chr> <dbl>
#> 1 1 Lorem ipsum dolor sit amet, consectetur adipiscing elit. 1.60
#> 2 2 Donec eu nisl commodo, elementum dui ut, gravida orci. 1.60
#> 3 3 Nulla aliquam, 0.400
#> 4 4 nibh cursus interdum volutpat, 1.20
#> 5 5 dolor lacus hendrerit tellus, vel faucibus justo nisi quis f… 1.20
#> 6 6 Suspendisse potenti. 0.900Subtitle files frequently contain formatting tags, closed-caption descriptions, and other non-speech artefacts that should be removed before text analysis.
clean_captions() removes text enclosed in parentheses or
square brackets — typically sound descriptions and speaker identifiers
used in accessibility captions.
bb <- read_subtitles(
file = system.file("extdata", "ex_breakingbad.srt", package = "subtools"),
clean.tags = FALSE
)
bb$Text_content
#> [1] "Oh, my God. Christ!"
#> [2] "Shit."
#> [3] "[SIRENS WAILING IN DISTANCE]"
#> [4] "Oh, God. Oh, my God."
#> [5] "Oh, my God. Oh, my God. Think, think, think."
clean_captions(x = bb)$Text_content
#> [1] "Oh, my God. Christ!"
#> [2] "Shit."
#> [3] "Oh, God. Oh, my God."
#> [4] "Oh, my God. Oh, my God. Think, think, think."clean_patterns() accepts any regular expression, giving
full flexibility for project-specific cleaning.
# Remove speaker labels such as "WALTER:" or "JESSE:"
s_labeled <- as_subtitle(
x = c(
"1", "00:00:01,000 --> 00:00:03,000", "WALTER: We need to cook.",
"",
"2", "00:00:04,000 --> 00:00:06,000", "JESSE: Yeah, Mr. White!"
),
format = "srt", clean.tags = FALSE
)
clean_patterns(x = s_labeled, pattern = "^[A-Z]+: ")$Text_content
#> [1] "We need to cook." "Yeah, Mr. White!"Because each cleaning function returns a subtitles
object, steps can be piped:
s_clean <- read_subtitles(file = f_srt, clean.tags = FALSE) |>
clean_tags() |>
clean_captions() |>
clean_patterns(pattern = "^-\\s*") # remove leading dialogue dashes
s_clean$Text_content
#> [1] "Lorem ipsum dolor sit amet, consectetur adipiscing elit."
#> [2] "Donec eu nisl commodo, elementum dui ut, gravida orci."
#> [3] "Nulla aliquam,"
#> [4] "nibh cursus interdum volutpat,"
#> [5] "dolor lacus hendrerit tellus, vel faucibus justo nisi quis felis."
#> [6] "Suspendisse potenti."bind_subtitles() merges any number of
subtitles (or multisubtitles) objects. With
collapse = TRUE (default), timecodes are shifted so that
each file follows the previous one sequentially.
s1 <- read_subtitles(
file = system.file("extdata", "ex_subrip.srt", package = "subtools"),
metadata = tibble::tibble(Episode = 1L)
)
s2 <- read_subtitles(
file = system.file("extdata", "ex_rushmore.srt", package = "subtools"),
metadata = tibble::tibble(Episode = 2L)
)
combined <- bind_subtitles(s1, s2)
nrow(combined)
#> [1] 10
range(combined$Timecode_in)
#> Time differences in secs
#> [1] 22.500 1292.851Set collapse = FALSE to get a
multisubtitles object — a named list of
subtitles — when you want to process episodes independently
before merging.
multi <- bind_subtitles(s1, s2, collapse = FALSE)
class(multi)
#> [1] "multisubtitles"
print(multi)
#> A multisubtitles object with 2 elements
#> subtitles object [[1]]
#> # A tibble: 6 × 5
#> ID Timecode_in Timecode_out Text_content Episode
#> <chr> <time> <time> <chr> <int>
#> 1 1 00'22.5" 00'24.1" Lorem ipsum dolor sit amet, consectetu… 1
#> 2 2 00'25.5" 00'27.1" Donec eu nisl commodo, elementum dui u… 1
#> 3 3 00'28.7" 00'29.1" Nulla aliquam, 1
#> 4 4 00'29.9" 00'31.1" nibh cursus interdum volutpat, 1
#> 5 5 00'31.9" 00'33.1" dolor lacus hendrerit tellus, vel fauc… 1
#> 6 6 00'33.9" 00'34.8" Suspendisse potenti. 1
#>
#>
#> subtitles object [[2]]
#> # A tibble: 4 × 5
#> ID Timecode_in Timecode_out Text_content Episode
#> <chr> <time> <time> <chr> <int>
#> 1 180 21'15.769" 21'23.069" Rushmore deserves an aquarium. A first… 2
#> 2 181 21'23.069" 21'25.670" - I don't know. What do you think, Ern… 2
#> 3 182 21'25.746" 21'32.170" - What kind of fish? - Barracudas. Sti… 2
#> 4 183 21'32.851" 21'36.570" - Piranhas? Really? - Yes, I'm talking… 2get_subtitles_info() also works on
multisubtitles:
For TV series organised in a standard directory tree,
subtools provides convenience readers that handle the
hierarchy automatically and extract Season/Episode metadata from folder
and file names.
Series_Collection/
|-- BreakingBad/
| |-- Season_01/
| | |-- S01E01.srt
| | |-- S01E02.srt
| |-- Season_02/
| |-- S02E01.srt
# Read a single season
season1 <- read_subtitles_season(dir = "BreakingBad/Season_01/")
# Read an entire series (all seasons)
bb_all <- read_subtitles_serie(dir = "BreakingBad/")
# Read multiple series at once
collection <- read_subtitles_multiseries(dir = "Series_Collection/")Each function returns a single collapsed subtitles
object by default (bind = TRUE), with Serie,
Season, and Episode columns populated from the
directory structure. Pass bind = FALSE to get a
multisubtitles list instead.
move_subtitles() shifts all timecodes by a fixed number
of seconds. Positive values shift forward; negative values shift
backward. This is useful when the subtitle file is out of sync with the
video.
subs_shifted <- move_subtitles(x = subs, lag = 2.5)
# Compare first cue before and after
subs$Timecode_in[1]
#> 00:00:22.5
subs_shifted$Timecode_in[1]
#> 00:00:25move_subtitles() also works on
multisubtitles:
multi_shifted <- move_subtitles(x = multi, lag = -1.0)
multi_shifted[[1]]$Timecode_in[1]
#> 00:00:21.5write_subtitles() serialises a subtitles
object to a SubRip .srt file.
unnest_tokens() extends
tidytext::unnest_tokens() with subtitle-aware timecode
remapping: each token inherits a proportional slice of the original
cue’s time window, enabling timeline-based analyses.
words <- unnest_tokens(tbl = subs)
words
#> # A tibble: 35 × 4
#> ID Timecode_in Timecode_out Text_content
#> <chr> <time> <time> <chr>
#> 1 1 00'22.5010" 00'22.6702" lorem
#> 2 1 00'22.6712" 00'22.8404" ipsum
#> 3 1 00'22.8414" 00'23.0106" dolor
#> 4 1 00'23.0116" 00'23.1128" sit
#> 5 1 00'23.1138" 00'23.2489" amet
#> 6 1 00'23.2499" 00'23.6234" consectetur
#> 7 1 00'23.6244" 00'23.9638" adipiscing
#> 8 1 00'23.9648" 00'24.1000" elit
#> 9 2 00'25.5010" 00'25.6860" donec
#> 10 2 00'25.6870" 00'25.7605" eu
#> # ℹ 25 more rowsThe Timecode_in / Timecode_out columns now
reflect the estimated position of each word within its cue.
# Bigrams
bigrams <- unnest_tokens(tbl = subs, output = Word, input = Text_content,
token = "ngrams", n = 2)
bigrams$Word
#> [1] "lorem ipsum" "ipsum dolor" "dolor sit"
#> [4] "sit amet" "amet consectetur" "consectetur adipiscing"
#> [7] "adipiscing elit" "donec eu" "eu nisl"
#> [10] "nisl commodo" "commodo elementum" "elementum dui"
#> [13] "dui ut" "ut gravida" "gravida orci"
#> [16] "nulla aliquam" "nibh cursus" "cursus interdum"
#> [19] "interdum volutpat" "dolor lacus" "lacus hendrerit"
#> [22] "hendrerit tellus" "tellus vel" "vel faucibus"
#> [25] "faucibus justo" "justo nisi" "nisi quis"
#> [28] "quis felis" "suspendisse potenti"The metadata columns added at read time make it straightforward to
compare episodes or seasons. The example below simulates a two-episode
corpus and computes per-episode word counts — a pattern that scales
directly to a full series loaded with
read_subtitles_serie().
ep1 <- read_subtitles(
file = system.file("extdata", "ex_breakingbad.srt", package = "subtools"),
metadata = tibble::tibble(Episode = 1L)
)
ep2 <- read_subtitles(
file = system.file("extdata", "ex_rushmore.srt", package = "subtools"),
metadata = tibble::tibble(Episode = 2L)
)
ep3 <- read_subtitles(
file = system.file("extdata", "ex_webvtt.vtt", package = "subtools"),
metadata = tibble::tibble(Episode = 3L)
)
corpus <- bind_subtitles(ep1, ep2, ep3)
token_counts <- unnest_tokens(corpus) |>
count(Episode, Text_content, sort = TRUE)
token_counts |>
slice_max(n, n = 5, by = Episode)
#> # A tibble: 51 × 3
#> Episode Text_content n
#> <int> <chr> <int>
#> 1 1 god 5
#> 2 1 oh 5
#> 3 1 my 4
#> 4 1 think 3
#> 5 1 christ 1
#> 6 1 distance 1
#> 7 1 in 1
#> 8 1 shit 1
#> 9 1 sirens 1
#> 10 1 wailing 1
#> # ℹ 41 more rowsTF-IDF highlights words that are distinctive to each episode compared with the rest of the corpus.
token_counts |>
tidytext::bind_tf_idf(Text_content, Episode, n) |>
arrange(Episode, desc(tf_idf)) |>
slice_max(tf_idf, n = 5, by = Episode)
#> # A tibble: 48 × 6
#> Episode Text_content n tf idf tf_idf
#> <int> <chr> <int> <dbl> <dbl> <dbl>
#> 1 1 god 5 0.217 1.10 0.239
#> 2 1 oh 5 0.217 1.10 0.239
#> 3 1 my 4 0.174 1.10 0.191
#> 4 1 think 3 0.130 0.405 0.0529
#> 5 1 christ 1 0.0435 1.10 0.0478
#> 6 1 distance 1 0.0435 1.10 0.0478
#> 7 1 shit 1 0.0435 1.10 0.0478
#> 8 1 sirens 1 0.0435 1.10 0.0478
#> 9 1 wailing 1 0.0435 1.10 0.0478
#> 10 2 aquarium 3 0.0612 1.10 0.0673
#> # ℹ 38 more rowsBecause timecodes are preserved through unnest_tokens(),
words can be plotted along a timeline, e.g. to visualise how vocabulary
density evolves across a film.
words_ep1 <- unnest_tokens(tbl = ep1) |>
mutate(minute = as.numeric(Timecode_in) / 60)
if (requireNamespace("ggplot2", quietly = TRUE)) {
library(ggplot2)
ggplot(words_ep1, aes(x = minute)) +
geom_histogram(binwidth = 0.5, fill = "steelblue", colour = "white") +
labs(
title = "Word density over time",
x = "Time (minutes)",
y = "Word count"
) +
theme_minimal()
}| Task | Function |
|---|---|
| Read a subtitle file | read_subtitles() |
| Parse in-memory text | as_subtitle() |
| Read a full season/series | read_subtitles_season() /
read_subtitles_serie() /
read_subtitles_multiseries() |
| Print a summary | get_subtitles_info() |
| Extract plain text | get_raw_text() |
| Remove HTML/ASS tags | clean_tags() |
| Remove closed captions | clean_captions() |
| Remove custom patterns | clean_patterns() |
| Merge subtitle objects | bind_subtitles() |
| Shift timecodes | move_subtitles() |
Write to .srt |
write_subtitles() |
| Tokenise (words, n-grams, …) | unnest_tokens() |