---
title: "Subtitle Text Analysis with subtools"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Subtitle Text Analysis with subtools}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
warning = FALSE,
message = FALSE,
comment = "#>"
)
library(subtools)
```
## Overview
`subtools` reads and manipulates video subtitle files from a variety of formats
(SubRip `.srt`, WebVTT `.vtt`, SubStation Alpha `.ass`/`.ssa`, SubViewer `.sub`,
MicroDVD `.sub`) and exposes them as tidy tibbles ready for text analysis.
This vignette walks through:
1. Reading subtitle files
2. Exploring and cleaning subtitle objects
3. Combining subtitles from multiple files
4. Adjusting timecodes
5. Tokenising and analysing text with `tidytext`
6. Analysing dialogue across a TV series
---
## 1. Reading subtitles
### From a file
`read_subtitles()` is the main entry point. It auto-detects the file format from
the extension and returns a `subtitles` object — a `tibble` with four core
columns: `ID`, `Timecode_in`, `Timecode_out`, and `Text_content`.
```{r read-srt}
f_srt <- system.file("extdata", "ex_subrip.srt", package = "subtools")
subs <- read_subtitles(file = f_srt)
subs
```
The same call works for every supported format. Use `format = "auto"` (default)
or supply the format explicitly.
```{r read-vtt}
f_vtt <- system.file("extdata", "ex_webvtt.vtt", package = "subtools")
read_subtitles(file = f_vtt, format = "webvtt")
```
```{r read-ass}
f_ass <- system.file("extdata", "ex_substation.ass", package = "subtools")
read_subtitles(file = f_ass, format = "substation")
```
### Attaching metadata at read time
Any descriptive information — season, episode, source, language — can be
attached as a one-row tibble via the `metadata` argument. The values are
repeated for every subtitle line, keeping the tidy structure intact.
```{r metadata}
subs_meta <- read_subtitles(
file = f_srt,
metadata = tibble::tibble(Season = 1L, Episode = 3L, Language = "en")
)
subs_meta
```
Metadata columns travel with the object through all `subtools` operations.
### From a character vector
`as_subtitle()` parses an in-memory character vector, which is useful when the
subtitle text is already loaded or generated programmatically.
```{r as-subtitle}
raw <- c(
"1",
"00:00:01,000 --> 00:00:03,500",
"Hello, world.",
"",
"2",
"00:00:04,000 --> 00:00:06,000",
"This is subtools."
)
as_subtitle(x = raw, format = "srt")
```
---
## 2. Exploring the subtitles object
### Quick summary
`get_subtitles_info()` prints a compact summary: line count, overall duration,
and attached metadata fields.
```{r info}
s <- read_subtitles(
file = system.file("extdata", "ex_subrip.srt", package = "subtools")
)
get_subtitles_info(x = s)
```
### Raw text extraction
`get_raw_text()` collapses all subtitle lines into a single character string,
useful when passing the whole transcript to external Natural Language Processing tools.
```{r raw-text}
transcript <- get_raw_text(x = s)
transcript
# One line per subtitle, separated by newlines
cat(get_raw_text(x = s, collapse = "\n"))
```
### Accessing individual columns
Because a `subtitles` object is a tibble, all `dplyr` verbs work directly:
```{r dplyr}
library(dplyr)
# Lines spoken after the first 30 seconds
s |>
filter(Timecode_in > hms::as_hms("00:00:30"))
# Duration of each subtitle cue (in seconds)
s |>
mutate(duration_s = as.numeric(Timecode_out - Timecode_in)) |>
select(ID, Text_content, duration_s)
```
---
## 3. Cleaning subtitles
Subtitle files frequently contain formatting tags, closed-caption descriptions,
and other non-speech artefacts that should be removed before text analysis.
### Remove formatting tags
`clean_tags()` strips HTML-style tags (used in SRT and WebVTT) and curly-brace
override blocks (used in SubStation Alpha).
```{r clean-tags}
tagged <- as_subtitle(
x = c(
"1",
"00:00:01,000 --> 00:00:03,000",
"This is important.",
"",
"2",
"00:00:04,000 --> 00:00:06,000",
"Warning!"
),
format = "srt",
clean.tags = FALSE # keep tags so we can demonstrate cleaning
)
tagged$Text_content
clean_tags(x = tagged)$Text_content
```
### Remove closed captions
`clean_captions()` removes text enclosed in parentheses or square brackets —
typically sound descriptions and speaker identifiers used in accessibility
captions.
```{r clean-captions}
bb <- read_subtitles(
file = system.file("extdata", "ex_breakingbad.srt", package = "subtools"),
clean.tags = FALSE
)
bb$Text_content
clean_captions(x = bb)$Text_content
```
### Remove arbitrary patterns
`clean_patterns()` accepts any regular expression, giving full flexibility for
project-specific cleaning.
```{r clean-patterns}
# Remove speaker labels such as "WALTER:" or "JESSE:"
s_labeled <- as_subtitle(
x = c(
"1", "00:00:01,000 --> 00:00:03,000", "WALTER: We need to cook.",
"",
"2", "00:00:04,000 --> 00:00:06,000", "JESSE: Yeah, Mr. White!"
),
format = "srt", clean.tags = FALSE
)
clean_patterns(x = s_labeled, pattern = "^[A-Z]+: ")$Text_content
```
### Chaining cleaning steps
Because each cleaning function returns a `subtitles` object, steps can be piped:
```{r clean-chain}
s_clean <- read_subtitles(file = f_srt, clean.tags = FALSE) |>
clean_tags() |>
clean_captions() |>
clean_patterns(pattern = "^-\\s*") # remove leading dialogue dashes
s_clean$Text_content
```
---
## 4. Combining subtitles
### Collapsing multiple objects into one
`bind_subtitles()` merges any number of `subtitles` (or `multisubtitles`)
objects. With `collapse = TRUE` (default), timecodes are shifted so that each
file follows the previous one sequentially.
```{r bind-collapse}
s1 <- read_subtitles(
file = system.file("extdata", "ex_subrip.srt", package = "subtools"),
metadata = tibble::tibble(Episode = 1L)
)
s2 <- read_subtitles(
file = system.file("extdata", "ex_rushmore.srt", package = "subtools"),
metadata = tibble::tibble(Episode = 2L)
)
combined <- bind_subtitles(s1, s2)
nrow(combined)
range(combined$Timecode_in)
```
### Keeping a list structure
Set `collapse = FALSE` to get a `multisubtitles` object — a named list of
`subtitles` — when you want to process episodes independently before merging.
```{r bind-list}
multi <- bind_subtitles(s1, s2, collapse = FALSE)
class(multi)
print(multi)
```
`get_subtitles_info()` also works on `multisubtitles`:
```{r info-multi}
get_subtitles_info(x = multi)
```
---
## 5. Reading an entire series
For TV series organised in a standard directory tree, `subtools` provides
convenience readers that handle the hierarchy automatically and extract
Season/Episode metadata from folder and file names.
```
Series_Collection/
|-- BreakingBad/
| |-- Season_01/
| | |-- S01E01.srt
| | |-- S01E02.srt
| |-- Season_02/
| |-- S02E01.srt
```
```{r read-series-demo, eval=FALSE}
# Read a single season
season1 <- read_subtitles_season(dir = "BreakingBad/Season_01/")
# Read an entire series (all seasons)
bb_all <- read_subtitles_serie(dir = "BreakingBad/")
# Read multiple series at once
collection <- read_subtitles_multiseries(dir = "Series_Collection/")
```
Each function returns a single collapsed `subtitles` object by default
(`bind = TRUE`), with `Serie`, `Season`, and `Episode` columns populated from
the directory structure. Pass `bind = FALSE` to get a `multisubtitles` list
instead.
---
## 6. Adjusting timecodes
`move_subtitles()` shifts all timecodes by a fixed number of seconds. Positive
values shift forward; negative values shift backward. This is useful when the
subtitle file is out of sync with the video.
```{r move}
subs_shifted <- move_subtitles(x = subs, lag = 2.5)
# Compare first cue before and after
subs$Timecode_in[1]
subs_shifted$Timecode_in[1]
```
`move_subtitles()` also works on `multisubtitles`:
```{r move-multi}
multi_shifted <- move_subtitles(x = multi, lag = -1.0)
multi_shifted[[1]]$Timecode_in[1]
```
---
## 7. Writing subtitles back to disk
`write_subtitles()` serialises a `subtitles` object to a SubRip `.srt` file.
```{r write, eval=FALSE}
write_subtitles(x = subs_shifted, file = "synced_episode.srt")
```
---
## 8. Text analysis with tidytext
### Tokenising into words
`unnest_tokens()` extends `tidytext::unnest_tokens()` with subtitle-aware
timecode remapping: each token inherits a proportional slice of the original
cue's time window, enabling timeline-based analyses.
```{r unnest-words}
words <- unnest_tokens(tbl = subs)
words
```
The `Timecode_in` / `Timecode_out` columns now reflect the estimated position
of each word within its cue.
### Tokenising into sentences or n-grams
```{r unnest-ngrams}
# Bigrams
bigrams <- unnest_tokens(tbl = subs, output = Word, input = Text_content,
token = "ngrams", n = 2)
bigrams$Word
```
### Word frequency
```{r word-freq}
library(dplyr)
words |>
count(Text_content, sort = TRUE) |>
head(10)
```
---
## 9. Advanced: cross-episode analysis
The metadata columns added at read time make it straightforward to compare
episodes or seasons. The example below simulates a two-episode corpus and
computes per-episode word counts — a pattern that scales directly to a full
series loaded with `read_subtitles_serie()`.
```{r cross-episode}
ep1 <- read_subtitles(
file = system.file("extdata", "ex_breakingbad.srt", package = "subtools"),
metadata = tibble::tibble(Episode = 1L)
)
ep2 <- read_subtitles(
file = system.file("extdata", "ex_rushmore.srt", package = "subtools"),
metadata = tibble::tibble(Episode = 2L)
)
ep3 <- read_subtitles(
file = system.file("extdata", "ex_webvtt.vtt", package = "subtools"),
metadata = tibble::tibble(Episode = 3L)
)
corpus <- bind_subtitles(ep1, ep2, ep3)
token_counts <- unnest_tokens(corpus) |>
count(Episode, Text_content, sort = TRUE)
token_counts |>
slice_max(n, n = 5, by = Episode)
```
### TF-IDF across episodes
TF-IDF highlights words that are distinctive to each episode compared with the
rest of the corpus.
```{r tfidf}
token_counts |>
tidytext::bind_tf_idf(Text_content, Episode, n) |>
arrange(Episode, desc(tf_idf)) |>
slice_max(tf_idf, n = 5, by = Episode)
```
### Dialogue timeline
Because timecodes are preserved through `unnest_tokens()`, words can be plotted
along a timeline, e.g. to visualise how vocabulary density evolves across a
film.
```{r timeline, fig.width = 7, fig.height = 3}
words_ep1 <- unnest_tokens(tbl = ep1) |>
mutate(minute = as.numeric(Timecode_in) / 60)
if (requireNamespace("ggplot2", quietly = TRUE)) {
library(ggplot2)
ggplot(words_ep1, aes(x = minute)) +
geom_histogram(binwidth = 0.5, fill = "steelblue", colour = "white") +
labs(
title = "Word density over time",
x = "Time (minutes)",
y = "Word count"
) +
theme_minimal()
}
```
---
## Summary
| Task | Function |
|------|----------|
| Read a subtitle file | `read_subtitles()` |
| Parse in-memory text | `as_subtitle()` |
| Read a full season/series | `read_subtitles_season()` / `read_subtitles_serie()` / `read_subtitles_multiseries()` |
| Print a summary | `get_subtitles_info()` |
| Extract plain text | `get_raw_text()` |
| Remove HTML/ASS tags | `clean_tags()` |
| Remove closed captions | `clean_captions()` |
| Remove custom patterns | `clean_patterns()` |
| Merge subtitle objects | `bind_subtitles()` |
| Shift timecodes | `move_subtitles()` |
| Write to `.srt` | `write_subtitles()` |
| Tokenise (words, n-grams, …) | `unnest_tokens()` |