
Hi! Here, you will find some basic informations to get started with
subtools. For more details, you can check the package
documentation.
Subtools is a R package to read, write and manipulate subtitles in R.
This then allows the full range of tools offered by the R ecosystem to
be used for the analysis of subtitles. With version 1.0,
subtools integrates the main principles of the tidyverse
and integrates directly with tidytext for a tidy approach
of subtitle text mining.
You can install the package directly from CRAN with
install.packages("subtools") or get the latest, in
development, version with:
remotes::install_github(
repo = "fkeck/subtools@dev",
build_manual = TRUE,
build_vignettes = TRUE
)library(subtools)The main goal of subtools is to provide a seamless way to import
subtitle files directly into R. This task can be performed with the
function read_subtitles():
rushmore_sub <- read_subtitles("ex_Rushmore.srt")
oss_sub <- read_subtitles("ex_OSS_117.srt")rushmore_sub
#> # A tibble: 4 × 4
#> ID Timecode_in Timecode_out Text_content
#> <chr> <time> <time> <chr>
#> 1 180 20'40.969" 20'48.269" Rushmore deserves an aquarium. A first class a…
#> 2 181 20'48.269" 20'50.870" - I don't know. What do you think, Ernie - Aqu…
#> 3 182 20'50.946" 20'57.370" - What kind of fish? - Barracudas. Stingrays. …
#> 4 183 20'58.051" 21'01.770" - Piranhas? Really? - Yes, I'm talking to a gu…
oss_sub
#> # A tibble: 3 × 4
#> ID Timecode_in Timecode_out Text_content
#> <chr> <time> <time> <chr>
#> 1 264 20'22.967" 20'27.427" Si vous voulez. Ça sera surtout l'occasion de …
#> 2 265 20'30.347" 20'32.297" Et non pas le gratin de pommes de terre.
#> 3 266 20'35.587" 20'37.697" Parce que ça ressemble à carotte, cairote.The function read_subtitles() returns an object of class
subtitles. This is a simple tibble with at
least four columns (“ID”, “Timecode_in”,
“Timecode_out” and “Text_content”).
The metadata are handled by adding extra-columns which can be used
during the analysis. You can add metadata by adding columns manually
(e.g. using mutate()). You can also provide a 1-row
data.frame of metadata to the function
read_subtitles().
bb_meta <- data.frame(Name = "Breaking Bad", Season = 1, Episode = 1)
bb_sub <- read_subtitles("ex_Breaking_Bad.srt", metadata = bb_meta)bb_sub
#> # A tibble: 5 × 7
#> ID Timecode_in Timecode_out Text_content Name Season Episode
#> <chr> <time> <time> <chr> <chr> <dbl> <dbl>
#> 1 5 01'09.236" 01'12.780" Oh, my God. Christ! Brea… 1 1
#> 2 6 01'15.993" 01'18.661" Shit. Brea… 1 1
#> 3 7 01'18.829" 01'21.205" [SIRENS WAILING IN DISTAN… Brea… 1 1
#> 4 8 01'24.918" 01'27.378" Oh, God. Oh, my God. Brea… 1 1
#> 5 9 01'27.546" 01'30.840" Oh, my God. Oh, my God. T… Brea… 1 1If you want to analyse subtitles of series with different seasons and
episodes, you will have to import many files at once. The
read_subtitles_season(),
read_subtitles_serie() and
read_subtitles_multiseries() functions can make your life
much easier, by making it possible to automatically import files and
extract metadata from a structured directory. You can check the manual
for more details.
Finally if you have a collection of movies in .mkv format, you can
extract the subtitle tracks of MKV files with
read_subtitles_mkv().
Often, the workflow begins with a cleaning step to get rid of
irrelevant information that might be present in text content. Three
functions can be used for this task. First, clean_tags()
cleans formatting tags. By default, this function is automatically
executed by the read_subtitles*() functions, so you
probably don’t need to run it again. Second,
clean_captions() can be used to suppress closed captions,
i.e. descriptions of non-speech elements in parentheses or squared
brackets. Finally, clean_patterns() is a more general
function to clean subtitles based on regex pattern matching.
bb_sub
#> # A tibble: 5 × 7
#> ID Timecode_in Timecode_out Text_content Name Season Episode
#> <chr> <time> <time> <chr> <chr> <dbl> <dbl>
#> 1 5 01'09.236" 01'12.780" Oh, my God. Christ! Brea… 1 1
#> 2 6 01'15.993" 01'18.661" Shit. Brea… 1 1
#> 3 7 01'18.829" 01'21.205" [SIRENS WAILING IN DISTAN… Brea… 1 1
#> 4 8 01'24.918" 01'27.378" Oh, God. Oh, my God. Brea… 1 1
#> 5 9 01'27.546" 01'30.840" Oh, my God. Oh, my God. T… Brea… 1 1
bb_sub_clean <- clean_captions(bb_sub)
bb_sub_clean
#> # A tibble: 4 × 7
#> ID Timecode_in Timecode_out Text_content Name Season Episode
#> <chr> <time> <time> <chr> <chr> <dbl> <dbl>
#> 1 5 01'09.236" 01'12.780" Oh, my God. Christ! Brea… 1 1
#> 2 6 01'15.993" 01'18.661" Shit. Brea… 1 1
#> 3 8 01'24.918" 01'27.378" Oh, God. Oh, my God. Brea… 1 1
#> 4 9 01'27.546" 01'30.840" Oh, my God. Oh, my God. T… Brea… 1 1Sometimes you will need to bind several subtitle objects together.
This can be achieved with the function bind_subtitles().
This function is very similar to bind_rows from
dplyr (they both bind rows of tibbles), but
bind_subtitles() allows to recalculate timecodes to follow
concatenation order (this can be disabled by setting
sequential to FALSE).
bind_subtitles(rushmore_sub, oss_sub, bb_sub_clean)
#> # A tibble: 11 × 7
#> ID Timecode_in Timecode_out Text_content Name Season Episode
#> <chr> <time> <time> <chr> <chr> <dbl> <dbl>
#> 1 180 20'40.969" 20'48.269" Rushmore deserves an aqu… <NA> NA NA
#> 2 181 20'48.269" 20'50.870" - I don't know. What do … <NA> NA NA
#> 3 182 20'50.946" 20'57.370" - What kind of fish? - B… <NA> NA NA
#> 4 183 20'58.051" 21'01.770" - Piranhas? Really? - Ye… <NA> NA NA
#> 5 264 41'24.737" 41'29.197" Si vous voulez. Ça sera … <NA> NA NA
#> 6 265 41'32.117" 41'34.067" Et non pas le gratin de … <NA> NA NA
#> 7 266 41'37.357" 41'39.467" Parce que ça ressemble à… <NA> NA NA
#> 8 5 42'48.703" 42'52.247" Oh, my God. Christ! Brea… 1 1
#> 9 6 42'55.460" 42'58.128" Shit. Brea… 1 1
#> 10 8 43'04.385" 43'06.845" Oh, God. Oh, my God. Brea… 1 1
#> 11 9 43'07.013" 43'10.307" Oh, my God. Oh, my God. … Brea… 1 1Some functions under certain conditions can also return a list of
subtitle objects (class multisubtitles). The function
bind_subtitles() can also be used on such object to bind
each elements into a new subtitle object, i.e. something similar to
do.call(rbind, x).
multi_sub <- bind_subtitles(rushmore_sub, bb_sub_clean, collapse = FALSE, sequential = FALSE)
multi_sub
#> A multisubtitles object with 2 elements
#> subtitles object [[1]]
#> # A tibble: 4 × 4
#> ID Timecode_in Timecode_out Text_content
#> <chr> <time> <time> <chr>
#> 1 180 20'40.969" 20'48.269" Rushmore deserves an aquarium. A first class a…
#> 2 181 20'48.269" 20'50.870" - I don't know. What do you think, Ernie - Aqu…
#> 3 182 20'50.946" 20'57.370" - What kind of fish? - Barracudas. Stingrays. …
#> 4 183 20'58.051" 21'01.770" - Piranhas? Really? - Yes, I'm talking to a gu…
#>
#>
#> subtitles object [[2]]
#> # A tibble: 4 × 7
#> ID Timecode_in Timecode_out Text_content Name Season Episode
#> <chr> <time> <time> <chr> <chr> <dbl> <dbl>
#> 1 5 01'09.236" 01'12.780" Oh, my God. Christ! Brea… 1 1
#> 2 6 01'15.993" 01'18.661" Shit. Brea… 1 1
#> 3 8 01'24.918" 01'27.378" Oh, God. Oh, my God. Brea… 1 1
#> 4 9 01'27.546" 01'30.840" Oh, my God. Oh, my God. T… Brea… 1 1
bind_subtitles(multi_sub)
#> # A tibble: 8 × 7
#> ID Timecode_in Timecode_out Text_content Name Season Episode
#> <chr> <time> <time> <chr> <chr> <dbl> <dbl>
#> 1 180 20'40.969" 20'48.269" Rushmore deserves an aqua… <NA> NA NA
#> 2 181 20'48.269" 20'50.870" - I don't know. What do y… <NA> NA NA
#> 3 182 20'50.946" 20'57.370" - What kind of fish? - Ba… <NA> NA NA
#> 4 183 20'58.051" 21'01.770" - Piranhas? Really? - Yes… <NA> NA NA
#> 5 5 22'11.006" 22'14.550" Oh, my God. Christ! Brea… 1 1
#> 6 6 22'17.763" 22'20.431" Shit. Brea… 1 1
#> 7 8 22'26.688" 22'29.148" Oh, God. Oh, my God. Brea… 1 1
#> 8 9 22'29.316" 22'32.610" Oh, my God. Oh, my God. T… Brea… 1 1The tidy text
format as defined by Julia Silge and David Robinson is a table with
one-token-per-row, a token being a meaningful unit of text, such as a
word or a sentence. The objects returned by
read_subtitles*() are in some ways already tidy (each row
being a subtitle block associated with a timecode). However, this unit
is not always the most relevant for data analysis. To perform
tokenization, the tidytext package provides the generic
function unnest_tokens(). The package subtools
adds a new method to unnest_tokens() to handle subtitles
objects. The main difference with the data.frame method is
the possibility to perform timecode remapping according to the
tokenisation process.
rushmore_sub
#> # A tibble: 4 × 4
#> ID Timecode_in Timecode_out Text_content
#> <chr> <time> <time> <chr>
#> 1 180 20'40.969" 20'48.269" Rushmore deserves an aquarium. A first class a…
#> 2 181 20'48.269" 20'50.870" - I don't know. What do you think, Ernie - Aqu…
#> 3 182 20'50.946" 20'57.370" - What kind of fish? - Barracudas. Stingrays. …
#> 4 183 20'58.051" 21'01.770" - Piranhas? Really? - Yes, I'm talking to a gu…
unnest_tokens(rushmore_sub)
#> # A tibble: 49 × 4
#> ID Timecode_in Timecode_out Text_content
#> <chr> <time> <time> <chr>
#> 1 180 20'40.9700" 20'41.4858" rushmore
#> 2 180 20'41.4868" 20'42.0026" deserves
#> 3 180 20'42.0036" 20'42.1318" an
#> 4 180 20'42.1328" 20'42.6486" aquarium
#> 5 180 20'42.6496" 20'42.7132" a
#> 6 180 20'42.7142" 20'43.0363" first
#> 7 180 20'43.0373" 20'43.3593" class
#> 8 180 20'43.3603" 20'43.8761" aquarium
#> 9 180 20'43.8771" 20'44.1991" where
#> 10 180 20'44.2001" 20'44.8451" scientists
#> # ℹ 39 more rows
unnest_tokens(bb_sub_clean, token = "sentences")
#> # A tibble: 8 × 7
#> ID Timecode_in Timecode_out Text_content Name Season Episode
#> <chr> <time> <time> <chr> <chr> <dbl> <dbl>
#> 1 5 01'09.2370" 01'11.4018" oh, my god. Breaking B… 1 1
#> 2 5 01'11.4028" 01'12.7800" christ! Breaking B… 1 1
#> 3 6 01'15.9940" 01'18.6610" shit. Breaking B… 1 1
#> 4 8 01'24.9190" 01'25.9538" oh, god. Breaking B… 1 1
#> 5 8 01'25.9548" 01'27.3780" oh, my god. Breaking B… 1 1
#> 6 9 01'27.5470" 01'28.4087" oh, my god. Breaking B… 1 1
#> 7 9 01'28.4097" 01'29.2714" oh, my god. Breaking B… 1 1
#> 8 9 01'29.2724" 01'30.8400" think, think, think. Breaking B… 1 1Note that unlike the data.frame method, the
input and output arguments are optional. This
is because here the Text_content column can be assumed to
be the column of interest.
Once your data are ready, you can analyse them. I recommend you to have a look at Text Mining with R: A Tidy Approach by Julia Silge and David Robinson. This is a great place to get started with text mining in R.
A list of cool projects using subtools.
Note that these project used the branch 0.x of subtools.
The API is totally different from subtools 1.0.
You beautiful, naïve, sophisticated newborn series by ma_salmon