Package: sentencepiece
Type: Package
Title: Text Tokenization using Byte Pair Encoding and Unigram Modelling
Version: 0.2.3
Authors@R: c(
    person('Jan', 'Wijffels', role = c('aut', 'cre', 'cph'), email = 'jwijffels@bnosac.be', comment = "R wrapper"), 
    person('BNOSAC', role = 'cph', comment = "R wrapper"), 
    person('Google Inc.', role = c('ctb', 'cph'), comment = "Files at src/sentencepiece/src (Apache License, Version 2.0"),
    person('The Abseil Authors', role = c('ctb', 'cph'), comment = "Files at src/third_party/absl (Apache License, Version 2.0"),
    person('Google Inc.', role = c('ctb', 'cph'), comment = "Files at src/third_party/protobuf-lite (BSD-3 License)"),
    person('Kenton Varda (Google Inc.)', role = c('ctb', 'cph'), comment = "Files at src/third_party/protobuf-lite: coded_stream.cc, extension_set.cc, generated_message_util.cc, generated_message_util.cc, message_lite.cc, repeated_field.cc, wire_format_lite.cc, zero_copy_stream.cc, zero_copy_stream_impl_lite.cc, google/protobuf/extension_set.h, google/protobuf/generated_message_util.h, google/protobuf/wire_format_lite.h, google/protobuf/wire_format_lite_inl.h, google/protobuf/message_lite.h, google/protobuf/repeated_field.h, google/protobuf/io/coded_stream.h, google/protobuf/io/zero_copy_stream_impl_lite.h, google/protobuf/io/zero_copy_stream.h, google/protobuf/stubs/common.h, google/protobuf/stubs/hash.h, google/protobuf/stubs/once.h, google/protobuf/stubs/once.h.org (BSD-3 License)"),
    person('Sanjay Ghemawat (Google Inc.)', role = c('ctb', 'cph'), comment = "Design of files at src/third_party/protobuf-lite: coded_stream.cc, extension_set.cc, generated_message_util.cc, generated_message_util.cc, message_lite.cc, repeated_field.cc, wire_format_lite.cc, zero_copy_stream.cc, zero_copy_stream_impl_lite.cc, google/protobuf/extension_set.h, google/protobuf/generated_message_util.h, google/protobuf/wire_format_lite.h, google/protobuf/wire_format_lite_inl.h, google/protobuf/message_lite.h, google/protobuf/repeated_field.h, google/protobuf/io/coded_stream.h, google/protobuf/io/zero_copy_stream_impl_lite.h, google/protobuf/io/zero_copy_stream.h (BSD-3 License)"),
    person('Jeff Dean (Google Inc.)', role = c('ctb', 'cph'), comment = "Design of files at src/third_party/protobuf-lite: coded_stream.cc, extension_set.cc, generated_message_util.cc, generated_message_util.cc, message_lite.cc, repeated_field.cc, wire_format_lite.cc, zero_copy_stream.cc, zero_copy_stream_impl_lite.cc, google/protobuf/extension_set.h, google/protobuf/generated_message_util.h, google/protobuf/wire_format_lite.h, google/protobuf/wire_format_lite_inl.h, google/protobuf/message_lite.h, google/protobuf/repeated_field.h, google/protobuf/io/coded_stream.h, google/protobuf/io/zero_copy_stream_impl_lite.h, google/protobuf/io/zero_copy_stream.h (BSD-3 License)"),
    person('Laszlo Csomor (Google Inc.)', role = c('ctb', 'cph'), comment = "Files at src/third_party/protobuf-lite: io_win32.cc, google/protobuf/stubs/io_win32.h  (BSD-3 License)"),
    person('Wink Saville (Google Inc.)', role = c('ctb', 'cph'), comment = "Files at src/third_party/protobuf-lite: message_lite.cc, google/protobuf/wire_format_lite.h, google/protobuf/wire_format_lite_inl.h, google/protobuf/message_lite.h (BSD-3 License)"),
    person('Jim Meehan (Google Inc.)', role = c('ctb', 'cph'), comment = "Files at src/third_party/protobuf-lite: structurally_valid.cc (BSD-3 License)"),
    person('Chris Atenasio (Google Inc.)', role = c('ctb', 'cph'), comment = "Files at src/third_party/protobuf-lite: google/protobuf/wire_format_lite.h (BSD-3 License)"),
    person('Jason Hsueh (Google Inc.)', role = c('ctb', 'cph'), comment = "Files at src/third_party/protobuf-lite: google/protobuf/io/coded_stream_inl.h (BSD-3 License)"),
    person('Anton Carver (Google Inc.)', role = c('ctb', 'cph'), comment = "Files at src/third_party/protobuf-lite: google/protobuf/stubs/map_util.h (BSD-3 License)"),
    person('Maxim Lifantsev (Google Inc.)', role = c('ctb', 'cph'), comment = "Files at src/third_party/protobuf-lite: google/protobuf/stubs/mathlimits.h (BSD-3 License)"),
    person('Susumu Yata', role = c('ctb', 'cph'), comment = "Files at src/third_party/darts_clone (BSD-3 License"),
    person('Daisuke Okanohara', role = c('ctb', 'cph'), comment = "File src/third_party/esaxx/esa.hxx (MIT License)"),
    person('Yuta Mori', role = c('ctb', 'cph'), comment = "File src/third_party/esaxx/sais.hxx (MIT License)"),
    person('Benjamin Heinzerling', role = c('ctb', 'cph'), comment = "Files data/models/nl.wiki.bpe.vs1000.d25.w2v.txt, data/models/nl.wiki.bpe.vs1000.d25.w2v.bin and data/models/nl.wiki.bpe.vs1000.model (MIT License)"))
Maintainer: Jan Wijffels <jwijffels@bnosac.be>
Description: Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. 
    Wraps the 'sentencepiece' library <https://github.com/google/sentencepiece> which provides a language independent tokenizer to split text in words and smaller subword units. 
    The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018) <doi:10.18653/v1/D18-2012>.
    Provides as well straightforward access to pretrained byte pair encoding models and subword embeddings trained on Wikipedia using 'word2vec', 
    as described in "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages" by Benjamin Heinzerling and Michael Strube (2018) <http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf>.
URL: https://github.com/bnosac/sentencepiece
License: MPL-2.0
Encoding: UTF-8
RoxygenNote: 7.1.2
Depends: R (>= 2.10)
Imports: Rcpp (>= 0.11.5), stats
Suggests: tokenizers.bpe, word2vec (>= 0.2.0)
LinkingTo: Rcpp
NeedsCompilation: yes
Packaged: 2022-11-11 10:45:36 UTC; jwijf
Author: Jan Wijffels [aut, cre, cph] (R wrapper),
  BNOSAC [cph] (R wrapper),
  Google Inc. [ctb, cph] (Files at src/sentencepiece/src (Apache License,
    Version 2.0),
  The Abseil Authors [ctb, cph] (Files at src/third_party/absl (Apache
    License, Version 2.0),
  Google Inc. [ctb, cph] (Files at src/third_party/protobuf-lite (BSD-3
    License)),
  Kenton Varda (Google Inc.) [ctb, cph] (Files at
    src/third_party/protobuf-lite: coded_stream.cc, extension_set.cc,
    generated_message_util.cc, generated_message_util.cc,
    message_lite.cc, repeated_field.cc, wire_format_lite.cc,
    zero_copy_stream.cc, zero_copy_stream_impl_lite.cc,
    google/protobuf/extension_set.h,
    google/protobuf/generated_message_util.h,
    google/protobuf/wire_format_lite.h,
    google/protobuf/wire_format_lite_inl.h,
    google/protobuf/message_lite.h, google/protobuf/repeated_field.h,
    google/protobuf/io/coded_stream.h,
    google/protobuf/io/zero_copy_stream_impl_lite.h,
    google/protobuf/io/zero_copy_stream.h,
    google/protobuf/stubs/common.h, google/protobuf/stubs/hash.h,
    google/protobuf/stubs/once.h, google/protobuf/stubs/once.h.org
    (BSD-3 License)),
  Sanjay Ghemawat (Google Inc.) [ctb, cph] (Design of files at
    src/third_party/protobuf-lite: coded_stream.cc, extension_set.cc,
    generated_message_util.cc, generated_message_util.cc,
    message_lite.cc, repeated_field.cc, wire_format_lite.cc,
    zero_copy_stream.cc, zero_copy_stream_impl_lite.cc,
    google/protobuf/extension_set.h,
    google/protobuf/generated_message_util.h,
    google/protobuf/wire_format_lite.h,
    google/protobuf/wire_format_lite_inl.h,
    google/protobuf/message_lite.h, google/protobuf/repeated_field.h,
    google/protobuf/io/coded_stream.h,
    google/protobuf/io/zero_copy_stream_impl_lite.h,
    google/protobuf/io/zero_copy_stream.h (BSD-3 License)),
  Jeff Dean (Google Inc.) [ctb, cph] (Design of files at
    src/third_party/protobuf-lite: coded_stream.cc, extension_set.cc,
    generated_message_util.cc, generated_message_util.cc,
    message_lite.cc, repeated_field.cc, wire_format_lite.cc,
    zero_copy_stream.cc, zero_copy_stream_impl_lite.cc,
    google/protobuf/extension_set.h,
    google/protobuf/generated_message_util.h,
    google/protobuf/wire_format_lite.h,
    google/protobuf/wire_format_lite_inl.h,
    google/protobuf/message_lite.h, google/protobuf/repeated_field.h,
    google/protobuf/io/coded_stream.h,
    google/protobuf/io/zero_copy_stream_impl_lite.h,
    google/protobuf/io/zero_copy_stream.h (BSD-3 License)),
  Laszlo Csomor (Google Inc.) [ctb, cph] (Files at
    src/third_party/protobuf-lite: io_win32.cc,
    google/protobuf/stubs/io_win32.h (BSD-3 License)),
  Wink Saville (Google Inc.) [ctb, cph] (Files at
    src/third_party/protobuf-lite: message_lite.cc,
    google/protobuf/wire_format_lite.h,
    google/protobuf/wire_format_lite_inl.h,
    google/protobuf/message_lite.h (BSD-3 License)),
  Jim Meehan (Google Inc.) [ctb, cph] (Files at
    src/third_party/protobuf-lite: structurally_valid.cc (BSD-3
    License)),
  Chris Atenasio (Google Inc.) [ctb, cph] (Files at
    src/third_party/protobuf-lite: google/protobuf/wire_format_lite.h
    (BSD-3 License)),
  Jason Hsueh (Google Inc.) [ctb, cph] (Files at
    src/third_party/protobuf-lite:
    google/protobuf/io/coded_stream_inl.h (BSD-3 License)),
  Anton Carver (Google Inc.) [ctb, cph] (Files at
    src/third_party/protobuf-lite: google/protobuf/stubs/map_util.h
    (BSD-3 License)),
  Maxim Lifantsev (Google Inc.) [ctb, cph] (Files at
    src/third_party/protobuf-lite: google/protobuf/stubs/mathlimits.h
    (BSD-3 License)),
  Susumu Yata [ctb, cph] (Files at src/third_party/darts_clone (BSD-3
    License),
  Daisuke Okanohara [ctb, cph] (File src/third_party/esaxx/esa.hxx (MIT
    License)),
  Yuta Mori [ctb, cph] (File src/third_party/esaxx/sais.hxx (MIT
    License)),
  Benjamin Heinzerling [ctb, cph] (Files
    data/models/nl.wiki.bpe.vs1000.d25.w2v.txt,
    data/models/nl.wiki.bpe.vs1000.d25.w2v.bin and
    data/models/nl.wiki.bpe.vs1000.model (MIT License))
Repository: CRAN
Date/Publication: 2022-11-13 09:30:06 UTC
Built: R 4.4.3; x86_64-w64-mingw32; 2025-10-21 12:24:39 UTC; windows
Archs: x64
