Bag-of-n-grams

A bag-of-n-grams model is a way to represent a document, similar to a [bag-of-words][/terms/bag-of-words/] model.

A bag-of-n-grams model represents a text document as an unordered collection of its n-grams.

For example, let’s use the following phrase and divide it into bi-grams (n=2).

James is the best person ever.

becomes

  • <start>James
  • James is
  • is the
  • the best
  • best person
  • person ever.
  • ever.<end>

In a typical bag-of-n-grams model, these 6 bigrams would be a sample from a large number of bigrams observed in a corpus. And then James is the best person ever. would be encoded in a representation showing which of the corpus’s bigrams were observed in the sentence.

A bag-of-n-grams model has the simplicity of the bag-of-words model, but allows the preservation of more word locality information.