Bag-of-words

The phrase bag-of-words typically refers to a way of representing text in natural language processing, although it has been applied to computer vision.

A bag-of-words representation contains how many times each word appears in a document, but disregards the order of the words.

Often, bag-of-words models will only include the $k$ most frequent words in a corpus. This reduces the memory needed to store relatively-infrequent words, and the underlying representation of the document is mostly the same because common words dominate the document.

Bag-of-words models are often highly effective at representing documents in tasks like classification, clustering, or topic modeling. But they can struggle with tasks where word order matters, like sentiment analysis and machine translation. For example, in a bag-of-words model, the phrase “dog bites man” and “man bites dog” have identical representations.

Search Results