n-gram

Short description: Item sequences in computational linguistics

Six n-grams frequently found in titles of publications about Coronavirus disease 2019 (COVID-19), as of 7 May 2020

An n-gram is a sequence of n adjacent symbols in particular order. The symbols may be n adjacent letters (including punctuation marks and blanks), syllables, or rarely whole words found in a language dataset; or adjacent phonemes extracted from a speech-recording dataset, or adjacent base pairs extracted from a genome. They are collected from a text corpus or speech corpus. If Latin numerical prefixes are used, then n-gram of size 1 is called a "unigram", size 2 a "bigram" (or, less commonly, a "digram") etc. If, instead of the Latin ones, the English cardinal numbers are furtherly used, then they are called "four-gram", "five-gram", etc. Similarly, using Greek numerical prefixes such as "monomer", "dimer", "trimer", "tetramer", "pentamer", etc., or English cardinal numbers, "one-mer", "two-mer", "three-mer", etc. are used in computational biology, for polymers or oligomers of a known size, called k-mers. When the items are words, $n$ -grams may also be called shingles.^[1]

In the context of NLP, the use of n-grams allows bag-of-words models to capture information such as word order, which would not be possible in the traditional bag of words setting.

Examples

(Shannon 1951)^[2] discussed n-gram models of English. For example:

3-gram character model: in no ist lat whey cratict froure birs grocid pondenome of demonstures of the retagin is regiactiona of cre
2-gram word model: the head and in frontal attack on an english writer that the character of this point is therefore another method for the letters that the time of who ever told the problem for an unexpected

Figure 1 n-gram examples from various disciplines
Field	Unit	Sample sequence	1-gram sequence	2-gram sequence	3-gram sequence
Vernacular name			unigram	bigram	trigram
Order of resulting Markov model			0	1	2
Protein sequencing	amino acid	... Cys-Gly-Leu-Ser-Trp ...	..., Cys, Gly, Leu, Ser, Trp, ...	..., Cys-Gly, Gly-Leu, Leu-Ser, Ser-Trp, ...	..., Cys-Gly-Leu, Gly-Leu-Ser, Leu-Ser-Trp, ...
DNA sequencing	base pair	...AGCTTCGA...	..., A, G, C, T, T, C, G, A, ...	..., AG, GC, CT, TT, TC, CG, GA, ...	..., AGC, GCT, CTT, TTC, TCG, CGA, ...
Language model	character	...to_be_or_not_to_be...	..., t, o, _, b, e, _, o, r, _, n, o, t, _, t, o, _, b, e, ...	..., to, o_, _b, be, e_, _o, or, r_, _n, no, ot, t_, _t, to, o_, _b, be, ...	..., to_, o_b, _be, be_, e_o, _or, or_, r_n, _no, not, ot_, t_t, _to, to_, o_b, _be, ...
Word n-gram language model	word	... to be or not to be ...	..., to, be, or, not, to, be, ...	..., to be, be or, or not, not to, to be, ...	..., to be or, be or not, or not to, not to be, ...

Figure 1 shows several example sequences and the corresponding 1-gram, 2-gram and 3-gram sequences.

Here are further examples; these are word-level 3-grams and 4-grams (and counts of the number of times they appeared) from the Google n-gram corpus.^[3]

3-grams

ceramics collectables collectibles (55)
ceramics collectables fine (130)
ceramics collected by (52)
ceramics collectible pottery (50)
ceramics collectibles cooking (45)

4-grams

serve as the incoming (92)
serve as the incubator (99)
serve as the independent (794)
serve as the index (223)
serve as the indication (72)
serve as the indicator (120)

References

↑ Broder, Andrei Z.; Glassman, Steven C.; Manasse, Mark S.; Zweig, Geoffrey (1997). "Syntactic clustering of the web". Computer Networks and ISDN Systems 29 (8): 1157–1166. doi:10.1016/s0169-7552(97)00031-7.
↑ Shannon, Claude E. "The redundancy of English." Cybernetics; Transactions of the 7th Conference, New York: Josiah Macy, Jr. Foundation. 1951.
↑ Alex Franz and Thorsten Brants (2006). "All Our N-gram are Belong to You". Google Research Blog. http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html.

External links

0.00

(0 votes)

Original source: https://en.wikipedia.org/wiki/N-gram. Read more

[1] Broder, Andrei Z.; Glassman, Steven C.; Manasse, Mark S.; Zweig, Geoffrey (1997). "Syntactic clustering of the web". Computer Networks and ISDN Systems 29 (8): 1157–1166. doi:10.1016/s0169-7552(97)00031-7.

[2] Shannon, Claude E. "The redundancy of English." Cybernetics; Transactions of the 7th Conference, New York: Josiah Macy, Jr. Foundation. 1951.

[3] Alex Franz and Thorsten Brants (2006). "All Our N-gram are Belong to You". Google Research Blog. http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html.

[1]

[2]

[3]

v t e Natural language processing
General terms	Natural language understanding Text corpus Speech corpus Stopwords Bag-of-words AI-complete n-gram (Bigram, Trigram)
Text analysis	Text segmentation Part-of-speech tagging Text chunking Compound term processing Collocation extraction Stemming Lemmatisation Named-entity recognition Coreference resolution Sentiment analysis Concept mining Parsing Word-sense disambiguation Ontology learning Terminology extraction Textual entailment Truecasing
Automatic summarization	Multi-document summarization Sentence extraction Text simplification
Machine translation	Computer-assisted Example-based Rule-based Neural
Automatic identification and data capture	Speech recognition Speech synthesis Optical character recognition Natural language generation
Topic model	Pachinko allocation Latent Dirichlet allocation Latent semantic analysis
Computer-assisted reviewing	Automated essay scoring Concordancer Grammar checker Predictive text Spell checker Syntax guessing
Natural language user interface	Automated online assistant Chatbot Interactive fiction Question answering Voice user interface

Anonymous

Search

n-gram

Namespaces

More

Page actions

Contents

Examples

References

Further reading

See also

External links

Navigation

Navigation

Help

Translate

Wiki tools

Wiki tools

Anonymous

Search

n-gram

Examples

References

Further reading

See also

External links

Navigation

Wiki tools

Page tools

Other projects

Categories