N-grams. Improving internal optimization with the help of competitors See what "N-gram" is in other dictionaries

Using N-grams

General use of N-grams

extraction of data to cluster a series of satellite images of the Earth from space, in order to then decide which specific parts of the Earth are in the image,
search for genetic sequences,
in the field of genetics are used to determine from which specific animal species DNA samples are collected,
in computer compression,
using N-grams, as a rule, data related to sound is indexed.

N-grams are also widely used in natural language processing.

Using N-grams for natural language processing needs

In the field of natural language processing, N-grams are mainly used for prediction based on probabilistic models. The N-gram model calculates the probability of the last word of the N-gram if all the previous ones are known. When using this approach to model a language, it is assumed that the appearance of each word depends only on the previous words.

Another application of N-grams is the detection of plagiarism. If you divide the text into several small fragments, represented by n-grams, they can be easily compared with each other, and thus obtain the degree of similarity of controlled documents. N-grams are often used successfully to categorize text and language. In addition, they can be used to create functions that allow you to get knowledge from text data. Using N-grams one can efficiently find candidates to replace misspelled words.

Google Research Projects

Google research centers have used N-gram models for a wide range of research and development. These include projects such as statistical translation from one language to another, speech recognition, spelling correction, information extraction, and more. For the purposes of these projects, corpora texts containing several trillion words were used.

Google has decided to create its own educational building. The project is called Google teracorpus and it contains 1,024,908,267,229 words collected from public websites.

Methods for extracting n-grams

Due to the frequent use of N-grams for solving various problems, a reliable and fast algorithm is needed to extract them from the text. A suitable n-gram extraction tool should be able to work with unlimited text size, work quickly and make efficient use of available resources. There are several methods for extracting N-grams from text. These methods are based on different principles:

Notes

Semantic core

In order to successfully develop and increase the visibility of the site in modern realities, it is necessary to constantly expand the semantic core. One of the best ways to grow is to collect competitor keywords.

Today, it is not difficult to get the semantics of competitors, because. There are many services, both paid and free.

Free list:

— megaindex.ru — Site Visibility tool

- xtool.ru - a well-known service that also shows the keywords for which the site is ranked

List of paid:

— spywords.ru — suitable for Yandex and Google

- semrush.ru - focused only on Google

— prodvigator.ua — Ukrainian analogue of spywords.ru

In addition to services, you can also use a manual method based on splitting the title and description into n-grams, as a result of which an additional list of phrases is obtained at the output.

N-gram - a sequence of n elements. In practice, the N-gram is more common as a series of words. A sequence of two consecutive elements is often called digram, a sequence of three elements is called trigram. At least four or more elements are designated as an N-gram, N is replaced by the number of consecutive elements.

Consider this technique step by step:

- Upload the title (description) of competitors. Can be done with Screaming Frog SEO.

- In a text editor, we clean the resulting list from service parts of speech, punctuation marks and other garbage. I use the "search and replace" function in the text editor sublime text (hot key ctrl+H), using regular expressions:

- Select the desired n-gram and set the frequency to at least one. The best option is trigrams and 4-grams:

- We get the following result:

Columncountshows the number of repetitionsn-gram, columnfrequency —frequencyn-grams.

After we have received a list of phrases, we need to analyze it and select the appropriate keywords to expand the semantic core. More details can be found in the relevant section of our blog.

Grouping requests

It is very important to understand how the semantic core of competitors is grouped, because this helps to correctly distribute key phrases on the pages of the site.

To do this, after we have generated a complete list of queries, we need to get relevant pages and positions of competitors (you can use the seolib.ru service), and then compare with our grouping. If you can see that a competitor has good positions and at the same time its grouping differs from ours (for example, a competitor’s requests are distributed on different pages, while ours has the same requests on one page), you need to pay attention to this and revise the landing pages on your site .

Let's consider a small example of comparing the grouping of a conditional site and its competitor.

As you can see from the table, site.ru has one landing page for all keywords. At a competitor, different pages are ranked for the same queries and occupy TOP or close to TOP positions. Based on this, we can conclude that the grouping on site.ru needs to be revised, in particular, it is necessary to create a separate page for key phrases with the word “facade”.

Text quality

The first and most important thing to pay attention to when analyzing competitors’ texts is not the quantitative component (number of occurrences, text length, etc.), but the qualitative or semantic one - how useful the information is, what the competitor offers and how he does it does.

Let's look at a few examples.

Let's say you're delivering flowers and on the main page in the text you guarantee their freshness. For example, like this:

Flower delivery servicesite. enguarantees the safety of bouquets even in the cold season.

Here is an example from one of the competitors:

It is profitable to order fragrant compositions from us, because we guarantee a 100% money back guarantee if the freshness of the flowers is in doubt.

A competitor's guarantee is backed by money, which is more significant than an abstract guarantee.

Consider another example - the text on the page of the "ceramic tiles" category of an online store:

This text does not carry any useful semantic load, solid water. Most likely, the person who came to the site and makes a purchase decision wants to know the benefits of the product and possible configurations, instead he receives a meaningless set of characters.

Now let's look at the text of a competitor:

This text is more useful because succinctly communicates the differences between tiles and helps you understand how to choose the right one.

Thus, by comparing the texts of competitors with your own, you can get a lot of useful information that will help copywriters when compiling TOR.

Relevance of texts

Continuing the theme of the quality of texts, one cannot help touching on their relevance. Today, in order for the text to be relevant, it is not enough just to include keywords. To increase the relevance of the page and at the same time not make the text spammy, you need to use words related to the topic.

When assessing the relevance of a text to a query, the search engine analyzes not only the presence of keywords, but also additional words, thus determining the meaning of the text. For example, if we write a text about an elephant, then the related words can be considered: “trunk”, “tusks”, “nature”, “zoo”. If the text about the chess piece is “elephant”, then these words will be: “figure”, “check”, “queen”, etc.

You can get the most suitable list of words for your needs in the texts of competitors. To do this, you need to do the following steps:

— We copy all the texts from the TOP-10 for the desired high-frequency request into different text files.

- We remove service parts of speech, punctuation marks and numbers from texts (considered earlier).

- Line up words in a line - use the "search and replace" function with regular expressions. Replace space with \n.

- Next, you need to bring all word forms to the normal dictionary form (lemma). To do this, you can use the service https://tools.k50project.ru/lemma/. In the field, enter a list of words from each file separately and click the "lemmetize and output as a csv table" button. The result should be 10 files with lemmetized words.

- In each file, we remove duplicate words.

- Combine words from files into one list.

— Now we need to create a frequency dictionary. To do this, add the resulting list to the service https://tools.k50project.ru/lemma/ and click "build a frequency dictionary in the form of CSV".

- Our list of words is ready:

If the frequency is 10, then this word was used on all 10 sites, if 8, then only on 8, etc. We recommend using the most frequent words, however, interesting solutions can be found among rare words.

In such a simple way, you can get a list of thematic words for compiling TOR for copywriters.

As you can see, competitors are a very important source of information that can help you optimize your sites better. In this article, I have covered far from all aspects, and in the future I will continue to write about what is useful and how you can learn from your competitors.

Subscribe to newsletter These algorithms are designed to search on previously unknown text, and can be used, for example, in text editors, document viewers or web browsers to search within a page. They do not require pre-processing of text and can work with a continuous stream of data.

Linear search

A simple sequential application of a given metric (for example, the Levenshtein metric) to words from the input text. When using a metric with a limit, this method allows you to achieve optimal performance. But, at the same time, the more k, the longer the running time increases. Asymptotic time estimate - O(kn).

Bitap (also known as Shift-Or or Baeza-Yates-Gonnet, and its modification from Wu-Manber)

Algorithm bitap and its various modifications are most often used for fuzzy search without indexing. Its variation is used, for example, in the unix utility agrep , which performs functions similar to the standard grep , but with support for errors in the search query and even provides limited opportunities for using regular expressions.

For the first time, the idea of this algorithm was proposed by citizens Ricardo Baeza-Yates And Gaston Gonnet, publishing a related article in 1992.
The original version of the algorithm only deals with character substitutions, and actually calculates the distance Hamming. But a little later sun wu And Udi Manber proposed a modification of this algorithm to calculate the distance Levenshtein, i.e. brought support for insertions and deletions, and developed the first version of the agrep utility based on it.

Result value

Where k- number of mistakes, j- character index, s x - character mask (in the mask, single bits are located at positions corresponding to the positions of the given character in the request).
A match or non-match to the query is determined by the very last bit of the resulting vector R.

The high speed of this algorithm is ensured by bit parallelism of calculations - in one operation it is possible to perform calculations on 32 or more bits simultaneously.
At the same time, the trivial implementation supports the search for words with a length of no more than 32. This limitation is determined by the width of the standard type int(on 32-bit architectures). You can also use types of large dimensions, but this can slow down the algorithm to some extent.

Despite the fact that the asymptotic running time of this algorithm O(kn) is the same as the linear method, it is much faster with long queries and the number of errors k over 2.

Testing

Testing was carried out on a text of 3.2 million words, the average word length was 10.

Exact search

Search time: 3562 ms

Search using the Levenshtein metric

Search time at k=2: 5728 ms
Search time at k=5: 8385 ms

Search using the Bitap algorithm with Wu-Manber modifications

Search time at k=2: 5499 ms
Search time at k=5: 5928 ms

Obviously, a simple search using the metric, unlike the Bitap algorithm, is highly dependent on the number of errors. k.

However, when it comes to searching large amounts of unchanged text, the search time can be significantly reduced by preprocessing such text, also called indexing.

Fuzzy Search Algorithms with Indexing (Offline)

A feature of all fuzzy search algorithms with indexing is that the index is built according to a dictionary compiled from the source text or a list of records in a database.

These algorithms use different approaches to solve the problem - some of them use reduction to exact search, others use the properties of the metric to build various spatial structures, and so on.

First of all, at the first step, a dictionary is built from the source text, containing words and their positions in the text. Also, you can count the frequencies of words and phrases to improve the quality of search results.

It is assumed that the index, like the dictionary, is fully loaded into memory.

Tactical and technical characteristics of the dictionary:

Source text - 8.2 gigabytes of materials from the Moshkov library (lib.ru), 680 million words;
Dictionary size - 65 megabytes;
Number of words - 3.2 million;
The average word length is 9.5 characters;
Root mean square word length (may be useful when evaluating some algorithms) - 10.0 characters;
Alphabet - capital letters A-Z, without E (to simplify some operations). Words containing non-alphabetic characters are not included in the dictionary.

The dependence of the dictionary size on the volume of text is not strictly linear - up to a certain volume, a basic word frame is formed, ranging from 15% at 500 thousand words to 5% at 5 million, and then the dependence approaches linear, slowly decreasing and reaching 0.5% at 680 million words. The subsequent maintenance of growth is ensured for the most part by rare words.

Sample Expansion Algorithm

This algorithm is often used in spell checking systems (i.e. spell-checkers), where the size of the dictionary is small, or where speed is not the main criterion.
It is based on reducing the fuzzy search problem to the exact search problem.

From the original query, a set of "erroneous" words is built, for each of which an exact search is then performed in the dictionary.

Its running time strongly depends on the number k of errors and on the size of the alphabet A, and in the case of using a binary dictionary search is:

For example, when k = 1 and words of length 7 (for example, "Crocodile") in the Russian alphabet, the set of erroneous words will be about 450 in size, that is, it will be necessary to make 450 dictionary queries, which is quite acceptable.
But already at k = 2 the size of such a set will be more than 115 thousand options, which corresponds to a complete enumeration of a small dictionary, or 1/27 in our case, and, therefore, the running time will be quite large. At the same time, one should not forget that for each of these words it is necessary to search for an exact match in the dictionary.

Peculiarities:

The algorithm can be easily modified to generate "erroneous" variants according to arbitrary rules, and, moreover, does not require any preliminary processing of the dictionary, and, accordingly, additional memory.

Possible improvements:

It is possible to generate not the entire set of "erroneous" words, but only those of them that are most likely to occur in a real situation, for example, words taking into account common spelling or typing errors.

This method has been around for a long time, and is the most widely used, since its implementation is extremely simple and it provides fairly good performance. The algorithm is based on the principle:
“If word A matches word B, given several errors, then with a high degree of probability they will have at least one common substring of length N.”
These substrings of length N are called N-grams.
During indexing, the word is split into such N-grams, and then this word is included in the lists for each of these N-grams. During the search, the query is also divided into N-grams, and for each of them, a list of words containing such a substring is searched sequentially.

The most commonly used in practice are trigrams - substrings of length 3. Choosing a larger value of N leads to a restriction on the minimum word length, at which error detection is already possible.

Peculiarities:

The N-gram algorithm does not find all possible misspelled words. If we take, for example, the word VOTKA, and decompose it into trigrams: IN T KA → VO T ABOUT T TO T KA - you can see that they all contain the error T. Thus, the word "VODKA" will not be found, since it does not contain any of these trigrams, and will not fall into the lists corresponding to them. Thus, the shorter the length of a word and the more errors it contains, the higher the chance that it will not fall into the lists corresponding to N-grams of the query, and will not be present in the result.

Meanwhile, the N-gram method leaves full scope for using your own metrics with arbitrary properties and complexity, but you have to pay for this - when using it, you still need to sequentially enumerate about 15% of the dictionary, which is quite a lot for large dictionaries.

Possible improvements:

It is possible to split the hash tables of N-grams by the length of the words and by the position of the N-gram in the word (modification 1). How the length of the searched word and the query cannot differ by more than k, and the positions of an N-gram in a word can differ by no more than k. Thus, it will be necessary to check only the table corresponding to the position of this N-gram in the word, as well as k tables on the left and k tables on the right, i.e. Total 2k+1 adjacent tables.

You can further reduce the size of the set needed to scan by splitting the tables by word length, and similarly looking through only neighboring tables. 2k+1 tables (modification 2).

This algorithm is described in the article by L.M. Boytsov. Signature hashing. It is based on a fairly obvious representation of the "structure" of a word in the form of bits, used as a hash (signature) in a hash table.

During indexing, such hashes are calculated for each of the words, and the correspondence of the list of dictionary words to this hash is entered into the table. Then, during the search, a hash is calculated for the query and all neighboring hashes that differ from the original one by no more than k bits are sorted out. For each of these hashes, the list of corresponding words is searched.

The process of calculating a hash - each bit of the hash is assigned a group of characters from the alphabet. Bit 1 on position i in a hash means that the original word contains a character from i-th alphabet groups. The order of the letters in the word is absolutely irrelevant.

Removing one character will either not change the hash value (if there are still characters from the same alphabet group in the word), or the bit corresponding to this group will change to 0. When inserted, in the same way, either one bit will be set to 1, or there will be no changes. When replacing characters, everything is a little more complicated - the hash can either remain unchanged at all, or it can change in 1 or 2 positions. When permuting, no changes occur at all, because the order of characters when constructing a hash, as noted earlier, is not taken into account. Thus, to completely cover k errors, you need to change at least 2k bit in the hash.

The running time, on average, with k "incomplete" (insertions, deletions and transpositions, as well as a small part of replacements) errors:

Peculiarities:

Due to the fact that when replacing one character, two bits can change at once, an algorithm that implements, for example, distortion of no more than 2 bits at the same time will not actually produce the full amount of results due to the absence of a significant (depending on the ratio of the hash size to the alphabet) part of the words with two replacements (and the larger the hash size, the more often a character replacement will lead to distortion of two bits at once, and the less complete the result will be). In addition, this algorithm does not allow prefix searches.

BK trees

Trees Burkhard-Keller are metric trees, the algorithms for constructing such trees are based on the property of the metric to meet the triangle inequality:

This property allows metrics to form metric spaces of arbitrary dimension. Such metric spaces are not necessarily Euclidean, so, for example, the metrics Levenshtein And Damerau-Levenshtein form non-Euclidean space. Based on these properties, you can build a data structure that searches in such a metric space, which is the Barkhard-Keller trees.

Improvements:

You can use the ability of some metrics to calculate distance with a constraint by setting an upper limit equal to the sum of the maximum distance to the descendants of the vertex and the resulting distance, which will slightly speed up the process:

Testing

Testing was carried out on a laptop with Intel Core Duo T2500 (2GHz/667MHz FSB/2MB), 2Gb RAM, OS - Ubuntu 10.10 Desktop i686, JRE - OpenJDK 6 Update 20.

Testing was carried out using the Damerau-Levenshtein distance and the number of errors k = 2. The index size is specified together with the dictionary (65 MB).

Index size: 65 MB
Search time: 320ms / 330ms
Completeness of results: 100%

N-grams (original)

Index size: 170 MB
Index creation time: 32 s
Search time: 71ms / 110ms
Completeness of results: 65%

N-grams (modification 1)

Index size: 170 MB
Index creation time: 32 s
Search time: 39ms / 46ms
Completeness of results: 63%

N-grams (modification 2)

Index size: 170 MB
Index creation time: 32 s
Search time: 37ms / 45ms
Completeness of results: 62%

Index size: 85 MB
Index creation time: 0.6 s
Search time: 55ms
Completeness of results: 56.5%

BK trees

Index size: 150 MB
Index creation time: 120 s
Search time: 540ms
Completeness of results: 63%

Total

Most indexed fuzzy search algorithms are not truly sublinear (i.e., have an asymptotic running time O(log n) or lower), and their speed of operation is usually directly dependent on N. Nevertheless, numerous improvements and improvements make it possible to achieve a sufficiently short running time even with very large volumes of dictionaries.

There are also many more diverse and inefficient methods based, among other things, on the adaptation of various techniques and techniques already used elsewhere to a given subject area. Among these methods is the adaptation of prefix trees (Trie) to fuzzy search problems, which I left unattended due to its low efficiency. But there are also algorithms based on original approaches, for example, the algorithm Maass-Novak, which, although it has a sublinear asymptotic running time, is extremely inefficient due to the huge constants hidden behind such a time estimate, which manifest themselves in the form of a huge index size.

The practical use of fuzzy search algorithms in real search engines is closely related to phonetic algorithms, lexical stemming algorithms - highlighting the base part of different word forms of the same word (for example, such functionality is provided by Snowball and Yandex mystem), as well as ranking based on statistical information , or using complex sophisticated metrics.

Levenshtein distance (with clipping and prefix option);
Damerau-Levenshtein distance (with clipping and prefix option);
Bitap algorithm (Shift-OR / Shift-AND with Wu-Manber modifications);
Sample expansion algorithm;
N-gram method (original and with modifications);
Signature hashing method;
BK-trees.

I wanted to make the code easy to understand, and at the same time efficient enough for practical use. Squeezing the last juices out of the JVM was not part of my tasks. enjoy.

It is worth noting that in the process of studying this topic, I came up with some of my own developments that allow me to reduce the search time by an order of magnitude due to a moderate increase in the size of the index and some limitation in the freedom of choice of metrics. But that's a completely different story.

Definition Example Applications Creating an n-gram language model Calculating n-gram probability Eliminating training corpus sparsity o Add-one Smoothing o Witten-Bell Discounting o Good-Turing Discounting o Katzs Backoff o Deleted Interpolation Estimating an n-gram language model using entropy Contents

N-gram (English N-gram) a subsequence of N elements of some sequence. Consider sequences of words. Unigrams cat, dog, horse,... Bigrams little cat, big dog, strong horse,... Trigrams little cat eats, big dog barks, strong horse runs,... Definition

Examples of applied tasks Speech recognition. Some words with different spellings are pronounced the same. The task is to choose the correct word in the context. Generation of texts on a given subject. Example: Yandex.Abstracts. Search for semantic errors. He is trying to fine out - in terms of syntax, it is true, in terms of semantics, no. He is trying to find out - right. trying to find out occurs in English texts much more often than trying to fine out, which means that if statistics are available, you can find and eliminate this kind of error

Creation of a language model of n-grams To solve the listed applied problems, it is necessary to create a language model of N-grams. To create a model, you need to: 1. Calculate the probabilities of n-grams in the training corpus. 2. Fix the hull sparsity issue with one of the anti-aliasing methods. 3. Evaluate the quality of the resulting language model of n-grams using entropy.

Calculating the probability of N-grams (1) In the training corpus, certain n-grams occur with different frequencies. For each n-gram, we can count how many times it occurs in the corpus. Based on the obtained data, a probabilistic model can be built, which can then be used to estimate the probability of n-grams in some test corpus.

Calculating the probability of N-grams (2) Consider an example. Let the corpus consist of one sentence: They picnicked by the pool, then lay back on the grass and looked at the stars Let's select n-grams. Unigrams: They, picknicked, by, … Digrams: They picnicked, picknicked by, by the, … Trigrams They picknicked by, picknicked by the, by the pool, …

Calculation of probability of N-grams (3) Now it is possible to count n-grams. All selected bi- and trigrams occur in the corpus once. All unigrams, with the exception of the word the, also occur once. The word the occurs three times. Now that we know how many times each n-gram occurs, we can build a probabilistic model of n-grams. In the case of unigrams, the probability of the word u can be calculated by the formula: For example, for the word the probability will be 3/16 (because there are 16 words in the corpus, 3 of which are the word the). Number of occurrences of the word u in the training corpus They picnicked by the pool, then lay back on the grass and looked at the stars

1, the probability is considered somewhat differently. Consider the case of bigrams: let it be necessary to calculate the probability of the bigram the pool. If we consider each bigram word as some event, then believing" title="(!LANG:N-gram probability calculation (4) For n-grams, where n>1, the probability is calculated somewhat differently. Consider the case of bigrams: let it be necessary to calculate the probability digram the pool If we consider each word of digram as some event, then believing" class="link_thumb"> 9 !} Calculating the probability of N-grams (4) For n-grams, where n>1, the probability is calculated somewhat differently. Consider the case of bigrams: let it be necessary to calculate the probability of the bigram the pool. If we consider each bigram word as some event, then the probability of a set of events can be calculated by the formula: Thus, the probability of the bigram is the pool:, where 1, the probability is considered somewhat differently. Consider the case of bigrams: let it be necessary to calculate the probability of the bigram the pool. If we consider each bigram word as some event, then believing "> 1, the probability is considered somewhat differently. Consider the case of bigrams: let it be necessary to calculate the probability of the bigram the pool. If we consider each bigram word as some event, then the probability of a set of events can be calculated by the formula : Thus, the probability of the bigram is the pool:, where "> 1, the probability is calculated a little differently. Consider the case of bigrams: let it be necessary to calculate the probability of the bigram the pool. If we consider each bigram word as some event, then believing" title="(!LANG:N-gram probability calculation (4) For n-grams, where n>1, the probability is calculated somewhat differently. Consider the case of bigrams: let it be necessary to calculate the probability digram the pool If we consider each word of digram as some event, then believing"> title="Calculating the probability of N-grams (4) For n-grams, where n>1, the probability is calculated somewhat differently. Consider the case of bigrams: let it be necessary to calculate the probability of the bigram the pool. If we consider each word of the digram as some event, then believing"> !}

Calculation of the probability of N-grams (5) Now consider the calculation of the probability of an arbitrary n-gram (or a sentence of length n). Expanding the case of bigrams, we obtain the probability formula for n-grams Thus, calculating the probability of a sentence is reduced to calculating the conditional probability of the N-grams that make up this sentence:

Elimination of corpus sparseness (1) The problem of a simple (unsmoothed) language model of n-grams: for some n-grams, the probability can be greatly underestimated (or even zero), although in reality (in the test corpus) these n-grams can occur quite often . Reason: limited training corpus and its specificity. Solution: by lowering the probability of some n-grams, increase the probability of those n-grams that did not occur (or were encountered quite rarely) in the training corpus.

Corpus Sparsity Elimination (3) The following concepts are used in sparsity elimination algorithms: Types – different words (sequences of words) in the text. Tokens are all words (sequences of words) in the text. They picnicked by the pool, then lay back on the grass and looked at the stars - 14 types, 16 tokens

Add-one smoothing (4) The method provokes a strong error in the calculations (for example, on the previous slide it was shown that for the word Chinese the number of bigrams was reduced by 8 times). Tests have shown that the unsmoothed model often shows more accurate results. Therefore, the method is interesting only from a theoretical point of view.

Witten-Bell Discounting (1) Based on a simple idea: use data on n-grams occurring in the training corpus to estimate the probability of missing n-grams. The idea of the method is taken from compression algorithms: two types of events are considered - they met a new character (type) and they met a character (token). Probability formula for all missing n-grams (that is, the probability of meeting an n-gram in the test corpus that was not in the training corpus): N is the number of tokens in the training corpus, T is the number of types that have already been encountered in the training corpus

Witten-Bell Discounting (4) =>=> =>"> =>"> =>" title="(!LANG:Witten-Bell Discounting (4) =>=>"> title="Witten-Bell Discounting (4) =>=>"> !}

Good-Turing Discounting (1) Idea: For n-grams that occur zero times (c times), the score is proportional to the number of n-grams that occur once (c + 1 times). Consider an example: Suppose 18 fish were caught. In total, 6 different species were caught, and only one representative was caught in three species. We need to find the probability that the next fish will belong to a new species. There are 7 possible species in total (6 species have already been caught).

Katzs Backoff (2) The coefficient α is necessary for the correct distribution of the residual probability of N-grams in accordance with the probability distribution of (N-1)-grams. If you do not enter α, the estimate will be erroneous, because the equality will not be fulfilled: The calculation of α is given at the end of the report. Evaluation of a language model using entropy (1) Entropy is a measure of uncertainty. With the help of entropy, one can determine the most appropriate language model of N-grams for a given applied task. Binary Entropy Formula: Example: Calculate the entropy of a coin-tossing test. Answer: 1 bit, provided that the results of the experiment are equally likely (either side falls out with a probability of 1/2).

Evaluation of the language model using entropy (3) Cross-entropy is used to compare different language models: The closer the value of cross-entropy H(p,m) to the real entropy H(p), the better the language model: In our case, H(p ) is the entropy of the test corpus. m(w) – language model (e.g. N-gram model)

Evaluation of a language model using entropy (4) There is another method for assessing the quality of a language model, based on the so-called. connectivity indicator (perplexity). Idea: Calculate the probability of the entire test corpus. A better model will show a higher probability. perplexity formula: Thus, the smaller the perplexity, the better the model. Perplexity can be interpreted as the average number of words that can come after a certain word (i.e. the more perplexity, the higher the ambiguity, and therefore, the worse the language model). Relationship between perplexity and binary entropy:

Estimation of the language model using entropy (5) As an example, consider the values of perplexity for some corpus, obtained using trained models of unigrams, bigrams and trigrams: In the case of trigrams, perplexity is the smallest, because disambiguation is facilitated by the longest history length of all models (equal to 2) when calculating the conditional probabilities of trigrams. UnigramBigramTrigram Perplexity

I want to implement some n-gram applications (preferably in PHP).

What type of n-grams is more suitable for most purposes? Word level or character level n-gram level? How can an n-grammatical tokenizer be implemented in PHP?

First, I would like to know what N-grams are. It's right? This is how I understand n-grams:

Suggestion: "I live in New York."

word-level birams (2 for n): "# I", "I live", "live in", "in New York", "NY #"

character level birams (2 for n): "#I", "I #", "#l", "li", "iv", "ve", "e #", "#i", "in", "n#", "#N", "NY", "Y#"

Once you have this array of n-gram parts, you throw in duplicates and add a counter for each part that specifies the frequency:

word level bigrams:

character level bigrams:

Is it correct?

Also, I would like to know more about what you can do with n-grams:

How can I detect the language of a text using n-grams?
Is it possible to do machine translation using n-grams even if you don't have a bilingual corpus?
How to create a spam filter (spam, ham)? Combine n-grams with Bayesian filter?
How can I find a topic? For example: is there a text about basketball or dogs? My approach (do the following with the Wikipedia article for "dogs" and "basketball"): plot the n-gram vectors for both documents, normalize them, calculate the Manhattan/Euclid distance, the closer the result is to 1, the higher the similarity will be

How do you feel about my application, especially the last one?

Hope you can help me. Thanks in advance!

2 answers

Word n-gram will generally be more useful for most of the text analysis applications you mentioned, with the possible exception of language definition, where something like character trigrams might give better results. Effectively, you would create a vector of n-grams for the text body in each language you are interested in, and then compare the trigram frequencies in each corpus with those in the document you are classifying. For example, the trigram the is likely to appear much more frequently in English than in German and would provide some level of statistical correlation. Once you have n-gram documents, you have a choice of many algorithms for further analysis, Baysian Filters, N Nearest Neighbor, Support Vector Machines, etc.

Of the applications you mentioned, machine translation is probably the most far-fetched since n-grams alone won't get you very far down the road. Converting the input file to n-gram representation is just a way to put the data into a format for further feature parsing, but as you lose a lot of contextual information it might not be useful for translation.

One thing to note is that it's not enough to create a vector for one document and a vector for another document if the dimensions don't match. That is, the first entry in the vector cannot be the in one document and is in another, or the algorithms will not work. You'll end up with vectors like , since most documents won't contain more than the n-grams you're interested in. This "lining" also requires that you determine in advance which ngrams you will include in your analysis. Often this is implemented as a two-pass algorithm to first decide the statistical significance of the various n-grams to decide what to keep. Google "feature selection" for more information.

Word-based n-grams plus support for vector machines is a great way to define a topic, but to prepare a classifier you need a large corpus of text pre-classified into "on-topic" and "off-topic" topics. You will find a large number of research papers explaining various approaches to this problem on a site such as citeseerx. I wouldn't recommend the euclidean distance approach to this problem, as it doesn't weight individual n-grams based on statistical significance, so two documents that include the , a , is and of would be considered a better match than two documents , which included Baysian. Removing the stop words from your n-grams of interest would improve this a bit.

You are right about the definition of n-grams.

You can use word level n-grams for search type applications. Character level n-grams can be used more to parse the text itself. For example, to identify the language of a text, I would use the letter frequencies against the established language frequencies. That is, the text should approximately correspond to the frequency of occurrence of letters in this language.

The n-grammatical tokenizer for words in PHP can be done using strtok:

For characters, use split:

You can then just split the array however you like into any number of n-grams.

Bayesian filters need to be trained for use as spam filters that can be used in conjunction with n-grams. However, you need to give him a lot of input to keep him learning.

Your last approach sounds decent as it learns the context of the page... it's still quite tricky to do, however, but n-grams seem like a good starting point for this.