A kind of across languages theme of news similarity comparison methods based on Parallel Corpus
Technical field
This patent proposes a kind of method of across the languages theme of news similitudes comparison based on Parallel Corpus.This patent side
Method can automatically filter out specific subject foreign language article without translating.On condition that have bilingual teaching mode,
On the basis of LDA topic models, bilingual LDA topic models are invented, and a kind of parallelization is realized by parallel computation frame
Processing can fast and efficiently realize that multilingual media event reports automatic screening using this patent method.It is related to language material
The fields such as library, word frequency analysis, Similarity measures.
Background technology
How across languages media event topic similarity automatic comparisons are rapidly completed in the case where prosthetic is translated, in turn
The multilingual media event report automatic screening for realizing identical theme, reduces and manually translates directly into this, grasp promptly and accurately
Western medium's news public sentiment is a problem to be solved.
Machine translation achieves prodigious raising and progress in field of language translation in recent years.Machine translation is to utilize meter
Calculation machine is converted into a kind of natural language the process of another natural language.Statistical method be applied to machine translation method it
Afterwards so that the accuracy rate of machine translation result is gradually increased.But it in the practical application under the conditions of technology today, generally uses
The method that machine translation and human translation combine:Machine translation method is first used, article is translated as another language, is then existed
By artificial modification and calibration, a complete, accurate article can be just obtained.However, just by human translation this process
The consumption of manpower and time are increased, efficiency reduces, and increases cost.This so that traditional machine translation is multilingual in reply magnanimity
News report shows deficiency not in time, inaccurate when translating.So this patent proposes a kind of similitude based on corpus
News article is screened in control methods, for the preprocessing process of the theme of news identification screening before machine translation, reaches
Unrelated report noise is reduced, is used manpower and material resources sparingly, efficient purpose is put forward.
In view of the above-mentioned problems, this patent proposes and a kind of translating across the languages news based on Parallel Corpus
The method of topic similarity comparison.The method that this patent proposes is based on Parallel Corpus.Corpus is extensive e-text library.
It is the linguistic data that will really occur in practical application using electronic computer as carrier, the sampling by science and rationally
Analysis and processing, become actually useful e-text library.Parallel Corpus is one group of text, each text is gone back in addition to itself
There are one or more than one kinds of translation Chinese language sheets, simplest Parallel Corpus to design bilingual text-original text and translate
Text.This patent is namely based on Parallel Corpus, and analysis obtains macaronic word segmentation regulation and word frequency distribution rule.It selects first
Go out the news report about some subject events in Chinese, be based on Universal Chinese character corpus, theme text is generated using LDA algorithm
The topic model of chapter;Then it chooses foreign language article to be screened and text feature is extracted based on the general corpus of foreign language;Finally will
The topic model of the text feature obtained by foreign language news article and some the theme article obtained by Chinese compares, if phase
Seemingly, then it is determined as that this foreign language article is the foreign language article about the theme.
This patent is expanded into bilingual LDA topic models on the basis of LDA topic models.Different from traditional LDA
Each document of topic model has independent theme distribution, bilingual LDA models to utilize bilingual teaching mode, shares theme
Distribution, different language describe the same theme.In addition, Parallel Corpus is described using different language, word frequency distribution can be with
It is different.This patent method is to establish the bilingual LDA topic models on generalized space on the basis of bilingual teaching mode, when
When having new language material, new LDA models are generated, the subject classification for judging new language material is compared with bilingual LDA models.
This patent using Gibbs to distribution sample, but in view of magnanimity training sample the case where, in order to improve LDA
Model formation efficiency realizes a kind of parallelization LDA algorithm by parallel computation frame spark here.It is calculated compared to traditional LDA
Method, the parallel LDA algorithm in this patent, which has made some improvements, achieves parallelization, and timestamp this feature is added, is being distributed
Sampling process is carried out in formula environment, can improve the efficiency and accuracy rate of whole process.
Invention content
The present invention is based on Parallel Corpus to give a kind of foreign language article method filtered out about specific subject event.Before
Carry is to gather around that there are one the Parallel Corpus of general Chinese and two languages of F language.Assuming that targeted news theme is T themes, outside
Literary language is F.In the case where not translating, filtered out in the article of a large amount of unknown theme of F language F language about T master
The article of topic:
1) each document has an independent theme distribution in Parallel Corpus, and the same theme of language description, shared
Theme distribution.First, article collection about T themes in retrieval Chinese, based on the Universal Chinese character corpus in Parallel Corpus, by
LDA topic model algorithms obtain the Chinese LDA topic models of article collection;
2) the T theme LDA topic models of Chinese then, are mapped to broad sense topic model space and obtain the Chinese of T themes
The LDA topic models shared with F language, using LDA algorithm, the article by the unknown theme to be screened of F language and parallel language
F language corpus in material library obtains F language LDA topic models;
3) by this generalized space LDA topic models and F language LDA topic models compare, think if similar
This article to be screened is the article about T themes.
4) a kind of across languages theme of news similarity comparison methods based on Parallel Corpus, which is characterized in that
This method is divided into three steps:
On condition that gathering around, there are one the Parallel Corpus of general Chinese and two languages of F language;Assuming that targeted news theme
T themes, foreign language is F, in the case where not translating, filtered out in the article of the unknown theme of F language F language about T
The article of theme:
(1) each document has an independent theme distribution in Parallel Corpus, and the same theme of language description, shared
Theme distribution;First, article collection about T themes in retrieval Chinese, based on the Universal Chinese character corpus in Parallel Corpus, by
LDA topic model algorithms obtain the Chinese LDA topic models of article collection;
(2) then, the T theme LDA topic models of Chinese are mapped to broad sense topic model space and obtain the Chinese of T themes
The LDA topic models shared with F language, using LDA algorithm, the article by the unknown theme to be screened of F language and parallel language
F language corpus in material library obtains F language LDA topic models;
(3) by this generalized space LDA topic models and F language LDA topic models compare, think if similar
This article to be screened is the article about T themes.
Identical theme contains similar semantic information in different language, thus the text of different language be indicated on it is same
In a broad sense theme space;After training data is marked with a kind of language, that is, as the master of certain news articles of Chinese
After inscribing model generation, in the topic space for the broad sense that just it is mapped, using the theme class of this broad sense, by the unknown of foreign language
The news article of theme generates topic model, compares to obtain theme with the topic model of generalized space as a result, steps are as follows:
For theme k,
(1) the Word probability distribution of the general corpus of sampling ChineseSample the common language of foreign language F
Expect the Word probability distribution in library
(2) for m in Parallel Corpus to Chinese, F Language Documents pair, m ∈ [1, M], sampling theme probability distribution θm
~Dirichlet (α),
1. for Chinese documentN-thCA lexical item,Theme Z is implied in selectionC~Dirichlet (θm),
Generate a lexical item
2. for foreign language documentN-thFA lexical item,Theme Z is implied in selectionF~Dirichlet (θm),
Generate a lexical item
Wherein, C represents Chinese, and F represents foreign language languages;θmIndicate theme probability distribution of the m to bilingual parallel document;
WithTheme Z is indicated respectivelykIn the vocabulary distribution probability of Chinese and foreign language;ZCAnd ZFIndicate m to bilingual parallel document respectively
Source language and the target language n-th of lexical item implicit theme;ωCAnd ωFIndicate m to bilingual parallel document Chinese respectively
With n-th of lexical item of foreign language;M indicates document lump logarithm;WithIndicate respectively m to bilingual parallel document Chinese and
The sum of foreign language document;θmIt obeys Dirichlet distributions and α is its Study first and is used to generate theme;WithIt obeys
Dirichlet is distributed and βCAnd βFIt to Study first and for generating lexical item;Wherein, α, βC、βFIt is that maximum likelihood is estimated
Metering indicates " document-theme " probability distribution and " theme-lexical item " probability distribution respectively;Here the general of entire corpus is chosen
Rate object function as an optimization obtains α, β by carrying out maximizing estimation to object functionC、βFValue, and then obtain LDA moulds
Type.
The parameter Estimation of LDA topic models, the distribution of the parameter the to be estimated i.e. distribution of document subject matter and descriptor, specifically
Realization process is:
Assuming that including m documents in news corpus library, number of threads p, algorithm initializes sampling parameter first;Then will
Training corpus is divided into p data set, and the foundation for dividing sample is to assign to the high sample of similarity in the same data set;It connects
It using Gibbs sampling algorithms, distribution above is sampled, per thread carries out the data set being assigned to parallel
Gibbs sampling processes;Since lexical item w observations given data obtains, only theme z is implicit variable in this distribution,
So only needing to sample distribution P (z | w);By parameter Estimation of the Dirichlet Posterior distrbutionps under Bayesian frame,
Assuming that the word ω that observation obtainsi=t, then parameter Estimation:
The Gibbs sampling formula of LDA models are obtained as a result,:
WhereinIndicate m to the Dirichlet probability distribution of the k themes of bilingual parallel document under Bayesian frame
Parameter Estimation, αkIt is its Study first, indicates the probability distribution of k-th of theme,It indicates for m documents of k themes
N-th of word,Indicate parameter Estimation of the Dirichlet probability distribution of the lexical item t of theme k under Bayesian frame, βtIt is it
Prior probability indicates the word distribution of t-th of theme;αkAnd βtIt is maximum-likelihood estimator, chooses the probability conduct of entire corpus
Optimization object function is estimated to obtain α by object function maximizekAnd βtValue;
This formula left side expression P (theme | document) P (word | theme), this probability is exactly document subject matter → word in fact
The path probability of item, since theme is that manually selected K is a, so Gibbs sampling formula mean that and carried out in this K path
Sampling;
After waiting for that all threads complete sampling, corresponding document-theme matrix and theme-word matrix are obtained, will be owned
Local parameter calculates global parameter by way of averaging, this global parameter is then distributed to each parallel LDA and is calculated
Initial samples parameter of the method as next round iteration, re-starts sampling, is iteratively repeated the above process in this way, until parameter is received
It holds back.
The method of parallel sampling, in order to ensure that the corresponding theme of each data set is mutual indepedent, when this patent takes foundation
Between element classification policy;Since parallel LDA algorithm cannot timely update global sampling ginseng during each iteration samples
Number, therefore the precision of final mask is compared to being to have certain loss for more traditional LDA algorithm, is mainly due to initial here
Caused by the mean random of block algorithm, i.e., entire data set is averagely simply divided into p parts, not to the document in every part
Between correlation take in, in this way every part of data concentrate theme be intended to equalize;
S text similarity measures are based on TF-IDF algorithms, and TF-IDF algorithms are a kind of for statistical method, use
In one words of assessment for the significance level of a copy of it file in a file set or a corpus;It finds out first every
Then the TF-IDF values of piece document are added time scale to every document according to the time that document is delivered, are carved based on this time
Degree, solves the similarity of two documents, and formula is as follows:
Wherein, taWith tbIndicate the time scale of document a and document b, the molecular formula denominator on the left side adds one also for preventing
Denominator is 0,Indicate the vector product of the word TF-IDF vectors of document a and document b.The effect and benefit of this patent:
1. the bilingual LDA topic models of invention can without translation but automatically screening goes out some specific master
The article of topic;
2. preprocessing process of this patent for the theme of news identification screening before machine translation, it is unrelated to reach reduction
It reports noise, uses manpower and material resources sparingly, put forward efficient purpose;
3. when handling data, this patent realizes a kind of parallelization processing by parallel computation frame, can be quick, high
The realization screening of speed;
4. on the basis of parallel LDA algorithm, this patent uses the classification policy according to element of time can to deblocking
To improve the precision of final result.
Description of the drawings
Fig. 1 foreign language article screening techniques describe
The Chinese articles of Fig. 2 topic news are handled
Participle processes of the Fig. 3 based on dictionary
The graph model of the bilingual LDA models of Fig. 4
The parallel LDA algorithm flow charts of Fig. 5
Fig. 6 probability calculations path
Fig. 7 algorithm parameter lists
Fig. 8 initialization algorithm pseudocodes
Fig. 9 divides pseudocode
Figure 10 Gibbs sampling pseudocodes
Figure 11 F language LDA topic models
With reference to the attached drawing of the present invention, technical scheme of the present invention is described in detail.
1. Fig. 2 is the processing procedure to the Chinese articles of some topic news.A target topic T is selected first, each
It is retrieved in big Chinese portal website and filters out a several pieces and be used as Chinese language material about the Chinese articles of T topics, by these Chinese
Language material segments, goes the pretreatment of stop words, and pretreatment is divided into following two steps:
(1) word segmentation processing.This patent is based on corpus, so using the participle side based on dictionary using corpus
Method.The character string of Chinese news article is compared with the word in dictionary for word segmentation according to strategy as defined in corpus, such as
Fruit finds character string to be slit then successful match in dictionary.The process of participle is illustrated in fig. 3 shown below
(2) stop words is gone to handle.Stop words refers to that the frequency of occurrences is higher and carries the word of less text message.2. using T
Theme Chinese language material and Universal Chinese character corpus generate the bilingual LDA models of T topics based on LDA models.This patent is assumed different
Identical theme contains similar semantic information in language, so the text of different language can be indicated on the same broad sense master
It inscribes in space.After training data is marked with a kind of language, i.e., when the topic model life of certain news articles of Chinese
At later, so that it may in the topic space for the broad sense that it is mapped, using the theme class of this broad sense, by the unknown theme of foreign language
News article generate topic model, compare to obtain theme result with the topic model of generalized space.Step is described as follows:
Fig. 4 is that the graph model of bilingual LDA topic models indicates that thus graph model is based on LDA topic models, illustrates to generate text
The process of shelves.C represents Chinese, and F represents foreign language languages;θmIndicate theme probability distribution of the m to bilingual parallel document;WithIndicate theme Z in Chinese and foreign language vocabulary distribution probability respectively;ZCAnd ZFSources of the m to bilingual parallel document is indicated respectively
The implicit theme of n-th of lexical item of language and object language;ωCAnd ωFIndicate m to bilingual parallel document Chinese and outer respectively
N-th of lexical item of text;M indicates document lump logarithm;WithChinese and foreign language of the m to bilingual parallel document are indicated respectively
The sum of document;θmObedience Dirichlet distributions and α are the Dirichlet priori ginsengs of the multinomial distribution of theme under each document
Number;WithObey Dirichlet distributions and βCAnd βFIt to Study first and for generating lexical item.
α reflects the relativeness implied in collection of document between theme, i.e. the probability distribution of " document-theme ", β is reflected
The probability distribution of " theme-Feature Words ".Here the probability of entire corpus object function as an optimization is chosen, by target letter
Number, which maximize, to be estimated to obtain α, βC、βFValue, and then obtain LDA models.Assuming that having M documents, all lexical items in language material
Indicate as follows with corresponding theme:
W=(ω1..., ωM)
Z=(z1..., zM)
Wherein, ωmIndicate the word in m documents, zmIndicate the corresponding theme number of these words, nmIndicate m documents
In k-th of theme generate word number.
The generating process of 2.1LDA models indicates, for theme k:
(1) the Word probability distribution of the general corpus of sampling ChineseSample the common language of foreign language F
Expect the Word probability distribution in library
(2) for m in Parallel Corpus to Chinese, F Language Documents pair, m ∈ [1, M], sampling theme probability distribution θm
~Dirichlet (α),
1. for Chinese documentN-thCA lexical item,Theme Z is implied in selectionC~Dirichlet (θm),
Generate a lexical item
2. for foreign language documentN-thFA lexical item,Theme Z is implied in selectionF~Dirichlet (θm),
Generate a lexical item
The generating probability of theme in entire language material:
The generating probability of lexical item in entire language material:
Merge two formulas to obtain:
2.2 in MCMC Gibbs sampling be to obtain the common method of LDA topic models.This patent is using Gibbs to upper
The distribution in face is sampled, but in view of magnanimity training sample the case where, in order to improve LDA model formation efficiencies, here by
Parallel computation frame realizes a kind of parallelization LDA algorithm, compares traditional LDA algorithm, and the parallel LDA algorithm in this patent is done
Some improvement achieve parallelization, and specific implementation process is:
Assuming that including m documents in news corpus library, number of threads p, algorithm initializes sampling parameter first;Then will
Training corpus is divided into p data set, and the foundation for dividing sample is that the high sample of similarity is assigned to (this in the same data set
In method for measuring similarity can be discussed in detail below);Then Gibbs sampling algorithms are used, distribution above is adopted
Sample, per thread carry out Gibbs sampling processes parallel to the data set being assigned to.Since word w observations given data obtains,
Only theme z is implicit variable in this distribution, so only needing to sample distribution P (z | w).By Dirichlet
Parameter Estimation of the Posterior distrbutionp under Bayesian frame, it is assumed that the word ω observedi=t, then parameter Estimation:
The Gibbs sampling formula of LDA models are obtained as a result,:
WhereinIndicate m to the Dirichlet probability distribution of the k themes of bilingual parallel document under Bayesian frame
Parameter Estimation, αkIt is its Study first, indicates the probability distribution of k-th of theme,It indicates for m documents of k themes
N-th of word,Indicate parameter Estimation of the Dirichlet probability distribution of the lexical item t of theme k under Bayesian frame, βtIt is it
Prior probability indicates the word distribution of t-th of theme.αkAnd βtIt is maximum-likelihood estimator, chooses the probability conduct of entire corpus
Optimization object function is estimated to obtain α by object function maximizekAnd βtValue.
After waiting for that all threads complete sampling, corresponding document-theme matrix and theme-word matrix are obtained, will be owned
Local parameter calculates global parameter by way of averaging, this global parameter is then distributed to each parallel LDA and is calculated
Initial samples parameter of the method as next round iteration, re-starts sampling, is iteratively repeated the above process in this way, until parameter is received
It holds back, the flow chart of whole process is as shown in Figure 5.
2.3 classification policy according to element of time.On the basis of above-mentioned parallel LDA algorithm, this patent is to deblocking
Part has carried out some improvement, to improve the precision of final result.The process sampled in each iteration due to parallel LDA algorithm
In cannot timely update global sampling parameter, therefore the precision of final mask compared to be for more traditional LDA algorithm have it is certain
Loss, here mainly caused by due to the mean random of initial block algorithm, i.e., simply by entire data ensemble average
P parts are divided into, the correlation between the document in every part does not take in, and the theme concentrated in this way in every part of data is intended to
Equalization.
Observation is carried out to parallel LDA algorithm and then is relatively learnt with conventional serial LDA algorithm, the theme-of parallel algorithm
Word matrix cannot timely update Gibbs sampling parameters, so the calculating of document subject matter probability is caused certain deviation occur, because
This, algorithm so that the corresponding theme of each data set is independent, is not influenced by other data sets, then can reduce
State deviation.
For Text similarity computing, this patent introduces a kind of classification policy based on element of time, the base of the strategy
Plinth is TF-IDF algorithms, TF-IDF algorithms be it is a kind of being used for statistical method, for assess a words for a file set or
The significance level of a copy of it file in one corpus, the number that the significance level of word occurs hereof with it is at just
Than, but can be inversely proportional with its frequency of occurrences in entire corpus.
TF-IDF is made of two parts:Word frequency (term frequency, TF) and inverse document frequency (inverse
Document frequency, IDF), wherein TF refers to the number that some given word occurs in some document, in word
On the basis of number, which has done normalized, in order to it be prevented to be biased to long text.
tfI, jWord frequency TF value of i-th of the word in corpus in j-th of document is represented, realizes that formula is as follows:
In above formula, nI, jIndicate occurrence number of the word in document j, and denominator ∑knK, jThen indicate institute in document j
There is the sum of the number of word appearance.The meaning of entire formula is as follows:
Inverse document frequency idfI, jIt is measurement of the word for the importance of entire corpus, the realization of this parameter is public
Formula is as follows:
Wherein, D indicates total number of documents in corpus, d (wi) then indicate to include the document number of word i, minimum value 1,
Occurs the case where denominator takes 0 in order to prevent, therefore denominator adds one, the meaning of entire formula is as follows:
There are TF and IDF, then the value of TF-IDF is then the product of the two of TF and IDF, i.e.,:
Documents Similarity is then calculated according to TF-IDF values, and this patent indicates two texts using the cosine value of vector
Document representation is the vectorial W for forming the TF-IDF of all words of the document and constituting, is wanted to embody the time by the similarity between shelves
Element, it is assumed that the life cycle of the topic of news discussion is one month, then, according to the temporal information of every document in corpus,
One time scale t is added to news, is such as published in the time scale t of i-th news of earliest one monthi=1, it allows similar
Degree is weighted time attribute, the news documents similarity higher of close life cycle is published in, on the contrary, two documents are delivered
Time interval it is more long, illustrate that its similarity is lower, in this way, the similarity s of document a and document bA, bIt is expressed as:
Wherein, taWith tbIndicate the time scale of document a and document b, the molecular formula denominator on the left side adds one also for preventing
Denominator is 0,Indicate the vector product of the word TF-IDF vectors of document a and document b.
After calculating the similarity matrix between document according to the method described above, according to the similarity size between document,
Entire corpus is divided into p parts, the higher document of similarity is assigned to as possible under the same data set.2.4 parallel LDA algorithms
Realization on the basis of above-mentioned partitioned data set, next provide the realization process of parallel LDA algorithm and corresponding pseudo- generation
Code, algorithm realize to include three parts:Initiation parameter, Segmentation of Data Set, Gibbs samplings.
Parameter name parameter declaration
X corpus
XiSmall data set after segmentation
P deblocking numbers
K theme numbers
M document numbers
Word number in V vocabularys
NmThe word number of m documents
The number of word in all documents of N
nmArray is counted, indicates the sum of contained word in m documents in X
nkArray is counted, indicates the sum of contained word in k-th of theme in X
Count matrix indicates the number for belonging to the word of theme k in X in m documents
Count matrix indicates the number for belonging to word in all words of theme k in X
Count matrix indicates XiIn belong to the number of word in all words of theme k
Count matrix indicates XiIn belong in m documents theme k word number
zM, nThe theme distribution of n-th of word in m documents
wM, nN-th of word in m documents
ta,tbThe time scale of document a, b indicate
Sim(i, j)The similarity of i-th document and jth piece document
Upper figure is the parameter declaration of parallel LDA algorithm, and the various pieces of algorithm are described in detail below.
It is the initialization procedure of parameter first, initialization is to each word w in documentM, nIt is randomly assigned a theme, so
After update count matrixWith counting array nm、nk.The pseudocode that algorithm is realized is illustrated in fig. 8 shown below.
Second part is Segmentation of Data Set, is discussed in detail in 2.3 sections about Segmentation of Data Set principle, provides here
Realize that code, pseudocode are illustrated in fig. 9 shown below.
It is finally Gibbs sampling processes, is broadly divided into two parts:Sampling and update.It will be to each data in sampling
Collect XiIt is sampled, during this, initializes the count matrix in sampling parameter firstWithIn order to avoid different numbers
According to the access conflict between collection, these matrixes are respectively stored under each local thread, local count matrix is obtained
In this way, local count matrix only need to be updated in sampling process;After an iteration is completed, all threads will count
Count matrix of the content update of matrix to the overall situationAnd using this matrix as the initial count matrix of next iteration.Weight
Multiple above-mentioned iterative process terminates iteration, passes through obtained count matrix when the Gibbs of entire data set samples convergence
WithWith counting array nm、nkTo calculate the theme probability distribution feelings of the distribution and every document of the word probability under each theme
The pseudocode of condition, specific implementation is illustrated in fig. 10 shown below.
2.5 model training.Formula is sampled by Gibbs obtained above, LDA models are trained based on language material.Training process is just
It is that the sample for obtaining (z, w) in language material is sampled by Gibbs, all parameters can be obtained based on final sampling in model
Sample estimated.Training process is as follows:
(1) random initializtion, to each word w in every news documents in Chinese language material, a random attached theme is compiled
Number z;
(2) corpus is rescaned, to each word w, formula resampling theme is sampled according to Gibbs, in language material
Update;
(3) the resampling process for repeating the above corpus is restrained until Gibbs is sampled;
(4) frequency matrix that theme-lexical item of statistics corpus occurs altogether, which is exactly LDA models.
3. Fig. 6 is the processing procedure of the article of the unknown theme of F languages.It is obtained with the generation method of same LDA topic models
To the F language LDA topic models of article.
4. similarity comparison.
Compare the LDA topic models of T topics Chinese LDA models and the unknown topic article of F language using KL distance methods
Similarity.Formula is as follows:
For all j, work as pj=qjWhen, DKL(p, q)=0.In view of the asymmetry of KL range formulas, there is KL apart from right
Claim formula:
Dλ(p, q)=λ DKL(p, λ q+ (1- λ) q)+(1- λ) DKL(q, λ p+ (1- λ) p)
As λ=0.5, KL range formulas can be converted into JS range formulas:
This patent selects JS range formulas as the module of similarity.
5. evaluation criteria.This patent is using F metrics as evaluation criterion.F metrics are a kind of group in information retrieval
Close the balance index of precision ratio and recall rate index.It is indicated using following formula:
P=Na/Nb
R=Na/NC
Wherein NaIndicate the individual sum correctly identified, NbIndicate the individual sum identified, NcIndicate exist in test set
Individual sum, P indicate accuracy, R indicate recall rate.