CN106547822A - A kind of text relevant determines method and device - Google Patents
A kind of text relevant determines method and device Download PDFInfo
- Publication number
- CN106547822A CN106547822A CN201610865596.0A CN201610865596A CN106547822A CN 106547822 A CN106547822 A CN 106547822A CN 201610865596 A CN201610865596 A CN 201610865596A CN 106547822 A CN106547822 A CN 106547822A
- Authority
- CN
- China
- Prior art keywords
- text
- probability
- feature words
- uncorrelated
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Probability & Statistics with Applications (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention discloses a kind of text relevant determines method and device, the Feature Words of each text in the samples of text for the high similarity of target domain and low similarity are extracted in advance, and uncorrelated likelihood probability of each Feature Words to the associated likelihood probability of the target domain and with the target domain is calculated, methods described includes:Extract the Feature Words of pending target text;According to the corresponding associated likelihood probability of calculated each Feature Words and uncorrelated likelihood probability, it is determined that the corresponding associated likelihood probability of each Feature Words of the described pending target text for being extracted and uncorrelated likelihood probability;According to determined by, the corresponding associated likelihood probability of each Feature Words of the pending target text and uncorrelated likelihood probability, determine the correlation of the pending target text and the target domain.Using the embodiment of the present invention, the accuracy rate of target text and target domain correlation prediction is improve.
Description
Technical field
The present invention relates to technical field of internet application, more particularly to a kind of text relevant determines method and device.
Background technology
With the continuous development of Web technologies, based on the machine learning of big data have been applied in medical treatment, education, traffic,
The various fields such as amusement.And text is modal data type, usually from the Email in network, note, microblogging, opinion
Model of altar etc..The correlation prediction of target text and target domain, is common text data processing mode.
Base unit for identifying content of text is feature or characteristic item, and the process of the process of this paper is generally needed
Participle is carried out to text, so, for representing that the word of the feature or characteristic item of text is text feature word.One text can
With comprising multiple Feature Words, generally carried out between text with the Feature Words of pending target text or text and target domain it
Between correlation differentiation.In prior art, using the Feature Words of the correlated samples for extracting target domain, target text is then calculated
Feature Words and the Feature Words of samples of text between the degree of correlation, so as to judge the correlation of target text and target domain.By
Just directly judge with target domain whether related to the similarity of the Feature Words of target text in only calculating, target text can be caused
It is relatively low with the accuracy rate of target domain correlation prediction.
The content of the invention
The purpose of the embodiment of the present invention is to provide a kind of text relevant to determine method and device, to improve target text
With the accuracy rate of target domain correlation prediction.
To reach above-mentioned purpose, the embodiment of the invention discloses a kind of text relevant determines method, extract in advance and be directed to
The Feature Words of each text in the samples of text of the high similarity of target domain and low similarity, and calculate each Feature Words
Uncorrelated likelihood probability to the associated likelihood probability of the target domain and with the target domain, methods described include:
Extract the Feature Words of pending target text;
According to the corresponding associated likelihood probability of calculated each Feature Words and uncorrelated likelihood probability, it is determined that being carried
The corresponding associated likelihood probability of each Feature Words of the described pending target text for taking and uncorrelated likelihood probability;
The corresponding associated likelihood probability of each Feature Words of the pending target text and uncorrelated according to determined by
Likelihood probability, determines the correlation of the pending target text and the target domain.
Preferably, the Feature Words of each text in the samples of text for target domain are extracted, including:
For each text in the samples of text, the technology for data mining is utilized, extract the spy of the text
Levy word;
The Feature Words for extracting pending target text, including:
For the pending target text, the technology for data mining is utilized, extract the Feature Words of the text.
Preferably, the technology for data mining, including:
TF-IDF technologies, or word embedded technology.
Preferably, the associated likelihood probability for calculating each Feature Words and the target domain and with the target
The uncorrelated likelihood probability in field, including:
Obtain the Feature Words of each text in the samples of text prior probability related to the target domain and with
The uncorrelated prior probability of the target domain;
According to acquired related prior probability and uncorrelated prior probability, determine that the related of each Feature Words expects frequency
The secondary and uncorrelated expectation frequency;
It is related according to determined by expect the frequency and the uncorrelated expectation frequency, calculate each Feature Words and lead with the target
The associated likelihood probability in domain and the uncorrelated likelihood probability to the target domain.
Preferably, the corresponding associated likelihood of each Feature Words of the pending target text determined by the basis is general
Rate and uncorrelated likelihood probability, determine the correlation of the pending target text and the target domain, including:
The corresponding associated likelihood probability of each Feature Words of the pending target text and uncorrelated according to determined by
Likelihood probability, calculates the product of probability of the associated likelihood probability of the corresponding Feature Words of the pending target text and uncorrelated likelihood
The product of probability of probability;
Judge whether the product of probability of the associated likelihood probability is more than the product of probability of the uncorrelated likelihood probability;
If it is, determining that the pending target text is related to the target domain;
If not, determining that the pending target text is uncorrelated to the target domain.
To reach above-mentioned purpose, the embodiment of the invention discloses a kind of text relevant determining device, described device includes:
First extraction module, for being extracted in the samples of text for the high similarity of target domain and low similarity in advance
The Feature Words of each text, and calculate each Feature Words and the target domain associated likelihood probability and with the mesh
The uncorrelated likelihood probability in mark field;
Second extraction module, for extracting the Feature Words of pending target text;
First determining module, for according to the corresponding associated likelihood probability of calculated each Feature Words and uncorrelated
Likelihood probability, it is determined that the corresponding associated likelihood probability of each Feature Words of the described pending target text for being extracted and uncorrelated
Likelihood probability;
Second determining module, for the corresponding correlation of each Feature Words of the pending target text determined by basis
Likelihood probability and uncorrelated likelihood probability, determine the correlation of the pending target text and the target domain.
Preferably, the Feature Words of each text in the samples of text for target domain are extracted, including:
For each text in the samples of text, the technology for data mining is utilized, extract the spy of the text
Levy word;
The Feature Words for extracting pending target text, including:
For the pending target text, the technology for data mining is utilized, extract the Feature Words of the text.
Preferably, the technology for data mining, including:
TF-IDF technologies, or word embedded technology.
Preferably, the associated likelihood probability for calculating each Feature Words and the target domain and with the target
The uncorrelated likelihood probability in field, including:
Obtain the Feature Words of each text in the samples of text prior probability related to the target domain and with
The uncorrelated prior probability of the target domain;
According to acquired related prior probability and uncorrelated prior probability, determine this feature word it is related expect the frequency and
The uncorrelated expectation frequency;
It is related according to determined by expect the frequency and the uncorrelated expectation frequency, calculate each Feature Words and lead with the target
The associated likelihood probability in domain and the uncorrelated likelihood probability to the target domain.
Preferably, second determining module, including:
Calculating sub module, for the pending target text determined by basis each Feature Words it is corresponding it is related seemingly
So probability and uncorrelated likelihood probability, calculate the probability of the associated likelihood probability of the corresponding Feature Words of the pending target text
The product of probability of product and uncorrelated likelihood probability;
Judging submodule, for judging the product of probability of the associated likelihood probability whether more than the uncorrelated likelihood probability
Product of probability;
Determination sub-module, in the case of being to be in the judging submodule judged result, determines the pending mesh
Mark text is related to the target domain;In the case where the judging submodule judged result is no, determine described pending
Target text is uncorrelated to the target domain.
As seen from the above technical solutions, a kind of text relevant provided in an embodiment of the present invention determines method and device,
Extract in advance the Feature Words of each text in the samples of text for target domain, and calculate each Feature Words with it is described
The associated likelihood probability of target domain and the uncorrelated likelihood probability to the target domain, methods described include:Extraction is treated
Process the Feature Words of target text;According to the corresponding associated likelihood probability of calculated each Feature Words and uncorrelated likelihood
Probability, it is determined that the corresponding associated likelihood probability of each Feature Words of the described pending target text for being extracted and uncorrelated likelihood
Probability;The corresponding associated likelihood probability of each Feature Words of the pending target text and uncorrelated likelihood according to determined by
Probability, determines the correlation of the pending target text and the target domain.
Using technical scheme provided in an embodiment of the present invention, according to the high similarity of target domain and the text sample of low similarity
The associated likelihood probability and uncorrelated likelihood probability of this corresponding Feature Words, obtains each spy that pending target text is extracted
The associated likelihood probability and uncorrelated likelihood probability of word are levied, it is general further according to the corresponding associated likelihood of all Feature Words of target text
Rate and uncorrelated likelihood probability determine its correlation with target domain, compared to existing technology in only by calculating pending target
Whether the Feature Words of text determine target text to the degree of correlation of the Feature Words of samples of text related with target domain, increased
The comparison of the irrelevance of Feature Words Feature Words corresponding with samples of text, improve Feature Words and target domain correlation and
It is comprehensive that irrelevance judges.This improves the accuracy rate of target text and target domain correlation prediction.
Certainly, the arbitrary product or method for implementing the present invention must be not necessarily required to while reaching all the above excellent
Point.
Description of the drawings
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
Accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this
Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with
Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is the schematic flow sheet that a kind of text relevant provided in an embodiment of the present invention determines method;
Fig. 2 is a kind of structural representation of text relevant determining device provided in an embodiment of the present invention.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than the embodiment of whole.It is based on
Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made
Embodiment, belongs to the scope of protection of the invention.
To solve prior art problem, embodiments provide a kind of text relevant and determine method and device, with
It is lower to be described in detail respectively.
It should be noted that according to target domain, substantial amounts of samples of text can be obtained ahead of time, samples of text is target neck
The high similarity text and low similarity text in domain.Therefore, the feature according to each Text Feature Extraction in samples of text out
Word is also different from the degree of correlation of target domain, and related degree can pass through the related prior probability of Feature Words and target domain
Showed with incoherent prior probability.The Feature Words for extracting text are prior art, and embodiment of the present invention here is not done superfluous
State.
The feature of each text in the samples of text for the high similarity of target domain and low similarity is extracted in advance
Word, and calculate each Feature Words to the associated likelihood probability of the target domain and with the target domain it is uncorrelated seemingly
Right probability.
Specifically, the Feature Words of each text in the samples of text for target domain are extracted, can be directed to described
Each text in samples of text, utilizes the technology for data mining, extracts the Feature Words of the text.
Specifically, the technology for data mining, can include:TF-IDF technologies, or word embedded technology.
Specifically, the associated likelihood probability for calculating each Feature Words and the target domain and with the target
The uncorrelated likelihood probability in field, can obtain each text in the samples of text Feature Words with the target domain
Related prior probability and the uncorrelated prior probability to the target domain;According to acquired related prior probability and uncorrelated
Prior probability, determines that the related of each Feature Words expects the frequency and the uncorrelated expectation frequency;It is related according to determined by expect
The frequency and the uncorrelated expectation frequency, calculate each Feature Words and the target domain associated likelihood probability and with the mesh
The uncorrelated likelihood probability in mark field.
It will be appreciated by persons skilled in the art that entering for the samples of text of the high similarity of target domain and low similarity
The extraction of style of writing eigen word, can adopt TF-IDF technologies, or word embedded technology, and according to the feature of all samples of text
Word, statistics obtain the frequency that each Feature Words occurs, but this feature word is in the related text and non-phase text of target domain
In distribution need further to obtain.For each Feature Words, can obtain rule of thumb or by way of directly giving
To a prior probability related to target domain and an incoherent prior probability, i.e., related prior probability and uncorrelated elder generation
Test probability.According to the related prior probability and uncorrelated prior probability and the frequency for occurring of Feature Words, by EM
(Expectation-maximization, expectation maximization) algorithm, obtains the related of each Feature Words and expects the frequency and not
It is related to expect the frequency;When a text is given, the dependent probability and uncorrelated probability under the text can also be obtained.And calculate
Under the present conditions, for each Feature Words, can be according to this feature word with the associated likelihood probability of target domain
It is related to expect that the frequency ratio for expecting frequency sum related to all Feature Words is obtained, in the same manner, this feature word and mesh is obtained
The uncorrelated likelihood probability in mark field.EM algorithms are prior art, and the embodiment of the present invention do not repeated to which.
Exemplary, with film《Mermaid》Film review be target domain, the Feature Words of acquisition are A1、A2、A3、A4、A5、
A6、A7、A8, according to statistics, obtain the frequency shown in table 1 and obtain its related prior probability and uncorrelated prior probability.Root
The related expectation frequency and the uncorrelated expectation frequency and corresponding associated likelihood probability are obtained according to EM algorithms and uncorrelated likelihood is general
Rate.With Feature Words A1As a example by, calculating its associated likelihood probability is:8/ (8+4+8+10+5+10+8+10)=13%;It is uncorrelated seemingly
So probability is:2/ (2+1+6+4+25+15+2+35)=2%, calculates A successively2、A3、A4、A5、A6、A7、A8Associated likelihood probability
With uncorrelated likelihood probability.What the prior probability and the frequency of embodiment of the present invention Feature Words was merely exemplary, do not constitute to this
The restriction of invention.
In practical application, target domain includes substantial amounts of Feature Words, obtains the related expectation frequency and non-phase according to EM algorithms
Close and expect the frequency and then after obtaining associated likelihood probability and uncorrelated likelihood probability, further according to associated likelihood probability and it is uncorrelated seemingly
So probability redefines the correlation classification of samples of text and target domain, and the expectation frequency of Feature Words is updated according to classification results
It is secondary, until target domain reaches convergence state in related and incoherent two classifications, updated according to the result for redefining
The related of Feature Words expects the frequency and the uncorrelated expectation frequency.So repeatedly by the correlation models training of EM, model
Update and assess until meeting permissible accuracy, the iteration for carrying out EM algorithms updates.
It will be appreciated by persons skilled in the art that the embodiment of the present invention is using samples of text (high related text and low
Related text) Feature Words that extract, and calculate probability of these Feature Words in the correlation and irrelevance of target domain, root
The probability distribution of Feature Words is corrected again can according to these probability, according to revised probability distribution from new calculating samples of text
Degree of correlation.Through iterating, thus it is possible to vary the original probability distribution of Feature Words, so as to both contain correlated characteristic word to those
Text containing uncorrelated features word is effectively distinguished again, improves the precision of classification.
Table 1
Fig. 1 is the schematic flow sheet that a kind of text relevant provided in an embodiment of the present invention determines method, including following step
Suddenly:
S101, extracts the Feature Words of pending target text.
Specifically, the Feature Words for extracting pending target text, can be directed to the pending target text, utilize
For the technology of data mining, the Feature Words of the text are extracted.
Specifically, the technology for data mining, can include:TF-IDF technologies, or word embedded technology.
S102, according to the corresponding associated likelihood probability of calculated each Feature Words and uncorrelated likelihood probability, really
The corresponding associated likelihood probability of each Feature Words and uncorrelated likelihood probability of the fixed described pending target text for being extracted.
S103, the corresponding associated likelihood probability of each Feature Words of the pending target text and not according to determined by
Associated likelihood probability, determines the correlation of the pending target text and the target domain.
Specifically, the corresponding associated likelihood of each Feature Words of the pending target text determined by the basis is general
Rate and uncorrelated likelihood probability, determine the correlation of the pending target text and the target domain, can according to really
The corresponding associated likelihood probability of each Feature Words of fixed described pending target text and uncorrelated likelihood probability, calculate described
The product of probability and the product of probability of uncorrelated likelihood probability of the associated likelihood probability of the corresponding Feature Words of pending target text;Judge
Whether the product of probability of the associated likelihood probability is more than the product of probability of the uncorrelated likelihood probability;If it is, treating described in determining
Process target text related to the target domain;If not, determining the pending target text with the target domain not
It is related.
Exemplary, judge whether pending text " children's story of mermaid, very touching " belongs to film《Beauty
Fish》Film review this target domain, extract the Feature Words of pending target text first, the technology of employing is for data mining
TF-IDF technologies or word embedded technology.The Feature Words that hypothesis is extracted are A1、A2、A5And A8, A is searched from table 11、A2、A5
And A8Corresponding associated likelihood probability and uncorrelated likelihood probability, respectively:13%th, 6%, 8%, 16% and 2%, 1%,
28%th, 39%.What the quantity of the Feature Words of Text Feature Extraction was merely exemplary, do not constitute the restriction to the embodiment of the present invention.
It will be appreciated by persons skilled in the art that all Feature Words of pending target text are obtained through S102 steps
Corresponding associated likelihood probability and uncorrelated likelihood probability, and then judge whether target text is related to target domain.Generally, may be used
To be judged by the way of product of probability is compared, Ke Yiwei:According to the product of probability of the associated likelihood probability of all Feature Words with
The size of the product of probability of uncorrelated likelihood probability is determined;Can also be using the product of probability or non-phase with associated likelihood probability
The product of probability and the threshold value of setting for closing likelihood probability is compared, so that it is determined that whether target text is related to target domain.
When the product of probability of the associated likelihood probability according to Feature Words is carried out to the size of the product of probability of uncorrelated likelihood probability
When determining, for all Feature Words of pending target text, the product of probability of associated likelihood probability is:13%*6%*
8%*16%=0.0101%;The product of probability of uncorrelated likelihood probability is:2%*1%*28%*39%=0.0027%, because
0.0101%>0.0027%, so determining related text of the pending target text for target domain.In practical application, according to
The determination result of target text carry out text and character pair word distribution renewal, e.g., can using text as target domain height
The samples of text of similarity or low similarity, or carry out the amendment of prior probability of the frequency of Feature Words, Feature Words etc.,
In follow-up content of text judges, to improve the accuracy rate of feature word judgment.Wherein, for being not logged in Feature Words in text,
Present invention probability of the setting unregistered word in associated class is much smaller than its probability in uncorrelated class, so as to strengthen the solution of TF-IDF
Release ability.
It can be seen that, using the embodiment shown in Fig. 1 of the present invention, according to the high similarity of target domain and the text sample of low similarity
The associated likelihood probability and uncorrelated likelihood probability of this corresponding Feature Words, obtains each spy that pending target text is extracted
The associated likelihood probability and uncorrelated likelihood probability of word are levied, it is general further according to the corresponding associated likelihood of all Feature Words of target text
Rate and uncorrelated likelihood probability determine its correlation with target domain, compared to existing technology in only by calculating pending target
Whether the Feature Words of text determine target text to the degree of correlation of the Feature Words of samples of text related with target domain, increased
The comparison of the irrelevance of Feature Words Feature Words corresponding with samples of text, improve Feature Words and target domain correlation and
It is comprehensive that irrelevance judges.This improves the accuracy rate of target text and target domain correlation prediction.
Fig. 2 is a kind of structural representation of text relevant determining device provided in an embodiment of the present invention, can be included:The
One extraction module 201, the second extraction module 202, the first determining module 203 and the second determining module 204.
First extraction module 201, for extracting the samples of text for the high similarity of target domain and low similarity in advance
In each text Feature Words, and calculate each Feature Words and the target domain associated likelihood probability and with institute
State the uncorrelated likelihood probability of target domain.
Specifically, the Feature Words of each text in the samples of text for target domain in practical application, are extracted, can
For each text in the samples of text, to utilize the technology for data mining, the Feature Words of the text are extracted.
Specifically, in practical application, the technology for data mining can include:TF-IDF technologies, or word is embedding
Enter technology.
Specifically, in practical application, the associated likelihood probability for calculating each Feature Words and the target domain with
And the uncorrelated likelihood probability to the target domain, can obtain each text in the samples of text Feature Words with institute
State the related prior probability and the uncorrelated prior probability to the target domain of target domain;According to acquired related priori
Probability and uncorrelated prior probability, determine that the related of this feature word expects the frequency and the uncorrelated expectation frequency;According to determined by
It is related to expect the frequency and the uncorrelated expectation frequency, calculate each Feature Words and the target domain associated likelihood probability and
To the uncorrelated likelihood probability of the target domain.
Second extraction module 202, for extracting the Feature Words of pending target text.
Specifically, in practical application, the Feature Words for extracting pending target text can be directed to the pending mesh
Mark text, utilizes the technology for data mining, extracts the Feature Words of the text.
Specifically, in practical application, the technology for data mining can include:TF-IDF technologies, or word is embedding
Enter technology.
First determining module 203, for according to the corresponding associated likelihood probability of calculated each Feature Words and not
Associated likelihood probability, it is determined that the corresponding associated likelihood probability of each Feature Words of the described pending target text for being extracted and not
Associated likelihood probability.
Second determining module 204, each Feature Words for the pending target text determined by basis are corresponding
Associated likelihood probability and uncorrelated likelihood probability, determine the correlation of the pending target text and the target domain.
Specifically, in practical application, second determining module 204 can include:Calculating sub module, judging submodule,
Determination sub-module (not shown);Wherein,
Calculating sub module, for the pending target text determined by basis each Feature Words it is corresponding it is related seemingly
So probability and uncorrelated likelihood probability, calculate the probability of the associated likelihood probability of the corresponding Feature Words of the pending target text
The product of probability of product and uncorrelated likelihood probability;
Judging submodule, for judging the product of probability of the associated likelihood probability whether more than the uncorrelated likelihood probability
Product of probability;
Determination sub-module, in the case of being to be in the judging submodule judged result, determines the pending mesh
Mark text is related to the target domain;In the case where the judging submodule judged result is no, determine described pending
Target text is uncorrelated to the target domain.
It can be seen that, using the embodiment shown in Fig. 2 of the present invention, according to the high similarity of target domain and the text sample of low similarity
The associated likelihood probability and uncorrelated likelihood probability of this corresponding Feature Words, obtains each spy that pending target text is extracted
The associated likelihood probability and uncorrelated likelihood probability of word are levied, it is general further according to the corresponding associated likelihood of all Feature Words of target text
Rate and uncorrelated likelihood probability determine its correlation with target domain, compared to existing technology in only by calculating pending target
Whether the Feature Words of text determine target text to the degree of correlation of the Feature Words of samples of text related with target domain, increased
The comparison of the irrelevance of Feature Words Feature Words corresponding with samples of text, improve Feature Words and target domain correlation and
It is comprehensive that irrelevance judges.This improves the accuracy rate of target text and target domain correlation prediction.
It should be noted that herein, such as first and second or the like relational terms are used merely to a reality
Body or operation are made a distinction with another entity or operation, and are not necessarily required or implied these entities or deposit between operating
In any this actual relation or order.And, term " including ", "comprising" or its any other variant are intended to
Nonexcludability is included, so that a series of process, method, article or equipment including key elements not only will including those
Element, but also including other key elements being not expressly set out, or also include for this process, method, article or equipment
Intrinsic key element.In the absence of more restrictions, the key element for being limited by sentence "including a ...", it is not excluded that
Also there is other identical element in process, method, article or equipment including the key element.
Each embodiment in this specification is described by the way of correlation, identical similar portion between each embodiment
Divide mutually referring to what each embodiment was stressed is the difference with other embodiment.Especially for device reality
For applying example, as which is substantially similar to embodiment of the method, so description is fairly simple, related part is referring to embodiment of the method
Part explanation.
One of ordinary skill in the art will appreciate that all or part of step in realizing said method embodiment is can
Instruct related hardware to complete with by program, described program can be stored in computer read/write memory medium,
The storage medium for obtaining designated herein, such as:ROM/RAM, magnetic disc, CD etc..
Presently preferred embodiments of the present invention is the foregoing is only, protection scope of the present invention is not intended to limit.It is all
Any modification, equivalent substitution and improvements made within the spirit and principles in the present invention etc., are all contained in protection scope of the present invention
It is interior.
Claims (10)
1. a kind of text relevant determines method, it is characterised in that extract in advance for the high similarity of target domain and low phase seemingly
The Feature Words of each text in the samples of text of degree, and calculate the associated likelihood of each Feature Words and the target domain
Probability and the uncorrelated likelihood probability to the target domain, methods described include:
Extract the Feature Words of pending target text;
According to the corresponding associated likelihood probability of calculated each Feature Words and uncorrelated likelihood probability, it is determined that extracted
The corresponding associated likelihood probability of each Feature Words of the pending target text and uncorrelated likelihood probability;
The corresponding associated likelihood probability of each Feature Words of the pending target text and uncorrelated likelihood according to determined by
Probability, determines the correlation of the pending target text and the target domain.
2. method according to claim 1, it is characterised in that extract each in the samples of text for target domain
The Feature Words of text, including:
For each text in the samples of text, the technology for data mining is utilized, extract the Feature Words of the text;
The Feature Words for extracting pending target text, including:
For the pending target text, the technology for data mining is utilized, extract the Feature Words of the text.
3. method according to claim 2, it is characterised in that the technology for data mining, including:
TF-IDF technologies, or word embedded technology.
4. method according to claim 1, it is characterised in that each Feature Words of the calculating and the target domain
Associated likelihood probability and the uncorrelated likelihood probability to the target domain, including:
Obtain the Feature Words of each text in the samples of text prior probability related to the target domain and with it is described
The uncorrelated prior probability of target domain;
According to acquired related prior probability and uncorrelated prior probability, determine each Feature Words it is related expect the frequency and
The uncorrelated expectation frequency;
It is related according to determined by expect the frequency and the uncorrelated expectation frequency, each Feature Words is calculated with the target domain
Associated likelihood probability and the uncorrelated likelihood probability to the target domain.
5. method according to claim 1, it is characterised in that the pending target text determined by the basis
Each corresponding associated likelihood probability of Feature Words and uncorrelated likelihood probability, determine the pending target text and the target
The correlation in field, including:
The corresponding associated likelihood probability of each Feature Words of the pending target text and uncorrelated likelihood according to determined by
Probability, calculates the product of probability and uncorrelated likelihood probability of the associated likelihood probability of the corresponding Feature Words of the pending target text
Product of probability;
Judge whether the product of probability of the associated likelihood probability is more than the product of probability of the uncorrelated likelihood probability;
If it is, determining that the pending target text is related to the target domain;
If not, determining that the pending target text is uncorrelated to the target domain.
6. a kind of text relevant determining device, it is characterised in that described device includes:
First extraction module, it is each in the samples of text for the high similarity of target domain and low similarity for extracting in advance
The Feature Words of individual text, and calculate each Feature Words with the associated likelihood probability of the target domain and lead with the target
The uncorrelated likelihood probability in domain;
Second extraction module, for extracting the Feature Words of pending target text;
First determining module, for according to the corresponding associated likelihood probability of calculated each Feature Words and uncorrelated likelihood
Probability, it is determined that the corresponding associated likelihood probability of each Feature Words of the described pending target text for being extracted and uncorrelated likelihood
Probability;
Second determining module, for the corresponding associated likelihood of each Feature Words of the pending target text determined by basis
Probability and uncorrelated likelihood probability, determine the correlation of the pending target text and the target domain.
7. device according to claim 6, it is characterised in that extract each in the samples of text for target domain
The Feature Words of text, including:
For each text in the samples of text, the technology for data mining is utilized, extract the Feature Words of the text;
The Feature Words for extracting pending target text, including:
For the pending target text, the technology for data mining is utilized, extract the Feature Words of the text.
8. device according to claim 7, it is characterised in that the technology for data mining, including:
TF-IDF technologies, or word embedded technology.
9. device according to claim 6, it is characterised in that each Feature Words of the calculating and the target domain
Associated likelihood probability and the uncorrelated likelihood probability to the target domain, including:
Obtain the Feature Words of each text in the samples of text prior probability related to the target domain and with it is described
The uncorrelated prior probability of target domain;
According to acquired related prior probability and uncorrelated prior probability, determine that the related of this feature word expects the frequency and non-phase
Close and expect the frequency;
It is related according to determined by expect the frequency and the uncorrelated expectation frequency, each Feature Words is calculated with the target domain
Associated likelihood probability and the uncorrelated likelihood probability to the target domain.
10. device according to claim 6, it is characterised in that second determining module, including:
Calculating sub module, the corresponding associated likelihood of each Feature Words for the pending target text determined by basis are general
Rate and uncorrelated likelihood probability, calculate the associated likelihood probability of the corresponding Feature Words of the pending target text product of probability and
The product of probability of uncorrelated likelihood probability;
Whether judging submodule, the product of probability for judging the associated likelihood probability are general more than the uncorrelated likelihood probability
Rate is accumulated;
Determination sub-module, for literary in the case of being, to determine the pending target in the judging submodule judged result
This is related to the target domain;In the case where the judging submodule judged result is no, the pending target is determined
Text is uncorrelated to the target domain.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610865596.0A CN106547822A (en) | 2016-09-29 | 2016-09-29 | A kind of text relevant determines method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610865596.0A CN106547822A (en) | 2016-09-29 | 2016-09-29 | A kind of text relevant determines method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106547822A true CN106547822A (en) | 2017-03-29 |
Family
ID=58368401
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610865596.0A Pending CN106547822A (en) | 2016-09-29 | 2016-09-29 | A kind of text relevant determines method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106547822A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107402984A (en) * | 2017-07-11 | 2017-11-28 | 北京金堤科技有限公司 | A kind of sorting technique and device based on theme |
CN108932525A (en) * | 2018-06-07 | 2018-12-04 | 阿里巴巴集团控股有限公司 | A kind of behavior prediction method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103049470A (en) * | 2012-09-12 | 2013-04-17 | 北京航空航天大学 | Opinion retrieval method based on emotional relevancy |
CN103150371A (en) * | 2013-03-08 | 2013-06-12 | 北京理工大学 | Confusion removal text retrieval method based on positive and negative training |
CN105760474A (en) * | 2016-02-14 | 2016-07-13 | Tcl集团股份有限公司 | Document collection feature word extracting method and system based on position information |
-
2016
- 2016-09-29 CN CN201610865596.0A patent/CN106547822A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103049470A (en) * | 2012-09-12 | 2013-04-17 | 北京航空航天大学 | Opinion retrieval method based on emotional relevancy |
CN103150371A (en) * | 2013-03-08 | 2013-06-12 | 北京理工大学 | Confusion removal text retrieval method based on positive and negative training |
CN105760474A (en) * | 2016-02-14 | 2016-07-13 | Tcl集团股份有限公司 | Document collection feature word extracting method and system based on position information |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107402984A (en) * | 2017-07-11 | 2017-11-28 | 北京金堤科技有限公司 | A kind of sorting technique and device based on theme |
CN108932525A (en) * | 2018-06-07 | 2018-12-04 | 阿里巴巴集团控股有限公司 | A kind of behavior prediction method and device |
CN108932525B (en) * | 2018-06-07 | 2022-04-29 | 创新先进技术有限公司 | Behavior prediction method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11227118B2 (en) | Methods, devices, and systems for constructing intelligent knowledge base | |
CN105045812B (en) | The classification method and system of text subject | |
CN110532353B (en) | Text entity matching method, system and device based on deep learning | |
CN109284397A (en) | A kind of construction method of domain lexicon, device, equipment and storage medium | |
CN107526800A (en) | Device, method and the computer-readable recording medium of information recommendation | |
CN109117474B (en) | Statement similarity calculation method and device and storage medium | |
CN109492217B (en) | Word segmentation method based on machine learning and terminal equipment | |
CN105095222B (en) | Uniterm replacement method, searching method and device | |
CN111488732B (en) | Method, system and related equipment for detecting deformed keywords | |
WO2014022172A2 (en) | Information classification based on product recognition | |
CN104866478A (en) | Detection recognition method and device of malicious text | |
CN110188357B (en) | Industry identification method and device for objects | |
CN111291177A (en) | Information processing method and device and computer storage medium | |
CN111881671A (en) | Attribute word extraction method | |
CN105912525A (en) | Sentiment classification method for semi-supervised learning based on theme characteristics | |
CN107861945A (en) | Finance data analysis method, application server and computer-readable recording medium | |
CN115456043A (en) | Classification model processing method, intent recognition method, device and computer equipment | |
CN113704393A (en) | Keyword extraction method, device, equipment and medium | |
CN113408282B (en) | Method, device, equipment and storage medium for topic model training and topic prediction | |
CN106547822A (en) | A kind of text relevant determines method and device | |
CN110019556B (en) | Topic news acquisition method, device and equipment thereof | |
CN107291686B (en) | Method and system for identifying emotion identification | |
CN107688594A (en) | The identifying system and method for risk case based on social information | |
CN116561298A (en) | Title generation method, device, equipment and storage medium based on artificial intelligence | |
CN109359274A (en) | The method, device and equipment that the character string of a kind of pair of Mass production is identified |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170329 |
|
RJ01 | Rejection of invention patent application after publication |