CN106547822A - A kind of text relevant determines method and device - Google Patents

A kind of text relevant determines method and device Download PDF

Info

Publication number
CN106547822A
CN106547822A CN201610865596.0A CN201610865596A CN106547822A CN 106547822 A CN106547822 A CN 106547822A CN 201610865596 A CN201610865596 A CN 201610865596A CN 106547822 A CN106547822 A CN 106547822A
Authority
CN
China
Prior art keywords
text
probability
feature words
uncorrelated
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610865596.0A
Other languages
Chinese (zh)
Inventor
鲍昕平
沈一
蔡龙军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN201610865596.0A priority Critical patent/CN106547822A/en
Publication of CN106547822A publication Critical patent/CN106547822A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a kind of text relevant determines method and device, the Feature Words of each text in the samples of text for the high similarity of target domain and low similarity are extracted in advance, and uncorrelated likelihood probability of each Feature Words to the associated likelihood probability of the target domain and with the target domain is calculated, methods described includes:Extract the Feature Words of pending target text;According to the corresponding associated likelihood probability of calculated each Feature Words and uncorrelated likelihood probability, it is determined that the corresponding associated likelihood probability of each Feature Words of the described pending target text for being extracted and uncorrelated likelihood probability;According to determined by, the corresponding associated likelihood probability of each Feature Words of the pending target text and uncorrelated likelihood probability, determine the correlation of the pending target text and the target domain.Using the embodiment of the present invention, the accuracy rate of target text and target domain correlation prediction is improve.

Description

A kind of text relevant determines method and device
Technical field
The present invention relates to technical field of internet application, more particularly to a kind of text relevant determines method and device.
Background technology
With the continuous development of Web technologies, based on the machine learning of big data have been applied in medical treatment, education, traffic, The various fields such as amusement.And text is modal data type, usually from the Email in network, note, microblogging, opinion Model of altar etc..The correlation prediction of target text and target domain, is common text data processing mode.
Base unit for identifying content of text is feature or characteristic item, and the process of the process of this paper is generally needed Participle is carried out to text, so, for representing that the word of the feature or characteristic item of text is text feature word.One text can With comprising multiple Feature Words, generally carried out between text with the Feature Words of pending target text or text and target domain it Between correlation differentiation.In prior art, using the Feature Words of the correlated samples for extracting target domain, target text is then calculated Feature Words and the Feature Words of samples of text between the degree of correlation, so as to judge the correlation of target text and target domain.By Just directly judge with target domain whether related to the similarity of the Feature Words of target text in only calculating, target text can be caused It is relatively low with the accuracy rate of target domain correlation prediction.
The content of the invention
The purpose of the embodiment of the present invention is to provide a kind of text relevant to determine method and device, to improve target text With the accuracy rate of target domain correlation prediction.
To reach above-mentioned purpose, the embodiment of the invention discloses a kind of text relevant determines method, extract in advance and be directed to The Feature Words of each text in the samples of text of the high similarity of target domain and low similarity, and calculate each Feature Words Uncorrelated likelihood probability to the associated likelihood probability of the target domain and with the target domain, methods described include:
Extract the Feature Words of pending target text;
According to the corresponding associated likelihood probability of calculated each Feature Words and uncorrelated likelihood probability, it is determined that being carried The corresponding associated likelihood probability of each Feature Words of the described pending target text for taking and uncorrelated likelihood probability;
The corresponding associated likelihood probability of each Feature Words of the pending target text and uncorrelated according to determined by Likelihood probability, determines the correlation of the pending target text and the target domain.
Preferably, the Feature Words of each text in the samples of text for target domain are extracted, including:
For each text in the samples of text, the technology for data mining is utilized, extract the spy of the text Levy word;
The Feature Words for extracting pending target text, including:
For the pending target text, the technology for data mining is utilized, extract the Feature Words of the text.
Preferably, the technology for data mining, including:
TF-IDF technologies, or word embedded technology.
Preferably, the associated likelihood probability for calculating each Feature Words and the target domain and with the target The uncorrelated likelihood probability in field, including:
Obtain the Feature Words of each text in the samples of text prior probability related to the target domain and with The uncorrelated prior probability of the target domain;
According to acquired related prior probability and uncorrelated prior probability, determine that the related of each Feature Words expects frequency The secondary and uncorrelated expectation frequency;
It is related according to determined by expect the frequency and the uncorrelated expectation frequency, calculate each Feature Words and lead with the target The associated likelihood probability in domain and the uncorrelated likelihood probability to the target domain.
Preferably, the corresponding associated likelihood of each Feature Words of the pending target text determined by the basis is general Rate and uncorrelated likelihood probability, determine the correlation of the pending target text and the target domain, including:
The corresponding associated likelihood probability of each Feature Words of the pending target text and uncorrelated according to determined by Likelihood probability, calculates the product of probability of the associated likelihood probability of the corresponding Feature Words of the pending target text and uncorrelated likelihood The product of probability of probability;
Judge whether the product of probability of the associated likelihood probability is more than the product of probability of the uncorrelated likelihood probability;
If it is, determining that the pending target text is related to the target domain;
If not, determining that the pending target text is uncorrelated to the target domain.
To reach above-mentioned purpose, the embodiment of the invention discloses a kind of text relevant determining device, described device includes:
First extraction module, for being extracted in the samples of text for the high similarity of target domain and low similarity in advance The Feature Words of each text, and calculate each Feature Words and the target domain associated likelihood probability and with the mesh The uncorrelated likelihood probability in mark field;
Second extraction module, for extracting the Feature Words of pending target text;
First determining module, for according to the corresponding associated likelihood probability of calculated each Feature Words and uncorrelated Likelihood probability, it is determined that the corresponding associated likelihood probability of each Feature Words of the described pending target text for being extracted and uncorrelated Likelihood probability;
Second determining module, for the corresponding correlation of each Feature Words of the pending target text determined by basis Likelihood probability and uncorrelated likelihood probability, determine the correlation of the pending target text and the target domain.
Preferably, the Feature Words of each text in the samples of text for target domain are extracted, including:
For each text in the samples of text, the technology for data mining is utilized, extract the spy of the text Levy word;
The Feature Words for extracting pending target text, including:
For the pending target text, the technology for data mining is utilized, extract the Feature Words of the text.
Preferably, the technology for data mining, including:
TF-IDF technologies, or word embedded technology.
Preferably, the associated likelihood probability for calculating each Feature Words and the target domain and with the target The uncorrelated likelihood probability in field, including:
Obtain the Feature Words of each text in the samples of text prior probability related to the target domain and with The uncorrelated prior probability of the target domain;
According to acquired related prior probability and uncorrelated prior probability, determine this feature word it is related expect the frequency and The uncorrelated expectation frequency;
It is related according to determined by expect the frequency and the uncorrelated expectation frequency, calculate each Feature Words and lead with the target The associated likelihood probability in domain and the uncorrelated likelihood probability to the target domain.
Preferably, second determining module, including:
Calculating sub module, for the pending target text determined by basis each Feature Words it is corresponding it is related seemingly So probability and uncorrelated likelihood probability, calculate the probability of the associated likelihood probability of the corresponding Feature Words of the pending target text The product of probability of product and uncorrelated likelihood probability;
Judging submodule, for judging the product of probability of the associated likelihood probability whether more than the uncorrelated likelihood probability Product of probability;
Determination sub-module, in the case of being to be in the judging submodule judged result, determines the pending mesh Mark text is related to the target domain;In the case where the judging submodule judged result is no, determine described pending Target text is uncorrelated to the target domain.
As seen from the above technical solutions, a kind of text relevant provided in an embodiment of the present invention determines method and device, Extract in advance the Feature Words of each text in the samples of text for target domain, and calculate each Feature Words with it is described The associated likelihood probability of target domain and the uncorrelated likelihood probability to the target domain, methods described include:Extraction is treated Process the Feature Words of target text;According to the corresponding associated likelihood probability of calculated each Feature Words and uncorrelated likelihood Probability, it is determined that the corresponding associated likelihood probability of each Feature Words of the described pending target text for being extracted and uncorrelated likelihood Probability;The corresponding associated likelihood probability of each Feature Words of the pending target text and uncorrelated likelihood according to determined by Probability, determines the correlation of the pending target text and the target domain.
Using technical scheme provided in an embodiment of the present invention, according to the high similarity of target domain and the text sample of low similarity The associated likelihood probability and uncorrelated likelihood probability of this corresponding Feature Words, obtains each spy that pending target text is extracted The associated likelihood probability and uncorrelated likelihood probability of word are levied, it is general further according to the corresponding associated likelihood of all Feature Words of target text Rate and uncorrelated likelihood probability determine its correlation with target domain, compared to existing technology in only by calculating pending target Whether the Feature Words of text determine target text to the degree of correlation of the Feature Words of samples of text related with target domain, increased The comparison of the irrelevance of Feature Words Feature Words corresponding with samples of text, improve Feature Words and target domain correlation and It is comprehensive that irrelevance judges.This improves the accuracy rate of target text and target domain correlation prediction.
Certainly, the arbitrary product or method for implementing the present invention must be not necessarily required to while reaching all the above excellent Point.
Description of the drawings
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing Accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is the schematic flow sheet that a kind of text relevant provided in an embodiment of the present invention determines method;
Fig. 2 is a kind of structural representation of text relevant determining device provided in an embodiment of the present invention.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than the embodiment of whole.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.
To solve prior art problem, embodiments provide a kind of text relevant and determine method and device, with It is lower to be described in detail respectively.
It should be noted that according to target domain, substantial amounts of samples of text can be obtained ahead of time, samples of text is target neck The high similarity text and low similarity text in domain.Therefore, the feature according to each Text Feature Extraction in samples of text out Word is also different from the degree of correlation of target domain, and related degree can pass through the related prior probability of Feature Words and target domain Showed with incoherent prior probability.The Feature Words for extracting text are prior art, and embodiment of the present invention here is not done superfluous State.
The feature of each text in the samples of text for the high similarity of target domain and low similarity is extracted in advance Word, and calculate each Feature Words to the associated likelihood probability of the target domain and with the target domain it is uncorrelated seemingly Right probability.
Specifically, the Feature Words of each text in the samples of text for target domain are extracted, can be directed to described Each text in samples of text, utilizes the technology for data mining, extracts the Feature Words of the text.
Specifically, the technology for data mining, can include:TF-IDF technologies, or word embedded technology.
Specifically, the associated likelihood probability for calculating each Feature Words and the target domain and with the target The uncorrelated likelihood probability in field, can obtain each text in the samples of text Feature Words with the target domain Related prior probability and the uncorrelated prior probability to the target domain;According to acquired related prior probability and uncorrelated Prior probability, determines that the related of each Feature Words expects the frequency and the uncorrelated expectation frequency;It is related according to determined by expect The frequency and the uncorrelated expectation frequency, calculate each Feature Words and the target domain associated likelihood probability and with the mesh The uncorrelated likelihood probability in mark field.
It will be appreciated by persons skilled in the art that entering for the samples of text of the high similarity of target domain and low similarity The extraction of style of writing eigen word, can adopt TF-IDF technologies, or word embedded technology, and according to the feature of all samples of text Word, statistics obtain the frequency that each Feature Words occurs, but this feature word is in the related text and non-phase text of target domain In distribution need further to obtain.For each Feature Words, can obtain rule of thumb or by way of directly giving To a prior probability related to target domain and an incoherent prior probability, i.e., related prior probability and uncorrelated elder generation Test probability.According to the related prior probability and uncorrelated prior probability and the frequency for occurring of Feature Words, by EM (Expectation-maximization, expectation maximization) algorithm, obtains the related of each Feature Words and expects the frequency and not It is related to expect the frequency;When a text is given, the dependent probability and uncorrelated probability under the text can also be obtained.And calculate Under the present conditions, for each Feature Words, can be according to this feature word with the associated likelihood probability of target domain It is related to expect that the frequency ratio for expecting frequency sum related to all Feature Words is obtained, in the same manner, this feature word and mesh is obtained The uncorrelated likelihood probability in mark field.EM algorithms are prior art, and the embodiment of the present invention do not repeated to which.
Exemplary, with film《Mermaid》Film review be target domain, the Feature Words of acquisition are A1、A2、A3、A4、A5、 A6、A7、A8, according to statistics, obtain the frequency shown in table 1 and obtain its related prior probability and uncorrelated prior probability.Root The related expectation frequency and the uncorrelated expectation frequency and corresponding associated likelihood probability are obtained according to EM algorithms and uncorrelated likelihood is general Rate.With Feature Words A1As a example by, calculating its associated likelihood probability is:8/ (8+4+8+10+5+10+8+10)=13%;It is uncorrelated seemingly So probability is:2/ (2+1+6+4+25+15+2+35)=2%, calculates A successively2、A3、A4、A5、A6、A7、A8Associated likelihood probability With uncorrelated likelihood probability.What the prior probability and the frequency of embodiment of the present invention Feature Words was merely exemplary, do not constitute to this The restriction of invention.
In practical application, target domain includes substantial amounts of Feature Words, obtains the related expectation frequency and non-phase according to EM algorithms Close and expect the frequency and then after obtaining associated likelihood probability and uncorrelated likelihood probability, further according to associated likelihood probability and it is uncorrelated seemingly So probability redefines the correlation classification of samples of text and target domain, and the expectation frequency of Feature Words is updated according to classification results It is secondary, until target domain reaches convergence state in related and incoherent two classifications, updated according to the result for redefining The related of Feature Words expects the frequency and the uncorrelated expectation frequency.So repeatedly by the correlation models training of EM, model Update and assess until meeting permissible accuracy, the iteration for carrying out EM algorithms updates.
It will be appreciated by persons skilled in the art that the embodiment of the present invention is using samples of text (high related text and low Related text) Feature Words that extract, and calculate probability of these Feature Words in the correlation and irrelevance of target domain, root The probability distribution of Feature Words is corrected again can according to these probability, according to revised probability distribution from new calculating samples of text Degree of correlation.Through iterating, thus it is possible to vary the original probability distribution of Feature Words, so as to both contain correlated characteristic word to those Text containing uncorrelated features word is effectively distinguished again, improves the precision of classification.
Table 1
Fig. 1 is the schematic flow sheet that a kind of text relevant provided in an embodiment of the present invention determines method, including following step Suddenly:
S101, extracts the Feature Words of pending target text.
Specifically, the Feature Words for extracting pending target text, can be directed to the pending target text, utilize For the technology of data mining, the Feature Words of the text are extracted.
Specifically, the technology for data mining, can include:TF-IDF technologies, or word embedded technology.
S102, according to the corresponding associated likelihood probability of calculated each Feature Words and uncorrelated likelihood probability, really The corresponding associated likelihood probability of each Feature Words and uncorrelated likelihood probability of the fixed described pending target text for being extracted.
S103, the corresponding associated likelihood probability of each Feature Words of the pending target text and not according to determined by Associated likelihood probability, determines the correlation of the pending target text and the target domain.
Specifically, the corresponding associated likelihood of each Feature Words of the pending target text determined by the basis is general Rate and uncorrelated likelihood probability, determine the correlation of the pending target text and the target domain, can according to really The corresponding associated likelihood probability of each Feature Words of fixed described pending target text and uncorrelated likelihood probability, calculate described The product of probability and the product of probability of uncorrelated likelihood probability of the associated likelihood probability of the corresponding Feature Words of pending target text;Judge Whether the product of probability of the associated likelihood probability is more than the product of probability of the uncorrelated likelihood probability;If it is, treating described in determining Process target text related to the target domain;If not, determining the pending target text with the target domain not It is related.
Exemplary, judge whether pending text " children's story of mermaid, very touching " belongs to film《Beauty Fish》Film review this target domain, extract the Feature Words of pending target text first, the technology of employing is for data mining TF-IDF technologies or word embedded technology.The Feature Words that hypothesis is extracted are A1、A2、A5And A8, A is searched from table 11、A2、A5 And A8Corresponding associated likelihood probability and uncorrelated likelihood probability, respectively:13%th, 6%, 8%, 16% and 2%, 1%, 28%th, 39%.What the quantity of the Feature Words of Text Feature Extraction was merely exemplary, do not constitute the restriction to the embodiment of the present invention.
It will be appreciated by persons skilled in the art that all Feature Words of pending target text are obtained through S102 steps Corresponding associated likelihood probability and uncorrelated likelihood probability, and then judge whether target text is related to target domain.Generally, may be used To be judged by the way of product of probability is compared, Ke Yiwei:According to the product of probability of the associated likelihood probability of all Feature Words with The size of the product of probability of uncorrelated likelihood probability is determined;Can also be using the product of probability or non-phase with associated likelihood probability The product of probability and the threshold value of setting for closing likelihood probability is compared, so that it is determined that whether target text is related to target domain.
When the product of probability of the associated likelihood probability according to Feature Words is carried out to the size of the product of probability of uncorrelated likelihood probability When determining, for all Feature Words of pending target text, the product of probability of associated likelihood probability is:13%*6%* 8%*16%=0.0101%;The product of probability of uncorrelated likelihood probability is:2%*1%*28%*39%=0.0027%, because 0.0101%>0.0027%, so determining related text of the pending target text for target domain.In practical application, according to The determination result of target text carry out text and character pair word distribution renewal, e.g., can using text as target domain height The samples of text of similarity or low similarity, or carry out the amendment of prior probability of the frequency of Feature Words, Feature Words etc., In follow-up content of text judges, to improve the accuracy rate of feature word judgment.Wherein, for being not logged in Feature Words in text, Present invention probability of the setting unregistered word in associated class is much smaller than its probability in uncorrelated class, so as to strengthen the solution of TF-IDF Release ability.
It can be seen that, using the embodiment shown in Fig. 1 of the present invention, according to the high similarity of target domain and the text sample of low similarity The associated likelihood probability and uncorrelated likelihood probability of this corresponding Feature Words, obtains each spy that pending target text is extracted The associated likelihood probability and uncorrelated likelihood probability of word are levied, it is general further according to the corresponding associated likelihood of all Feature Words of target text Rate and uncorrelated likelihood probability determine its correlation with target domain, compared to existing technology in only by calculating pending target Whether the Feature Words of text determine target text to the degree of correlation of the Feature Words of samples of text related with target domain, increased The comparison of the irrelevance of Feature Words Feature Words corresponding with samples of text, improve Feature Words and target domain correlation and It is comprehensive that irrelevance judges.This improves the accuracy rate of target text and target domain correlation prediction.
Fig. 2 is a kind of structural representation of text relevant determining device provided in an embodiment of the present invention, can be included:The One extraction module 201, the second extraction module 202, the first determining module 203 and the second determining module 204.
First extraction module 201, for extracting the samples of text for the high similarity of target domain and low similarity in advance In each text Feature Words, and calculate each Feature Words and the target domain associated likelihood probability and with institute State the uncorrelated likelihood probability of target domain.
Specifically, the Feature Words of each text in the samples of text for target domain in practical application, are extracted, can For each text in the samples of text, to utilize the technology for data mining, the Feature Words of the text are extracted.
Specifically, in practical application, the technology for data mining can include:TF-IDF technologies, or word is embedding Enter technology.
Specifically, in practical application, the associated likelihood probability for calculating each Feature Words and the target domain with And the uncorrelated likelihood probability to the target domain, can obtain each text in the samples of text Feature Words with institute State the related prior probability and the uncorrelated prior probability to the target domain of target domain;According to acquired related priori Probability and uncorrelated prior probability, determine that the related of this feature word expects the frequency and the uncorrelated expectation frequency;According to determined by It is related to expect the frequency and the uncorrelated expectation frequency, calculate each Feature Words and the target domain associated likelihood probability and To the uncorrelated likelihood probability of the target domain.
Second extraction module 202, for extracting the Feature Words of pending target text.
Specifically, in practical application, the Feature Words for extracting pending target text can be directed to the pending mesh Mark text, utilizes the technology for data mining, extracts the Feature Words of the text.
Specifically, in practical application, the technology for data mining can include:TF-IDF technologies, or word is embedding Enter technology.
First determining module 203, for according to the corresponding associated likelihood probability of calculated each Feature Words and not Associated likelihood probability, it is determined that the corresponding associated likelihood probability of each Feature Words of the described pending target text for being extracted and not Associated likelihood probability.
Second determining module 204, each Feature Words for the pending target text determined by basis are corresponding Associated likelihood probability and uncorrelated likelihood probability, determine the correlation of the pending target text and the target domain.
Specifically, in practical application, second determining module 204 can include:Calculating sub module, judging submodule, Determination sub-module (not shown);Wherein,
Calculating sub module, for the pending target text determined by basis each Feature Words it is corresponding it is related seemingly So probability and uncorrelated likelihood probability, calculate the probability of the associated likelihood probability of the corresponding Feature Words of the pending target text The product of probability of product and uncorrelated likelihood probability;
Judging submodule, for judging the product of probability of the associated likelihood probability whether more than the uncorrelated likelihood probability Product of probability;
Determination sub-module, in the case of being to be in the judging submodule judged result, determines the pending mesh Mark text is related to the target domain;In the case where the judging submodule judged result is no, determine described pending Target text is uncorrelated to the target domain.
It can be seen that, using the embodiment shown in Fig. 2 of the present invention, according to the high similarity of target domain and the text sample of low similarity The associated likelihood probability and uncorrelated likelihood probability of this corresponding Feature Words, obtains each spy that pending target text is extracted The associated likelihood probability and uncorrelated likelihood probability of word are levied, it is general further according to the corresponding associated likelihood of all Feature Words of target text Rate and uncorrelated likelihood probability determine its correlation with target domain, compared to existing technology in only by calculating pending target Whether the Feature Words of text determine target text to the degree of correlation of the Feature Words of samples of text related with target domain, increased The comparison of the irrelevance of Feature Words Feature Words corresponding with samples of text, improve Feature Words and target domain correlation and It is comprehensive that irrelevance judges.This improves the accuracy rate of target text and target domain correlation prediction.
It should be noted that herein, such as first and second or the like relational terms are used merely to a reality Body or operation are made a distinction with another entity or operation, and are not necessarily required or implied these entities or deposit between operating In any this actual relation or order.And, term " including ", "comprising" or its any other variant are intended to Nonexcludability is included, so that a series of process, method, article or equipment including key elements not only will including those Element, but also including other key elements being not expressly set out, or also include for this process, method, article or equipment Intrinsic key element.In the absence of more restrictions, the key element for being limited by sentence "including a ...", it is not excluded that Also there is other identical element in process, method, article or equipment including the key element.
Each embodiment in this specification is described by the way of correlation, identical similar portion between each embodiment Divide mutually referring to what each embodiment was stressed is the difference with other embodiment.Especially for device reality For applying example, as which is substantially similar to embodiment of the method, so description is fairly simple, related part is referring to embodiment of the method Part explanation.
One of ordinary skill in the art will appreciate that all or part of step in realizing said method embodiment is can Instruct related hardware to complete with by program, described program can be stored in computer read/write memory medium, The storage medium for obtaining designated herein, such as:ROM/RAM, magnetic disc, CD etc..
Presently preferred embodiments of the present invention is the foregoing is only, protection scope of the present invention is not intended to limit.It is all Any modification, equivalent substitution and improvements made within the spirit and principles in the present invention etc., are all contained in protection scope of the present invention It is interior.

Claims (10)

1. a kind of text relevant determines method, it is characterised in that extract in advance for the high similarity of target domain and low phase seemingly The Feature Words of each text in the samples of text of degree, and calculate the associated likelihood of each Feature Words and the target domain Probability and the uncorrelated likelihood probability to the target domain, methods described include:
Extract the Feature Words of pending target text;
According to the corresponding associated likelihood probability of calculated each Feature Words and uncorrelated likelihood probability, it is determined that extracted The corresponding associated likelihood probability of each Feature Words of the pending target text and uncorrelated likelihood probability;
The corresponding associated likelihood probability of each Feature Words of the pending target text and uncorrelated likelihood according to determined by Probability, determines the correlation of the pending target text and the target domain.
2. method according to claim 1, it is characterised in that extract each in the samples of text for target domain The Feature Words of text, including:
For each text in the samples of text, the technology for data mining is utilized, extract the Feature Words of the text;
The Feature Words for extracting pending target text, including:
For the pending target text, the technology for data mining is utilized, extract the Feature Words of the text.
3. method according to claim 2, it is characterised in that the technology for data mining, including:
TF-IDF technologies, or word embedded technology.
4. method according to claim 1, it is characterised in that each Feature Words of the calculating and the target domain Associated likelihood probability and the uncorrelated likelihood probability to the target domain, including:
Obtain the Feature Words of each text in the samples of text prior probability related to the target domain and with it is described The uncorrelated prior probability of target domain;
According to acquired related prior probability and uncorrelated prior probability, determine each Feature Words it is related expect the frequency and The uncorrelated expectation frequency;
It is related according to determined by expect the frequency and the uncorrelated expectation frequency, each Feature Words is calculated with the target domain Associated likelihood probability and the uncorrelated likelihood probability to the target domain.
5. method according to claim 1, it is characterised in that the pending target text determined by the basis Each corresponding associated likelihood probability of Feature Words and uncorrelated likelihood probability, determine the pending target text and the target The correlation in field, including:
The corresponding associated likelihood probability of each Feature Words of the pending target text and uncorrelated likelihood according to determined by Probability, calculates the product of probability and uncorrelated likelihood probability of the associated likelihood probability of the corresponding Feature Words of the pending target text Product of probability;
Judge whether the product of probability of the associated likelihood probability is more than the product of probability of the uncorrelated likelihood probability;
If it is, determining that the pending target text is related to the target domain;
If not, determining that the pending target text is uncorrelated to the target domain.
6. a kind of text relevant determining device, it is characterised in that described device includes:
First extraction module, it is each in the samples of text for the high similarity of target domain and low similarity for extracting in advance The Feature Words of individual text, and calculate each Feature Words with the associated likelihood probability of the target domain and lead with the target The uncorrelated likelihood probability in domain;
Second extraction module, for extracting the Feature Words of pending target text;
First determining module, for according to the corresponding associated likelihood probability of calculated each Feature Words and uncorrelated likelihood Probability, it is determined that the corresponding associated likelihood probability of each Feature Words of the described pending target text for being extracted and uncorrelated likelihood Probability;
Second determining module, for the corresponding associated likelihood of each Feature Words of the pending target text determined by basis Probability and uncorrelated likelihood probability, determine the correlation of the pending target text and the target domain.
7. device according to claim 6, it is characterised in that extract each in the samples of text for target domain The Feature Words of text, including:
For each text in the samples of text, the technology for data mining is utilized, extract the Feature Words of the text;
The Feature Words for extracting pending target text, including:
For the pending target text, the technology for data mining is utilized, extract the Feature Words of the text.
8. device according to claim 7, it is characterised in that the technology for data mining, including:
TF-IDF technologies, or word embedded technology.
9. device according to claim 6, it is characterised in that each Feature Words of the calculating and the target domain Associated likelihood probability and the uncorrelated likelihood probability to the target domain, including:
Obtain the Feature Words of each text in the samples of text prior probability related to the target domain and with it is described The uncorrelated prior probability of target domain;
According to acquired related prior probability and uncorrelated prior probability, determine that the related of this feature word expects the frequency and non-phase Close and expect the frequency;
It is related according to determined by expect the frequency and the uncorrelated expectation frequency, each Feature Words is calculated with the target domain Associated likelihood probability and the uncorrelated likelihood probability to the target domain.
10. device according to claim 6, it is characterised in that second determining module, including:
Calculating sub module, the corresponding associated likelihood of each Feature Words for the pending target text determined by basis are general Rate and uncorrelated likelihood probability, calculate the associated likelihood probability of the corresponding Feature Words of the pending target text product of probability and The product of probability of uncorrelated likelihood probability;
Whether judging submodule, the product of probability for judging the associated likelihood probability are general more than the uncorrelated likelihood probability Rate is accumulated;
Determination sub-module, for literary in the case of being, to determine the pending target in the judging submodule judged result This is related to the target domain;In the case where the judging submodule judged result is no, the pending target is determined Text is uncorrelated to the target domain.
CN201610865596.0A 2016-09-29 2016-09-29 A kind of text relevant determines method and device Pending CN106547822A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610865596.0A CN106547822A (en) 2016-09-29 2016-09-29 A kind of text relevant determines method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610865596.0A CN106547822A (en) 2016-09-29 2016-09-29 A kind of text relevant determines method and device

Publications (1)

Publication Number Publication Date
CN106547822A true CN106547822A (en) 2017-03-29

Family

ID=58368401

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610865596.0A Pending CN106547822A (en) 2016-09-29 2016-09-29 A kind of text relevant determines method and device

Country Status (1)

Country Link
CN (1) CN106547822A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107402984A (en) * 2017-07-11 2017-11-28 北京金堤科技有限公司 A kind of sorting technique and device based on theme
CN108932525A (en) * 2018-06-07 2018-12-04 阿里巴巴集团控股有限公司 A kind of behavior prediction method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049470A (en) * 2012-09-12 2013-04-17 北京航空航天大学 Opinion retrieval method based on emotional relevancy
CN103150371A (en) * 2013-03-08 2013-06-12 北京理工大学 Confusion removal text retrieval method based on positive and negative training
CN105760474A (en) * 2016-02-14 2016-07-13 Tcl集团股份有限公司 Document collection feature word extracting method and system based on position information

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049470A (en) * 2012-09-12 2013-04-17 北京航空航天大学 Opinion retrieval method based on emotional relevancy
CN103150371A (en) * 2013-03-08 2013-06-12 北京理工大学 Confusion removal text retrieval method based on positive and negative training
CN105760474A (en) * 2016-02-14 2016-07-13 Tcl集团股份有限公司 Document collection feature word extracting method and system based on position information

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107402984A (en) * 2017-07-11 2017-11-28 北京金堤科技有限公司 A kind of sorting technique and device based on theme
CN108932525A (en) * 2018-06-07 2018-12-04 阿里巴巴集团控股有限公司 A kind of behavior prediction method and device
CN108932525B (en) * 2018-06-07 2022-04-29 创新先进技术有限公司 Behavior prediction method and device

Similar Documents

Publication Publication Date Title
US11227118B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
CN105045812B (en) The classification method and system of text subject
CN110532353B (en) Text entity matching method, system and device based on deep learning
CN109284397A (en) A kind of construction method of domain lexicon, device, equipment and storage medium
CN107526800A (en) Device, method and the computer-readable recording medium of information recommendation
CN109117474B (en) Statement similarity calculation method and device and storage medium
CN109492217B (en) Word segmentation method based on machine learning and terminal equipment
CN105095222B (en) Uniterm replacement method, searching method and device
CN111488732B (en) Method, system and related equipment for detecting deformed keywords
WO2014022172A2 (en) Information classification based on product recognition
CN104866478A (en) Detection recognition method and device of malicious text
CN110188357B (en) Industry identification method and device for objects
CN111291177A (en) Information processing method and device and computer storage medium
CN111881671A (en) Attribute word extraction method
CN105912525A (en) Sentiment classification method for semi-supervised learning based on theme characteristics
CN107861945A (en) Finance data analysis method, application server and computer-readable recording medium
CN115456043A (en) Classification model processing method, intent recognition method, device and computer equipment
CN113704393A (en) Keyword extraction method, device, equipment and medium
CN113408282B (en) Method, device, equipment and storage medium for topic model training and topic prediction
CN106547822A (en) A kind of text relevant determines method and device
CN110019556B (en) Topic news acquisition method, device and equipment thereof
CN107291686B (en) Method and system for identifying emotion identification
CN107688594A (en) The identifying system and method for risk case based on social information
CN116561298A (en) Title generation method, device, equipment and storage medium based on artificial intelligence
CN109359274A (en) The method, device and equipment that the character string of a kind of pair of Mass production is identified

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170329

RJ01 Rejection of invention patent application after publication