CN106033445B - The method and apparatus for obtaining article degree of association data - Google Patents
The method and apparatus for obtaining article degree of association data Download PDFInfo
- Publication number
- CN106033445B CN106033445B CN201510114670.0A CN201510114670A CN106033445B CN 106033445 B CN106033445 B CN 106033445B CN 201510114670 A CN201510114670 A CN 201510114670A CN 106033445 B CN106033445 B CN 106033445B
- Authority
- CN
- China
- Prior art keywords
- text
- analyzed
- word
- label
- keyword
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
The invention discloses a kind of method and apparatus for obtaining article degree of association data.Wherein, this method comprises: obtaining predetermined keyword and multiple texts to be analyzed, wherein each text to be analyzed corresponds to multiple first text labels;Statistics predetermined keyword corresponds to the derivative keyword of multiple texts to be analyzed;Determine that predetermined keyword appears in first number in multiple texts to be analyzed and derivative keyword appears in second number in multiple texts to be analyzed;Determine the second text label to match in multiple first text labels of each text to be analyzed with predetermined keyword;Second number of the label achievement data of the second text label based on each text to be analyzed, first number of predetermined keyword and derivative keyword calculates the degree of association data of predetermined keyword and each text to be analyzed.Using the present invention, solves the problems, such as accurately determine the degree of association of article and keyword in the prior art, realize the effect for accurately determining the degree of association of article and keyword.
Description
Technical field
The present invention relates to internet areas, in particular to a kind of method and apparatus for obtaining article degree of association data.
Background technique
Association article is searched by keyword in the prior art and the method to sort to it is all simply according to article
In whether there is the number of the position and appearance of the keyword, the keyword in the article occurred to realize, specifically,
This method for searching article is similar to the mode scanned in the search engines such as Baidu or Google to keyword, for example,
If keyword is " big data ", then when searching with " big data " associated article, it is " big to count according to whether occurring in article
According to ", " big data " number for occurring in article, and " big data " appear in the position in article determine article with it is " big
The degree of association of data " is simultaneously from high to low ranked up association article according to the degree of association, wherein keyword appears in article not
With weighted shared when position, e.g., the weight highest of headline, text take second place, advertisement is minimum.But in the prior art
Association article is determined above by keyword and the method that sorts to it cannot reflect theme and the pass of article very accurately
The relevance of keyword.
For example, reporting the work of some big data article at one when searching with keyword " big data " associated article
Person participates in the news of dinner party, and " big data " word repeatedly occurs, but the theme of the news is that the author participates in dinner party, without
It is " big data ", but according to the prior art, since " big data " frequency of occurrence is more, keyword " big number can be determined it as
According to " association article therefore cannot accurately reflect the relevance of article theme and keyword by the prior art.
For another example, when keyword is " big data ", there is an article in the whole text at discussion " big data ", only to " big data "
Word does not mention, and then this article cannot be screened out and participate in sorting.
For another example, keyword is still " big data ", has article A to mention 1 time " big data ", 5 times " big data ", article B is mentioned
To 2 times " big data ", 4 times " data mining ", according to the above-mentioned prior art, article B is associated with keyword " big data "
Degree should be higher than article A, it is apparent that not being.
Aiming at the problem that can not accurately determine the degree of association of article and keyword in the prior art, not yet propose at present effective
Solution.
Summary of the invention
It is existing to solve the main purpose of the present invention is to provide a kind of method and apparatus for obtaining article degree of association data
The problem of degree of association of article and keyword can not be accurately determined in technology.
To achieve the goals above, according to an aspect of an embodiment of the present invention, a kind of acquisition article degree of association is provided
The method of data, this method comprises: obtaining predetermined keyword and multiple texts to be analyzed;Statistics predetermined keyword correspond to it is multiple to
Analyze the derivative keyword of text, wherein derivative keyword is to be appeared in simultaneously in a text to be analyzed with predetermined keyword
Keyword;Determine predetermined keyword appear in first number in multiple texts to be analyzed and derivative keyword appear in it is multiple
Second number in text to be analyzed;Determine in multiple first text labels of each text to be analyzed with predetermined keyword phase
The second text label matched, wherein the first text label is used to identify the theme of text to be analyzed;Based on each text to be analyzed
The label achievement data of the second text label, first number of predetermined keyword and derivative keyword second number calculate
The degree of association data of predetermined keyword and each text to be analyzed.
Further, it includes: to multiple wait divide that statistics predetermined keyword, which corresponds to the derivative keyword of multiple texts to be analyzed,
Analysis text carries out word segmentation processing and obtains set of words;Obtain the first quantity of each first word in set of words, wherein first
Quantity is greater than the first preset threshold;Obtain the second quantity of each second word in set of words, wherein the second quantity is one
Second word and predetermined keyword appear in the aggregate value of the number in each text to be analyzed simultaneously, and the second quantity is greater than second
Preset threshold;Compare the second word and the first word, if the second word is identical as the first word, by the second quantity and the first number
Frequency of occurrence of the ratio of amount as the second word;If the second word is different from the first word, using the second quantity as second
The frequency of occurrence of word;Frequency of occurrence is greater than the second word of third predetermined threshold value as derivative keyword.
Further, match in multiple first text labels for determining each text to be analyzed with predetermined keyword
Before second text label, this method further include: obtain the conjunctive word of pre-set text label and pre-set text label, wherein pre-
If text label includes the first text label, pre-set text label corresponds at least one conjunctive word;Traverse multiple texts to be analyzed
Obtain multiple conjunctive words that each text to be analyzed includes;It searches corresponding with each conjunctive word that each text to be analyzed includes
Multiple pre-set text labels, as multiple first text labels.
Further, the label achievement data in the second text label based on each text to be analyzed, predetermined keyword
First number and second number of derivative keyword calculate the degree of association data of predetermined keywords and each text to be analyzed
Before, this method further include: the label achievement data A of each first text label is calculated according to the first formula, wherein first is public
Formula are as follows:N is the number of the corresponding conjunctive word of the first text label, BiIt is corresponding for i-th of first text labels
Conjunctive word appear in the number in a text to be analyzed, biI-th is the corresponding conjunctive word of the first text label corresponding the
The default weight of one text label.
Further, the label achievement data of the second text label based on each text to be analyzed, predetermined keyword
Second number of first number and derivative keyword calculates the degree of association data packet of predetermined keyword and each text to be analyzed
Include: will the corresponding conjunctive word of identical with derivative keyword the second text label as third word;It is calculated according to the second formula every
The degree of association data G of a text to be analyzed, wherein the second formula is
K is first number of predetermined keyword, and C is the label achievement data of the second text label, and D is that the second text label appears in one
Number in a text to be analyzed, d are the default weight of the second text label, and m is the number of third word, kjIt is j-th
Second number of the corresponding derivative keyword of three words, EjIn a text to be analyzed is appeared in for j-th of third word
Three numbers, ejFor the default weight of corresponding second text label of j-th of third word.
Further, the label achievement data in the second text label based on each text to be analyzed, predetermined keyword
First number and second number of derivative keyword calculate the degree of association data of predetermined keywords and each text to be analyzed
Later, this method further include: be ranked up, obtain according to degree of association data of the sequence from high to low to each text to be analyzed
Relational degree taxis table;Show top n degree of association data and corresponding text to be analyzed in relational degree taxis table, wherein N is nature
Number.
To achieve the goals above, according to another aspect of an embodiment of the present invention, a kind of acquisition article degree of association is provided
The device of data, which includes: the first acquisition module, for obtaining predetermined keyword and multiple texts to be analyzed;Count mould
Block corresponds to the derivative keyword of multiple texts to be analyzed for counting predetermined keyword, wherein derivative keyword is and default pass
Keyword appears in the keyword in a text to be analyzed simultaneously;First determining module, for determining that predetermined keyword appears in
First number and derivative keyword in multiple texts to be analyzed appear in second number in multiple texts to be analyzed;Second really
Cover half block, the second text to be matched in multiple first text labels for determining each text to be analyzed with predetermined keyword
Label, wherein the first text label is used to identify the theme of text to be analyzed;First computing module, for based on each wait divide
Analyse the label achievement data of the second text label of text, first number of predetermined keyword and derivative keyword for the second time
Number calculates the degree of association data of predetermined keyword and each text to be analyzed.
Further, statistical module includes: word segmentation module, obtains word for carrying out word segmentation processing to multiple texts to be analyzed
Language set;Second obtains module, for obtaining the first quantity of each first word in set of words, wherein the first quantity is big
In the first preset threshold;Third obtains module, for obtaining the second quantity of each second word in set of words, wherein the
Two quantity are the aggregate value that second word and predetermined keyword appear in the number in each text to be analyzed simultaneously, second
Quantity is greater than the second preset threshold;Comparison module, for comparing the second word and the first word, if the second word and the first word
It is identical, then using the ratio of the second quantity and the first quantity as the frequency of occurrence of the second word;If the second word and the first word
Difference, then using the second quantity as the frequency of occurrence of the second word;Frequency of occurrence is greater than for determining by third determining module
Second word of three preset thresholds is as derivative keyword.
Further, device further include: the 4th obtains module, for determining multiple the first of each text to be analyzed
Before the second text label to match in text label with predetermined keyword, pre-set text label and pre-set text label are obtained
Conjunctive word, wherein pre-set text label includes the first text label, and pre-set text label corresponds at least one conjunctive word;Time
Module is gone through, obtains multiple conjunctive words that each text to be analyzed includes for traversing multiple texts to be analyzed;Searching module is used for
Multiple pre-set text labels corresponding with each conjunctive word that each text to be analyzed includes are searched, as multiple first text marks
Label.
Further, the device further include: the second computing module, in the second text based on each text to be analyzed
Second number of the label achievement data of label, first number of predetermined keyword and derivative keyword calculates predetermined keyword
Before the degree of association data of each text to be analyzed, the label index number of each first text label is calculated according to the first formula
According to A, wherein the first formula are as follows:N is the number of the corresponding conjunctive word of the first text label, BiIt is i-th
The corresponding conjunctive word of one text label appears in the number in a text to be analyzed, biIt is corresponding for i-th of first text labels
Corresponding first text label of conjunctive word default weight.
Further, the first computing module includes: the 4th determining module, will identical with derivative keyword for determination
The corresponding conjunctive word of two text labels is as third word;Computational submodule, it is each to be analyzed for being calculated according to the second formula
The degree of association data G of text, wherein the second formula isK is pre-
If first number of keyword, C be the second text label label achievement data, D be the second text label appear in one to
The number in text is analyzed, d is the default weight of the second text label, and m is the number of third word, kjFor j-th of third word
Second number of the corresponding derivative keyword of language, EjThe third time in a text to be analyzed is appeared in for j-th of third word
Number, ejFor the default weight of corresponding second text label of j-th of third word.
Further, the device further include: sorting module, in the second text label based on each text to be analyzed
Label achievement data, first number of predetermined keyword and second number of derivative keyword calculate predetermined keyword and every
After the degree of association data of a text to be analyzed, according to sequence from high to low to the degree of association data of each text to be analyzed into
Row sequence, obtains relational degree taxis table;Display module, for showing top n degree of association data and correspondence in relational degree taxis table
Text to be analyzed, wherein N is natural number.
Using the embodiment of the present invention, after obtaining predetermined keyword and multiple texts to be analyzed, predetermined keyword is counted
The derivative keyword of corresponding multiple texts to be analyzed, and determine that predetermined keyword appears in the first time in multiple texts to be analyzed
Several and derivative keyword appears in second number in multiple texts to be analyzed, is determining multiple the first of each text to be analyzed
After the second text label to match in text label with predetermined keyword, the second text mark based on each text to be analyzed
Second number of the label achievement data of label, first number of predetermined keyword and derivative keyword calculate predetermined keywords with
The degree of association data of each text to be analyzed.Through the embodiment of the present invention, in conjunction with corresponding to predetermined keyword and text to be analyzed
Text label calculate the degree of association of text to be analyzed and predetermined keyword, since text label identifies the master of text to be analyzed
Topic, therefore the degree of association of predetermined keyword Yu text to be analyzed can be accurately determined.Using the embodiment of the present invention, solve existing
There is the problem of degree of association that can not accurately determine article and keyword in technology, realizes and accurately determine article and keyword
The effect of the degree of association.
Detailed description of the invention
The attached drawing constituted part of this application is used to provide further understanding of the present invention, schematic reality of the invention
It applies example and its explanation is used to explain the present invention, do not constitute improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of the method according to an embodiment of the present invention for obtaining article degree of association data;
Fig. 2 is a kind of flow chart of optional method for obtaining article degree of association data according to an embodiment of the present invention;With
And
Fig. 3 is the schematic diagram of the device according to an embodiment of the present invention for obtaining article degree of association data.
Specific embodiment
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase
Mutually combination.The present invention will be described in detail below with reference to the accompanying drawings and embodiments.
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention
Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only
The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people
The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work
It encloses.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, "
Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way
Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein.In addition, term " includes " and " tool
Have " and their any deformation, it is intended that cover it is non-exclusive include, for example, containing a series of steps or units
Process, method, system, product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include without clear
Other step or units listing to Chu or intrinsic for these process, methods, product or equipment.
The embodiment of the invention provides a kind of methods for obtaining article degree of association data.
Fig. 1 is the flow chart of the method according to an embodiment of the present invention for obtaining article degree of association data.As shown in Figure 1, should
Method may include that steps are as follows:
Step S102 obtains predetermined keyword and multiple texts to be analyzed.
Step S104, statistics predetermined keyword correspond to the derivative keyword of multiple texts to be analyzed.
Wherein, derivative keyword is the keyword appeared in simultaneously in a text to be analyzed with predetermined keyword.
Step S106 determines that predetermined keyword appears in first number in multiple texts to be analyzed and derivative keyword goes out
Second number in present multiple texts to be analyzed.
Step S108 determines to match in multiple first text labels of each text to be analyzed with predetermined keyword
Two text labels.
Wherein, the first text label is used to identify the theme of text to be analyzed.
Step S110, the label achievement data of the second text label based on each text to be analyzed, predetermined keyword
Second number of first number and derivative keyword calculates the degree of association data of predetermined keyword and each text to be analyzed.
Using the embodiment of the present invention, after obtaining predetermined keyword and multiple texts to be analyzed, predetermined keyword is counted
The derivative keyword of corresponding multiple texts to be analyzed, and determine that predetermined keyword appears in the first time in multiple texts to be analyzed
Several and derivative keyword appears in second number in multiple texts to be analyzed, is determining multiple the first of each text to be analyzed
After the second text label to match in text label with predetermined keyword, the second text mark based on each text to be analyzed
Second number of the label achievement data of label, first number of predetermined keyword and derivative keyword calculate predetermined keywords with
The degree of association data of each text to be analyzed.Through the embodiment of the present invention, in conjunction with corresponding to predetermined keyword and text to be analyzed
Text label calculate the degree of association of text to be analyzed and predetermined keyword, since text label identifies the master of text to be analyzed
Topic, therefore the degree of association of predetermined keyword Yu text to be analyzed can be accurately determined.Using the embodiment of the present invention, solve existing
There is the problem of degree of association that can not accurately determine article and keyword in technology, realizes and accurately determine article and keyword
The effect of the degree of association.
In the above-described embodiments, text to be analyzed can be the network text for swashing and getting from internet by web crawlers
Chapter, it is alternatively possible to article is crawled from internet according to the url list of the page to be crawled, it can also be according to the grade of the page
Number is to crawl article, for example, web crawlers can be made to crawl certain website (e.g., Sina, Netease or Tencent etc.) by setting
First level pages on content (e.g., the content in Sina's homepage), then crawl on the website second level page content (e.g., open
The content after link in Sina's homepage) etc..
Wherein, URL is Uniform Resources Locator, i.e. uniform resource locator, is to can be from internet
On the succinct expression of obtained one kind of the position of resource and access method, be the address of standard resource on internet.
In the above-described embodiment, the article crawled (i.e. above-mentioned text to be analyzed) can be stored in database
In.
According to that above embodiment of the present invention, the derivative keyword that statistics predetermined keyword corresponds to multiple texts to be analyzed can be with
It include: to carry out word segmentation processing to multiple texts to be analyzed to obtain set of words;Obtain the of each first word in set of words
One quantity, wherein the first quantity is greater than the first preset threshold;The second quantity of each second word in set of words is obtained,
In, the second quantity is that second word and predetermined keyword appear in the total of the number in each text to be analyzed simultaneously
Value, the second quantity are greater than the second preset threshold;Compare the second word and the first word, if the second word is identical as the first word,
Then using the ratio of the second quantity and the first quantity as the frequency of occurrence of the second word;If the second word is different from the first word,
Then using the second quantity as the frequency of occurrence of the second word;Frequency of occurrence is greater than the second word of third predetermined threshold value as spreading out
Raw keyword.
Specifically, multiple texts to be analyzed can be segmented according to the default word in default dictionary, including
The set of words of multiple first words, for example, if text to be analyzed be " big data refers to without shortcut as random analysis method,
And be analyzed and processed using all data ", then may include following to its set of words obtained after word segmentation processing
Word: " big data ", " finger ", " not having to ", " random analysis method ", " such ", " shortcut ", " and ", " use ", " all ", " number
According to ", " progress " and " analysis processing ".
Further, after obtaining set of words, the first quantity for counting each first word in set of words can be with
It include: the word quantity for counting each word in set of words, e.g., " scientific and technical article " one shares 100 (i.e. in set of words
The word quantity of " scientific and technical article " be 100), " China " one share 90 (i.e. the word quantity of " China " is 90), " big datas "
One shares 30 (i.e. the word quantity of " big data " be 30), " finance " one shares 25 (i.e. the word quantity of " finance " is
25), " data mining " one shares 20 (i.e. the word quantity of " data mining " is 20) and " big data " one shares 15
(i.e. the word quantity of " big data " is 15);Word quantity is greater than the word of the first preset threshold (e.g., 50) as first
Word;It or can be by a word of the highest preceding Y (e.g., Y=2) of quantity as the first word, wherein Y is natural number;And it records
The corresponding quantity of each first word is as the first quantity down, and e.g., in the above example, the first word can be " science and technology text
Chapter " and " China ", and first quantity of " scientific and technical article " is 100, and first quantity of " China " is 90.
In the above-described embodiments, after obtaining set of words, the second number of each second word in set of words is obtained
Amount may include: each text to be analyzed of statistics through segmenting the number occurred simultaneously in obtained each word with predetermined keyword
Aggregate value, as shown in table 1, if predetermined keyword is " big data ", and " big data " appears in three text (such as tables to be analyzed
Text A, text B to be analyzed and text C to be analyzed to be analyzed shown in 1) in, it is same with " big data " in each text to be analyzed
When the word that occurs and its frequency of occurrence it is as shown in table 1, in conjunction with table 1 as can be seen that each word and predetermined keyword simultaneously
It is 10+2=12 that the number of appearance, which is respectively as follows: " scientific and technical article ",;" China " is 6+7=13;" big data " is 5+7=12;" gold
Melt " it is 2+1=3;" big data " is 10+10=20;" data mining " is 5+5+3=13.By highest preceding X (such as X of frequency of occurrence
=5) a word is as the second word, wherein X is natural number;Or number is greater than to the word of the second preset threshold (e.g., 10)
As the second word;The corresponding number of each second word is recorded as the second quantity, in the above example, the second word
For " scientific and technical article " (its second quantity is 12), " China " (its second quantity is 13), " big data " (its second quantity is 20),
" big data " (its second quantity is 12) and " data mining " (its second quantity is 13).
Table 1
After determining multiple first words and multiple second words, more each second word and each first word,
If the second word is identical as the first word, using the ratio of the second quantity and the first quantity as the frequency of occurrence of the second word;
If the second word and the first word be not identical, using the second quantity as the frequency of occurrence of the second word;And it is frequency of occurrence is big
In third predetermined threshold value the second word as derivative keyword, or using highest preceding Z the second words of frequency of occurrence as
Derivative keyword.
Above-described embodiment through the invention can automatically determine the derivative of predetermined keyword based on multiple texts to be analyzed and close
Keyword improves the accuracy of determining derivative keyword, is calculating without adding derivative keyword manually for predetermined keyword
When predetermined keyword and the degree of association data of text to be analyzed, it also is contemplated that the influence of derivative keyword improves calculated
The accuracy of degree of association data.
In an alternative embodiment, the article (text i.e. to be analyzed) got that can will swash from internet is stored in
In database, and word segmentation processing is carried out to article, calculates the highest Y word of frequency of occurrence in all articles stored in database
As frequency of occurrence a (i.e. above-mentioned first for solely showing word (the first word i.e. in above-described embodiment) and each only now word of record
Quantity);According to predetermined keyword, the highest X word of number occurred simultaneously with the predetermined keyword is grabbed in the database and is made
For co-occurrence word (the second i.e. above-mentioned word) and record the number b (the second i.e. above-mentioned quantity) of each co-occurrence word.In conjunction with solely existing
The frequency of occurrence of the co-occurrence word is denoted as b/ if there is word identical with the word solely showed in word in co-occurrence word by word and co-occurrence word
A, if co-occurrence word is different from the word in only now word, the frequency of occurrence of the co-occurrence word is still b, then, by frequency of occurrence
Derivative keyword of the highest Z word as the predetermined keyword.
In embodiments of the present invention, a, b and Z are natural number.
Further described in detail to the present invention in conjunction with above-mentioned example, in the above example, the first word and its
First quantity and the second word and its second quantity are as shown in table 2, due to the second word " big data ", " big data " and
" data mining " and the first word (being in this embodiment " scientific and technical article " and " China ") be not identical, then, the appearance of " big data "
Number be the frequency of occurrence of 20, " big data " be 12 and the frequency of occurrence of " data mining " be 13;Due to second
Word " scientific and technical article " and " China " are identical as the first word " scientific and technical article " and " China " respectively, then the second word " science and technology
The frequency of occurrence of article " is 12/100=0.12., and the frequency of occurrence of the second word " China " is 13/90=0.14.If the
Three preset thresholds are 10 or Z=3, then in this embodiment, derivative keyword is " big data ", " big data " and " number
According to excavation ", and second number of derivative keyword is followed successively by 20,12 and 13.
Table 2
In the above embodiment of the present invention, can not will there is no Special Significance in the word occurred simultaneously with predetermined keyword
Word removal, for example, an article is the application about big data, but repeatedly mentions " scientific and technical article " or " China " in article
There is no the word of particularity (for example, repeatedly mentioning this article in this article is to be published in China Deng for " big data "
A certain scientific and technical article publication on), by comparing highest first word of frequency of occurrence and and predetermined keyword in each article
Highest second word of number occurred simultaneously, repairs the frequency of occurrence of the second word using the first quantity of the first word
Just, achieve the purpose that remove in the second word and there is no the word of particularity.
Further, it is determined that predetermined keyword appears in first number in multiple texts to be analyzed and derives keyword
Second number in present multiple texts to be analyzed may include: to make the word quantity of the predetermined keyword in above-described embodiment
For first number of predetermined keyword;Using the frequency of occurrence of the derivative keyword in above-described embodiment as the of derivative keyword
Two numbers.
Specifically, in counting set of words after the word quantity of each word, by the corresponding word of predetermined keyword
First number of the language quantity as predetermined keyword, e.g., in the above example, if predetermined keyword is " big data ", then
First number of predetermined keyword is 30;After determining derivative keyword, the corresponding frequency of occurrence of derivative keyword is recorded
As second number of derivative keyword, as in above-mentioned example, second number of second derivative keyword " big data " is 12.
In the above embodiment of the present invention, closed in multiple first text labels for determining each text to be analyzed with default
Before the second text label that keyword matches, this method can also include: to obtain pre-set text label and pre-set text label
Conjunctive word, wherein pre-set text label includes the first text label, and pre-set text label corresponds at least one conjunctive word;Time
It goes through multiple texts to be analyzed and obtains multiple conjunctive words that each text to be analyzed includes;It searches and includes with each text to be analyzed
The corresponding multiple pre-set text labels of each conjunctive word, as multiple first text labels.
Specifically, the conjunctive word of pre-set text label and pre-set text label can be obtained from default tag library, wherein
Pre-set text label and its corresponding conjunctive word are stored in default tag library, each pre-set text label corresponds at least one pass
Join word;Multiple texts to be analyzed are traversed, multiple conjunctive words that each text to be analyzed includes are obtained;For each text to be analyzed
This, searches the corresponding multiple pre-set text labels of each conjunctive word that it includes, and multiple pre-set text label is set as this
First text label of text to be analyzed.In this embodiment, since a pre-set text label can correspond to multiple conjunctive words,
When determining the first text label of each text to be analyzed, the number of the first text label is corresponding no more than the text to be analyzed
Conjunctive word number.
It should be further noted that a conjunctive word can only correspond to a pre-set text label.
For example, having pre-set text label " big data " and " finance " in default tag library, wherein each pre-set text label
Corresponding conjunctive word and its default weight are as shown in table 3.If including " big data " in a text to be analyzed, " data are dug
Pick ", " data application " and " Wall Street ", according to the above description, " big data ", " data mining " and " data are answered
With " corresponding pre-set text label " big data ", " Wall Street " is corresponding pre-set text label " finance ", then the text pair to be analyzed
There are two the first text labels answered: " big data " and " finance ", thus, it is determined that the first text label of text to be analyzed,
There are two the themes for showing the text to be analyzed: " big data " and " finance ".
Table 3
Above-described embodiment through the invention, the pre-set text label that will be stored in each text to be analyzed and default tag library
And its conjunctive word is matched, it is determined that multiple first text labels of each text to be analyzed, each first text label mark
Know a theme for the text to be analyzed, can accurately identify the text to be analyzed by multiple first text labels
The each theme being related to, when calculating the degree of association of predetermined keyword and text to be analyzed, it is first determined each first text mark
The second text label to match in label with predetermined keyword, then reflect based on this second text mark of text subject to be analyzed
Label calculate the degree of association data between predetermined keyword and the text (e.g., the article on internet) to be analyzed, avoid existing
Article is determined according to whether occurring position in this article of keyword, the keyword and frequency of occurrence in article in technology
The problem of with the poor accuracy of the degree of association of keyword, improve the accuracy of calculated degree of association data.
According to that above embodiment of the present invention, in the label index number of the second text label based on each text to be analyzed
Predetermined keywords and each text to be analyzed are calculated according to first number of, predetermined keyword and second number of derivative keyword
Degree of association data before, this method can also include: that the label index of each first text label is calculated according to the first formula
Data A, wherein the first formula are as follows:N is the number of the corresponding conjunctive word of the first text label, BiIt is i-th
The corresponding conjunctive word of first text label appears in the number in a text to be analyzed, biFor i-th of first text labels pair
The default weight of corresponding first text label of the conjunctive word answered.
Specifically, before calculating the degree of association data of predetermined keyword and each text to be analyzed, each first is calculated
It is corresponding each to read its for first text label from default tag library for the label achievement data of text label
The default weight of a conjunctive word, and the number that each conjunctive word occurs in the text to be analyzed is counted, calculate each conjunctive word
Frequency of occurrence and respective default weight, then product addition is obtained to the label achievement data of first text label.
For example, in conjunction with table 3, for the first text label " big data ", if its conjunctive word " big data ", " big data ",
" data mining ", " data application " and " data processing " goes out in an article (text to be analyzed i.e. in above-described embodiment)
Existing number is respectively as follows: 4,3,5,2 and 1, according to that above embodiment of the present invention, the mark of " big data " this first text label
Sign achievement data are as follows: 5 × 4+5 × 3+3 × 5+2 × 2+1 × 1=55.
Above-described embodiment through the invention calculates the label index number of each first text label of each text to be analyzed
According to can reflect the correlation of the theme of the text (e.g., the article on internet) to be analyzed by label achievement data, that is,
The label achievement data of first text label is bigger, the correlation of the first text label corresponding theme and the text to be analyzed
It is bigger.
In the above embodiment of the present invention, the label achievement data of the second text label based on each text to be analyzed,
Second number of first number of predetermined keyword and derivative keyword calculates predetermined keyword and each text to be analyzed
Degree of association data may include: conjunctive word that the second text label identical with derivative keyword is corresponding as third word;
The degree of association data G of each text to be analyzed is calculated according to the second formula, wherein the second formula isK is first number of predetermined keyword, and C is the second text label
Label achievement data, D is that the second text label appears in the number in a text to be analyzed, and d is the second text label
Default weight, m are the number of third word, kjFor second number of the corresponding derivative keyword of j-th of third word, EjFor jth
A third word appears in the third number in a text to be analyzed, ejFor corresponding second text label of j-th of third word
Default weight.
Specifically, in determining multiple first text labels after the second text label identical with predetermined keyword, from
The label achievement data of second text label is determined in the label achievement data for multiple first text labels being calculated, and
Using conjunctive word identical with each derivative keyword in the corresponding conjunctive word of the second text label as third word, in conjunction with
The label achievement data of two text labels, first number of predetermined keyword and derivative second number of keyword, according to the second public affairs
Formula calculates the degree of association data of the predetermined keyword Yu the text to be analyzed.
It is and pre- in two the first text labels " big data " of text to be analyzed and " finance " in conjunction with above-mentioned example
If identical keyword " big data " is the first text label " big data ", then " big data " is determined as the second text mark
Label, since the conjunctive word of the second text label " big data " has, " big data ", " big data ", " data mining ", " data are answered
With " and " data processing ", the derivative keyword of predetermined keyword " big data " has " big data ", then third word is
" big data ", in this embodiment, the several K of first time of predetermined keyword " big data " are 30, the second text label " big data "
Label achievement data C is 55, and it is 4 that the second text label " big data ", which appears in the number D in a text to be analyzed, the second text
The default weight d of this label " big data " is 5, and the number m of third word is 1, the corresponding derivative keyword of j-th of third word
Second several k of (i.e. above-mentioned " big data ")jIt is 10, j-th of third word (i.e. above-mentioned " big data ") appears in
Third number E in one text to be analyzedjIt is 3, corresponding second text of j-th of third word (i.e. above-mentioned " big data ")
The default weight e of labeljIt is 5, then the text to be analyzed and the degree of association data G of predetermined keyword " big data " are 30 × 55
+ 30 × 4 × 5+ (10 × 55+10 × 3 × 5)=2950.
Above-described embodiment through the invention considers when calculating the degree of association data of predetermined keyword and text to be analyzed
The theme of derivative keyword and to be analyzed text of the predetermined keyword based on each text to be analyzed, avoids the prior art
It is middle to determine article according to whether occurring position in this article of keyword, the keyword and frequency of occurrence in article and close
The problem of poor accuracy of the degree of association of keyword, improves the accuracy of calculated degree of association data.
According to that above embodiment of the present invention, in the label index number of the second text label based on each text to be analyzed
Predetermined keywords and each text to be analyzed are calculated according to first number of, predetermined keyword and second number of derivative keyword
Degree of association data after, this method can also include: the degree of association according to sequence from high to low to each text to be analyzed
Data are ranked up, and obtain relational degree taxis table;Show top n degree of association data and corresponding to be analyzed in relational degree taxis table
Text, wherein N is natural number.
Specifically, right according to sequence from high to low after the degree of association data that each text to be analyzed is calculated
Each degree of association data are ranked up, and obtain relational degree taxis table, and are closed top n in the relational degree taxis table is (3 such as preceding)
Connection degree evidence and its corresponding textual presentation to be analyzed come out.
Above-described embodiment through the invention, degree of association data are higher, indicate the pass of the predetermined keyword Yu text to be analyzed
Connection degree is bigger, and the highest top n degree of association data of degree of association data and its corresponding textual presentation to be analyzed are come out, can be made
It obtains people and understands maximally related article in knowledge corresponding to the predetermined keyword or technical field.
The above embodiment of the present invention is discussed in detail below with reference to Fig. 2, as shown in Fig. 2, this method may include steps of:
Step S202, mechanical reptile crawls the article on internet from server 80, and the article crawled is stored
In the database.
In this step, the web crawlers in mechanical reptile, that is, the above embodiment of the present invention, the working principle of mechanical reptile
Consistent with the web crawlers in the above embodiment of the present invention, details are not described herein.
Step S204 counts the highest Y word of frequency of occurrence as only now word in the database.
Above-mentioned step S202 and step S204 can be realized by crawler unit 20.
The keyword to be searched for is arranged in step S206.
Wherein, the predetermined keyword in the keyword, that is, the above embodiment of the present invention.
Step S208 show that frequency of occurrence is highest simultaneously with keyword according to keyword in the database of storage article
X word is as co-occurrence word.
Step S210, according to solely now word and co-occurrence word calculate Z prolong new word and obtain the Z prolong new word prolong new word power
Weight.
In the step, prolonging new word is the derivative keyword in the above embodiment of the present invention, and it is i.e. of the invention to prolong new word weight
Second number of the derivative keyword in above-described embodiment.
Above-mentioned step S206 to step S210 can be realized by keyword setting unit 40.
Label and the corresponding characteristic word of each label is arranged in step S212.
In this embodiment, the pre-set text label in label, that is, the above embodiment of the present invention, characteristic word, that is, above-mentioned default
The conjunctive word of text label.
Characteristic word weight is arranged for each characteristic word in step S214.
Wherein, the conjunctive word of pre-set text label corresponds to the pre-set text in characteristic word weight, that is, the above embodiment of the present invention
The default weight of label.
Step S216 goes out according to the frequency of occurrence of the characteristic word of label each in every article and its characteristic word weight calculation
The label score of article on to that tag.
In this embodiment, the label achievement data in label score, that is, the above embodiment of the present invention, can be according to above-mentioned
The first formula calculate article label score on to that tag, details are not described herein.
Step S218, prolonging new word and its prolong new word weight and article corresponds to the label of keyword according to keyword
Label score calculates the degree of association data of keyword and article and is ranked up to degree of association data.
Specifically, the realization of step S218 is consistent with the implementation of step S110, and details are not described herein.
Above-mentioned step S212 to step S218 can be realized by label setting unit 60.
In this embodiment, X, Y and Z are natural number.
It should be further noted that showing after being ranked up to the degree of association data being calculated and coming preceding N
A degree of association data and its corresponding article.
Above-described embodiment through the invention prolongs new word based on what the article crawled obtained keyword automatically, without manual
Addition, and different weights (second number i.e. in the above embodiment of the present invention) is defined each to prolong new word;Label is set
And the corresponding characteristic word of label, and different weights (i.e. above-mentioned characteristic word weight) is defined for each characteristic word;Then
Article is calculated in conjunction with keyword and the corresponding label of article (the second text label i.e. in the above embodiment of the present invention) and is closed
The degree of association data of keyword are simultaneously ranked up it, can will come top n degree of association data and its corresponding article is shown
Come, to facilitate people to understand the content with the mostly concerned article of the keyword.
It should be noted that step shown in the flowchart of the accompanying drawings can be in such as a group of computer-executable instructions
It is executed in computer system, although also, logical order is shown in flow charts, and it in some cases, can be with not
The sequence being same as herein executes shown or described step.
The embodiment of the invention also provides a kind of devices for obtaining article degree of association data.The device can be through the invention
The method of acquisition article degree of association data in above-described embodiment realizes its function.
Fig. 3 is the schematic diagram of the device according to an embodiment of the present invention for obtaining article degree of association data.As shown in figure 3, should
Device may include: the first acquisition module 10, statistical module 30, the first determining module 50, the second determining module 70 and first
Computing module 90.
Wherein, the first acquisition module 10 is for obtaining predetermined keyword and multiple texts to be analyzed;Statistical module 30 is used for
Statistics predetermined keyword corresponds to the derivative keyword of multiple texts to be analyzed, wherein derivative keyword is same with predetermined keyword
When appear in keyword in a text to be analyzed;First determining module 50 for determine predetermined keyword appear in it is multiple to
First number and derivative keyword in analysis text appear in second number in multiple texts to be analyzed;Second determining module
The second text label to match in 70 multiple first text labels for determining each text to be analyzed with predetermined keyword,
Wherein, the first text label is used to identify the theme of text to be analyzed;First computing module 90 is used to be based on each text to be analyzed
This label achievement data of the second text label, first number of predetermined keyword and second number meter for deriving keyword
Calculate the degree of association data of predetermined keyword and each text to be analyzed.
Using the embodiment of the present invention, after obtaining predetermined keyword and multiple texts to be analyzed, predetermined keyword is counted
The derivative keyword of corresponding multiple texts to be analyzed, and determine that predetermined keyword appears in the first time in multiple texts to be analyzed
Several and derivative keyword appears in second number in multiple texts to be analyzed, is determining multiple the first of each text to be analyzed
After the second text label to match in text label with predetermined keyword, the second text mark based on each text to be analyzed
Second number of the label achievement data of label, first number of predetermined keyword and derivative keyword calculate predetermined keywords with
The degree of association data of each text to be analyzed.Through the embodiment of the present invention, in conjunction with corresponding to predetermined keyword and text to be analyzed
Text label calculate the degree of association of text to be analyzed and predetermined keyword, since text label identifies the master of text to be analyzed
Topic, therefore the degree of association of predetermined keyword Yu text to be analyzed can be accurately determined.Using the embodiment of the present invention, solve existing
There is the problem of degree of association that can not accurately determine article and keyword in technology, realizes and accurately determine article and keyword
The effect of the degree of association.
In the above-described embodiments, text to be analyzed can be the network text for swashing and getting from internet by web crawlers
Chapter, it is alternatively possible to article is crawled from internet according to the url list of the page to be crawled, it can also be according to the grade of the page
Number is to crawl article, for example, web crawlers can be made to crawl certain website (e.g., Sina, Netease or Tencent etc.) by setting
First level pages on content (e.g., the content in Sina's homepage), then crawl on the website second level page content (e.g., open
The content after link in Sina's homepage) etc..
Wherein, URL is Uniform Resources Locator, i.e. uniform resource locator, is to can be from internet
On the succinct expression of obtained one kind of the position of resource and access method, be the address of standard resource on internet.
In the above-described embodiment, the article crawled (i.e. above-mentioned text to be analyzed) can be stored in database
In.
According to that above embodiment of the present invention, statistical module may include: word segmentation module, for multiple texts to be analyzed into
Row word segmentation processing obtains set of words;Second obtains module, for obtaining the first quantity of each first word in set of words,
Wherein, the first quantity is greater than the first preset threshold;Third obtains module, for obtaining the of each second word in set of words
Two quantity, wherein the second quantity appears in time in each text to be analyzed for second word and predetermined keyword simultaneously
Several aggregate values, the second quantity are greater than the second preset threshold;Comparison module, for comparing the second word and the first word, if the
Two words are identical as the first word, then using the ratio of the second quantity and the first quantity as the frequency of occurrence of the second word;If the
Two words are different from the first word, then using the second quantity as the frequency of occurrence of the second word;Third determining module, for determining
Frequency of occurrence is greater than the second word of third predetermined threshold value as derivative keyword.
Specifically, multiple texts to be analyzed can be segmented according to the default word in default dictionary, including
The set of words of multiple first words, for example, if text to be analyzed be " big data refers to without shortcut as random analysis method,
And be analyzed and processed using all data ", then may include following to its set of words obtained after word segmentation processing
Word: " big data ", " finger ", " not having to ", " random analysis method ", " such ", " shortcut ", " and ", " use ", " all ", " number
According to ", " progress " and " analysis processing ".
Further, after obtaining set of words, the first quantity for counting each first word in set of words can be with
It include: the word quantity for counting each word in set of words.
In the above-described embodiments, after obtaining set of words, the second number of each second word in set of words is obtained
Amount may include: each text to be analyzed of statistics through segmenting the number occurred simultaneously in obtained each word with predetermined keyword
Aggregate value;Using the highest preceding X word of frequency of occurrence as the second word, wherein X is natural number;Or number is greater than
The word of second preset threshold is as the second word;The corresponding number of each second word is recorded as the second quantity.
After determining multiple first words and multiple second words, more each second word and each first word,
If the second word is identical as the first word, using the ratio of the second quantity and the first quantity as the frequency of occurrence of the second word;
If the second word and the first word be not identical, using the second quantity as the frequency of occurrence of the second word;And it is frequency of occurrence is big
In third predetermined threshold value the second word as derivative keyword, or using highest preceding Z the second words of frequency of occurrence as
Derivative keyword.
Above-described embodiment through the invention can automatically determine the derivative of predetermined keyword based on multiple texts to be analyzed and close
Keyword improves the accuracy of determining derivative keyword, is calculating without adding derivative keyword manually for predetermined keyword
When predetermined keyword and the degree of association data of text to be analyzed, it also is contemplated that the influence of derivative keyword improves calculated
The accuracy of degree of association data.
In an alternative embodiment, the article (text i.e. to be analyzed) got that can will swash from internet is stored in
In database, and word segmentation processing is carried out to article, calculates the highest Y word of frequency of occurrence in all articles stored in database
As frequency of occurrence a (i.e. above-mentioned first for solely showing word (the first word i.e. in above-described embodiment) and each only now word of record
Quantity);According to predetermined keyword, the highest X word of number occurred simultaneously with the predetermined keyword is grabbed in the database and is made
For co-occurrence word (the second i.e. above-mentioned word) and record the number b (the second i.e. above-mentioned quantity) of each co-occurrence word.In conjunction with solely existing
The frequency of occurrence of the co-occurrence word is denoted as b/ if there is word identical with the word solely showed in word in co-occurrence word by word and co-occurrence word
A, if co-occurrence word is different from the word in only now word, the frequency of occurrence of the co-occurrence word is still b, then, by frequency of occurrence
Derivative keyword of the highest Z word as the predetermined keyword.
In embodiments of the present invention, a, b and Z are natural number.
In the above embodiment of the present invention, can not will there is no Special Significance in the word occurred simultaneously with predetermined keyword
Word removal, for example, an article is the application about big data, but repeatedly mentions " scientific and technical article " or " China " in article
There is no the word of particularity (for example, repeatedly mentioning this article in this article is to be published in China Deng for " big data "
A certain scientific and technical article publication on), by comparing highest first word of frequency of occurrence and and predetermined keyword in each article
Highest second word of number occurred simultaneously, repairs the frequency of occurrence of the second word using the first quantity of the first word
Just, achieve the purpose that remove in the second word and there is no the word of particularity.
Further, the first determining module 50 may include: by the word quantity of the predetermined keyword in above-described embodiment
First number as predetermined keyword;Using the frequency of occurrence of the derivative keyword in above-described embodiment as derivative keyword
Second number.
Specifically, in counting set of words after the word quantity of each word, by the corresponding word of predetermined keyword
First number of the language quantity as predetermined keyword;After determining derivative keyword, the corresponding appearance of derivative keyword is recorded
Second number of the number as derivative keyword.
In the above embodiment of the present invention, which can also include: the 4th acquisition module, for determining each wait divide
Before analysing the second text label to match in multiple first text labels of text with predetermined keyword, pre-set text mark is obtained
The conjunctive word of label and pre-set text label, wherein pre-set text label includes the first text label, and pre-set text label is corresponding extremely
A few conjunctive word;Spider module obtains multiple associations that each text to be analyzed includes for traversing multiple texts to be analyzed
Word;Searching module, the corresponding multiple pre-set text labels of each conjunctive word for searching with each text to be analyzed includes, makees
For multiple first text labels.
Specifically, the conjunctive word of pre-set text label and pre-set text label can be obtained from default tag library, wherein
Pre-set text label and its corresponding conjunctive word are stored in default tag library, each pre-set text label corresponds at least one pass
Join word;Multiple texts to be analyzed are traversed, multiple conjunctive words that each text to be analyzed includes are obtained;For each text to be analyzed
This, searches the corresponding multiple pre-set text labels of each conjunctive word that it includes, and multiple pre-set text label is set as this
First text label of text to be analyzed.In this embodiment, since a pre-set text label can correspond to multiple conjunctive words,
When determining the first text label of each text to be analyzed, the number of the first text label is corresponding no more than the text to be analyzed
Conjunctive word number.
It should be further noted that a conjunctive word can only correspond to a pre-set text label.
Above-described embodiment through the invention, the pre-set text label that will be stored in each text to be analyzed and default tag library
And its conjunctive word is matched, it is determined that multiple first text labels of each text to be analyzed, each first text label mark
Know a theme for the text to be analyzed, can accurately identify the text to be analyzed by multiple first text labels
The each theme being related to, when calculating the degree of association of predetermined keyword and text to be analyzed, it is first determined each first text mark
The second text label to match in label with predetermined keyword, then reflect based on this second text mark of text subject to be analyzed
Label calculate the degree of association data between predetermined keyword and the text (e.g., the article on internet) to be analyzed, avoid existing
Article is determined according to whether occurring position in this article of keyword, the keyword and frequency of occurrence in article in technology
The problem of with the poor accuracy of the degree of association of keyword, improve the accuracy of calculated degree of association data.
According to that above embodiment of the present invention, which can also include: the second computing module, for based on each wait divide
Analyse the label achievement data of the second text label of text, first number of predetermined keyword and derivative keyword for the second time
Before number calculates predetermined keyword and the degree of association data of each text to be analyzed, each first text is calculated according to the first formula
The label achievement data A of label, wherein the first formula are as follows:N is the corresponding conjunctive word of the first text label
Number, BiThe number in a text to be analyzed, b are appeared in for the corresponding conjunctive word of i-th of first text labelsiIt is i-th
The default weight of corresponding first text label of the corresponding conjunctive word of one text label.
Specifically, before calculating the degree of association data of predetermined keyword and each text to be analyzed, each first is calculated
It is corresponding each to read its for first text label from default tag library for the label achievement data of text label
The default weight of a conjunctive word, and the number that each conjunctive word occurs in the text to be analyzed is counted, calculate each conjunctive word
Frequency of occurrence and respective default weight, then product addition is obtained to the label achievement data of first text label.
Above-described embodiment through the invention calculates the label index number of each first text label of each text to be analyzed
According to can reflect the correlation of the theme of the text (e.g., the article on internet) to be analyzed by label achievement data, that is,
The label achievement data of first text label is bigger, the correlation of the first text label corresponding theme and the text to be analyzed
It is bigger.
In the above embodiment of the present invention, the first computing module may include: the 4th determining module, for determination will with spread out
The corresponding conjunctive word of identical second text label of life keyword is as third word;Computational submodule, for according to the second public affairs
Formula calculates the degree of association data G of each text to be analyzed, wherein the second formula isK is first number of predetermined keyword, and C is the second text label
Label achievement data, D is that the second text label appears in the number in a text to be analyzed, and d is the second text label
Default weight, m are the number of third word, kjFor second number of the corresponding derivative keyword of j-th of third word, EjFor jth
A third word appears in the third number in a text to be analyzed, ejFor corresponding second text label of j-th of third word
Default weight.
Specifically, in determining multiple first text labels after the second text label identical with predetermined keyword, from
The label achievement data of second text label is determined in the label achievement data for multiple first text labels being calculated, and
Using conjunctive word identical with each derivative keyword in the corresponding conjunctive word of the second text label as third word, in conjunction with
The label achievement data of two text labels, first number of predetermined keyword and derivative second number of keyword, according to the second public affairs
Formula calculates the degree of association data of the predetermined keyword Yu the text to be analyzed.
Above-described embodiment through the invention considers when calculating the degree of association data of predetermined keyword and text to be analyzed
The theme of derivative keyword and to be analyzed text of the predetermined keyword based on each text to be analyzed, avoids the prior art
It is middle to determine article according to whether occurring position in this article of keyword, the keyword and frequency of occurrence in article and close
The problem of poor accuracy of the degree of association of keyword, improves the accuracy of calculated degree of association data.
According to that above embodiment of the present invention, which can also include: sorting module, for being based on each text to be analyzed
This label achievement data of the second text label, first number of predetermined keyword and second number meter for deriving keyword
After calculating predetermined keyword and the degree of association data of each text to be analyzed, according to sequence from high to low to each text to be analyzed
This degree of association data are ranked up, and obtain relational degree taxis table;Display module, for showing top n in relational degree taxis table
Degree of association data and corresponding text to be analyzed, wherein N is natural number.
Specifically, right according to sequence from high to low after the degree of association data that each text to be analyzed is calculated
Each degree of association data are ranked up, and obtain relational degree taxis table, and are closed top n in the relational degree taxis table is (3 such as preceding)
Connection degree evidence and its corresponding textual presentation to be analyzed come out.
Above-described embodiment through the invention, degree of association data are higher, indicate the pass of the predetermined keyword Yu text to be analyzed
Connection degree is bigger, and the highest top n degree of association data of degree of association data and its corresponding textual presentation to be analyzed are come out, can be made
It obtains people and understands maximally related article in knowledge corresponding to the predetermined keyword or technical field.
Application method is identical, answers with provided by the corresponding step of embodiment of the method for modules provided in the present embodiment
It can also be identical with scene.It is noted, of course, that during the scheme that above-mentioned module is related to can be not limited to the above embodiments
Content and scene, and above-mentioned module may operate in terminal or mobile terminal, can pass through software or hardware realization.
It can be seen from the above description that the present invention realizes following technical effect:
Using the embodiment of the present invention, after obtaining predetermined keyword and multiple texts to be analyzed, predetermined keyword is counted
The derivative keyword of corresponding multiple texts to be analyzed, and determine that predetermined keyword appears in the first time in multiple texts to be analyzed
Several and derivative keyword appears in second number in multiple texts to be analyzed, is determining multiple the first of each text to be analyzed
After the second text label to match in text label with predetermined keyword, the second text mark based on each text to be analyzed
Second number of the label achievement data of label, first number of predetermined keyword and derivative keyword calculate predetermined keywords with
The degree of association data of each text to be analyzed.Through the embodiment of the present invention, in conjunction with corresponding to predetermined keyword and text to be analyzed
Text label calculate the degree of association of text to be analyzed and predetermined keyword, since text label identifies the master of text to be analyzed
Topic, therefore the degree of association of predetermined keyword Yu text to be analyzed can be accurately determined.Using the embodiment of the present invention, solve existing
There is the problem of degree of association that can not accurately determine article and keyword in technology, realizes and accurately determine article and keyword
The effect of the degree of association.
Obviously, those skilled in the art should be understood that each module of the above invention or each step can be with general
Computing device realize that they can be concentrated on a single computing device, or be distributed in multiple computing devices and formed
Network on, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored
Be performed by computing device in the storage device, perhaps they are fabricated to each integrated circuit modules or by they
In multiple modules or step be fabricated to single integrated circuit module to realize.In this way, the present invention is not limited to any specific
Hardware and software combines.
The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field
For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, made any to repair
Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.
Claims (8)
1. a kind of method for obtaining article degree of association data characterized by comprising
Obtain predetermined keyword and multiple texts to be analyzed;
Count the derivative keyword that the predetermined keyword corresponds to the multiple text to be analyzed, wherein the derivative keyword
To appear in the keyword in a text to be analyzed simultaneously with the predetermined keyword;
Determine that the predetermined keyword appears in first number in the multiple text to be analyzed and the derivative keyword goes out
Second number in present the multiple text to be analyzed;
Determine the second text to match in multiple first text labels of each text to be analyzed with the predetermined keyword
This label, wherein first text label is used to identify the theme of the text to be analyzed;
The first time of the label achievement data of the second text label based on each text to be analyzed, the predetermined keyword
Second number of the several and described derivative keyword calculates the degree of association of the predetermined keyword and each text to be analyzed
Data;
Wherein, the label achievement data in the second text label based on each text to be analyzed, the predetermined keyword
First number and second time of derivative keyword number calculate the predetermined keywords and each text to be analyzed
Degree of association data before, the method also includes:
The label achievement data A of each first text label is calculated according to the first formula, wherein
First formula are as follows:
The n is the number of the corresponding conjunctive word of first text label, the BiFor i-th of first text label pair
The conjunctive word answered appears in the number in a text to be analyzed, the biI-th corresponding for first text label
Conjunctive word correspond to the default weight of first text label;
The first time of the label achievement data of the second text label based on each text to be analyzed, the predetermined keyword
Second number of the several and described derivative keyword calculates the degree of association of the predetermined keyword and each text to be analyzed
Data include:
The mark of second text label is determined from the label achievement data for multiple first text labels being calculated
Sign achievement data;
Will the corresponding conjunctive word of identical with the derivative keyword second text label as third word;
The degree of association data G of each text to be analyzed is calculated according to the second formula, wherein
Second formula is
The K is first number of the predetermined keyword, and the C is the label achievement data of second text label, institute
Stating D is that second text label appears in number in a text to be analyzed, and the d is second text label
Default weight, the m be the third word number, the kjIt is corresponding derivative crucial for j-th of third word
Second number of word, the EjThe third number in a text to be analyzed, institute are appeared in for j-th of third word
State ejThe default weight of second text label is corresponded to for j-th of third word;
Wherein, the label achievement data is used to react the correlation of the theme of the text to be analyzed.
2. the method according to claim 1, wherein count the predetermined keyword correspond to it is the multiple to be analyzed
The derivative keyword of text includes:
Word segmentation processing is carried out to the multiple text to be analyzed and obtains set of words;
Obtain the first quantity of each first word in the set of words, wherein first quantity is greater than the first default threshold
Value;
Obtain the second quantity of each second word in the set of words, wherein second quantity is one described second
Word and the predetermined keyword appear in the aggregate value of the number in each text to be analyzed, second quantity simultaneously
Greater than the second preset threshold;
Compare second word and first word, it, will be described if second word is identical as first word
Frequency of occurrence of the ratio of second quantity and first quantity as second word;If second word and described the
One word is different, then using second quantity as the frequency of occurrence of second word;
The frequency of occurrence is greater than the second word of third predetermined threshold value as the derivative keyword.
3. the method according to claim 1, wherein in multiple first texts for determining each text to be analyzed
Before the second text label to match in this label with the predetermined keyword, the method also includes:
Obtain the conjunctive word of pre-set text label and the pre-set text label, wherein the pre-set text label includes described
First text label, the pre-set text label correspond at least one described conjunctive word;
It traverses the multiple text to be analyzed and obtains multiple conjunctive words that each text to be analyzed includes;
Multiple pre-set text labels corresponding with each conjunctive word that each text to be analyzed includes are searched, as institute
State multiple first text labels.
4. the method according to claim 1, which is characterized in that be based on each text to be analyzed
The of this label achievement data of the second text label, first number of the predetermined keyword and the derivative keyword
After two numbers calculate the predetermined keyword and the degree of association data of each text to be analyzed, the method also includes:
It is ranked up according to degree of association data of the sequence from high to low to each text to be analyzed, obtains relational degree taxis
Table;
Show degree of association data described in top n and the corresponding text to be analyzed in the relational degree taxis table, wherein described
N is natural number.
5. a kind of device for obtaining article degree of association data characterized by comprising
First obtains module, for obtaining predetermined keyword and multiple texts to be analyzed;
Statistical module corresponds to the derivative keyword of the multiple text to be analyzed for counting the predetermined keyword, wherein institute
Stating derivative keyword is the keyword appeared in simultaneously in a text to be analyzed with the predetermined keyword;
First determining module, for determine the predetermined keyword appear in first number in the multiple text to be analyzed and
The derivative keyword appears in second number in the multiple text to be analyzed;
Second determining module, in multiple first text labels for determining each text to be analyzed with the default key
The second text label that word matches, wherein first text label is used to identify the theme of the text to be analyzed;
First computing module, the label achievement data, described for the second text label based on each text to be analyzed
Second number of first number of predetermined keyword and the derivative keyword calculates the predetermined keyword and each described
The degree of association data of text to be analyzed;
Wherein, described device further include:
Second computing module, for the label achievement data in the second text label based on each text to be analyzed, institute
Second number of first number and the derivative keyword of stating predetermined keyword calculates the predetermined keyword and each institute
Before the degree of association data for stating text to be analyzed, the label index number of each first text label is calculated according to the first formula
According to A, wherein
First formula are as follows:
The n is the number of the corresponding conjunctive word of first text label, the BiFor i-th of first text label pair
The conjunctive word answered appears in the number in a text to be analyzed, the biIt is corresponding for i-th of first text label
Conjunctive word correspond to the default weight of first text label;
Wherein, first computing module includes:
4th determining module, described in being determined from the label achievement data for multiple first text labels being calculated
The label achievement data of second text label, and determination will second text label identical with the derivative keyword it is corresponding
Conjunctive word as third word;
Computational submodule, for calculating the degree of association data G of each text to be analyzed according to the second formula, wherein
Second formula is
The K is first number of the predetermined keyword, and the C is the label achievement data of second text label, institute
Stating D is that second text label appears in number in a text to be analyzed, and the d is second text label
Default weight, the m be the third word number, the kjIt is corresponding derivative crucial for j-th of third word
Second number of word, the EjThe third number in a text to be analyzed, institute are appeared in for j-th of third word
State ejThe default weight of second text label is corresponded to for j-th of third word;
Wherein, the label achievement data is used to react the correlation of the theme of the text to be analyzed.
6. device according to claim 5, which is characterized in that the statistical module includes:
Word segmentation module obtains set of words for carrying out word segmentation processing to the multiple text to be analyzed;
Second obtains module, for obtaining the first quantity of each first word in the set of words, wherein first number
Amount is greater than the first preset threshold;
Third obtains module, for obtaining the second quantity of each second word in the set of words, wherein second number
Amount appears in the conjunction of the number in each text to be analyzed for second word and the predetermined keyword simultaneously
Evaluation, second quantity are greater than the second preset threshold;
Comparison module is used for second word and first word, if second word and first word
It is identical, then using the ratio of second quantity and first quantity as the frequency of occurrence of second word;If described
Two words are different from first word, then using second quantity as the frequency of occurrence of second word;
The frequency of occurrence is greater than the second word of third predetermined threshold value as the derivative for determining by third determining module
Keyword.
7. device according to claim 5, which is characterized in that described device further include:
4th obtain module, in multiple first text labels for determining each text to be analyzed with the default pass
Before the second text label that keyword matches, the conjunctive word of pre-set text label and the pre-set text label is obtained, wherein
The pre-set text label includes first text label, and the pre-set text label corresponds at least one described conjunctive word;
Spider module obtains multiple associations that each text to be analyzed includes for traversing the multiple text to be analyzed
Word;
Searching module, the corresponding multiple default texts of each conjunctive word for searching with each text to be analyzed includes
This label, as the multiple first text label.
8. the device according to any one of claim 5 to 7, which is characterized in that described device further include:
Sorting module, for the label achievement data, described pre- in the second text label based on each text to be analyzed
If second number of first number of keyword and the derivative keyword calculate the predetermined keyword and it is each it is described to
After the degree of association data for analyzing text, according to sequence from high to low to the degree of association data of each text to be analyzed into
Row sequence, obtains relational degree taxis table;
Display module, for showing in the relational degree taxis table degree of association data described in top n and corresponding described to be analyzed
Text, wherein the N is natural number.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510114670.0A CN106033445B (en) | 2015-03-16 | 2015-03-16 | The method and apparatus for obtaining article degree of association data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510114670.0A CN106033445B (en) | 2015-03-16 | 2015-03-16 | The method and apparatus for obtaining article degree of association data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106033445A CN106033445A (en) | 2016-10-19 |
CN106033445B true CN106033445B (en) | 2019-10-25 |
Family
ID=57150156
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510114670.0A Active CN106033445B (en) | 2015-03-16 | 2015-03-16 | The method and apparatus for obtaining article degree of association data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106033445B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108241986B (en) * | 2016-12-23 | 2021-12-24 | 北京国双科技有限公司 | Data processing method and terminal |
CN106934004A (en) * | 2017-03-07 | 2017-07-07 | 广州优视网络科技有限公司 | A kind of method and apparatus for recommending article to user based on regional feature |
CN106919711B (en) * | 2017-03-13 | 2020-10-02 | 北京百度网讯科技有限公司 | Method and device for labeling information based on artificial intelligence |
CN110020029B (en) * | 2017-09-30 | 2021-09-07 | 北京国双科技有限公司 | Method and device for acquiring correlation between document and query term |
CN110309312B (en) * | 2018-03-09 | 2022-02-11 | 北京国双科技有限公司 | Associated event acquisition method and device |
CN108959431B (en) * | 2018-06-11 | 2022-07-05 | 中国科学院上海高等研究院 | Automatic label generation method, system, computer readable storage medium and equipment |
CN110032639B (en) * | 2018-12-27 | 2023-10-31 | 中国银联股份有限公司 | Method, device and storage medium for matching semantic text data with tag |
CN111275327A (en) * | 2020-01-19 | 2020-06-12 | 深圳前海微众银行股份有限公司 | Resource allocation method, device, equipment and storage medium |
CN111598526B (en) * | 2020-04-21 | 2023-02-03 | 奇计(江苏)科技服务有限公司 | Intelligent comparison review method for describing scientific and technological innovation content |
CN114282092A (en) * | 2021-12-07 | 2022-04-05 | 咪咕音乐有限公司 | Information processing method, device, equipment and computer readable storage medium |
CN116228265A (en) * | 2023-03-24 | 2023-06-06 | 北京中诺链捷数字科技有限公司 | Invoice risk identification method, device and equipment |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101295319A (en) * | 2008-06-24 | 2008-10-29 | 北京搜狗科技发展有限公司 | Method and device for expanding query, search engine system |
CN103235773A (en) * | 2013-04-26 | 2013-08-07 | 百度在线网络技术(北京)有限公司 | Method and device for extracting text labels based on keywords |
CN103631874A (en) * | 2013-11-07 | 2014-03-12 | 微梦创科网络科技(中国)有限公司 | UGC label classification determining method and device for social platform |
CN103761249A (en) * | 2013-12-24 | 2014-04-30 | 北京恒华伟业科技股份有限公司 | Data importing method and system based on data matching |
CN103793387A (en) * | 2012-10-29 | 2014-05-14 | 腾讯科技(深圳)有限公司 | Thematic word relevance processing method and system and thematic word recommendation method and system |
CN104216995A (en) * | 2014-09-10 | 2014-12-17 | 北京金山安全软件有限公司 | Information processing method and device |
CN104375989A (en) * | 2014-12-01 | 2015-02-25 | 国家电网公司 | Natural language text keyword association network construction system |
-
2015
- 2015-03-16 CN CN201510114670.0A patent/CN106033445B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101295319A (en) * | 2008-06-24 | 2008-10-29 | 北京搜狗科技发展有限公司 | Method and device for expanding query, search engine system |
CN103793387A (en) * | 2012-10-29 | 2014-05-14 | 腾讯科技(深圳)有限公司 | Thematic word relevance processing method and system and thematic word recommendation method and system |
CN103235773A (en) * | 2013-04-26 | 2013-08-07 | 百度在线网络技术(北京)有限公司 | Method and device for extracting text labels based on keywords |
CN103631874A (en) * | 2013-11-07 | 2014-03-12 | 微梦创科网络科技(中国)有限公司 | UGC label classification determining method and device for social platform |
CN103761249A (en) * | 2013-12-24 | 2014-04-30 | 北京恒华伟业科技股份有限公司 | Data importing method and system based on data matching |
CN104216995A (en) * | 2014-09-10 | 2014-12-17 | 北京金山安全软件有限公司 | Information processing method and device |
CN104375989A (en) * | 2014-12-01 | 2015-02-25 | 国家电网公司 | Natural language text keyword association network construction system |
Also Published As
Publication number | Publication date |
---|---|
CN106033445A (en) | 2016-10-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106033445B (en) | The method and apparatus for obtaining article degree of association data | |
Bhuiyan et al. | Retrieving YouTube video by sentiment analysis on user comment | |
CN109948911B (en) | Evaluation method for calculating network product information security risk | |
CN105488196B (en) | A kind of hot topic automatic mining system based on interconnection corpus | |
CN108763321B (en) | Related entity recommendation method based on large-scale related entity network | |
US10002330B2 (en) | Context based co-operative learning system and method for representing thematic relationships | |
CN106156372B (en) | A kind of classification method and device of internet site | |
US9519718B2 (en) | Webpage information detection method and system | |
CN102855309B (en) | A kind of information recommendation method based on user behavior association analysis and device | |
US20080270376A1 (en) | Web spam page classification using query-dependent data | |
CN111708740A (en) | Mass search query log calculation analysis system based on cloud platform | |
CN103365904B (en) | A kind of advertising message searching method and system | |
CN110543595B (en) | In-station searching system and method | |
CN111324801B (en) | Hot event discovery method in judicial field based on hot words | |
CN107885793A (en) | A kind of hot microblog topic analyzing and predicting method and system | |
CN106598949B (en) | A kind of determination method and device of word to text contribution degree | |
US20090259649A1 (en) | System and method for detecting templates of a website using hyperlink analysis | |
CN107918644A (en) | News subject under discussion analysis method and implementation system in reputation Governance framework | |
CN113297457B (en) | High-precision intelligent information resource pushing system and pushing method | |
CN103455758A (en) | Method and device for identifying malicious website | |
CN103955480B (en) | A kind of method and apparatus for determining the target object information corresponding to user | |
CN113065070A (en) | Intelligent sorting method, system, equipment and computer storage medium for mobile internet information search and retrieval | |
CN106815265A (en) | The searching method and device of judgement document | |
KR101543680B1 (en) | Entity searching and opinion mining system of hybrid-based using internet and method thereof | |
CN109471934B (en) | Financial risk clue mining method based on Internet |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |