CN106033445B

CN106033445B - The method and apparatus for obtaining article degree of association data

Info

Publication number: CN106033445B
Application number: CN201510114670.0A
Authority: CN
Inventors: 陈俊宏; 余德乐; 杨韬; 赵冬玲
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2015-03-16
Filing date: 2015-03-16
Publication date: 2019-10-25
Anticipated expiration: 2035-03-16
Also published as: CN106033445A

Abstract

The invention discloses a kind of method and apparatus for obtaining article degree of association data.Wherein, this method comprises: obtaining predetermined keyword and multiple texts to be analyzed, wherein each text to be analyzed corresponds to multiple first text labels；Statistics predetermined keyword corresponds to the derivative keyword of multiple texts to be analyzed；Determine that predetermined keyword appears in first number in multiple texts to be analyzed and derivative keyword appears in second number in multiple texts to be analyzed；Determine the second text label to match in multiple first text labels of each text to be analyzed with predetermined keyword；Second number of the label achievement data of the second text label based on each text to be analyzed, first number of predetermined keyword and derivative keyword calculates the degree of association data of predetermined keyword and each text to be analyzed.Using the present invention, solves the problems, such as accurately determine the degree of association of article and keyword in the prior art, realize the effect for accurately determining the degree of association of article and keyword.

Description

The method and apparatus for obtaining article degree of association data

Technical field

The present invention relates to internet areas, in particular to a kind of method and apparatus for obtaining article degree of association data.

Background technique

Association article is searched by keyword in the prior art and the method to sort to it is all simply according to article In whether there is the number of the position and appearance of the keyword, the keyword in the article occurred to realize, specifically, This method for searching article is similar to the mode scanned in the search engines such as Baidu or Google to keyword, for example, If keyword is " big data ", then when searching with " big data " associated article, it is " big to count according to whether occurring in article According to ", " big data " number for occurring in article, and " big data " appear in the position in article determine article with it is " big The degree of association of data " is simultaneously from high to low ranked up association article according to the degree of association, wherein keyword appears in article not With weighted shared when position, e.g., the weight highest of headline, text take second place, advertisement is minimum.But in the prior art Association article is determined above by keyword and the method that sorts to it cannot reflect theme and the pass of article very accurately The relevance of keyword.

For example, reporting the work of some big data article at one when searching with keyword " big data " associated article Person participates in the news of dinner party, and " big data " word repeatedly occurs, but the theme of the news is that the author participates in dinner party, without It is " big data ", but according to the prior art, since " big data " frequency of occurrence is more, keyword " big number can be determined it as According to " association article therefore cannot accurately reflect the relevance of article theme and keyword by the prior art.

For another example, when keyword is " big data ", there is an article in the whole text at discussion " big data ", only to " big data " Word does not mention, and then this article cannot be screened out and participate in sorting.

For another example, keyword is still " big data ", has article A to mention 1 time " big data ", 5 times " big data ", article B is mentioned To 2 times " big data ", 4 times " data mining ", according to the above-mentioned prior art, article B is associated with keyword " big data " Degree should be higher than article A, it is apparent that not being.

Aiming at the problem that can not accurately determine the degree of association of article and keyword in the prior art, not yet propose at present effective Solution.

Summary of the invention

It is existing to solve the main purpose of the present invention is to provide a kind of method and apparatus for obtaining article degree of association data The problem of degree of association of article and keyword can not be accurately determined in technology.

To achieve the goals above, according to an aspect of an embodiment of the present invention, a kind of acquisition article degree of association is provided The method of data, this method comprises: obtaining predetermined keyword and multiple texts to be analyzed；Statistics predetermined keyword correspond to it is multiple to Analyze the derivative keyword of text, wherein derivative keyword is to be appeared in simultaneously in a text to be analyzed with predetermined keyword Keyword；Determine predetermined keyword appear in first number in multiple texts to be analyzed and derivative keyword appear in it is multiple Second number in text to be analyzed；Determine in multiple first text labels of each text to be analyzed with predetermined keyword phase The second text label matched, wherein the first text label is used to identify the theme of text to be analyzed；Based on each text to be analyzed The label achievement data of the second text label, first number of predetermined keyword and derivative keyword second number calculate The degree of association data of predetermined keyword and each text to be analyzed.

Further, it includes: to multiple wait divide that statistics predetermined keyword, which corresponds to the derivative keyword of multiple texts to be analyzed, Analysis text carries out word segmentation processing and obtains set of words；Obtain the first quantity of each first word in set of words, wherein first Quantity is greater than the first preset threshold；Obtain the second quantity of each second word in set of words, wherein the second quantity is one Second word and predetermined keyword appear in the aggregate value of the number in each text to be analyzed simultaneously, and the second quantity is greater than second Preset threshold；Compare the second word and the first word, if the second word is identical as the first word, by the second quantity and the first number Frequency of occurrence of the ratio of amount as the second word；If the second word is different from the first word, using the second quantity as second The frequency of occurrence of word；Frequency of occurrence is greater than the second word of third predetermined threshold value as derivative keyword.

Further, match in multiple first text labels for determining each text to be analyzed with predetermined keyword Before second text label, this method further include: obtain the conjunctive word of pre-set text label and pre-set text label, wherein pre- If text label includes the first text label, pre-set text label corresponds at least one conjunctive word；Traverse multiple texts to be analyzed Obtain multiple conjunctive words that each text to be analyzed includes；It searches corresponding with each conjunctive word that each text to be analyzed includes Multiple pre-set text labels, as multiple first text labels.

Further, the label achievement data in the second text label based on each text to be analyzed, predetermined keyword First number and second number of derivative keyword calculate the degree of association data of predetermined keywords and each text to be analyzed Before, this method further include: the label achievement data A of each first text label is calculated according to the first formula, wherein first is public Formula are as follows:N is the number of the corresponding conjunctive word of the first text label, B_iIt is corresponding for i-th of first text labels Conjunctive word appear in the number in a text to be analyzed, b_iI-th is the corresponding conjunctive word of the first text label corresponding the The default weight of one text label.

Further, the label achievement data of the second text label based on each text to be analyzed, predetermined keyword Second number of first number and derivative keyword calculates the degree of association data packet of predetermined keyword and each text to be analyzed Include: will the corresponding conjunctive word of identical with derivative keyword the second text label as third word；It is calculated according to the second formula every The degree of association data G of a text to be analyzed, wherein the second formula is K is first number of predetermined keyword, and C is the label achievement data of the second text label, and D is that the second text label appears in one Number in a text to be analyzed, d are the default weight of the second text label, and m is the number of third word, k_jIt is j-th Second number of the corresponding derivative keyword of three words, E_jIn a text to be analyzed is appeared in for j-th of third word Three numbers, e_jFor the default weight of corresponding second text label of j-th of third word.

Further, the label achievement data in the second text label based on each text to be analyzed, predetermined keyword First number and second number of derivative keyword calculate the degree of association data of predetermined keywords and each text to be analyzed Later, this method further include: be ranked up, obtain according to degree of association data of the sequence from high to low to each text to be analyzed Relational degree taxis table；Show top n degree of association data and corresponding text to be analyzed in relational degree taxis table, wherein N is nature Number.

To achieve the goals above, according to another aspect of an embodiment of the present invention, a kind of acquisition article degree of association is provided The device of data, which includes: the first acquisition module, for obtaining predetermined keyword and multiple texts to be analyzed；Count mould Block corresponds to the derivative keyword of multiple texts to be analyzed for counting predetermined keyword, wherein derivative keyword is and default pass Keyword appears in the keyword in a text to be analyzed simultaneously；First determining module, for determining that predetermined keyword appears in First number and derivative keyword in multiple texts to be analyzed appear in second number in multiple texts to be analyzed；Second really Cover half block, the second text to be matched in multiple first text labels for determining each text to be analyzed with predetermined keyword Label, wherein the first text label is used to identify the theme of text to be analyzed；First computing module, for based on each wait divide Analyse the label achievement data of the second text label of text, first number of predetermined keyword and derivative keyword for the second time Number calculates the degree of association data of predetermined keyword and each text to be analyzed.

Further, statistical module includes: word segmentation module, obtains word for carrying out word segmentation processing to multiple texts to be analyzed Language set；Second obtains module, for obtaining the first quantity of each first word in set of words, wherein the first quantity is big In the first preset threshold；Third obtains module, for obtaining the second quantity of each second word in set of words, wherein the Two quantity are the aggregate value that second word and predetermined keyword appear in the number in each text to be analyzed simultaneously, second Quantity is greater than the second preset threshold；Comparison module, for comparing the second word and the first word, if the second word and the first word It is identical, then using the ratio of the second quantity and the first quantity as the frequency of occurrence of the second word；If the second word and the first word Difference, then using the second quantity as the frequency of occurrence of the second word；Frequency of occurrence is greater than for determining by third determining module Second word of three preset thresholds is as derivative keyword.

Further, device further include: the 4th obtains module, for determining multiple the first of each text to be analyzed Before the second text label to match in text label with predetermined keyword, pre-set text label and pre-set text label are obtained Conjunctive word, wherein pre-set text label includes the first text label, and pre-set text label corresponds at least one conjunctive word；Time Module is gone through, obtains multiple conjunctive words that each text to be analyzed includes for traversing multiple texts to be analyzed；Searching module is used for Multiple pre-set text labels corresponding with each conjunctive word that each text to be analyzed includes are searched, as multiple first text marks Label.

Further, the device further include: the second computing module, in the second text based on each text to be analyzed Second number of the label achievement data of label, first number of predetermined keyword and derivative keyword calculates predetermined keyword Before the degree of association data of each text to be analyzed, the label index number of each first text label is calculated according to the first formula According to A, wherein the first formula are as follows:N is the number of the corresponding conjunctive word of the first text label, B_iIt is i-th The corresponding conjunctive word of one text label appears in the number in a text to be analyzed, b_iIt is corresponding for i-th of first text labels Corresponding first text label of conjunctive word default weight.

Further, the first computing module includes: the 4th determining module, will identical with derivative keyword for determination The corresponding conjunctive word of two text labels is as third word；Computational submodule, it is each to be analyzed for being calculated according to the second formula The degree of association data G of text, wherein the second formula isK is pre- If first number of keyword, C be the second text label label achievement data, D be the second text label appear in one to The number in text is analyzed, d is the default weight of the second text label, and m is the number of third word, k_jFor j-th of third word Second number of the corresponding derivative keyword of language, E_jThe third time in a text to be analyzed is appeared in for j-th of third word Number, e_jFor the default weight of corresponding second text label of j-th of third word.

Further, the device further include: sorting module, in the second text label based on each text to be analyzed Label achievement data, first number of predetermined keyword and second number of derivative keyword calculate predetermined keyword and every After the degree of association data of a text to be analyzed, according to sequence from high to low to the degree of association data of each text to be analyzed into Row sequence, obtains relational degree taxis table；Display module, for showing top n degree of association data and correspondence in relational degree taxis table Text to be analyzed, wherein N is natural number.

Using the embodiment of the present invention, after obtaining predetermined keyword and multiple texts to be analyzed, predetermined keyword is counted The derivative keyword of corresponding multiple texts to be analyzed, and determine that predetermined keyword appears in the first time in multiple texts to be analyzed Several and derivative keyword appears in second number in multiple texts to be analyzed, is determining multiple the first of each text to be analyzed After the second text label to match in text label with predetermined keyword, the second text mark based on each text to be analyzed Second number of the label achievement data of label, first number of predetermined keyword and derivative keyword calculate predetermined keywords with The degree of association data of each text to be analyzed.Through the embodiment of the present invention, in conjunction with corresponding to predetermined keyword and text to be analyzed Text label calculate the degree of association of text to be analyzed and predetermined keyword, since text label identifies the master of text to be analyzed Topic, therefore the degree of association of predetermined keyword Yu text to be analyzed can be accurately determined.Using the embodiment of the present invention, solve existing There is the problem of degree of association that can not accurately determine article and keyword in technology, realizes and accurately determine article and keyword The effect of the degree of association.

Detailed description of the invention

The attached drawing constituted part of this application is used to provide further understanding of the present invention, schematic reality of the invention It applies example and its explanation is used to explain the present invention, do not constitute improper limitations of the present invention.In the accompanying drawings:

Fig. 1 is the flow chart of the method according to an embodiment of the present invention for obtaining article degree of association data；

Fig. 2 is a kind of flow chart of optional method for obtaining article degree of association data according to an embodiment of the present invention；With And

Fig. 3 is the schematic diagram of the device according to an embodiment of the present invention for obtaining article degree of association data.

Specific embodiment

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The present invention will be described in detail below with reference to the accompanying drawings and embodiments.

In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work It encloses.

It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein.In addition, term " includes " and " tool Have " and their any deformation, it is intended that cover it is non-exclusive include, for example, containing a series of steps or units Process, method, system, product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include without clear Other step or units listing to Chu or intrinsic for these process, methods, product or equipment.

The embodiment of the invention provides a kind of methods for obtaining article degree of association data.

Fig. 1 is the flow chart of the method according to an embodiment of the present invention for obtaining article degree of association data.As shown in Figure 1, should Method may include that steps are as follows:

Step S102 obtains predetermined keyword and multiple texts to be analyzed.

Step S104, statistics predetermined keyword correspond to the derivative keyword of multiple texts to be analyzed.

Wherein, derivative keyword is the keyword appeared in simultaneously in a text to be analyzed with predetermined keyword.

Step S106 determines that predetermined keyword appears in first number in multiple texts to be analyzed and derivative keyword goes out Second number in present multiple texts to be analyzed.

Step S108 determines to match in multiple first text labels of each text to be analyzed with predetermined keyword Two text labels.

Wherein, the first text label is used to identify the theme of text to be analyzed.

Step S110, the label achievement data of the second text label based on each text to be analyzed, predetermined keyword Second number of first number and derivative keyword calculates the degree of association data of predetermined keyword and each text to be analyzed.

In the above-described embodiments, text to be analyzed can be the network text for swashing and getting from internet by web crawlers Chapter, it is alternatively possible to article is crawled from internet according to the url list of the page to be crawled, it can also be according to the grade of the page Number is to crawl article, for example, web crawlers can be made to crawl certain website (e.g., Sina, Netease or Tencent etc.) by setting First level pages on content (e.g., the content in Sina's homepage), then crawl on the website second level page content (e.g., open The content after link in Sina's homepage) etc..

Wherein, URL is Uniform Resources Locator, i.e. uniform resource locator, is to can be from internet On the succinct expression of obtained one kind of the position of resource and access method, be the address of standard resource on internet.

In the above-described embodiment, the article crawled (i.e. above-mentioned text to be analyzed) can be stored in database In.

According to that above embodiment of the present invention, the derivative keyword that statistics predetermined keyword corresponds to multiple texts to be analyzed can be with It include: to carry out word segmentation processing to multiple texts to be analyzed to obtain set of words；Obtain the of each first word in set of words One quantity, wherein the first quantity is greater than the first preset threshold；The second quantity of each second word in set of words is obtained, In, the second quantity is that second word and predetermined keyword appear in the total of the number in each text to be analyzed simultaneously Value, the second quantity are greater than the second preset threshold；Compare the second word and the first word, if the second word is identical as the first word, Then using the ratio of the second quantity and the first quantity as the frequency of occurrence of the second word；If the second word is different from the first word, Then using the second quantity as the frequency of occurrence of the second word；Frequency of occurrence is greater than the second word of third predetermined threshold value as spreading out Raw keyword.

Specifically, multiple texts to be analyzed can be segmented according to the default word in default dictionary, including The set of words of multiple first words, for example, if text to be analyzed be " big data refers to without shortcut as random analysis method, And be analyzed and processed using all data ", then may include following to its set of words obtained after word segmentation processing Word: " big data ", " finger ", " not having to ", " random analysis method ", " such ", " shortcut ", " and ", " use ", " all ", " number According to ", " progress " and " analysis processing ".

Further, after obtaining set of words, the first quantity for counting each first word in set of words can be with It include: the word quantity for counting each word in set of words, e.g., " scientific and technical article " one shares 100 (i.e. in set of words The word quantity of " scientific and technical article " be 100), " China " one share 90 (i.e. the word quantity of " China " is 90), " big datas " One shares 30 (i.e. the word quantity of " big data " be 30), " finance " one shares 25 (i.e. the word quantity of " finance " is 25), " data mining " one shares 20 (i.e. the word quantity of " data mining " is 20) and " big data " one shares 15 (i.e. the word quantity of " big data " is 15)；Word quantity is greater than the word of the first preset threshold (e.g., 50) as first Word；It or can be by a word of the highest preceding Y (e.g., Y=2) of quantity as the first word, wherein Y is natural number；And it records The corresponding quantity of each first word is as the first quantity down, and e.g., in the above example, the first word can be " science and technology text Chapter " and " China ", and first quantity of " scientific and technical article " is 100, and first quantity of " China " is 90.

In the above-described embodiments, after obtaining set of words, the second number of each second word in set of words is obtained Amount may include: each text to be analyzed of statistics through segmenting the number occurred simultaneously in obtained each word with predetermined keyword Aggregate value, as shown in table 1, if predetermined keyword is " big data ", and " big data " appears in three text (such as tables to be analyzed Text A, text B to be analyzed and text C to be analyzed to be analyzed shown in 1) in, it is same with " big data " in each text to be analyzed When the word that occurs and its frequency of occurrence it is as shown in table 1, in conjunction with table 1 as can be seen that each word and predetermined keyword simultaneously It is 10+2=12 that the number of appearance, which is respectively as follows: " scientific and technical article ",；" China " is 6+7=13；" big data " is 5+7=12；" gold Melt " it is 2+1=3；" big data " is 10+10=20；" data mining " is 5+5+3=13.By highest preceding X (such as X of frequency of occurrence =5) a word is as the second word, wherein X is natural number；Or number is greater than to the word of the second preset threshold (e.g., 10) As the second word；The corresponding number of each second word is recorded as the second quantity, in the above example, the second word For " scientific and technical article " (its second quantity is 12), " China " (its second quantity is 13), " big data " (its second quantity is 20), " big data " (its second quantity is 12) and " data mining " (its second quantity is 13).

Table 1

After determining multiple first words and multiple second words, more each second word and each first word, If the second word is identical as the first word, using the ratio of the second quantity and the first quantity as the frequency of occurrence of the second word； If the second word and the first word be not identical, using the second quantity as the frequency of occurrence of the second word；And it is frequency of occurrence is big In third predetermined threshold value the second word as derivative keyword, or using highest preceding Z the second words of frequency of occurrence as Derivative keyword.

Above-described embodiment through the invention can automatically determine the derivative of predetermined keyword based on multiple texts to be analyzed and close Keyword improves the accuracy of determining derivative keyword, is calculating without adding derivative keyword manually for predetermined keyword When predetermined keyword and the degree of association data of text to be analyzed, it also is contemplated that the influence of derivative keyword improves calculated The accuracy of degree of association data.

In an alternative embodiment, the article (text i.e. to be analyzed) got that can will swash from internet is stored in In database, and word segmentation processing is carried out to article, calculates the highest Y word of frequency of occurrence in all articles stored in database As frequency of occurrence a (i.e. above-mentioned first for solely showing word (the first word i.e. in above-described embodiment) and each only now word of record Quantity)；According to predetermined keyword, the highest X word of number occurred simultaneously with the predetermined keyword is grabbed in the database and is made For co-occurrence word (the second i.e. above-mentioned word) and record the number b (the second i.e. above-mentioned quantity) of each co-occurrence word.In conjunction with solely existing The frequency of occurrence of the co-occurrence word is denoted as b/ if there is word identical with the word solely showed in word in co-occurrence word by word and co-occurrence word A, if co-occurrence word is different from the word in only now word, the frequency of occurrence of the co-occurrence word is still b, then, by frequency of occurrence Derivative keyword of the highest Z word as the predetermined keyword.

In embodiments of the present invention, a, b and Z are natural number.

Further described in detail to the present invention in conjunction with above-mentioned example, in the above example, the first word and its First quantity and the second word and its second quantity are as shown in table 2, due to the second word " big data ", " big data " and " data mining " and the first word (being in this embodiment " scientific and technical article " and " China ") be not identical, then, the appearance of " big data " Number be the frequency of occurrence of 20, " big data " be 12 and the frequency of occurrence of " data mining " be 13；Due to second Word " scientific and technical article " and " China " are identical as the first word " scientific and technical article " and " China " respectively, then the second word " science and technology The frequency of occurrence of article " is 12/100=0.12., and the frequency of occurrence of the second word " China " is 13/90=0.14.If the Three preset thresholds are 10 or Z=3, then in this embodiment, derivative keyword is " big data ", " big data " and " number According to excavation ", and second number of derivative keyword is followed successively by 20,12 and 13.

Table 2

In the above embodiment of the present invention, can not will there is no Special Significance in the word occurred simultaneously with predetermined keyword Word removal, for example, an article is the application about big data, but repeatedly mentions " scientific and technical article " or " China " in article There is no the word of particularity (for example, repeatedly mentioning this article in this article is to be published in China Deng for " big data " A certain scientific and technical article publication on), by comparing highest first word of frequency of occurrence and and predetermined keyword in each article Highest second word of number occurred simultaneously, repairs the frequency of occurrence of the second word using the first quantity of the first word Just, achieve the purpose that remove in the second word and there is no the word of particularity.

Further, it is determined that predetermined keyword appears in first number in multiple texts to be analyzed and derives keyword Second number in present multiple texts to be analyzed may include: to make the word quantity of the predetermined keyword in above-described embodiment For first number of predetermined keyword；Using the frequency of occurrence of the derivative keyword in above-described embodiment as the of derivative keyword Two numbers.

Specifically, in counting set of words after the word quantity of each word, by the corresponding word of predetermined keyword First number of the language quantity as predetermined keyword, e.g., in the above example, if predetermined keyword is " big data ", then First number of predetermined keyword is 30；After determining derivative keyword, the corresponding frequency of occurrence of derivative keyword is recorded As second number of derivative keyword, as in above-mentioned example, second number of second derivative keyword " big data " is 12.

In the above embodiment of the present invention, closed in multiple first text labels for determining each text to be analyzed with default Before the second text label that keyword matches, this method can also include: to obtain pre-set text label and pre-set text label Conjunctive word, wherein pre-set text label includes the first text label, and pre-set text label corresponds at least one conjunctive word；Time It goes through multiple texts to be analyzed and obtains multiple conjunctive words that each text to be analyzed includes；It searches and includes with each text to be analyzed The corresponding multiple pre-set text labels of each conjunctive word, as multiple first text labels.

Specifically, the conjunctive word of pre-set text label and pre-set text label can be obtained from default tag library, wherein Pre-set text label and its corresponding conjunctive word are stored in default tag library, each pre-set text label corresponds at least one pass Join word；Multiple texts to be analyzed are traversed, multiple conjunctive words that each text to be analyzed includes are obtained；For each text to be analyzed This, searches the corresponding multiple pre-set text labels of each conjunctive word that it includes, and multiple pre-set text label is set as this First text label of text to be analyzed.In this embodiment, since a pre-set text label can correspond to multiple conjunctive words, When determining the first text label of each text to be analyzed, the number of the first text label is corresponding no more than the text to be analyzed Conjunctive word number.

It should be further noted that a conjunctive word can only correspond to a pre-set text label.

For example, having pre-set text label " big data " and " finance " in default tag library, wherein each pre-set text label Corresponding conjunctive word and its default weight are as shown in table 3.If including " big data " in a text to be analyzed, " data are dug Pick ", " data application " and " Wall Street ", according to the above description, " big data ", " data mining " and " data are answered With " corresponding pre-set text label " big data ", " Wall Street " is corresponding pre-set text label " finance ", then the text pair to be analyzed There are two the first text labels answered: " big data " and " finance ", thus, it is determined that the first text label of text to be analyzed, There are two the themes for showing the text to be analyzed: " big data " and " finance ".

Table 3

Above-described embodiment through the invention, the pre-set text label that will be stored in each text to be analyzed and default tag library And its conjunctive word is matched, it is determined that multiple first text labels of each text to be analyzed, each first text label mark Know a theme for the text to be analyzed, can accurately identify the text to be analyzed by multiple first text labels The each theme being related to, when calculating the degree of association of predetermined keyword and text to be analyzed, it is first determined each first text mark The second text label to match in label with predetermined keyword, then reflect based on this second text mark of text subject to be analyzed Label calculate the degree of association data between predetermined keyword and the text (e.g., the article on internet) to be analyzed, avoid existing Article is determined according to whether occurring position in this article of keyword, the keyword and frequency of occurrence in article in technology The problem of with the poor accuracy of the degree of association of keyword, improve the accuracy of calculated degree of association data.

According to that above embodiment of the present invention, in the label index number of the second text label based on each text to be analyzed Predetermined keywords and each text to be analyzed are calculated according to first number of, predetermined keyword and second number of derivative keyword Degree of association data before, this method can also include: that the label index of each first text label is calculated according to the first formula Data A, wherein the first formula are as follows:N is the number of the corresponding conjunctive word of the first text label, B_iIt is i-th The corresponding conjunctive word of first text label appears in the number in a text to be analyzed, b_iFor i-th of first text labels pair The default weight of corresponding first text label of the conjunctive word answered.

Specifically, before calculating the degree of association data of predetermined keyword and each text to be analyzed, each first is calculated It is corresponding each to read its for first text label from default tag library for the label achievement data of text label The default weight of a conjunctive word, and the number that each conjunctive word occurs in the text to be analyzed is counted, calculate each conjunctive word Frequency of occurrence and respective default weight, then product addition is obtained to the label achievement data of first text label.

For example, in conjunction with table 3, for the first text label " big data ", if its conjunctive word " big data ", " big data ", " data mining ", " data application " and " data processing " goes out in an article (text to be analyzed i.e. in above-described embodiment) Existing number is respectively as follows: 4,3,5,2 and 1, according to that above embodiment of the present invention, the mark of " big data " this first text label Sign achievement data are as follows: 5 × 4+5 × 3+3 × 5+2 × 2+1 × 1=55.

Above-described embodiment through the invention calculates the label index number of each first text label of each text to be analyzed According to can reflect the correlation of the theme of the text (e.g., the article on internet) to be analyzed by label achievement data, that is, The label achievement data of first text label is bigger, the correlation of the first text label corresponding theme and the text to be analyzed It is bigger.

In the above embodiment of the present invention, the label achievement data of the second text label based on each text to be analyzed, Second number of first number of predetermined keyword and derivative keyword calculates predetermined keyword and each text to be analyzed Degree of association data may include: conjunctive word that the second text label identical with derivative keyword is corresponding as third word； The degree of association data G of each text to be analyzed is calculated according to the second formula, wherein the second formula isK is first number of predetermined keyword, and C is the second text label Label achievement data, D is that the second text label appears in the number in a text to be analyzed, and d is the second text label Default weight, m are the number of third word, k_jFor second number of the corresponding derivative keyword of j-th of third word, E_jFor jth A third word appears in the third number in a text to be analyzed, e_jFor corresponding second text label of j-th of third word Default weight.

Specifically, in determining multiple first text labels after the second text label identical with predetermined keyword, from The label achievement data of second text label is determined in the label achievement data for multiple first text labels being calculated, and Using conjunctive word identical with each derivative keyword in the corresponding conjunctive word of the second text label as third word, in conjunction with The label achievement data of two text labels, first number of predetermined keyword and derivative second number of keyword, according to the second public affairs Formula calculates the degree of association data of the predetermined keyword Yu the text to be analyzed.

It is and pre- in two the first text labels " big data " of text to be analyzed and " finance " in conjunction with above-mentioned example If identical keyword " big data " is the first text label " big data ", then " big data " is determined as the second text mark Label, since the conjunctive word of the second text label " big data " has, " big data ", " big data ", " data mining ", " data are answered With " and " data processing ", the derivative keyword of predetermined keyword " big data " has " big data ", then third word is " big data ", in this embodiment, the several K of first time of predetermined keyword " big data " are 30, the second text label " big data " Label achievement data C is 55, and it is 4 that the second text label " big data ", which appears in the number D in a text to be analyzed, the second text The default weight d of this label " big data " is 5, and the number m of third word is 1, the corresponding derivative keyword of j-th of third word Second several k of (i.e. above-mentioned " big data ")_jIt is 10, j-th of third word (i.e. above-mentioned " big data ") appears in Third number E in one text to be analyzed_jIt is 3, corresponding second text of j-th of third word (i.e. above-mentioned " big data ") The default weight e of label_jIt is 5, then the text to be analyzed and the degree of association data G of predetermined keyword " big data " are 30 × 55 + 30 × 4 × 5+ (10 × 55+10 × 3 × 5)=2950.

Above-described embodiment through the invention considers when calculating the degree of association data of predetermined keyword and text to be analyzed The theme of derivative keyword and to be analyzed text of the predetermined keyword based on each text to be analyzed, avoids the prior art It is middle to determine article according to whether occurring position in this article of keyword, the keyword and frequency of occurrence in article and close The problem of poor accuracy of the degree of association of keyword, improves the accuracy of calculated degree of association data.

According to that above embodiment of the present invention, in the label index number of the second text label based on each text to be analyzed Predetermined keywords and each text to be analyzed are calculated according to first number of, predetermined keyword and second number of derivative keyword Degree of association data after, this method can also include: the degree of association according to sequence from high to low to each text to be analyzed Data are ranked up, and obtain relational degree taxis table；Show top n degree of association data and corresponding to be analyzed in relational degree taxis table Text, wherein N is natural number.

Specifically, right according to sequence from high to low after the degree of association data that each text to be analyzed is calculated Each degree of association data are ranked up, and obtain relational degree taxis table, and are closed top n in the relational degree taxis table is (3 such as preceding) Connection degree evidence and its corresponding textual presentation to be analyzed come out.

Above-described embodiment through the invention, degree of association data are higher, indicate the pass of the predetermined keyword Yu text to be analyzed Connection degree is bigger, and the highest top n degree of association data of degree of association data and its corresponding textual presentation to be analyzed are come out, can be made It obtains people and understands maximally related article in knowledge corresponding to the predetermined keyword or technical field.

The above embodiment of the present invention is discussed in detail below with reference to Fig. 2, as shown in Fig. 2, this method may include steps of:

Step S202, mechanical reptile crawls the article on internet from server 80, and the article crawled is stored In the database.

In this step, the web crawlers in mechanical reptile, that is, the above embodiment of the present invention, the working principle of mechanical reptile Consistent with the web crawlers in the above embodiment of the present invention, details are not described herein.

Step S204 counts the highest Y word of frequency of occurrence as only now word in the database.

Above-mentioned step S202 and step S204 can be realized by crawler unit 20.

The keyword to be searched for is arranged in step S206.

Wherein, the predetermined keyword in the keyword, that is, the above embodiment of the present invention.

Step S208 show that frequency of occurrence is highest simultaneously with keyword according to keyword in the database of storage article X word is as co-occurrence word.

Step S210, according to solely now word and co-occurrence word calculate Z prolong new word and obtain the Z prolong new word prolong new word power Weight.

In the step, prolonging new word is the derivative keyword in the above embodiment of the present invention, and it is i.e. of the invention to prolong new word weight Second number of the derivative keyword in above-described embodiment.

Above-mentioned step S206 to step S210 can be realized by keyword setting unit 40.

Label and the corresponding characteristic word of each label is arranged in step S212.

In this embodiment, the pre-set text label in label, that is, the above embodiment of the present invention, characteristic word, that is, above-mentioned default The conjunctive word of text label.

Characteristic word weight is arranged for each characteristic word in step S214.

Wherein, the conjunctive word of pre-set text label corresponds to the pre-set text in characteristic word weight, that is, the above embodiment of the present invention The default weight of label.

Step S216 goes out according to the frequency of occurrence of the characteristic word of label each in every article and its characteristic word weight calculation The label score of article on to that tag.

In this embodiment, the label achievement data in label score, that is, the above embodiment of the present invention, can be according to above-mentioned The first formula calculate article label score on to that tag, details are not described herein.

Step S218, prolonging new word and its prolong new word weight and article corresponds to the label of keyword according to keyword Label score calculates the degree of association data of keyword and article and is ranked up to degree of association data.

Specifically, the realization of step S218 is consistent with the implementation of step S110, and details are not described herein.

Above-mentioned step S212 to step S218 can be realized by label setting unit 60.

In this embodiment, X, Y and Z are natural number.

It should be further noted that showing after being ranked up to the degree of association data being calculated and coming preceding N A degree of association data and its corresponding article.

Above-described embodiment through the invention prolongs new word based on what the article crawled obtained keyword automatically, without manual Addition, and different weights (second number i.e. in the above embodiment of the present invention) is defined each to prolong new word；Label is set And the corresponding characteristic word of label, and different weights (i.e. above-mentioned characteristic word weight) is defined for each characteristic word；Then Article is calculated in conjunction with keyword and the corresponding label of article (the second text label i.e. in the above embodiment of the present invention) and is closed The degree of association data of keyword are simultaneously ranked up it, can will come top n degree of association data and its corresponding article is shown Come, to facilitate people to understand the content with the mostly concerned article of the keyword.

It should be noted that step shown in the flowchart of the accompanying drawings can be in such as a group of computer-executable instructions It is executed in computer system, although also, logical order is shown in flow charts, and it in some cases, can be with not The sequence being same as herein executes shown or described step.

The embodiment of the invention also provides a kind of devices for obtaining article degree of association data.The device can be through the invention The method of acquisition article degree of association data in above-described embodiment realizes its function.

Fig. 3 is the schematic diagram of the device according to an embodiment of the present invention for obtaining article degree of association data.As shown in figure 3, should Device may include: the first acquisition module 10, statistical module 30, the first determining module 50, the second determining module 70 and first Computing module 90.

Wherein, the first acquisition module 10 is for obtaining predetermined keyword and multiple texts to be analyzed；Statistical module 30 is used for Statistics predetermined keyword corresponds to the derivative keyword of multiple texts to be analyzed, wherein derivative keyword is same with predetermined keyword When appear in keyword in a text to be analyzed；First determining module 50 for determine predetermined keyword appear in it is multiple to First number and derivative keyword in analysis text appear in second number in multiple texts to be analyzed；Second determining module The second text label to match in 70 multiple first text labels for determining each text to be analyzed with predetermined keyword, Wherein, the first text label is used to identify the theme of text to be analyzed；First computing module 90 is used to be based on each text to be analyzed This label achievement data of the second text label, first number of predetermined keyword and second number meter for deriving keyword Calculate the degree of association data of predetermined keyword and each text to be analyzed.

According to that above embodiment of the present invention, statistical module may include: word segmentation module, for multiple texts to be analyzed into Row word segmentation processing obtains set of words；Second obtains module, for obtaining the first quantity of each first word in set of words, Wherein, the first quantity is greater than the first preset threshold；Third obtains module, for obtaining the of each second word in set of words Two quantity, wherein the second quantity appears in time in each text to be analyzed for second word and predetermined keyword simultaneously Several aggregate values, the second quantity are greater than the second preset threshold；Comparison module, for comparing the second word and the first word, if the Two words are identical as the first word, then using the ratio of the second quantity and the first quantity as the frequency of occurrence of the second word；If the Two words are different from the first word, then using the second quantity as the frequency of occurrence of the second word；Third determining module, for determining Frequency of occurrence is greater than the second word of third predetermined threshold value as derivative keyword.

Further, after obtaining set of words, the first quantity for counting each first word in set of words can be with It include: the word quantity for counting each word in set of words.

In the above-described embodiments, after obtaining set of words, the second number of each second word in set of words is obtained Amount may include: each text to be analyzed of statistics through segmenting the number occurred simultaneously in obtained each word with predetermined keyword Aggregate value；Using the highest preceding X word of frequency of occurrence as the second word, wherein X is natural number；Or number is greater than The word of second preset threshold is as the second word；The corresponding number of each second word is recorded as the second quantity.

In embodiments of the present invention, a, b and Z are natural number.

Further, the first determining module 50 may include: by the word quantity of the predetermined keyword in above-described embodiment First number as predetermined keyword；Using the frequency of occurrence of the derivative keyword in above-described embodiment as derivative keyword Second number.

Specifically, in counting set of words after the word quantity of each word, by the corresponding word of predetermined keyword First number of the language quantity as predetermined keyword；After determining derivative keyword, the corresponding appearance of derivative keyword is recorded Second number of the number as derivative keyword.

In the above embodiment of the present invention, which can also include: the 4th acquisition module, for determining each wait divide Before analysing the second text label to match in multiple first text labels of text with predetermined keyword, pre-set text mark is obtained The conjunctive word of label and pre-set text label, wherein pre-set text label includes the first text label, and pre-set text label is corresponding extremely A few conjunctive word；Spider module obtains multiple associations that each text to be analyzed includes for traversing multiple texts to be analyzed Word；Searching module, the corresponding multiple pre-set text labels of each conjunctive word for searching with each text to be analyzed includes, makees For multiple first text labels.

According to that above embodiment of the present invention, which can also include: the second computing module, for based on each wait divide Analyse the label achievement data of the second text label of text, first number of predetermined keyword and derivative keyword for the second time Before number calculates predetermined keyword and the degree of association data of each text to be analyzed, each first text is calculated according to the first formula The label achievement data A of label, wherein the first formula are as follows:N is the corresponding conjunctive word of the first text label Number, B_iThe number in a text to be analyzed, b are appeared in for the corresponding conjunctive word of i-th of first text labels_iIt is i-th The default weight of corresponding first text label of the corresponding conjunctive word of one text label.

In the above embodiment of the present invention, the first computing module may include: the 4th determining module, for determination will with spread out The corresponding conjunctive word of identical second text label of life keyword is as third word；Computational submodule, for according to the second public affairs Formula calculates the degree of association data G of each text to be analyzed, wherein the second formula isK is first number of predetermined keyword, and C is the second text label Label achievement data, D is that the second text label appears in the number in a text to be analyzed, and d is the second text label Default weight, m are the number of third word, k_jFor second number of the corresponding derivative keyword of j-th of third word, E_jFor jth A third word appears in the third number in a text to be analyzed, e_jFor corresponding second text label of j-th of third word Default weight.

According to that above embodiment of the present invention, which can also include: sorting module, for being based on each text to be analyzed This label achievement data of the second text label, first number of predetermined keyword and second number meter for deriving keyword After calculating predetermined keyword and the degree of association data of each text to be analyzed, according to sequence from high to low to each text to be analyzed This degree of association data are ranked up, and obtain relational degree taxis table；Display module, for showing top n in relational degree taxis table Degree of association data and corresponding text to be analyzed, wherein N is natural number.

Application method is identical, answers with provided by the corresponding step of embodiment of the method for modules provided in the present embodiment It can also be identical with scene.It is noted, of course, that during the scheme that above-mentioned module is related to can be not limited to the above embodiments Content and scene, and above-mentioned module may operate in terminal or mobile terminal, can pass through software or hardware realization.

It can be seen from the above description that the present invention realizes following technical effect:

Obviously, those skilled in the art should be understood that each module of the above invention or each step can be with general Computing device realize that they can be concentrated on a single computing device, or be distributed in multiple computing devices and formed Network on, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored Be performed by computing device in the storage device, perhaps they are fabricated to each integrated circuit modules or by they In multiple modules or step be fabricated to single integrated circuit module to realize.In this way, the present invention is not limited to any specific Hardware and software combines.

The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, made any to repair Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims

1. a kind of method for obtaining article degree of association data characterized by comprising

Obtain predetermined keyword and multiple texts to be analyzed；

Count the derivative keyword that the predetermined keyword corresponds to the multiple text to be analyzed, wherein the derivative keyword To appear in the keyword in a text to be analyzed simultaneously with the predetermined keyword；

Determine that the predetermined keyword appears in first number in the multiple text to be analyzed and the derivative keyword goes out Second number in present the multiple text to be analyzed；

Determine the second text to match in multiple first text labels of each text to be analyzed with the predetermined keyword This label, wherein first text label is used to identify the theme of the text to be analyzed；

The first time of the label achievement data of the second text label based on each text to be analyzed, the predetermined keyword Second number of the several and described derivative keyword calculates the degree of association of the predetermined keyword and each text to be analyzed Data；

Wherein, the label achievement data in the second text label based on each text to be analyzed, the predetermined keyword First number and second time of derivative keyword number calculate the predetermined keywords and each text to be analyzed Degree of association data before, the method also includes:

The label achievement data A of each first text label is calculated according to the first formula, wherein

First formula are as follows:

The n is the number of the corresponding conjunctive word of first text label, the B_iFor i-th of first text label pair The conjunctive word answered appears in the number in a text to be analyzed, the b_iI-th corresponding for first text label Conjunctive word correspond to the default weight of first text label；

The first time of the label achievement data of the second text label based on each text to be analyzed, the predetermined keyword Second number of the several and described derivative keyword calculates the degree of association of the predetermined keyword and each text to be analyzed Data include:

The mark of second text label is determined from the label achievement data for multiple first text labels being calculated Sign achievement data；

Will the corresponding conjunctive word of identical with the derivative keyword second text label as third word；

The degree of association data G of each text to be analyzed is calculated according to the second formula, wherein

Second formula is

The K is first number of the predetermined keyword, and the C is the label achievement data of second text label, institute Stating D is that second text label appears in number in a text to be analyzed, and the d is second text label Default weight, the m be the third word number, the k_jIt is corresponding derivative crucial for j-th of third word Second number of word, the E_jThe third number in a text to be analyzed, institute are appeared in for j-th of third word State e_jThe default weight of second text label is corresponded to for j-th of third word；

Wherein, the label achievement data is used to react the correlation of the theme of the text to be analyzed.

2. the method according to claim 1, wherein count the predetermined keyword correspond to it is the multiple to be analyzed The derivative keyword of text includes:

Word segmentation processing is carried out to the multiple text to be analyzed and obtains set of words；

Obtain the first quantity of each first word in the set of words, wherein first quantity is greater than the first default threshold Value；

Obtain the second quantity of each second word in the set of words, wherein second quantity is one described second Word and the predetermined keyword appear in the aggregate value of the number in each text to be analyzed, second quantity simultaneously Greater than the second preset threshold；

Compare second word and first word, it, will be described if second word is identical as first word Frequency of occurrence of the ratio of second quantity and first quantity as second word；If second word and described the One word is different, then using second quantity as the frequency of occurrence of second word；

The frequency of occurrence is greater than the second word of third predetermined threshold value as the derivative keyword.

3. the method according to claim 1, wherein in multiple first texts for determining each text to be analyzed Before the second text label to match in this label with the predetermined keyword, the method also includes:

Obtain the conjunctive word of pre-set text label and the pre-set text label, wherein the pre-set text label includes described First text label, the pre-set text label correspond at least one described conjunctive word；

It traverses the multiple text to be analyzed and obtains multiple conjunctive words that each text to be analyzed includes；

Multiple pre-set text labels corresponding with each conjunctive word that each text to be analyzed includes are searched, as institute State multiple first text labels.

4. the method according to claim 1, which is characterized in that be based on each text to be analyzed The of this label achievement data of the second text label, first number of the predetermined keyword and the derivative keyword After two numbers calculate the predetermined keyword and the degree of association data of each text to be analyzed, the method also includes:

It is ranked up according to degree of association data of the sequence from high to low to each text to be analyzed, obtains relational degree taxis Table；

Show degree of association data described in top n and the corresponding text to be analyzed in the relational degree taxis table, wherein described N is natural number.

5. a kind of device for obtaining article degree of association data characterized by comprising

First obtains module, for obtaining predetermined keyword and multiple texts to be analyzed；

Statistical module corresponds to the derivative keyword of the multiple text to be analyzed for counting the predetermined keyword, wherein institute Stating derivative keyword is the keyword appeared in simultaneously in a text to be analyzed with the predetermined keyword；

First determining module, for determine the predetermined keyword appear in first number in the multiple text to be analyzed and The derivative keyword appears in second number in the multiple text to be analyzed；

Second determining module, in multiple first text labels for determining each text to be analyzed with the default key The second text label that word matches, wherein first text label is used to identify the theme of the text to be analyzed；

First computing module, the label achievement data, described for the second text label based on each text to be analyzed Second number of first number of predetermined keyword and the derivative keyword calculates the predetermined keyword and each described The degree of association data of text to be analyzed；

Wherein, described device further include:

Second computing module, for the label achievement data in the second text label based on each text to be analyzed, institute Second number of first number and the derivative keyword of stating predetermined keyword calculates the predetermined keyword and each institute Before the degree of association data for stating text to be analyzed, the label index number of each first text label is calculated according to the first formula According to A, wherein

First formula are as follows:

The n is the number of the corresponding conjunctive word of first text label, the B_iFor i-th of first text label pair The conjunctive word answered appears in the number in a text to be analyzed, the b_iIt is corresponding for i-th of first text label Conjunctive word correspond to the default weight of first text label；

Wherein, first computing module includes:

4th determining module, described in being determined from the label achievement data for multiple first text labels being calculated The label achievement data of second text label, and determination will second text label identical with the derivative keyword it is corresponding Conjunctive word as third word；

Computational submodule, for calculating the degree of association data G of each text to be analyzed according to the second formula, wherein

Second formula is

6. device according to claim 5, which is characterized in that the statistical module includes:

Word segmentation module obtains set of words for carrying out word segmentation processing to the multiple text to be analyzed；

Second obtains module, for obtaining the first quantity of each first word in the set of words, wherein first number Amount is greater than the first preset threshold；

Third obtains module, for obtaining the second quantity of each second word in the set of words, wherein second number Amount appears in the conjunction of the number in each text to be analyzed for second word and the predetermined keyword simultaneously Evaluation, second quantity are greater than the second preset threshold；

Comparison module is used for second word and first word, if second word and first word It is identical, then using the ratio of second quantity and first quantity as the frequency of occurrence of second word；If described Two words are different from first word, then using second quantity as the frequency of occurrence of second word；

The frequency of occurrence is greater than the second word of third predetermined threshold value as the derivative for determining by third determining module Keyword.

7. device according to claim 5, which is characterized in that described device further include:

4th obtain module, in multiple first text labels for determining each text to be analyzed with the default pass Before the second text label that keyword matches, the conjunctive word of pre-set text label and the pre-set text label is obtained, wherein The pre-set text label includes first text label, and the pre-set text label corresponds at least one described conjunctive word；

Spider module obtains multiple associations that each text to be analyzed includes for traversing the multiple text to be analyzed Word；

Searching module, the corresponding multiple default texts of each conjunctive word for searching with each text to be analyzed includes This label, as the multiple first text label.

8. the device according to any one of claim 5 to 7, which is characterized in that described device further include:

Sorting module, for the label achievement data, described pre- in the second text label based on each text to be analyzed If second number of first number of keyword and the derivative keyword calculate the predetermined keyword and it is each it is described to After the degree of association data for analyzing text, according to sequence from high to low to the degree of association data of each text to be analyzed into Row sequence, obtains relational degree taxis table；

Display module, for showing in the relational degree taxis table degree of association data described in top n and corresponding described to be analyzed Text, wherein the N is natural number.