CN106033445A

CN106033445A - Method and device for obtaining article association degree data

Info

Publication number: CN106033445A
Application number: CN201510114670.0A
Authority: CN
Inventors: 陈俊宏; 余德乐; 杨韬; 赵冬玲
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2015-03-16
Filing date: 2015-03-16
Publication date: 2016-10-19
Anticipated expiration: 2035-03-16
Also published as: CN106033445B

Abstract

The invention discloses a method and a device for obtaining article association degree data. The method comprises: obtaining preset keywords and a plurality of to-be-analyzed texts, wherein each to-be-analyzed text is corresponding to a plurality of first text labels; counting derivative keywords of the preset keywords, corresponding to the plurality of to-be-analyzed texts; determining first number of times that the preset keywords appear in the plurality of to-be-analyzed texts and second number of times that the derivative keywords appear in the plurality of to-be-analyzed texts; determining second text labels which are matched with the preset keywords in the plurality of first text labels in each to-be-analyzed text; based on the label index data of the second text labels of each to-be-analyzed text, the first number of times of the preset keywords, and the second number of times of the derivative keywords, calculating association degree data of the preset keywords and each to-be-analyzed text. The method solves problems in the prior art that association degree of an article and keywords cannot be accurately determined, and realizes effect of accurately determining association degree of an article and keywords.

Description

The method and apparatus obtaining article degree of association data

Technical field

The present invention relates to internet arena, in particular to a kind of method and apparatus obtaining article degree of association data.

Background technology

Prior art searches association article by key word and it is all simple according in article to its method sorted Whether occur that the number of times of this key word, this key word position in the article occurred and appearance realizes, specifically Ground, the method for this lookup article is similar to the mode scanned for key word in the search engine such as Baidu or Google, Such as, if key word is " big data ", then when searching the article associated with " big data ", according in article be The number of times that no appearance " big data ", " big data " occur in article, and " big data " occur in article Position determines the degree of association of article and " big data " and is ranked up associating article from high to low according to the degree of association, Wherein, key word occurs in weighted shared during diverse location in article, and e.g., the weight of headline is the highest, just Literary composition takes second place, advertisement is minimum.But, of the prior art above by key word determine association article and to its sort Method can not reflect the theme of article and the relatedness of key word very accurately.

Such as, when searching the article associated with key word " big data ", report certain big data article at one Author participates in the news of dinner party, and " big data " word repeatedly occurs, but the theme of this news is this author participates in evening Fete rather than " big data ", but according to prior art, owing to " big data " occurrence number is more, can be true by it Be set to the association article of key word " big data ", therefore, by prior art can not reflect exactly article theme with The relatedness of key word.

And for example, when key word is " big data ", an article is had " big data " to be discussed, to " several in the whole text According to " say nothing, then this article can not screened go out and participate in sequence.

For another example, key word is still " big data ", has article A to mention 1 time " big data ", 5 times " big data ", Article B mentions 2 times " big data ", 4 times " data mining ", according to above-mentioned prior art, article B and this pass The degree of association of keyword " big data " should be higher than article A, it is apparent that be not.

For the problem of the degree of association that cannot accurately determine article and key word in prior art, the most not yet propose effectively Solution.

Summary of the invention

A kind of method and apparatus obtaining article degree of association data of offer is provided, existing to solve The problem that cannot accurately determine article and the degree of association of key word in technology.

To achieve these goals, an aspect according to embodiments of the present invention, it is provided that a kind of acquisition article degree of association The method of data, the method includes: obtain predetermined keyword and multiple text to be analyzed；Statistics predetermined keyword is corresponding The derivative key word of multiple texts to be analyzed, wherein, derivative key word is treated for simultaneously appearing in one with predetermined keyword Analyze the key word in text；Determine first number that predetermined keyword occurs in multiple text to be analyzed and derivative close Keyword occurs in second number in multiple text to be analyzed；Determine multiple first text labels of each text to be analyzed In the second text label of matching with predetermined keyword, wherein, the first text label is for identifying text to be analyzed Theme；The label achievement data of the second text label based on each text to be analyzed, first number of predetermined keyword And second number of derivative key word calculates the degree of association data of predetermined keyword and each text to be analyzed.

Further, the derivative key word of the corresponding multiple texts to be analyzed of statistics predetermined keyword includes: treat point multiple Analysis text carries out word segmentation processing and obtains set of words；First quantity of each first word in acquisition set of words, wherein, First quantity is more than the first predetermined threshold value；Obtain the second quantity of each second word in set of words, wherein, second Quantity is the aggregate value that second word and predetermined keyword simultaneously appear in the number of times in each text to be analyzed, the Two quantity are more than the second predetermined threshold value；Relatively the second word and the first word, if the second word and the first word are identical, Then using the ratio of the second quantity and the first quantity as the occurrence number of the second word；If the second word and the first word are not With, then using the second quantity as the occurrence number of the second word；Will appear from the number of times the second word more than the 3rd predetermined threshold value Language is as derivative key word.

Further, match with predetermined keyword in multiple first text labels determine each text to be analyzed Before second text label, the method also includes: obtain pre-set text label and the conjunctive word of pre-set text label, its In, pre-set text label includes the first text label, pre-set text label at least one conjunctive word corresponding；Travel through multiple Text to be analyzed obtains multiple conjunctive words that each text to be analyzed includes；Search each with what each text to be analyzed included Multiple pre-set text labels that individual conjunctive word is corresponding, as multiple first text labels.

Further, in label achievement data, the predetermined keyword of the second text label based on each text to be analyzed First number and second number of derivative key word calculate predetermined keyword and each text to be analyzed associate the number of degrees According to before, the method also includes: calculate the label achievement data A of each first text label according to the first formula, its In, the first formula is:N is the number of the conjunctive word that the first text label is corresponding, B_iFor i-th The conjunctive word that first text label is corresponding occurs in the number of times in a text to be analyzed, b_iI-th is the first text label The default weight of corresponding first text label of corresponding conjunctive word.

Further, the label achievement data of the second text label based on each text to be analyzed, predetermined keyword Second number of first number and derivative key word calculates the degree of association data of predetermined keyword and each text to be analyzed Including: using conjunctive word corresponding for second text label identical with derivative key word as the 3rd word；Public according to second Formula calculates degree of association data G of each text to be analyzed, and wherein, the second formula is

G = K * C + K * D * d + Σ_{j = 1}^{m} (k_{j} * C + k_{j} * E_{j} * e_{j}),

K is first number of predetermined keyword, and C is second The label achievement data of text label, D is the number of times that the second text label occurs in a text to be analyzed, and d is The default weight of the second text label, m is the number of the 3rd word, k_jFor the derivative pass that jth the 3rd word is corresponding Second number of keyword, E_jThe third time number in a text to be analyzed, e is occurred in for jth the 3rd word_jFor jth The default weight of corresponding second text label of individual 3rd word.

Further, in label achievement data, the predetermined keyword of the second text label based on each text to be analyzed First number and second number of derivative key word calculate predetermined keyword and each text to be analyzed associate the number of degrees According to afterwards, the method also includes: arrange the degree of association data of each text to be analyzed according to order from high to low Sequence, obtains relational degree taxis table；Show top n degree of association data and the text to be analyzed of correspondence in relational degree taxis table, Wherein, N is natural number.

To achieve these goals, another aspect according to embodiments of the present invention, it is provided that a kind of acquisition article degree of association The device of data, this device includes: the first acquisition module, is used for obtaining predetermined keyword and multiple text to be analyzed； Statistical module, for adding up the derivative key word of the corresponding multiple texts to be analyzed of predetermined keyword, wherein, derivative key Word is to simultaneously appear in the key word in a text to be analyzed with predetermined keyword；First determines module, is used for determining Predetermined keyword occurs in first number in multiple text to be analyzed and derivative key word occurs in multiple text to be analyzed In second number；Second determines module, with pre-in multiple first text labels determining each text to be analyzed If the second text label that key word matches, wherein, the first text label is for identifying the theme of text to be analyzed； First computing module, for the label achievement data of the second text label based on each text to be analyzed, preset critical First number of word and second number of derivative key word calculate the degree of association of predetermined keyword and each text to be analyzed Data.

Further, statistical module includes: word-dividing mode, obtains for multiple texts to be analyzed are carried out word segmentation processing Set of words；Second acquisition module, for obtaining the first quantity of each first word in set of words, wherein, the One quantity is more than the first predetermined threshold value；3rd acquisition module, for obtaining in set of words the second of each second word Quantity, wherein, the second quantity is that second word simultaneously appears in each text to be analyzed with predetermined keyword The aggregate value of number of times, the second quantity is more than the second predetermined threshold value；Comparison module, is used for comparing the second word and the first word Language, if the second word and the first word are identical, then using ratio the going out as the second word of the second quantity with the first quantity Occurrence number；If the second word and the first word are different, then using the second quantity as the occurrence number of the second word；3rd Determine module, will appear from the number of times the second word more than the 3rd predetermined threshold value as derivative key word for determining.

Further, this device also includes: the 4th acquisition module, for determining multiple the of each text to be analyzed Before the second text label matched with predetermined keyword in one text label, obtain pre-set text label and preset literary composition The conjunctive word of this label, wherein, pre-set text label includes the first text label, pre-set text label correspondence at least Individual conjunctive word；Spider module, obtains, for traveling through multiple text to be analyzed, multiple associations that each text to be analyzed includes Word；Search module, for searching the multiple pre-set text marks corresponding with each conjunctive word that each text to be analyzed includes Sign, as multiple first text labels.

Further, this device also includes: the second computing module, at the second literary composition based on each text to be analyzed Second number of the label achievement data of this label, first number of predetermined keyword and derivative key word calculates to be preset Before the degree of association data of key word and each text to be analyzed, calculate each first text label according to the first formula Label achievement data A, wherein, the first formula is:N is the conjunctive word that the first text label is corresponding Number, B_iThe conjunctive word corresponding for i-th the first text label occurs in the number of times in a text to be analyzed, b_iFor The default weight of corresponding first text label of conjunctive word that i-th the first text label is corresponding.

Further, the first computing module includes: the 4th determines module, for determining identical with derivative key word Conjunctive word corresponding to the second text label is as the 3rd word；Calculating sub module, for calculating each according to the second formula Degree of association data G of text to be analyzed, wherein, the second formula is

G = K * C + K * D * d + Σ_{j = 1}^{m} (k_{j} * C + k_{j} * E_{j} * e_{j}),

Further, this device also includes: order module, at the second text mark based on each text to be analyzed Second number of label achievement data, first number of predetermined keyword and the derivative key word signed calculates preset critical After the degree of association data of word and each text to be analyzed, according to the pass to each text to be analyzed of the order from high to low Connection degrees of data is ranked up, and obtains relational degree taxis table；Display module, is used for showing top n in relational degree taxis table Degree of association data and the text to be analyzed of correspondence, wherein, N is natural number.

Use the embodiment of the present invention, after obtaining predetermined keyword and multiple text to be analyzed, add up predetermined keyword The derivative key word of corresponding multiple texts to be analyzed, and determine that predetermined keyword occurs in multiple text to be analyzed the Number and derivative key word occur in second number in multiple text to be analyzed, are determining each text to be analyzed After the second text label matched with predetermined keyword in multiple first text labels, based on each text to be analyzed The label achievement data of the second text label, first number of predetermined keyword and second number of derivative key word Calculate the degree of association data of predetermined keyword and each text to be analyzed.By the embodiment of the present invention, in conjunction with preset critical Text label corresponding to word and text to be analyzed calculates the degree of association of text to be analyzed and predetermined keyword, due to literary composition The theme of this tag identifier text to be analyzed, therefore can accurately determine associating of predetermined keyword and text to be analyzed Degree.Use the embodiment of the present invention, solve and prior art cannot accurately determine asking of the article degree of association with key word Topic, it is achieved that accurately determine the effect of the degree of association of article and key word.

Accompanying drawing explanation

The accompanying drawing of the part constituting the application is used for providing a further understanding of the present invention, and the present invention's is schematic real Execute example and illustrate for explaining the present invention, being not intended that inappropriate limitation of the present invention.In the accompanying drawings:

Fig. 1 is the flow chart of the method obtaining article degree of association data according to embodiments of the present invention；

Fig. 2 is the flow chart of the method for a kind of optional acquisition article degree of association data according to embodiments of the present invention；With And

Fig. 3 is the schematic diagram of the device obtaining article degree of association data according to embodiments of the present invention.

Detailed description of the invention

It should be noted that in the case of not conflicting, the embodiment in the application and the feature in embodiment can phases Combination mutually.Describe the present invention below with reference to the accompanying drawings and in conjunction with the embodiments in detail.

In order to make those skilled in the art be more fully understood that the present invention program, below in conjunction with in the embodiment of the present invention Accompanying drawing, is clearly and completely described the technical scheme in the embodiment of the present invention, it is clear that described embodiment It is only the embodiment of a present invention part rather than whole embodiments.Based on the embodiment in the present invention, ability The every other embodiment that territory those of ordinary skill is obtained under not making creative work premise, all should belong to The scope of protection of the invention.

It should be noted that term " first " in description and claims of this specification and above-mentioned accompanying drawing, " Two " it is etc. for distinguishing similar object, without being used for describing specific order or precedence.Should be appreciated that this The data that sample uses can be exchanged in the appropriate case, in order to embodiments of the invention described herein.Additionally, term " include " and " having " and their any deformation, it is intended that cover non-exclusive comprising, such as, comprise The process of series of steps or unit, method, system, product or equipment are not necessarily limited to those steps clearly listed Rapid or unit, but can include that the most clearly list or intrinsic for these processes, method, product or equipment Other step or unit.

Embodiments provide a kind of method obtaining article degree of association data.

Fig. 1 is the flow chart of the method obtaining article degree of association data according to embodiments of the present invention.As it is shown in figure 1, The method can comprise the following steps that

Step S102, obtains predetermined keyword and multiple text to be analyzed.

Step S104, the derivative key word of the corresponding multiple texts to be analyzed of statistics predetermined keyword.

Wherein, derivative key word is to simultaneously appear in the key word in a text to be analyzed with predetermined keyword.

Step S106, determines that first number that predetermined keyword occurs in multiple text to be analyzed and derivative key word go out Second number in present multiple text to be analyzed.

Step S108, determines matched in multiple first text labels of each text to be analyzed with predetermined keyword Two text labels.

Wherein, the first text label is for identifying the theme of text to be analyzed.

Step S110, the label achievement data of the second text label based on each text to be analyzed, predetermined keyword What second number of first number and derivative key word calculated predetermined keyword and each text to be analyzed associates the number of degrees According to.

In the above-described embodiments, text to be analyzed can be to be swashed from the Internet the web documents got by web crawlers, It is alternatively possible to crawl article from the Internet according to the url list of the page to be crawled, it is also possible to according to the page Progression crawls article, for example, it is possible to by setting make web crawlers crawl certain website (e.g., Sina, Netease or Person Tengxun etc.) first level pages on content (e.g., the content in Sina's homepage), then crawl on the page of two grades, this website Content (e.g., opening the content after the link in Sina's homepage) etc..

Wherein, URL is Uniform Resources Locator, i.e. URL, is to can be from interconnection The position of the resource obtained on the net and a kind of succinct expression of access method, be the address of standard resource on the Internet.

In the above-described embodiment, the article (the most above-mentioned text to be analyzed) crawled can be stored in data base In.

According to the above embodiment of the present invention, the derivative key word of the corresponding multiple texts to be analyzed of statistics predetermined keyword is permissible Including: multiple texts to be analyzed are carried out word segmentation processing and obtains set of words；Obtain each first word in set of words The first quantity, wherein, the first quantity be more than the first predetermined threshold value；Obtain the of each second word in set of words Two quantity, wherein, the second quantity is that second word simultaneously appears in each text to be analyzed with predetermined keyword The aggregate value of number of times, the second quantity is more than the second predetermined threshold value；Relatively the second word and the first word, if the second word Language and the first word are identical, then using the ratio of the second quantity and the first quantity as the occurrence number of the second word；If the Two words and the first word are different, then using the second quantity as the occurrence number of the second word；Will appear from number of times more than the Second word of three predetermined threshold value is as derivative key word.

Specifically, according to the default word in default dictionary, multiple texts to be analyzed can be carried out participle, including The set of words of multiple first words, such as, if text to be analyzed is for " big data refer to without random analysis method such Shortcut, and use all data to be analyzed processing ", then the set of words obtained after it is carried out word segmentation processing is permissible Including following word: " big data ", " finger ", " need not ", " random analysis method ", " such ", " shortcut ", " and ", " use ", " owning ", " data ", " carrying out " and " analyzing and processing ".

Further, after obtaining set of words, in statistics set of words, the first quantity of each the first word is permissible Including: the word quantity of each word in statistics set of words, e.g., in set of words, " scientific and technical article " has 100 Individual (i.e. the word quantity of " scientific and technical article " is 100), " Chinese " have 90 (i.e. word numbers of " Chinese " Amount is 90), " big data " have 30 (i.e. the word quantity of " big data " they are 30), " financial " one has 25 (i.e. the word quantity of " financial " is 25), " data minings " have 20 (i.e. " data mining " Word quantity is 20) and " big data " have 15 (i.e. the word quantity of " big data " they are 15)；Will Word quantity is more than the word of the first predetermined threshold value (e.g., 50) as the first word；Or can be the highest by quantity Individual word is as the first word for front Y (e.g., Y=2), and wherein, Y is natural number；And record each first word Corresponding quantity is as the first quantity, e.g., in above-mentioned example, the first word can be " scientific and technical article " and " in State ", and first quantity of " scientific and technical article " is 100, and first quantity of " Chinese " is 90.

In the above-described embodiments, after obtaining set of words, obtain the second number of each second word in set of words Amount may include that adds up what each text to be analyzed occurred with predetermined keyword in each word that participle obtains simultaneously The aggregate value of number of times, as shown in table 1, if predetermined keyword is " big data ", and " big data " occur in three and treat Analyze in text (text A to be analyzed as shown in table 1, text B to be analyzed and text C to be analyzed), each The word simultaneously occurred with " big data " in text to be analyzed and its occurrence number are as shown in table 1, can in conjunction with table 1 To find out, it is 10+2=12 that the number of times that each word and predetermined keyword occur simultaneously is respectively as follows: " scientific and technical article "；In " State " it is 6+7=13；" big data " is 5+7=12；" financial " is 2+1=3；" big data " are 10+10=20；" number According to excavating " it is 5+5+3=13.Will appear from the highest front X (such as X=5) the individual word of number of times as the second word, wherein, X is natural number；Or number of times is more than the word of the second predetermined threshold value (e.g., 10) as the second word；Record Number of times corresponding to each second word is as the second quantity, and in above-mentioned example, the second word is " scientific and technical article " (its second quantity is 12), " Chinese " (its second quantity is 13), " big data " (its second quantity is 20), " big Data " (its second quantity is 12) and " data mining " (its second quantity is 13).

Table 1

After determining multiple first word and multiple second word, compare each second word and each first word, If the second word and the first word are identical, then the ratio of the second quantity and the first quantity is gone out occurrence as the second word Number；If the second word and the first word differ, then using the second quantity as the occurrence number of the second word；And will go out Occurrence number as derivative key word, or will appear from the highest front Z of number of times more than the second word of the 3rd predetermined threshold value Second word is as derivative key word.

By the above embodiment of the present invention, the derivative pass of predetermined keyword can be automatically determined based on multiple texts to be analyzed Keyword, it is not necessary to manually add derivative key word for predetermined keyword, improve the accuracy of the derivative key word determined, When calculating the degree of association data of predetermined keyword and text to be analyzed, it also is contemplated that the impact of derivative key word, improve The accuracy of the degree of association data calculated.

In an optional embodiment, article (text the most to be analyzed) storage can got swashing from the Internet In data base, and article is carried out word segmentation processing, calculate occurrence number in all articles stored in data base the highest Y word as solely now word (i.e. the first word in above-described embodiment) occurrence number a that records each the most now word (the first the most above-mentioned quantity)；According to predetermined keyword, in data base, crawl and this predetermined keyword occur simultaneously X the word that number of times is the highest as co-occurrence word (the second the most above-mentioned word) and records the number of times b of each co-occurrence word (i.e. The second above-mentioned quantity).In conjunction with the most now word and co-occurrence word, if co-occurrence word there being word identical with the word in the most now word, Then the occurrence number of this co-occurrence word is designated as b/a, if co-occurrence word is different from the word in the most now word, then this co-occurrence word Occurrence number be still b, then, will appear from Z the highest word of the number of times derivative key as this predetermined keyword Word.

In embodiments of the present invention, a, b and Z are natural number.

Combine above-mentioned example further the present invention is described in detail, in the above example, the first word and First quantity and the second word and the second quantity thereof are as shown in table 2, due to the second word " big data ", " big data " And " data mining " differs with the first word (in this embodiment for " scientific and technical article " and " Chinese "), then, The occurrence number of " big data " is 20, the occurrence number of " big data " is 12 and the going out of " data mining " Occurrence number is 13；Due to the second word " scientific and technical article " and " Chinese " respectively with the first word " scientific and technical article " " Chinese " is identical, then the occurrence number of the second word " scientific and technical article " is 12/100=0.12., the second word The occurrence number of language " Chinese " is 13/90=0.14.If the 3rd predetermined threshold value is 10 or Z=3, then in this example In, derivative key word is " big data ", " big data " and " data mining ", and the second of derivative key word Number of times is followed successively by 20,12 and 13.

Table 2

In the above embodiment of the present invention, the word that can simultaneously will occur with predetermined keyword do not has Special Significance Word is removed, and such as, an article is the application about big data, but repeatedly mentions " scientific and technical article " in article Or " Chinese " etc. does not has the word of particularity (such as, repeatedly to mention this in this article for " big data " Article is to be published on a certain scientific and technical article publication of China), by compare that in each article, occurrence number is the highest first Word and second word the highest with the number of times that predetermined keyword occurs simultaneously, use the first quantity of the first word to The occurrence number of two words is revised, and reaches to remove the purpose of the word not having particularity in the second word.

Further, it is determined that predetermined keyword occurs in first number in multiple text to be analyzed and derivative key word goes out Second number in present multiple text to be analyzed may include that the word number of the predetermined keyword in above-described embodiment Measure first number as predetermined keyword；Using the occurrence number of the derivative key word in above-described embodiment as derivative pass Second number of keyword.

Specifically, in counting set of words after the word quantity of each word, by word corresponding for predetermined keyword Language quantity is as first number of predetermined keyword, e.g., in above-mentioned example, if predetermined keyword is " big data ", So first number of predetermined keyword is 30；After determining derivative key word, the derivative key word of record is corresponding Occurrence number is as second number of derivative key word, and in above-mentioned example, second derives key word " big data " Second time number is 12.

In the above embodiment of the present invention, close with presetting in multiple first text labels determine each text to be analyzed Before the second text label that keyword matches, the method can also include: obtains pre-set text label and pre-set text The conjunctive word of label, wherein, pre-set text label includes the first text label, pre-set text label corresponding at least one Conjunctive word；Travel through multiple text to be analyzed and obtain multiple conjunctive words that each text to be analyzed includes；Search and treat with each Analyze multiple pre-set text labels that each conjunctive word of including of text is corresponding, as multiple first text labels.

Specifically, pre-set text label and the conjunctive word of pre-set text label can be obtained from default tag library, wherein Preset storage in tag library and have the conjunctive word of pre-set text label and correspondence thereof, each pre-set text label correspondence at least Individual conjunctive word；Travel through multiple text to be analyzed, obtain multiple conjunctive words that each text to be analyzed includes；For each Text to be analyzed, searches multiple pre-set text labels that it each conjunctive word included is corresponding, and by the plurality of default literary composition This label is set to the first text label of this text to be analyzed.In this embodiment, can due to a pre-set text label With corresponding multiple conjunctive words, when determining the first text label of each text to be analyzed, the number of the first text label It is not more than the number of conjunctive word corresponding to this text to be analyzed.

Need it is further noted that a conjunctive word can only a corresponding pre-set text label.

Such as, default tag library there are pre-set text label " big data " and " financial ", wherein, each pre-set text Conjunctive word and default weight thereof that label is corresponding are as shown in table 3.If a text to be analyzed includes " big data ", " number According to excavate ", " market demand " and " Wall Street ", according to the above description understand, " big data ", " data mining " And " market demand " corresponding pre-set text label " big data ", " Wall Street " corresponding pre-set text label " financial ", The first text label that so this text to be analyzed is corresponding has two: " big data " and " financial ", thus, it is determined that First text label of text to be analyzed, the theme indicating this text to be analyzed has two: " big data " and " financial ".

Table 3

By the above embodiment of the present invention, by each text to be analyzed and the pre-set text label of storage in default tag library And conjunctive word mates, it is determined that multiple first text labels of each text to be analyzed, each first text mark Sign the theme identifying this text to be analyzed, can be identified this exactly by multiple first text labels and treat Analyze each theme that text relates to, when calculating the degree of association of predetermined keyword and text to be analyzed, it is first determined each The second text label matched with predetermined keyword in individual first text label, then reflect text to be analyzed based on this Second text label of theme calculates between predetermined keyword and this text to be analyzed (e.g., the article on the Internet) Degree of association data, it is to avoid according to whether key word, this key word occurring in article in this article in prior art Position and occurrence number determine the problem of the poor accuracy of the degree of association of article and key word, improves and calculates The accuracy of degree of association data.

According to the above embodiment of the present invention, the second text label based on each text to be analyzed label achievement data, First number of predetermined keyword and second number of derivative key word calculate predetermined keyword and each text to be analyzed Degree of association data before, the method can also include: according to first formula calculate each first text label label Achievement data A, wherein, the first formula is:N is the individual of the conjunctive word that the first text label is corresponding Number, B_iThe conjunctive word corresponding for i-th the first text label occurs in the number of times in a text to be analyzed, b_iIt is i-th The default weight of corresponding first text label of conjunctive word that individual first text label is corresponding.

Specifically, before calculating the predetermined keyword degree of association data with each text to be analyzed, each first is calculated The label achievement data of text label, for first text label, reads it corresponding from default tag library The default weight of each conjunctive word, and add up the number of times that each conjunctive word occurs in this text to be analyzed, calculate each The occurrence number of individual conjunctive word and the most default weight, then show product addition that the label of this first text label refers to Mark data.

Such as, in conjunction with table 3, for the first text label " big data ", if its conjunctive word " big data ", " big data ", " data mining ", " market demand " and " data process " are at an article (i.e. to be analyzed in above-described embodiment Text) in occur number of times be respectively as follows: 4,3,5,2 and 1, according to the above embodiment of the present invention, " big data " The label achievement data of this first text label is: 5 × 4+5 × 3+3 × 5+2 × 2+1 × 1=55.

By the above embodiment of the present invention, calculate the label index number of each the first text label of each text to be analyzed According to, theme relevant of this text to be analyzed (e.g., the article on the Internet) can be reflected by label achievement data Property, i.e. the label achievement data of the first text label is the biggest, and the theme that this first text label is corresponding is to be analyzed with this The dependency of text is the biggest.

In the above embodiment of the present invention, the label achievement data of the second text label based on each text to be analyzed, First number of predetermined keyword and second number of derivative key word calculate predetermined keyword and each text to be analyzed Degree of association data may include that conjunctive word corresponding for second text label identical with derivative key word as the 3rd Word；Calculate degree of association data G of each text to be analyzed according to the second formula, wherein, the second formula is

G = K * C + K * D * d + Σ_{j = 1}^{m} (k_{j} * C + k_{j} * E_{j} * e_{j}),

Specifically, after the second text label identical with predetermined keyword in determining multiple first text label, from The label achievement data of calculated multiple first text label determines the label index number of this second text label According to, and using conjunctive word identical with each derivative key word in conjunctive word corresponding for this second text label as the 3rd word Language, in conjunction with the label achievement data of the second text label, first number of predetermined keyword and derivative key word second time Number, calculates the degree of association data of this predetermined keyword and this text to be analyzed according to the second formula.

In conjunction with above-mentioned example, in two first text labels " big data " and " financial " of text to be analyzed, Identical with predetermined keyword " big data " is the first text label " big data ", then " big data " be defined as Second text label, owing to the conjunctive word of the second text label " big data " has " big data ", " big data ", " number According to excavate ", " market demand " and " data process ", the derivative key word of predetermined keyword " big data " has " big Data ", then the 3rd word is " big data ", in this embodiment, first number K of predetermined keyword " big data " Being 30, the label achievement data C of the second text label " big data " is 55, and the second text label " big data " goes out Number of times D in a present text to be analyzed is 4, and default weight d of the second text label " big data " is 5, the Number m of three words is 1, the of the derivative key word (the most above-mentioned " big data ") that jth the 3rd word is corresponding Two number k_jBeing 10, jth the 3rd word (the most above-mentioned " big data ") occurs in a text to be analyzed Number E for the third time_jIt is 3, the default power of jth the 3rd word (the most above-mentioned " big data ") corresponding second text label Weight e_jIt is 5, then this text to be analyzed is 30 × 55+30 with degree of association data G of predetermined keyword " big data " × 4 × 5+ (10 × 55+10 × 3 × 5)=2950.

By the above embodiment of the present invention, when calculating the degree of association data of predetermined keyword and text to be analyzed, it is considered to Predetermined keyword derivative key word based on each text to be analyzed and the theme of text to be analyzed, it is to avoid existing According to whether article occurring, key word, this key word position in this article and occurrence number determine in technology Article and the problem of the poor accuracy of the degree of association of key word, improve the accuracy of the degree of association data calculated.

According to the above embodiment of the present invention, the second text label based on each text to be analyzed label achievement data, First number of predetermined keyword and second number of derivative key word calculate predetermined keyword and each text to be analyzed Degree of association data after, the method can also include: according to the pass to each text to be analyzed of the order from high to low Connection degrees of data is ranked up, and obtains relational degree taxis table；Show in relational degree taxis table top n degree of association data and right The text to be analyzed answered, wherein, N is natural number.

Specifically, after the degree of association data being calculated each text to be analyzed, right according to order from high to low Each degree of association data are ranked up, and obtain relational degree taxis table, and by top n in this relational degree taxis table (as front 3) textual presentation to be analyzed of degree of association data and correspondence thereof is out.

By the above embodiment of the present invention, degree of association data are the highest, represent the pass of this predetermined keyword and text to be analyzed Connection degree is the biggest, by the textual presentation to be analyzed of top n degree of association data the highest for degree of association data and correspondence thereof out, So that people understand maximally related article in knowledge corresponding to this predetermined keyword or technical field.

The above embodiment of the present invention is discussed in detail, as in figure 2 it is shown, the method can include walking as follows below in conjunction with Fig. 2 Rapid:

Step S202, mechanical reptile crawls the article on the Internet from server 80, and is deposited by the article crawled Storage is in data base.

In this step, the web crawlers in mechanical reptile i.e. the above embodiment of the present invention, the operation principle of mechanical reptile Consistent with the web crawlers in the above embodiment of the present invention, do not repeat them here.

Step S204, counts Y the highest word of occurrence number as the most now word in data base.

Above-mentioned step S202 and step S204 can be realized by reptile unit 20.

Step S206, arranges key word to be searched for.

Wherein, the predetermined keyword in this key word i.e. the above embodiment of the present invention.

Step S208, draws the highest with key word occurrence number simultaneously in the data base of storage article according to key word X word is as co-occurrence word.

Step S210, according to solely now word and co-occurrence word calculate Z prolong new word and obtain this Z prolong new word prolong new word Weight.

In this step, prolong the derivative key word that new word is in the above embodiment of the present invention, prolong the new word weight i.e. present invention Second number of the derivative key word in above-described embodiment.

Above-mentioned step S206 can arrange unit 40 by key word to step S210 and realize.

Step S212, arranges label and characteristic word corresponding to each label.

In this embodiment, the pre-set text label in label i.e. the above embodiment of the present invention, characteristic word is the most above-mentioned to be preset The conjunctive word of text label.

Step S214, arranges characteristic word weight for each characteristic word.

Wherein, in characteristic word weight i.e. the above embodiment of the present invention the conjunctive word of pre-set text label to should pre-set text The default weight of label.

Step S216, occurrence number and characteristic word weight calculation thereof according to the characteristic word of label each in every article go out Article label mark on to that tag.

In this embodiment, the label achievement data in label mark i.e. the above embodiment of the present invention, can be according to above-mentioned First formula calculate article label mark on to that tag, do not repeat them here.

Step S218, prolonging new word and prolonging the label of new word weight and article correspondence key word according to key word Label mark calculates key word and the degree of association data of article and is ranked up degree of association data.

Specifically, the realization of step S218 is consistent with the implementation of step S110, does not repeats them here.

Above-mentioned step S212 can arrange unit 60 by label to step S218 and realize.

In this embodiment, X, Y and Z are natural number.

Need it is further noted that after calculated degree of association data are ranked up, show before coming N number of degree of association data and the article of correspondence thereof.

By the above embodiment of the present invention, prolong new word based on what the article crawled obtained key word automatically, it is not necessary to manually Add, and define different weights (i.e. second number in the above embodiment of the present invention) for each new word that prolongs；Arrange Label and characteristic word corresponding to label, and define different weight (the most above-mentioned characteristic word power for each characteristic word Weight)；Then in conjunction with label (i.e. the second text label in the above embodiment of the present invention) meter that key word is corresponding with article Calculate the degree of association data of article and key word and it be ranked up, can by come top n degree of association data and Corresponding article displays, to be convenient for people to understand the content of the article mostly concerned with this key word.

It should be noted that can be at such as one group of computer executable instructions in the step shown in the flow chart of accompanying drawing Computer system performs, and, although show logical order in flow charts, but in some cases, can With to be different from the step shown or described by order execution herein.

The embodiment of the present invention additionally provides a kind of device obtaining article degree of association data.This device can pass through the present invention The method obtaining article degree of association data in above-described embodiment realizes its function.

Fig. 3 is the schematic diagram of the device obtaining article degree of association data according to embodiments of the present invention.As it is shown on figure 3, This device may include that the first acquisition module 10, statistical module 30, first determines module 50, second determines module 70 and first computing module 90.

Wherein, the first acquisition module 10 is used for obtaining predetermined keyword and multiple text to be analyzed；Statistical module 30 is used In the derivative key word of the corresponding multiple texts to be analyzed of statistics predetermined keyword, wherein, derivative key word is and default pass Keyword simultaneously appears in the key word in a text to be analyzed；First determines that module 50 is for determining that predetermined keyword goes out First number and derivative key word in present multiple text to be analyzed occur in the second time in multiple text to be analyzed Number；Second determine module 70 in multiple first text labels determine each text to be analyzed with predetermined keyword phase Second text label of coupling, wherein, the first text label is for identifying the theme of text to be analyzed；First calculates mould Block 90 is for the label achievement data of the second text label based on each text to be analyzed, the first time of predetermined keyword Several and derivative key word second number calculates the degree of association data of predetermined keyword and each text to be analyzed.

According to the above embodiment of the present invention, statistical module may include that word-dividing mode, for multiple texts to be analyzed Carry out word segmentation processing and obtain set of words；Second acquisition module, for obtaining the of each first word in set of words One quantity, wherein, the first quantity is more than the first predetermined threshold value；3rd acquisition module, is used for obtaining in set of words every Second quantity of individual second word, wherein, the second quantity is that second word simultaneously appears in respectively with predetermined keyword The aggregate value of the number of times in individual text to be analyzed, the second quantity is more than the second predetermined threshold value；Comparison module, is used for comparing Second word and the first word, if the second word and the first word are identical, then by the ratio of the second quantity Yu the first quantity Occurrence number as the second word；If the second word and the first word are different, then using the second quantity as the second word Occurrence number；3rd determines module, for determining the second word conduct that will appear from number of times more than the 3rd predetermined threshold value Derivative key word.

Further, after obtaining set of words, in statistics set of words, the first quantity of each the first word is permissible Including: the word quantity of each word in statistics set of words.

In the above-described embodiments, after obtaining set of words, obtain the second number of each second word in set of words Amount may include that adds up what each text to be analyzed occurred with predetermined keyword in each word that participle obtains simultaneously The aggregate value of number of times；Will appear from the highest front X the word of number of times as the second word, wherein, X is natural number；Or Number of times is more than the word of the second predetermined threshold value as the second word by person；Record number of times corresponding to each second word to make It is the second quantity.

In embodiments of the present invention, a, b and Z are natural number.

Further, first determines that module 50 may include that the word quantity of the predetermined keyword in above-described embodiment First number as predetermined keyword；Using the occurrence number of the derivative key word in above-described embodiment as derivative key Second number of word.

Specifically, in counting set of words after the word quantity of each word, by word corresponding for predetermined keyword Language quantity is as first number of predetermined keyword；After determining derivative key word, the derivative key word of record is corresponding Occurrence number is as second number of derivative key word.

In the above embodiment of the present invention, this device can also include: the 4th acquisition module, for determining each treating Before analyzing the second text label matched with predetermined keyword in multiple first text labels of text, obtain and preset Text label and the conjunctive word of pre-set text label, wherein, pre-set text label includes the first text label, presets literary composition This label at least one conjunctive word corresponding；Spider module, is used for traveling through multiple text to be analyzed and obtains each literary composition to be analyzed Originally the multiple conjunctive words included；Search module, corresponding with each conjunctive word that each text to be analyzed includes for searching Multiple pre-set text labels, as multiple first text labels.

According to the above embodiment of the present invention, this device can also include: the second computing module, for treating based on each Analyze the label achievement data of the second text label of text, first number of predetermined keyword and derivative key word Before second number calculates the degree of association data of predetermined keyword and each text to be analyzed, calculate every according to the first formula The label achievement data A of individual first text label, wherein, the first formula is:N is the first text The number of the conjunctive word that label is corresponding, B_iFor the conjunctive word that i-th the first text label is corresponding occur in one to be analyzed Number of times in text, b_iDefault weight for corresponding first text label of conjunctive word corresponding to i-th the first text label.

In the above embodiment of the present invention, the first computing module may include that the 4th determines module, be used for determining by with Derive conjunctive word corresponding to the second identical text label of key word as the 3rd word；Calculating sub module, for according to Second formula calculates degree of association data G of each text to be analyzed, and wherein, the second formula is

G = K * C + K * D * d + Σ_{j = 1}^{m} (k_{j} * C + k_{j} * E_{j} * e_{j}),

According to the above embodiment of the present invention, this device can also include: order module, for based on each to be analyzed The label achievement data of the second text label of text, first number and the second of derivative key word of predetermined keyword After number of times calculates the degree of association data of predetermined keyword and each text to be analyzed, according to order from high to low to respectively The degree of association data of individual text to be analyzed are ranked up, and obtain relational degree taxis table；Display module, is used for showing association Top n degree of association data and the text to be analyzed of correspondence in degree sequencing table, wherein, N is natural number.

The using method that modules provided in the present embodiment step corresponding with embodiment of the method is provided is identical, should Can also be identical by scene.It is noted, of course, that the scheme that above-mentioned module relates to can be not limited to above-mentioned enforcement Content in example and scene, and above-mentioned module may operate in terminal or mobile terminal, can by software or Hardware realizes.

As can be seen from the above description, present invention achieves following technique effect:

Obviously, those skilled in the art should be understood that each module of the above-mentioned present invention or each step can be with general Calculating device realize, they can concentrate on single calculating device, or is distributed in multiple calculating device institute On the network of composition, alternatively, they can realize with calculating the executable program code of device, it is thus possible to It is stored in storing in device and is performed by calculating device, or they are fabricated to respectively each integrated circuit die Block, or the multiple modules in them or step are fabricated to single integrated circuit module realize.So, the present invention It is not restricted to any specific hardware and software combine.

The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.All within the spirit and principles in the present invention, made Any modification, equivalent substitution and improvement etc., should be included within the scope of the present invention.

Claims

1. the method obtaining article degree of association data, it is characterised in that including:

Obtain predetermined keyword and multiple text to be analyzed；

Add up the derivative key word of the corresponding the plurality of text to be analyzed of described predetermined keyword, wherein, described in spread out Raw key word is to simultaneously appear in the key word in a described text to be analyzed with described predetermined keyword；

Determine first number and described derivative pass that described predetermined keyword occurs in the plurality of text to be analyzed Keyword occurs in second number in the plurality of text to be analyzed；

Determine and multiple first text labels of each described text to be analyzed match with described predetermined keyword Second text label, wherein, described first text label is for identifying the theme of described text to be analyzed；

The label achievement data of the second text label based on each described text to be analyzed, described predetermined keyword First number and second number of described derivative key word calculate described predetermined keyword and treat point described in each The degree of association data of analysis text.

Method the most according to claim 1, it is characterised in that add up that described predetermined keyword is corresponding the plurality of to be treated The derivative key word analyzing text includes:

The plurality of text to be analyzed is carried out word segmentation processing and obtains set of words；

Obtaining the first quantity of each first word in described set of words, wherein, described first quantity is more than the One predetermined threshold value；

Obtaining the second quantity of each second word in described set of words, wherein, described second quantity is one Described second word and described predetermined keyword simultaneously appear in the total of the number of times in each described text to be analyzed Value, described second quantity is more than the second predetermined threshold value；

Relatively described second word and described first word, if described second word is identical with described first word, Then using the ratio of described second quantity and described first quantity as the occurrence number of described second word；If it is described Second word is different from described first word, then using described second quantity as the occurrence number of described second word；

Described occurrence number is more than the second word of the 3rd predetermined threshold value as described derivative key word.

Method the most according to claim 1, it is characterised in that determining multiple the of each described text to be analyzed Before the second text label matched with described predetermined keyword in one text label, described method also includes:

Obtain pre-set text label and the conjunctive word of described pre-set text label, wherein, described pre-set text label Including described first text label, described pre-set text label at least one described conjunctive word corresponding；

Travel through the plurality of text to be analyzed and obtain multiple conjunctive words that each described text to be analyzed includes；

Search the multiple pre-set text marks corresponding with each described conjunctive word that each described text to be analyzed includes Sign, as the plurality of first text label.

Method the most according to claim 3, it is characterised in that at the second literary composition based on each described text to be analyzed The label achievement data of this label, first number of described predetermined keyword and the second of described derivative key word Before number of times calculates the degree of association data of described predetermined keyword and each described text to be analyzed, described method is also Including:

According to the label achievement data A of the first formula each described first text label of calculating, wherein,

Described first formula is:

A = Σ_{i = 1}^{n} B_{i} * b_{i},

Described n is the number of the conjunctive word that described first text label is corresponding, described B_iFor the first literary composition described in i-th The conjunctive word that this label is corresponding occurs in the number of times in a described text to be analyzed, described b_iI-th is described The default weight of corresponding described first text label of conjunctive word that one text label is corresponding.

Method the most according to claim 4, it is characterised in that the second text based on each described text to be analyzed The label achievement data of label, first number of described predetermined keyword and the second time of described derivative key word The number described predetermined keyword of calculating includes with the degree of association data of each described text to be analyzed:

Using conjunctive word corresponding for described second text label identical with described derivative key word as the 3rd word；

According to degree of association data G of the second formula each described text to be analyzed of calculating, wherein,

Described second formula is

G = K * C + K * D * d + Σ_{j = 1}^{m} (k_{j} * C + k_{j} * E_{j} * e_{j}),

Described K is first number of described predetermined keyword, and described C is that the label of described second text label refers to Mark data, described D is the number of times that described second text label occurs in a described text to be analyzed, described d For the default weight of described second text label, described m is the number of described 3rd word, described k_jFor jth Second number of the derivative key word that described 3rd word is corresponding, described E_jOccur for the 3rd word described in jth Third time number in a described text to be analyzed, described e_jFor the 3rd word corresponding described the described in jth The default weight of two text labels.

Method the most as claimed in any of claims 1 to 5, it is characterised in that based on treating described in each point Analysis the label achievement data of the second text label of text, first number of described predetermined keyword and described in spread out Second number of raw key word calculate described predetermined keyword and each described text to be analyzed degree of association data it After, described method also includes:

According to order from high to low, the degree of association data of text to be analyzed each described are ranked up, are closed Connection degree sequencing table；

Show degree of association data described in top n and the text described to be analyzed of correspondence in described relational degree taxis table, Wherein, described N is natural number.

7. the device obtaining article degree of association data, it is characterised in that including:

First acquisition module, is used for obtaining predetermined keyword and multiple text to be analyzed；

Statistical module, for adding up the derivative key word of the corresponding the plurality of text to be analyzed of described predetermined keyword, Wherein, described derivative key word is to simultaneously appear in a described text to be analyzed with described predetermined keyword Key word；

First determines module, for determining that described predetermined keyword occurs in the plurality of text to be analyzed the Number and described derivative key word occur in second number in the plurality of text to be analyzed；

Second determines module, with described in multiple first text labels determining each described text to be analyzed The second text label that predetermined keyword matches, wherein, described first text label be used for identifying described in treat point The theme of analysis text；

First computing module, for the label index number of the second text label based on each described text to be analyzed Described default pass is calculated according to second number of, first number of described predetermined keyword and described derivative key word Keyword and the degree of association data of each described text to be analyzed.

Device the most according to claim 7, it is characterised in that described statistical module includes:

Word-dividing mode, obtains set of words for the plurality of text to be analyzed is carried out word segmentation processing；

Second acquisition module, for obtaining the first quantity of each first word in described set of words, wherein, Described first quantity is more than the first predetermined threshold value；

3rd acquisition module, for obtaining the second quantity of each second word in described set of words, wherein, Described second quantity is described second word and to simultaneously appear in each described to be analyzed for described predetermined keyword The aggregate value of the number of times in text, described second quantity is more than the second predetermined threshold value；

Comparison module, for relatively described second word and described first word, if described second word is with described First word is identical, then using ratio the going out as described second word of described second quantity with described first quantity Occurrence number；If described second word is different from described first word, then using described second quantity as described second The occurrence number of word；

3rd determines module, for determine using described occurrence number more than the 3rd predetermined threshold value the second word as Described derivative key word.

Device the most according to claim 7, it is characterised in that described device also includes:

4th acquisition module, is used in multiple first text labels determining each described text to be analyzed and institute Before stating the second text label that predetermined keyword matches, obtain pre-set text label and described pre-set text mark The conjunctive word signed, wherein, described pre-set text label includes described first text label, described pre-set text mark Sign at least one described conjunctive word corresponding；

Spider module, obtains each described text to be analyzed include many for traveling through the plurality of text to be analyzed Individual conjunctive word；

Search module, corresponding many with each described conjunctive word that each described text to be analyzed includes for searching Individual pre-set text label, as the plurality of first text label.

Device the most according to claim 9, it is characterised in that described device also includes:

Second computing module, for the label index at the second text label based on each described text to be analyzed Second number of data, first number of described predetermined keyword and described derivative key word calculates described presetting Before the degree of association data of key word and each described text to be analyzed, calculate each described the according to the first formula The label achievement data A of one text label, wherein,

Described first formula is:

A = Σ_{i = 1}^{n} B_{i} * b_{i},

Described n is the number of the conjunctive word that described first text label is corresponding, described B_iFor the first literary composition described in i-th The conjunctive word that this label is corresponding occurs in the number of times in a described text to be analyzed, described b_iFor described in i-th The default weight of corresponding described first text label of conjunctive word that one text label is corresponding.

11. devices according to claim 10, it is characterised in that described first computing module includes:

4th determines module, for determining corresponding for described second text label identical with described derivative key word Conjunctive word as the 3rd word；

Calculating sub module, for calculating degree of association data G of each described text to be analyzed according to the second formula, Wherein,

Described second formula is

G = K * C + K * D * d + Σ_{j = 1}^{m} (k_{j} * C + k_{j} * E_{j} * e_{j}),

12. according to the device described in any one in claim 7 to 11, it is characterised in that described device also includes:

Order module, for the second text label based on each described text to be analyzed label achievement data, First number of described predetermined keyword and second number of described derivative key word calculate described predetermined keyword After the degree of association data of each described text to be analyzed, according to order from high to low to treating described in each point The degree of association data of analysis text are ranked up, and obtain relational degree taxis table；

Display module, for showing degree of association data described in top n and the institute of correspondence in described relational degree taxis table Stating text to be analyzed, wherein, described N is natural number.