CN105224521B

CN105224521B - Key phrases extraction method and the method and device using its acquisition correlated digital resource

Info

Publication number: CN105224521B
Application number: CN201510627961.XA
Authority: CN
Inventors: 许茜; 叶茂; 任彩红; 徐剑波; 汤帜
Original assignee: Peking University; Peking University Founder Group Co Ltd; Beijing Founder Apabi Technology Co Ltd
Current assignee: New Founder Holdings Development Co ltd; Peking University; Founder Apabi Technology Ltd
Priority date: 2015-09-28
Filing date: 2015-09-28
Publication date: 2018-05-25
Anticipated expiration: 2035-09-28
Also published as: CN105224521A

Abstract

The present invention provides a kind of key phrases extraction method and obtains the method and device of correlated digital resource using it, and wherein key phrases extraction method includes：First, the text of digital resource is segmented, meaning word is then obtained according to word segmentation result；For each theme, the probability distribution of the meaning word is obtained, the probability distribution includes meaning word and its corresponding weight；Each meaning of a word of the meaning word is obtained, merges the meaning word with the identical meaning of a word and its corresponding weight；Descriptor is determined according to the meaning word after merging and its weight.From the angle of the meaning of a word in the program, word with the identical meaning of a word is merged, avoid the interference of polysemant, synonym to key phrases extraction in the prior art, improve the accuracy of key phrases extraction, it eliminates the selection to Feature Words in the prior art and names the dependence of Entity recognition, weaken the interference that polysemant and synonym are brought to theme term vector, while realize user oriented personalized special topic tissue and generation.

Description

Key phrases extraction method and the method and device using its acquisition correlated digital resource

Technical field

The present invention relates to digital resource process fields, and in particular to a kind of key phrases extraction method obtains correlated digital money The method and device in source.

Background technology

With the fast development of internet, digital newspaper becomes increasingly popular, so as to significantly enhance the friendship of user and newpapers and periodicals Mutual property provides possibility for personalized newpapers and periodicals special topic tissue with generation.In addition, the daily new increase news report in the whole nation, most For newborn event and with a large amount of neologisms.So-called " neologisms " refer mainly to that content is new, form is new, originally in lexical system without or Though but the meaning of a word is brand-new word.

In order to preferably be described to these digital resources, convenient for processing such as recommendation, the retrievals of follow-up related special topic, need These digital resources are carried out with the extraction of descriptor, it is general using the vocabulary extracted after participle in digital resource in the prior art Mode, the vocabulary more than frequency of occurrence is obtained by way of merging as descriptor, but since each word may have A variety of different semantic informations, the meaning of different words expression again may be identical, such as mobile phone, mobile phone, expression Meaning is identical, also based on the extraction write inscription bring interference.In addition, in existing key phrases extraction method, secret service volume is generally required Volume Feature Words or theme candidate word list determine descriptor candidate word using name entity technology, using vector space model and Name Entity recognition.Program process complexity is, it is necessary to substantial amounts of data operation quantity.

The descriptor of said extracted can be used in digital resource such as the tissue and generation of Special Topics in Journalism.Special Topics in Journalism Tissue refers to, by together with relevant news organization, form a special topic with generation.For example, when newpapers and periodicals user plane is emerging to oneself sense During a certain media event of interest, it is desirable to be able to conveniently and efficiently be obtained from the magnanimity news report of more newpapers and periodicals more related Report, the personalization for improving the efficiency of acquisition of information and reading.For example, when user reads certain piece in relation to foreign press to " 3.1 elder brothers During the report of open fire station violence terror case " view, it is desirable to be able to quickly check other related foreign press to the event view Report when, first, the interested news that this user is selected to read, by analyze obtain the news descriptor, so Afterwards by the keyword of remaining news compared with above-mentioned descriptor, by the high news linked groups of degree of correlation to just shape together Into a special topic.At present, mainly using technologies such as vector space model, name Entity recognition, text clusters in advance in newpapers and periodicals Special topic is extracted on resources bank, user is pushed to and selects to consult for user.Such method knows the selection of Feature Words and name entity Dependence that Ju You be very not strong, less effective during the newpapers and periodicals text to occur frequently so as to cause processing neologisms, and do not take into full account new The interference that the semantic information and polysemant and synonym of news are brought to theme term vector, it is impossible to according to the report of user's current interest Road comes tissue, the personalized special topic of generation.

The content of the invention

Therefore, the technical problem to be solved in the present invention be to overcome during key phrases extraction of the prior art be subject to it is more Adopted word, synonym interference determine to lead, it is necessary to by human-edited's Feature Words or descriptor candidate list using name entity technology The defects of writing inscription candidate word, so as to provide a kind of key phrases extraction method and apparatus.

The invention solves another technical problem be to overcome special topic generation of the prior art when needs use to Quantity space model and name Entity recognition, the defects of poor robustness, so as to provide a kind of method for obtaining correlated digital resource and Device.

The present invention provides a kind of key phrases extraction method, includes the following steps：

The text of digital resource is segmented；

Meaning word is obtained according to word segmentation result；

For each theme, the probability distribution of the meaning word is obtained, the probability distribution includes meaning word and its correspondence Weight；

Each meaning of a word of the meaning word is obtained, merges the meaning word with the identical meaning of a word and its corresponding weight；

Descriptor is determined according to the meaning word after merging and its weight.

In addition, the present invention provides a kind of method for obtaining correlated digital resource, include the following steps：

Extract the descriptor of the first digital resource；

Obtain the keyword and its weight of the second digital resource；

Obtain the text similarity of first digital resource and second digital resource；

Obtain semantic distribution density of the descriptor in second digital resource；

Judge whether the text similarity is more than text similarity threshold value and whether semantic distribution density is more than semantic point During cloth density threshold, using the second digital resource as the correlated digital resource of the first digital resource if being to be.

In addition, the present invention also provides a kind of key phrases extraction device, including：

Participle unit segments the text of digital resource；

Word segmentation result processing unit obtains meaning word according to word segmentation result；

Probability distribution unit for each theme, obtains the probability distribution of the meaning word, the probability distribution is including anticipating Adopted word and its corresponding weight；

Combining unit, obtains each meaning of a word of the meaning word, merges meaning word with the identical meaning of a word and its corresponding Weight；

Descriptor determination unit determines descriptor according to the meaning word after merging and its weight.

In addition, the present invention also provides it is a kind of obtain correlated digital resource device, including

Key phrases extraction unit extracts the descriptor of the first digital resource；

Keyword determination unit obtains the keyword and its weight of the second digital resource；

It is similar to the text of second digital resource to obtain first digital resource for text similarity acquiring unit Degree；

It is close to obtain semanteme distribution of the descriptor in second digital resource for semantic distribution density acquiring unit Degree；

Related resource determination unit judges whether the text similarity is more than text similarity threshold value and semantic distribution is close When whether degree is more than semantic distribution density threshold value, using the second digital resource as the dependency number of the first digital resource if being to be Word resource.

Technical solution of the present invention has the following advantages that：

1. the present invention provides a kind of key phrases extraction method, first, the text of digital resource is segmented, then basis Word segmentation result obtains meaning word；For each theme, the probability distribution of the meaning word is obtained, the probability distribution includes meaning Word and its corresponding weight；It obtains each meaning of a word of the meaning word, merges meaning word with the identical meaning of a word and its corresponding Weight；Descriptor is determined according to the meaning word after merging and its weight.From the angle of the meaning of a word in the program, will have identical The word of the meaning of a word is merged, and is avoided the interference of polysemant, synonym to key phrases extraction in the prior art, is improved master Write inscription the accuracy of extraction.In addition, the program need not pass through human-edited's Feature Words or descriptor candidate list, it is not required that Descriptor candidate word is determined using name entity technology.Select local feature word by way of filtering function word, and without using Vector space model and name Entity recognition enhance the robustness of key phrases extraction method.

2. key phrases extraction method of the present invention, pre-establishes the mapping relations between word and the meaning of a word, by this Correspondence can obtain the corresponding meaning of a word of meaning word, then merge the meaning word of the identical meaning of a word and tire out weight Add, sort according to the descending meaning word by after the merging of weight, the meaning word for the default quantity being arranged in front is selected to make Based on write inscription, as selected to be arranged in front 20% as keyword, the meaning word identical by merging the meaning of a word improves keyword Accuracy selects be arranged in front 20% meaning word, can cover the important information of the digital resource substantially, reduces follow-up Data processing amount.

3. the present invention also provides a kind of methods for obtaining correlated digital resource, first, the theme of the first digital resource is extracted Then word obtains the keyword and its weight of the second digital resource, obtain first digital resource and provided with the described second number The text similarity in source obtains semantic distribution density of the descriptor in second digital resource, when the text phase It is more than text similarity threshold value like degree, and when semantic distribution density is more than semantic distribution density threshold value, the second digital resource is made For the correlated digital resource of the first digital resource.In the program, text similarity and semantic distribution by two digital resources The aspect of density two, it is whether related to weigh two digital resources, text similarity indicate this two texts describe it is same The degree of theme, semantic distribution density represent the balance degree that the first digital resource descriptor is distributed in the second digital resource, The degree of correlation represented between digital resource that can be quantified by the two values, so as to obtain accurately relevant number money Source.

Description of the drawings

It, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical solution of the prior art Embodiment or attached drawing needed to be used in the description of the prior art are briefly described, it should be apparent that, in describing below Attached drawing is some embodiments of the present invention, for those of ordinary skill in the art, before not making the creative labor It puts, can also be obtained according to these attached drawings other attached drawings.

Fig. 1 is the flow chart of the key phrases extraction method in the embodiment of the present invention 1；

Fig. 2 is the flow chart of the method for the acquisition correlated digital resource in the embodiment of the present invention 2；

Fig. 3 is the flow chart of the thematic generation method in the embodiment of the present invention 3；

Fig. 4 is the flow chart of the theme term vector of the generation special topic in the embodiment of the present invention 4；

Fig. 5 is the flow chart of the generation special topic in the embodiment of the present invention 4；

Fig. 6 is the thematic list schematic diagram in the embodiment of the present invention 4；

Fig. 7 is the flow chart of the key phrases extraction device in the embodiment of the present invention 5；

Fig. 8 is the flow chart of the device of the acquisition correlated digital resource in the embodiment of the present invention 6；

Fig. 9 is the flow chart of the thematic generating means in the embodiment of the present invention 7.

Specific embodiment

Technical scheme is clearly and completely described below in conjunction with attached drawing, it is clear that described implementation Example is part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill Personnel's all other embodiments obtained without making creative work, belong to the scope of protection of the invention.

In the description of the present invention, it is necessary to which explanation, term " first ", " second ", " the 3rd " are only used for description purpose, And it is not intended that instruction or hint relative importance.It is in addition, involved in invention described below different embodiments Technical characteristic can be combined with each other as long as they do not conflict with each other.

Embodiment 1

A kind of key phrases extraction method, for extracting the descriptor in digital resource, number herein are provided in the present embodiment Word resource can be a file or more files, after preselecting digital resource, for selected digital resource come Extract descriptor.The flow chart of this method is as shown in Figure 1, include the following steps：

S11, the text of digital resource is segmented.

After selected digital resource, the set of selected digital resource is positioned as D={ d₁,d₂,…,d_m, wherein d_i, i= 1 ..., m represent i-th newsletter archive, and m can be 1.Loading user-oriented dictionary segments single newsletter archive.User dictionary The set of words being made of idiom, abbreviation and neologisms, effect are some jargoons to specific area, are such as practised Idiom, abbreviation and neologisms are added, and are improved the precision of segmenter participle, are defined as userLib={ e₁,e₂,…,e_r, Wherein e_i, i=1 ..., r represent a word or phrase.

In the step, it can complete to segment by the segmenter of maturation of the prior art, be conducive to by user dictionary It is reasonably segmented, improves the precision of word segmentation.By participle, above-mentioned digital resource can be divided into a series of phrase and word Language.

S12, meaning word is obtained according to word segmentation result.

All words in digital resource are contained in word segmentation result, some of words do not have the specific meaning such as tone Word, auxiliary word, additionally including punctuate and some there is no the insignificant word of specifying information meaning, these words are required for removing. It pre-establishes deactivated vocabulary and deactivated part of speech is set, wherein deactivated vocabulary is by meaningless word in punctuation mark and journalistic style etc. The set of words of composition is defined as stopWords={ w₁,w₂,…,w_s, wherein w_i, i=1 ..., s represent word, a punctuate Symbol or phrase.Deactivated part of speech is the set being made of function part of speech, is defined as stopSpeeches={ s₁,s₂,…,s_t, Middle s_i, i=1 ..., t represent a kind of function part of speech, such as modal particle, auxiliary word.This sentences stopWords and stopSpeeches The mode of filtering function word selects local feature word, and without using vector space model and name Entity recognition, can enhance master Write inscription the robustness of extracting method.The step includes following process：

First, denoising is carried out to word segmentation result using deactivated vocabulary and deactivated part of speech and obtains sequence of terms.In word segmentation result In, remove the punctuate in deactivated vocabulary and insignificant word, and remove functional word, then obtain a series of word, institute The sequence of terms of generation is defined as seqTerms={ term₁,term₂,…,term_o, wherein term_i, i=1 ..., o represent the I meaning word.In the sequence of terms, each word is arranged in order according to the order of text, and dittograph language is also according to appearance What order retained in the sequence successively.

Then, the word obtained after the identical word in sequence of terms is merged is as meaning word.For a upper mistake Sequence of terms in journey, the seqTerms that identical word has been carried out merging element in the set V, all D that form meaning word In meaning word form D meaning set of words, be defined as V={ v₁,v₂,…,v_n, wherein v_i, i=1 ..., n, i represented the in V I meaning word.

S13, for each theme, obtain the probability distribution of the meaning word, the probability distribution include meaning word and its Corresponding weight.

Using document subject matter model is generated to calculate in V significant word theme probability distribution, every digital resource can To belong to multiple and different themes, but theme probability distribution when belonging to different themes is different, herein using document Theme generation model calculates the probability distribution for being directed to selected theme of the significant word of institute in V.

Document subject matter generation model is realized using scheme of the prior art, such as LDA (Latent Dirichlet Allocation be) a kind of document subject matter generation model, also referred to as three layers of bayesian probability model, comprising word, theme and Document three-decker.So-called generation model, that is, it is believed that each word of an article is by " with certain probability Some theme being selected, and with some word of certain probability selection from this theme " such a process obtains.Document to master Topic obeys multinomial distribution, and theme to word obeys multinomial distribution.LDA is a kind of non-supervisory machine learning techniques, can be used for Identify the subject information hidden in extensive document sets (document collection) or corpus (corpus).It is used Each document is considered as a word frequency vector by the method for bag of words (bag of words), this method, so as to by text envelope Breath converts the digital information for ease of modeling.But bag of words method does not account for the order between word and word, this simplifies ask The complexity of topic, while also opportunity is provided for the improvement of model.Formed one of some themes of each documents representative Probability distribution, and each theme represents the probability distribution that many words are formed.

Therefore, using document subject matter generation model can calculate in V significant word belong to selected theme Probability distribution arranges these probability descendings, and the drop probability topic term vector of a certain theme is termFreq=(fterm₁, fterm₂,…,fterm_p), wherein fterm_i, i=1 ..., p, the high meaning word of i expression probability i-th, each meaning word correspondence one A probability right.

S14, each meaning of a word for obtaining the meaning word merge the meaning word with the identical meaning of a word and its corresponding weight, Process is as follows：

First, establish the mapping relations between word and the meaning of a word.Make W={ w_i, i=1 ..., u } and it is ambiguity set of words, M= {m_j, j=1 ..., v } and it is meaning of a word code set, it is defined as by the word of Chinese thesaurus generation and the mapping relations of the meaning of a wordIts meaning expressed is the word for having a variety of semantemes for one X, corresponding semanteme, which collects, is combined into Y, a kind of semanteme of each word corresponding word x in Y.For example, for word mobile phone, correspond to Semantic collection be combined into { mobile phone holds phone }.

Second, obtain the corresponding meaning of a word of meaning word.For each meaning word in termFreq, can all obtain Its corresponding semantic set.

3rd, search the meaning word with the identical meaning of a word.By the way that the word in semantic gather is compared, the two meanings are seen With the presence or absence of identical semantic coding in the semantic set of word, illustrating the two meaning words there are identical semantic coding, there are phases Same semanteme, then performs next step, if it is not, not performing any operation then in this way.

4th, the meaning word with the identical meaning of a word is merged into a meaning word.Meaning word after merging, can select The highest meaning word of weight in meaning word with the identical meaning of a word.

5th, meaning word after the corresponding weight of meaning word with the identical meaning of a word is added up as merging Weight.

By the above process, meaning word and its corresponding weight after being merged.

S15, descriptor is determined according to the meaning word after merging and its weight.

It sorts according to the descending meaning word by after the merging of weight, selects the meaning of default quantity being arranged in front Word is as descriptor.Generally, the 10%-30% that quantity is total amount is preset, preferably described default quantity is the 20% of total amount.Pass through 20% meaning word can cover the subject direction of the digital resource substantially, and reduce subsequent arithmetic amount.Utilize synonymyMap TermFreq is done to the theme term vector that the meaning word of θ before choosing after semantic duplicate removal obtains and is defined as topicWords= (tterm₁,tterm₂,…,tterm_q), wherein tterm_i, i=1 ..., q (q<P) the high descriptor of semantic weight i-th is represented, Corresponding distribution probability is defined as pi.

From the angle of the meaning of a word in said program in the present embodiment, the word with the identical meaning of a word is closed And the interference of polysemant, synonym to key phrases extraction in the prior art is avoided, improve the accuracy of key phrases extraction. In addition, the program need not pass through human-edited's Feature Words or descriptor candidate list, it is not required that using name entity technology Determine descriptor candidate word.Local feature word is selected in a manner of stopWords and stopSpeeches filtering function words, not Use vector space model and name Entity recognition, the robustness of enhancing key phrases extraction method.

In further embodiment, the mapping relations between word and the meaning of a word are pre-established, can be obtained by the correspondence The corresponding multiple meaning of a word of meaning word are obtained, then merge the meaning word containing the identical meaning of a word and weight adds up, according to The descending meaning word by after the merging of weight sorts, and selects the meaning word for the default quantity being arranged in front as theme Word, as selected to be arranged in front 20% as keyword, the meaning word identical by merging the meaning of a word improves the accurate of keyword Degree selects be arranged in front 20% meaning word, can cover the important information of the digital resource substantially, reduce subsequent number According to treating capacity.

Embodiment 2：

A kind of method for obtaining correlated digital resource is provided in the present embodiment, in the digital resource of magnanimity, obtaining With the selected relevant digital resource of digital resource, first, the first digital resource is selected, the first digital resource can be one Can be the more digital resources for belonging to a theme, the purpose of the present embodiment be exactly find out with the first digital resource it is relevant its His digital resource.The flow chart of this method is as shown in Fig. 2, comprise the following steps：

S21, the descriptor that the first digital resource is extracted using the method in embodiment 1.After selected first digital resource, carry Taking the descriptor of the first digital resource, details are not described herein again, passes through the side in embodiment 1 using the method in embodiment 1 Method can obtain the theme term vector topicWords=(tterm of the first digital resource₁,tterm₂,…,tterm_q), wherein tterm_i, i=1 ..., q (q<P), i represents the high descriptor of semantic weight i-th, and corresponding distribution probability is defined as pi.

S22, the keyword and its weight for obtaining the second digital resource.Second digital resource is to need to judge and the first number The whether relevant digital resource of resource, the second digital resource can be other digital resources beyond the first digital resource.It obtains The keyword of second digital resource and its process of weight are as follows：

First, the text of the second digital resource is segmented.Participle mode is identical in embodiment 1, herein no longer It repeats.

Secondth, denoising is carried out to word segmentation result and obtains sequence of terms.Also it is identical with the method in embodiment 1, using deactivated Vocabulary and deactivated part of speech carry out denoising to word segmentation result and obtain sequence of terms seqTerms.In sequence of terms seqTerms, it is The word that order according to text is arranged in order, dittograph language are also to retain in the sequence successively according to appearance order.

3rd, the word in the sequence of terms is arranged using TF-IDF methods descending.

TF-IDF is a kind of statistical method of the prior art, to assess a words for a file set or a language Expect the significance level of a copy of it file in storehouse.The directly proportional increasing of number that the importance of words occurs hereof with it Add, but the frequency that can occur simultaneously with it in corpus is inversely proportional decline.The main thought of TF-IDF is：If some word Or the frequency TF high that phrase occurs in an article, and seldom occur in other articles, then it is assumed that this word or phrase With good class discrimination ability, it is adapted to classify.High term frequencies and the word in a certain specific file are whole Low document-frequency in a file set can produce the TF-IDF of high weight.Therefore, TF-IDF tends to filter out common Word, retain important word.

Obtained important word and its weight after being handled by TF-IDF methods, and according to weight height by these words Language has carried out descending arrangement.

4th, each meaning of a word of the word retained in the previous step is obtained, merges the word with the identical meaning of a word, it will Word after merging is as keyword.

It is also same as Example 1 to merge the word with the identical meaning of a word, is removed by synonymyMap set.It will The crucial term vector that meaning word in seqTerms is arranged by TF-IDF descendings and obtained after being removed using synonymyMap is KeyWords=(kterm₁,kterm₂,…,kterm_Q), wherein kterm_i, i=1 ..., Q, i the i-th important keywords of expression, Q represents the sum of keyword.kterm_iWeight be arranged to

S23, the text similarity for obtaining first digital resource and second digital resource.

Text similarity computing formula is：Wherein M is the keyword and the first number of the second digital resource The sum for the non-duplicate semantic word that the descriptor of word resource contains, w_iRepresent i-th of non-duplicate semantic word in the second number money Weight in source, p_iRepresent the distribution probability in descriptor of i-th of non-duplicate semantic word in the first digital resource.

Although the calculation of text similarity of the prior art also there are many, using the above-mentioned side in the present embodiment Method can obtain better effect.

S24, semantic distribution density of the descriptor in second digital resource is obtained.

The computational methods of semantic distribution density ρ are as follows herein：

The first step chooses the descriptor of the first digital resource with containing jointly in the keyword of the second digital resource The non-duplicate word having.

Second step is ranked up from high to low according to weight of each word in the descriptor of the first digital resource.

3rd step selects the word for the default quantity being arranged in front to pay close attention to word as density.3 words can be selected herein Language can also select other quantity as needed.

4th step obtains the same semantic word of the density concern word.Selected each density concern word, which corresponds to, multiple to be had Same or similar semantic same semantic word, herein by the way of same with the above-mentioned embodiment, can obtain each density Pay close attention to the same semantic word of word.

5th step obtains the position with the same semantic word first appeared in semantic word in second digital resource It puts.In the step, the same semantic word occurred at first in multiple words with semanteme is obtained, using the position of this same semantic word as most Early position.

6th step obtains the position of the same semantic word that last occurs in second digital resource in the word with semanteme It puts.In the step, above-mentioned multiple with the same semantic word occurred for the last time in semantic word, the position that position is occurred for last is obtained It puts.

The distance between 7th step, the semantic word that the semantic word first appeared described in acquisition occurs with last, herein can be with Count number of characters or number of words.

8th step, using the ratio of the distance and the second digital resource length as the semantic distribution density.The The length of two digital resources is also counted using number of characters or number of words.The ratio represents the first digital resource descriptor in the second number The balance degree being distributed in word resource, the degree of correlation represented between digital resource that can be quantified by the two values.

S25, judge whether the text similarity is more than text similarity threshold value and whether semantic distribution density is more than language During adopted distribution density threshold value, using the second digital resource as the correlated digital resource of the first digital resource if being to be.

Usually, the text similarity threshold value is arranged to 0.2-0.4；The semanteme distribution density threshold value is arranged to 0.4- 0.6.Preferably, the text similarity threshold value sets ξ=0.3, and the semanteme distribution density threshold value is arranged to δ=0.5, works as s> ξ and ρ>During δ, using the second digital resource as the correlated digital resource of the first digital resource.

In the scheme of the present embodiment, by two aspects of the text similarities of two digital resources and semantic distribution density, Whether related to weigh two digital resources, text similarity indicates the degree that this two texts describe same subject, language Adopted distribution density represents the balance degree that the first digital resource descriptor is distributed in the second digital resource, can by the two values With the degree of correlation represented between digital resource of quantization, so as to obtain accurate relevant digital resource, it can be used for correlation In the fields such as the recommendation of digital resource, the foundation in thematic library.

Embodiment 3

A kind of thematic generation method is provided in the present embodiment, the interested file for having been read according to user goes to obtain It obtains the file read in resources bank with user and belongs to a thematic file, these special topics are pushed to user, increase user's body It tests.The flow of the subject generating method is as shown in figure 3, comprise the following steps：

S31, selection the first digital resource, can select user interested herein or concern digital resource or Some digital resources that user had read.For the step for selecting reference information, the first digital resource is subsequent processing With reference to information.

S32, a candidate numbers resource is chosen successively as the second digital resource.One is selected in the resources bank of candidate Digital resource carries out subsequent processing as the second digital resource.

S33, using the method described in embodiment 2 obtain with relevant second digital resource of the first digital resource, if full Sufficient s>ξ and ρ>During δ, using the second digital resource as the correlated digital resource of the first digital resource, dependency number is otherwise not considered as Word resource.In this way, travel through the second digital resource selected successively in all S32, can obtain in the resources bank of candidate it is all with Relevant second digital resource of first digital resource is as the digital resource in the special topic.

Scheme in through this embodiment can be used for the number that user's concern is obtained according to the current reading content of user Word resource, such as according to the theme term vector of extraction of semantics user's stories of interest of newsletter archive, and using topic relativity from It is organized in digital newspaper resources bank and generates personalized special topic.The Reporting that user can be utilized currently to read, passes through text Processing, the theme term vector based on extraction of semantics stories of interest, and then according to theme term vector in digital newspaper resources bank Relevant report is extracted, and utilizes the power of correlation and the distribution situation tissue of descriptor, the newpapers and periodicals special topic of generation personalization, side Just the user's quick obtaining stories of interest.The program can eliminate the selection to Feature Words in the prior art and name entity is known Other dependence weakens the interference that polysemant and synonym are brought to theme term vector, need not by human-edited's Feature Words or Descriptor candidate list, it is not required that descriptor candidate word is determined using name entity technology, realizes user oriented personalization Special topic tissue and generation.

In a further embodiment, further include obtain it is preferential with relevant second digital resource of the first digital resource Grade, second digital resource is ranked up according to the height of priority.That is, for the second number in thematic storehouse Resource, with the degree of correlation of the first digital resource and differing, s is bigger, and ρ is bigger, then the priority of the digital resource is higher. Priority of the digital resource being calculated by s and ρ in special topic is defined as prior.Priority herein may be employed existing There is the scheme in technology to be calculated, such as the mode of weighting summation, its purpose is to sort to resource, obtained thematic role is determined Justice is as specialTopic={ news1, news2 ..., newsT }, wherein newsi, i=1 ..., and T, i represent priority row The high digital resource of sequence i-th.

In addition, on the basis of the above, for the digital resource of same priority, in order to avoid being the digital resource repeated, The text similarity between two the second digital resources with same priority can also be further calculated, if the text phase It is more than predetermined threshold value like degree, when such as 0.8, then the two records word resource mark is attached most importance to complex digital resource, remove a wherein record Word resource.The calculating of text similarity herein uses scheme of the prior art, can such as be realized by the matching of word. Certainly, the method for the calculating text similarity in above-described embodiment 2 can also use, but due to this method ratio in embodiment 2 More complicated, preferable effect can be obtained by preferably simply calculating the method for text similarity in the prior art herein.

Embodiment 4

Originally apply example and a kind of specific application example is provided, user oriented newpapers and periodicals special topic tissue mainly includes two with generation Step.

The first step, using theme term vector of the interested news collection of user based on semantic generation special topic, the step it is defeated It is the interested newsletter archive set D of user to enter, and output is thematic theme term vector topicWords.Particular flow sheet is shown in Fig. 1. After segmenter is loaded user dictionary, coarseness participle is carried out to newsletter archive collection D.Semantic-based document subject matter model uses LDA(Latent Dirichlet Allocation).It takes before distribution probability sequence is higher after synonymyMap semanteme duplicate removals 20% as final thematic descriptor, as shown in Figure 4.

Specifically, for example, say and tell for one of user's selection " 3.8 horses navigate event " search-and-rescue work news, tissue and Generation special topic.

In the first step, thematic theme term vector is generated.After segmenter is loaded user dictionary, coarse grain is carried out to the news Degree participle.Word segmentation result is filtered by stopWords and stopSpeeches.It is instructed using the meaning word obtained after filtering Practice LDA models, calculate descriptor probability distribution, obtain { marine site=0.0432, aircraft=0.0305, passenger plane=0.0029, Malaysia West Asia=0.0208, rescue=0.0203, naval=0.0183, search=0.0168, warship=0.0163, Ma Hang= 0.0158 ... }.Ma Hang, Malaysia and warship, naval vessel, naval vessels, warship etc. are respectively provided with identical semanteme in synonymyMap Coding, using synonymyMap semanteme duplicate removals posterior probability distribution be changed to marine site=0.0468, aircraft=0.0336, warship= 0.0318, rescue=0.0289, search=0.0275, passenger plane=0.0029, ship=0.0224, Malaysia=0.0208, horse Boat=0.0204 ... }, take preceding 20% descriptor as " horse boat search-and-rescue work " that distribution probability sequence is higher.

Second step, by each candidate's newsletter archive in digital newspaper resources bank and the similarity calculation of descriptor come tissue and Generation special topic.The input of the step is the theme term vector topicWords of digital newspaper resources bank and special topic, and output is user Interested thematic role.After choosing thematic Candidate Set using the publication time of user's news interested and newpapers and periodicals priority, waiting Traversal calculates each news and descriptor density p in the similarity s and newsletter archive of thematic descriptor on selected works, works as s>ξ and ρ>δ When, which is added in specialTopic.Prior, and the order group according to prior from high to low are calculated using s and ρ Knit news.Similarity calculation two-by-two is carried out to each newsletter archive under same prior in specialTopic, similarity is more than Two news of η are labeled as news is repeated, as shown in Figure 5.

With reference to above-mentioned specific example, the news of " 3.8 horses navigate event " search-and-rescue work is said and told for one of user's selection Come tissue and generation special topic.In this step, by calculate in digital newspaper storehouse the similarity tissue of newsletter archive and descriptor and Generation special topic.The issuing time " on March 10th, 2014 " of news is selected according to user, by before and after the date in digital newspaper storehouse one Fix time interior important newpapers and periodicals all news as thematic Candidate Set.It is calculated to every news in Candidate Set to obtain with the first step It is close further to calculate distribution of the descriptor in its text for news of the similarity more than 0.3 by the similarity s of the descriptor arrived ρ is spent, when distribution density is more than 0.5, which is added in " horse boat search-and-rescue work " special topic.It is new to each item in special topic It hears, is ranked up from high to low according to the prior being calculated by s and ρ, and similarity in same prior is each more than 0.8 News is marked." horse navigate search-and-rescue work " special topic for finally obtaining is as shown in fig. 6, the news of same priority similarly hereinafter group It represents to be marked as the news repeated.

It in the present embodiment, inputs as the interested newsletter archive set of user, is user oriented personalized special topic tissue With generation, situation about being described with multiple keywords is difficult to better than keywords-based retrieval mode, particularly theme of news；With The mode of stopWords and stopSpeeches filtering function words selects local feature word, and without using vector space model and Entity recognition is named, enhances the robustness of method；Special Topics in Journalism theme term vector is extracted with reference to LDA and synonymyMap, is filled Divide the semantic information for considering news, reduce the interference that polysemant and synonym are brought to theme term vector；Customized phase Like degree computational methods, the threshold value of different special topics can be unified and establish Global Vector spatial model, met user oriented The personalized and diversified demand of newpapers and periodicals special topic.

Embodiment 5

A kind of key phrases extraction device is provided in the present embodiment, as shown in fig. 7, comprises：

Participle unit 11 segments the text of digital resource；

Word segmentation result processing unit 12 obtains meaning word according to word segmentation result；

Probability distribution unit 13 for each theme, obtains the probability distribution of the meaning word, and the probability distribution includes Meaning word and its corresponding weight；

Combining unit 14 obtains each meaning of a word of the meaning word, merges meaning word and its correspondence with the identical meaning of a word Weight；

Descriptor determination unit 15 determines descriptor according to the meaning word after merging and its weight.According to the size of weight Meaning word after the merging is sorted, selects the meaning word for the default quantity being arranged in front as descriptor.The present count The 10%-30% for total amount is measured, is preferably the 20% of total amount.

Wherein, combining unit 14 includes

Subelement is mapped, establishes the mapping relations between word and the meaning of a word.

The meaning of a word obtains subelement, obtains the corresponding meaning of a word of meaning word.

The meaning of a word searches subelement, searches the meaning word with the identical meaning of a word.

Meaning word merges subelement, and the meaning word with the identical meaning of a word is merged into a meaning word, selects corresponding power The highest meaning word of weight is as the meaning word after merging.

Weight calculation subelement adds up the corresponding weight of meaning word with the identical meaning of a word as after merging Meaning word weight.

Wherein, word segmentation result processing unit 12 includes：

Denoising subelement carries out denoising to word segmentation result using deactivated vocabulary and part of speech and obtains sequence of terms；

Word merges subelement, and the word obtained after the identical word in sequence of terms is merged is as meaning word.

Embodiment 6

In addition, a kind of device for obtaining correlated digital resource is also provided in the present embodiment, as shown in figure 8, including

Key phrases extraction unit 21 extracts the descriptor of the first digital resource；

Keyword determination unit 22 obtains the keyword and its weight of the second digital resource；

It is similar to the text of second digital resource to obtain first digital resource for text similarity acquiring unit 23 Degree；

It is close to obtain semanteme distribution of the descriptor in second digital resource for semantic distribution density acquiring unit 24 Degree；

Related resource determination unit 25, judges whether the text similarity is more than text similarity threshold value and semantic distribution When whether density is more than semantic distribution density threshold value, using the second digital resource as the correlation of the first digital resource if being to be Digital resource.Wherein, the text similarity threshold value is arranged to 0.2-0.4；And/or the semantic distribution density threshold value is arranged to 0.4-0.6.It is preferred that the text similarity threshold value is arranged to 0.3；And/or the semantic distribution density threshold value is arranged to 0.5.

Wherein, keyword determination unit 22 includes：

Text segments subelement, and the text of the second digital resource is segmented；

Word segmentation result denoising subelement carries out denoising to word segmentation result and obtains sequence of terms；

Descending arranges subelement, and the word in the sequence of terms is arranged using TF-IDF methods descending；

Keyword obtains subelement, obtains each meaning of a word of the word, merges the word with the identical meaning of a word, will merge Word afterwards is as keyword.

The key term vector is keyWords=(kterm₁,kterm₂,…,kterm_Q), wherein kterm_i, i= 1 ..., Q, i represent the i-th important keyword, and Q represents the sum of keyword；

kterm_iWeight be arranged to

Wherein, text similarity acquiring unit 23 includes

Text similarity computing formula：Wherein M is the keyword and the first number of the second digital resource The sum for the non-duplicate semantic word that the descriptor of resource contains, w_iRepresent i-th of non-duplicate semantic word in the second digital resource In weight, p_iRepresent the distribution probability in descriptor of i-th of non-duplicate semantic word in the first digital resource.

Wherein, semantic distribution density acquiring unit 24 includes

Non-duplicate word determination subelement, choose the first digital resource the descriptor and the second digital resource it is described The non-duplicate word contained jointly in keyword；

Weight sequencing subelement is arranged from high to low according to weight of each word in the descriptor of the first digital resource Sequence；

Subelement is chosen, the word for the default quantity being arranged in front is selected to pay close attention to word as density；

Subelement is obtained with semanteme word, obtains the same semantic word of the density concern word；

Position acquisition subelement is first appeared, obtains and is first appeared in the word with semanteme in second digital resource Semantic word position；

There is position acquisition subelement in last, obtains described with last occurs in second digital resource in semantic word Semantic word position；

Distance obtains the distance between subelement, the semantic word that the semantic word first appeared described in acquisition occurs with last；

Semantic distribution density computation subunit, using the ratio of the distance and the second digital resource length as described in Semantic distribution density.

Embodiment 7

A kind of thematic generating means are provided in the present embodiment, as shown in figure 9, including：

First choosing digital resources unit 31 selects the first digital resource；

Second choosing digital resources unit 32 chooses a candidate numbers resource as the second digital resource successively；

Thematic generation unit 33 obtains and relevant second digital resource of the first digital resource, all second numbers of traversal Resource, using with relevant second digital resource of the first digital resource as it is described special topic in digital resource.

In addition, further including priority calculation unit, obtain preferential with relevant second digital resource of the first digital resource Grade, second digital resource is ranked up according to the height of priority.

Duplicate removal unit is further included, calculates the text similarity between two the second digital resources with same priority, If the text similarity is more than predetermined threshold value, which is attached most importance to complex digital resource, removes wherein one Piece digital resource.

It should be understood by those skilled in the art that, the embodiment of the present invention can be provided as method, system or computer program Product.Therefore, the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware can be used in the present invention Apply the form of example.Moreover, the computer for wherein including computer usable program code in one or more can be used in the present invention The computer program production that usable storage medium is implemented on (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.

Obviously, the above embodiments are merely examples for clarifying the description, and is not intended to limit the embodiments.It is right For those of ordinary skill in the art, can also make on the basis of the above description it is other it is various forms of variation or It changes.There is no necessity and possibility to exhaust all the enbodiments.And the obvious variation thus extended out or Among changing still in the protection domain of the invention.

Claims

A kind of 1. key phrases extraction method, which is characterized in that include the following steps：

The text of digital resource is segmented；

Meaning word is obtained according to word segmentation result；

For each theme, the probability distribution of the meaning word is obtained, the probability distribution includes meaning word and its corresponding power Weight；

Each meaning of a word of the meaning word is obtained, merges the meaning word with the identical meaning of a word and its corresponding weight；

The descriptor of the first digital resource is determined according to the meaning word after merging and its weight；

Obtain the keyword and its weight of the second digital resource；

Obtain the text similarity of first digital resource and second digital resource；

Obtain semantic distribution density of the descriptor in second digital resource；

Judge whether the text similarity is more than text similarity threshold value and whether semantic distribution density is close more than semantic distribution When spending threshold value, using the second digital resource as the correlated digital resource of the first digital resource if being to be.
2. according to the method described in claim 1, it is characterized in that, each meaning of a word for obtaining the meaning word, merges tool There is the step of meaning word of the identical meaning of a word and its corresponding weight, including

Establish the mapping relations between word and the meaning of a word；

Obtain the corresponding meaning of a word of meaning word；

Search the meaning word with the identical meaning of a word；

Meaning word with the identical meaning of a word is merged into a meaning word；

The weight of meaning word after the corresponding weight of meaning word with the identical meaning of a word is added up as merging.
3. according to the method described in claim 2, it is characterized in that, described be merged into one by the meaning word with the identical meaning of a word In the step of meaning word, after the corresponding highest meaning word of weight is selected in the meaning word with the identical meaning of a word as merging Meaning word.
4. according to the method described in claim 1 or 2 or 3, which is characterized in that determined according to the meaning word after merging and its weight In the step of descriptor, including

The meaning word after the merging is sorted according to the size of weight, select the meaning word of the default quantity being arranged in front as Descriptor.
5. according to the method described in claim 4, it is characterized in that, the default quantity is the 10%-30% of total amount.
6. according to the method described in claim 5, it is characterized in that, the default quantity is the 20% of total amount.
7. according to the method described in claim 4, it is characterized in that, the step of obtaining meaning word according to word segmentation result, including

Denoising is carried out to word segmentation result using deactivated vocabulary and part of speech and obtains sequence of terms；

The word obtained after identical word in sequence of terms is merged is as meaning word.
8. according to the method described in claim 5, it is characterized in that, for each theme, the probability for obtaining the meaning word divides In the step of cloth, the probability distribution of the meaning word is calculated using document subject matter generation model.
9. according to the method described in claim 1, it is characterized in that, obtain the keyword of the second digital resource and its step of weight Suddenly, including：

The text of second digital resource is segmented；

Denoising is carried out to word segmentation result and obtains sequence of terms；

Word in the sequence of terms is arranged using TF-IDF methods descending；

Each meaning of a word of the word is obtained, merges the word with the identical meaning of a word, using the word after merging as keyword.
10. according to the method described in claim 1, it is characterized in that, obtain the keyword and its weight of second digital resource In step, wherein,

The key term vector is keyWords=(kterm₁,kterm₂,…,kterm_Q), wherein kterm_i, i=1 ..., Q, i Represent the i-th important keyword, Q represents the sum of keyword；

kterm_iWeight be arranged to
11. according to the method described in claim 1, it is characterized in that, described obtain first digital resource and described second In the step of text similarity of digital resource, including

Text similarity computing formula is：Wherein M is the keyword and the first digital resource of the second digital resource The sum of non-duplicate semantic word that contains of descriptor, w_iRepresent i-th of non-duplicate semantic word in the second digital resource Weight, p_iRepresent the distribution probability in descriptor of i-th of non-duplicate semantic word in the first digital resource.
12. according to the method described in claim 1, it is characterized in that, described obtain the descriptor in the described second number money The step of semantic distribution density in source, including

Choose contain jointly in the descriptor of the first digital resource and the keyword of the second digital resource it is non-duplicate Word；

It is ranked up from high to low according to weight of each word in the descriptor of the first digital resource；

The word for the default quantity being arranged in front is selected to pay close attention to word as density；

Obtain the same semantic word of the density concern word；

Obtain the position with the semantic word first appeared in semantic word in second digital resource；

Obtain the position of the semantic word that last occurs in second digital resource in the word with semanteme；

The distance between semantic word that the semantic word first appeared described in acquisition occurs with last；

Using the ratio of the distance and the second digital resource length as the semantic distribution density.
13. a kind of device for obtaining correlated digital resource, which is characterized in that including

Key phrases extraction unit extracts the descriptor of the first digital resource；

Keyword determination unit obtains the keyword and its weight of the second digital resource；

Text similarity acquiring unit obtains the text similarity of first digital resource and second digital resource；

Semantic distribution density acquiring unit obtains semantic distribution density of the descriptor in second digital resource；

Related resource determination unit, judges whether the text similarity is more than text similarity threshold value and semantic distribution density and is During the no distribution density threshold value more than semanteme, provided if being to be using the second digital resource as the correlated digital of the first digital resource Source.