CN105224521A

CN105224521A - Key phrases extraction method and use its method obtaining correlated digital resource and device

Info

Publication number: CN105224521A
Application number: CN201510627961.XA
Authority: CN
Inventors: 许茜; 叶茂; 任彩红; 徐剑波; 汤帜
Original assignee: Peking University; Peking University Founder Group Co Ltd; Beijing Founder Apabi Technology Co Ltd
Current assignee: New Founder Holdings Development Co ltd; Peking University; Founder Apabi Technology Ltd
Priority date: 2015-09-28
Filing date: 2015-09-28
Publication date: 2016-01-06
Anticipated expiration: 2035-09-28
Also published as: CN105224521B

Abstract

The invention provides a kind of key phrases extraction method and use its method obtaining correlated digital resource and device, wherein key phrases extraction method comprises: first, carries out participle, then obtain meaning word according to word segmentation result to the text of digital resource; For each theme, obtain the probability distribution of described meaning word, described probability distribution comprises the weight of meaning word and correspondence thereof; Obtain each meaning of a word of described meaning word, merge and there is the meaning word of the identical meaning of a word and the weight of correspondence thereof; According to the meaning word after merging and weight determination descriptor thereof.From the angle of the meaning of a word in the program, the word with the identical meaning of a word is merged, avoid polysemant in prior art, synonym to the interference of key phrases extraction, improve the accuracy of key phrases extraction, eliminate in prior art the selection of Feature Words and the dependence of named entity recognition, weaken the interference that polysemant and synonym bring to descriptor vector, realize user oriented personalized special topic tissue simultaneously and generate.

Description

Key phrases extraction method and use its method obtaining correlated digital resource and device

Technical field

The present invention relates to digital resource process field, be specifically related to a kind of key phrases extraction method, the method obtaining correlated digital resource and device.

Background technology

Along with the fast development of internet, digital newspaper is day by day universal, thus significantly enhances the interactivity of user and newpapers and periodicals, for personalized newpapers and periodicals special topic tissue provides possibility with generation.In addition, whole nation every day, new increase news report, was mostly newborn event and with a large amount of neologisms.So-called " neologisms " mainly refer to that content is new, form newly, though originally do not have in lexical system or have the meaning of a word to be brand-new word.

In order to better describe these digital resources, be convenient to the process such as recommendation, retrieval of follow-up relevant special topic, need the extraction these digital resources being carried out to descriptor, the mode of the vocabulary in digital resource is extracted after general employing participle in prior art, the many vocabulary of frequency of occurrence is obtained as descriptor by the mode merged, but because each word may have multiple different semantic information, the implication that different words is expressed again may be identical, such as mobile phone, mobile phone, its implication expressed is identical, and the extraction of the word that is also the theme brings interference.In addition, in existing key phrases extraction method, generally need secret service editor Feature Words or theme candidate word list, adopt named entity technology determination descriptor candidate word, use vector space model and named entity recognition.Program process is complicated, needs a large amount of data operation quantity.

The descriptor of said extracted, may be used for digital resource if the tissue of Special Topics in Journalism is with generation.The tissue of Special Topics in Journalism refers to by relevant news organization together to generation, forms a special topic.Such as, when newpapers and periodicals user is in the face of oneself interested a certain media event, ites is desirable to obtain more relevant report quickly and easily from the magnanimity news report of many newpapers and periodicals, improve the efficiency of acquisition of information and the personalization of reading.Such as, when user reads certain section about the report of foreign press to " 3.1 railway station, Kunming violence terrified case " view, when wishing to check fast other about the report of foreign press to this event view, first, select the interested news that this section of user reads, by analyzing the descriptor obtaining this news, then the keyword of all the other news and above-mentioned descriptor are compared, by news linked groups high for degree of correlation to just defining a special topic together.At present, mainly utilize the technology such as vector space model, named entity recognition, text cluster on newspaper and periodicals resource storehouse, to extract special topic in advance, be pushed to user and select to consult for user.These class methods have very strong dependence to the selection of Feature Words and named entity recognition, thus less effective during the newpapers and periodicals text causing process neologisms to occur frequently, and do not take into full account the interference that the semantic information of news and polysemant and synonym are brought to descriptor vector, can not organize according to the report of user's current interest, generate personalized special topic.

Summary of the invention

Therefore, the technical problem to be solved in the present invention is to overcome in key phrases extraction process of the prior art and is subject to polysemant, synonym interference, need by human-edited's Feature Words or descriptor candidate list, adopt the defect of named entity technology determination descriptor candidate word, thus a kind of key phrases extraction method and apparatus is provided.

Another technical matters that the present invention will solve is to overcome when special topic of the prior art generates to be needed to use vector space model and named entity recognition, the defect of poor robustness, thus provides a kind of method and apparatus obtaining correlated digital resource.

The invention provides a kind of key phrases extraction method, comprise the steps:

Participle is carried out to the text of digital resource;

Meaning word is obtained according to word segmentation result;

For each theme, obtain the probability distribution of described meaning word, described probability distribution comprises the weight of meaning word and correspondence thereof;

Obtain each meaning of a word of described meaning word, merge and there is the meaning word of the identical meaning of a word and the weight of correspondence thereof;

According to the meaning word after merging and weight determination descriptor thereof.

In addition, the invention provides a kind of method obtaining correlated digital resource, comprise the steps:

Extract the descriptor of the first digital resource;

Obtain keyword and the weight thereof of the second digital resource;

Obtain the text similarity of described first digital resource and described second digital resource;

Obtain the semantic distribution density of described descriptor in described second digital resource;

Judge whether described text similarity is greater than text similarity threshold value and whether semantic distribution density is greater than semantic distribution density threshold value time, if be, using the correlated digital resource of the second digital resource as the first digital resource.

In addition, the present invention also provides a kind of key phrases extraction device, comprising:

Participle unit, carries out participle to the text of digital resource;

Word segmentation result processing unit, obtains meaning word according to word segmentation result;

Probability distribution unit, for each theme, obtain the probability distribution of described meaning word, described probability distribution comprises the weight of meaning word and correspondence thereof;

Merge cells, obtains each meaning of a word of described meaning word, merges and has the meaning word of the identical meaning of a word and the weight of correspondence thereof;

Descriptor determining unit, according to the meaning word after merging and weight determination descriptor thereof.

In addition, the present invention also provides a kind of device obtaining correlated digital resource, comprises

Key phrases extraction unit, extracts the descriptor of the first digital resource;

Keyword determining unit, obtains keyword and the weight thereof of the second digital resource;

Text similarity acquiring unit, obtains the text similarity of described first digital resource and described second digital resource;

Semantic distribution density acquiring unit, obtains the semantic distribution density of described descriptor in described second digital resource;

Related resource determining unit, judge whether described text similarity is greater than text similarity threshold value and whether semantic distribution density is greater than semantic distribution density threshold value time, if be, using the correlated digital resource of the second digital resource as the first digital resource.

Technical solution of the present invention, tool has the following advantages:

1. the invention provides a kind of key phrases extraction method, first, participle is carried out to the text of digital resource, then obtains meaning word according to word segmentation result; For each theme, obtain the probability distribution of described meaning word, described probability distribution comprises the weight of meaning word and correspondence thereof; Obtain each meaning of a word of described meaning word, merge and there is the meaning word of the identical meaning of a word and the weight of correspondence thereof; According to the meaning word after merging and weight determination descriptor thereof.From the angle of the meaning of a word in the program, the word with the identical meaning of a word is merged, avoid polysemant in prior art, synonym to the interference of key phrases extraction, improve the accuracy of key phrases extraction.In addition, the program does not need by human-edited's Feature Words or descriptor candidate list, does not need to adopt named entity technology determination descriptor candidate word yet.By the way selection local feature word of filtering function word, do not use vector space model and named entity recognition, enhance the robustness of key phrases extraction method.

2. key phrases extraction method of the present invention, set up the mapping relations between word and the meaning of a word in advance, the corresponding meaning of a word of described meaning word can be obtained by this corresponding relation, then the meaning word of the identical meaning of a word is merged and weight is added up, descending by the meaning word sequence after described merging according to weight, select the meaning word of the predetermined number be arranged in front as descriptor, be arranged in front as selected 20% as keyword, by merging the identical meaning word of the meaning of a word, improve the accuracy of keyword, select the meaning word of be arranged in front 20%, substantially the important information of this digital resource can be covered, decrease follow-up data processing amount.

3. the present invention also provides a kind of method obtaining correlated digital resource, first, extract the descriptor of the first digital resource, then keyword and the weight thereof of the second digital resource is obtained, obtain the text similarity of described first digital resource and described second digital resource, obtain the semantic distribution density of described descriptor in described second digital resource, when described text similarity is greater than text similarity threshold value, and semantic distribution density is when being greater than semantic distribution density threshold value, using the correlated digital resource of the second digital resource as the first digital resource.In the program, by text similarity and semantic distribution density two aspects of two sections of digital resources, weigh two sections of digital resources whether to be correlated with, text similarity indicates the degree that these two sections of texts describe same subject, semantic distribution density represents the balanced intensity that the first digital resource descriptor distributes in the second digital resource, by the degree of correlation indicated between digital resource that these two values can quantize, thus obtain digital resource relevant accurately.

Accompanying drawing explanation

In order to be illustrated more clearly in the specific embodiment of the invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is the process flow diagram of the key phrases extraction method in the embodiment of the present invention 1;

Fig. 2 is the process flow diagram of the method for acquisition correlated digital resource in the embodiment of the present invention 2;

Fig. 3 is the process flow diagram of the special topic generation method in the embodiment of the present invention 3;

Fig. 4 is the process flow diagram of the descriptor vector of generation special topic in the embodiment of the present invention 4;

Fig. 5 is the process flow diagram of the generation special topic in the embodiment of the present invention 4;

Fig. 6 is the thematic list schematic diagram in the embodiment of the present invention 4;

Fig. 7 is the process flow diagram of the key phrases extraction device in the embodiment of the present invention 5;

Fig. 8 is the process flow diagram of the device of acquisition correlated digital resource in the embodiment of the present invention 6;

Fig. 9 is the process flow diagram of the thematic generating apparatus in the embodiment of the present invention 7.

Embodiment

Be clearly and completely described technical scheme of the present invention below in conjunction with accompanying drawing, obviously, described embodiment is the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

In describing the invention, it should be noted that, term " first ", " second ", " the 3rd " only for describing object, and can not be interpreted as instruction or hint relative importance.In addition, if below in the described different embodiment of the present invention involved technical characteristic do not form conflict each other and just can be combined with each other.

Embodiment 1

There is provided a kind of key phrases extraction method in the present embodiment, for extracting the descriptor in digital resource, digital resource herein can be one section of file, also can be many sections of files, after chosen in advance digital resource, extracts descriptor for selected digital resource.The process flow diagram of the method as shown in Figure 1, comprises the steps:

S11, participle is carried out to the text of digital resource.

After selected digital resource, D={d is orientated in the set of selected digital resource as ₁, d ₂..., d _m, wherein d _i, i=1 ..., m represents i-th section of newsletter archive, and m can be 1.Load user-oriented dictionary and participle is carried out to single section of newsletter archive.The set of words that user dictionary is made up of idiom, abbreviation and neologisms, its effect is some jargoons to specific area, as idiom, abbreviation and neologisms add, improves the precision of segmenter participle, is defined as userLib={e ₁, e ₂..., e _r, wherein e _i, i=1 ..., r represents a word or phrase.

In this step, can participle be completed by the segmenter of maturation of the prior art, be conducive to reasonably carrying out participle by user dictionary, improve the precision of word segmentation.By participle, above-mentioned digital resource can be divided into a series of phrase and word.

S12, obtain meaning word according to word segmentation result.

Contain all words in digital resource in word segmentation result, some of them word does not have concrete implication as modal particle, auxiliary word, comprises punctuate in addition and some do not have the insignificant word of specifying information implication, and these words all need to remove.Set up in advance and stop using vocabulary and inactive part of speech is set, the set of words that vocabulary of wherein stopping using is made up of the meaningless word etc. in punctuation mark and journalistic style, be defined as stopWords={w ₁, w ₂..., w _s, wherein w _i, i=1 ..., s represents word, punctuation mark or a phrase.Inactive part of speech is the set be made up of function part of speech, is defined as stopSpeeches={s ₁, s ₂..., s _t, wherein s _i, i=1 ..., t represents a kind of function part of speech, as modal particle, auxiliary word etc.This sentences the way selection local feature word of stopWords and stopSpeeches filtering function word, does not use vector space model and named entity recognition, can strengthen the robustness of key phrases extraction method.This step comprises following process:

First, utilize inactive vocabulary and inactive part of speech to carry out denoising to word segmentation result and obtain sequence of terms.In word segmentation result, remove the punctuate in inactive vocabulary and insignificant word, and remove functional word, then obtain a series of word, the sequence of terms produced is defined as seqTerms={term ₁, term ₂..., term _o, wherein term _i, i=1 ..., o represents i-th meaning word.In this sequence of terms, each word is arranged in order according to the order of text, and dittograph language also retains successively in the sequence according to appearance order.

Then, using the word that obtains after the identical word in sequence of terms merges as meaning word.For the sequence of terms in a upper process, identical word carried out merge the set V forming meaning word, in all D element seqTerms in meaning word form the meaning set of words of D, be defined as V={v ₁, v ₂..., v _n, wherein v _i, i=1 ..., n, i represent i-th meaning word in V.

S13, for each theme, obtain the probability distribution of described meaning word, described probability distribution comprises the weight of meaning word and correspondence thereof.

Utilize document subject matter generation model to calculate in V the theme probability distribution of meaningful word, every section of digital resource can belong to multiple different theme, but theme probability distribution when belonging to different themes is different, adopt herein be document subject matter generation model to calculate in V the probability distribution for a selected theme of meaningful word.

Document subject matter generation model adopts scheme of the prior art to realize, if LDA (LatentDirichletAllocation) is a kind of document subject matter generation model, also referred to as three layers of bayesian probability model, comprise word, theme and document three-decker.So-called generation model, in other words, we think that each word of one section of article is obtained by " with certain probability selection certain theme, and with certain word of certain probability selection from this theme " such process.Document obeys multinomial distribution to theme, and theme obeys multinomial distribution to word.LDA is a kind of non-supervisory machine learning techniques, can be used for identifying subject information hiding in extensive document sets (documentcollection) or corpus (corpus).It has employed the method for word bag (bagofwords), and each section of document is considered as a word frequency vector by this method, thus text message is transformed the numerical information for ease of modeling.But word bag method does not consider the order between word and word, this simplify the complex nature of the problem, simultaneously also for the improvement of model provides opportunity.Probability distribution that some themes of each section of documents representative are formed, and each theme represents the probability distribution that a lot of word is formed.

Therefore, utilize document subject matter generation model just can to calculate in V the probability distribution belonging to selected theme of meaningful word, by these probability descending sorts, the probability topic term vector that falls of a certain theme is termFreq=(fterm ₁, fterm ₂..., fterm _p), wherein fterm _i, i=1 ..., p, i represent the meaning word that probability i-th is high, the corresponding probability right of each meaning word.

S14, obtain each meaning of a word of described meaning word, merge and have the meaning word of the identical meaning of a word and the weight of correspondence thereof, process is as follows:

The first, set up the mapping relations between word and the meaning of a word.Make W={w _i, i=1 ..., u} is polysemant set, M={m _j, j=1 ..., v} is meaning of a word code set, and the word generated by Chinese thesaurus and the mapping relations of the meaning of a word are defined as

s y n o n y m y M a p = {(x, Y) | x &Element; W, Y &Subset; M},

Its implication expressed is the word x one to multiple semanteme, and the semanteme set of its correspondence is that the one of each word corresponding word x in Y, Y is semantic.Such as, for word mobile phone, the semanteme set of its correspondence is { mobile phone, cell-phone }.

The second, obtain the meaning of a word that described meaning word is corresponding.For each the meaning word in termFreq, the semanteme set of its correspondence all can be obtained.

3rd, search the meaning word with the identical meaning of a word.By the word in semanteme set is compared, see in the semanteme set of these two meaning words whether there is identical semantic coding, there is identical semantic coding and illustrate that these two meaning words exist identical semanteme, then perform next step in this way, if not, then any operation is not performed.

4th, the meaning word with the identical meaning of a word is merged into a meaning word.Meaning word after merging, can select the meaning word that the weight in the meaning word with the identical meaning of a word is the highest.

5th, weight corresponding respectively for the described meaning word with the identical meaning of a word is added up as the weight of the meaning word after merging.

By said process, the weight of the meaning word after merging and correspondence thereof can be obtained.

S15, according to merge after meaning word and weight determination descriptor.

Descending by the meaning word sequence after described merging according to weight, select the meaning word of the predetermined number be arranged in front as descriptor.Generally, predetermined number is the 10%-30% of total amount, and preferred described predetermined number is 20% of total amount.Meaning word by 20% can cover the subject direction of this digital resource substantially, and reduces subsequent arithmetic amount.The descriptor Definition of Vector that the meaning word choosing front θ after utilizing synonymyMap that termFreq is done semantic duplicate removal obtains is topicWords=(tterm ₁, tterm ₂..., tterm _q), wherein tterm _i, i=1 ..., q (q<p) represents the descriptor that semantic weight i-th is high, and corresponding distribution probability is defined as pi.

From the angle of the meaning of a word in such scheme in the present embodiment, the word with the identical meaning of a word is merged, avoid polysemant in prior art, synonym to the interference of key phrases extraction, improve the accuracy of key phrases extraction.In addition, the program does not need by human-edited's Feature Words or descriptor candidate list, does not need to adopt named entity technology determination descriptor candidate word yet.With the way selection local feature word of stopWords and stopSpeeches filtering function word, do not use vector space model and named entity recognition, strengthen the robustness of key phrases extraction method.

In further scheme, set up the mapping relations between word and the meaning of a word in advance, the corresponding multiple meaning of a word of described meaning word can be obtained by this corresponding relation, then the meaning word containing the identical meaning of a word is merged and weight is added up, descending by the meaning word sequence after described merging according to weight, select the meaning word of the predetermined number be arranged in front as descriptor, be arranged in front as selected 20% as keyword, by merging the identical meaning word of the meaning of a word, improve the accuracy of keyword, select the meaning word of be arranged in front 20%, substantially the important information of this digital resource can be covered, decrease follow-up data processing amount.

Embodiment 2:

A kind of method obtaining correlated digital resource is provided in the present embodiment, for in the digital resource of magnanimity, obtain the digital resource relevant to selected digital resource, first, selected first digital resource, first digital resource can be one section also can be the many sections of digital resources belonging to a theme, and the object of the present embodiment is exactly find out other digital resources relevant to the first digital resource.The process flow diagram of the method as shown in Figure 2, comprises the following steps:

Method in S21, employing embodiment 1 extracts the descriptor of the first digital resource.After selected first digital resource, the method that what the descriptor extracting the first digital resource adopted is in embodiment 1, repeats no more herein, by the method in embodiment 1, can obtain the descriptor vector topicWords=(tterm of the first digital resource ₁, tterm ₂..., tterm _q), wherein tterm _i, i=1 ..., q (q<p), i represent the descriptor that semantic weight i-th is high, and corresponding distribution probability is defined as pi.

S22, the keyword obtaining the second digital resource and weight thereof.Second digital resource needs to judge the digital resource whether relevant to the first digital resource, and the second digital resource can be other digital resources beyond the first digital resource.The process obtaining the keyword of the second digital resource and weight thereof is as follows:

The first, participle is carried out to the text of the second digital resource.Identical with embodiment 1 of participle mode, repeats no more herein.

The second, denoising is carried out to word segmentation result and obtain sequence of terms.Also identical with the method in embodiment 1, utilize inactive vocabulary and inactive part of speech to carry out denoising to word segmentation result and obtain sequence of terms seqTerms.In this sequence of terms seqTerms, be the word be arranged in order according to the order of text, dittograph language also retains successively in the sequence according to appearance order.

Three, the word in described sequence of terms is adopted the descending sort of TF-IDF method.

TF-IDF is a kind of statistical method of the prior art, in order to assess the significance level of a words for a copy of it file in a file set or a corpus.The importance of words to be directly proportional increase along with the number of times that it occurs hereof, the decline but the frequency that can occur in corpus along with it is inversely proportional to simultaneously.The main thought of TF-IDF is: if the frequency TF that certain word or phrase occur in one section of article is high, and seldom occur in other articles, then think that this word or phrase have good class discrimination ability, is applicable to for classification.High term frequencies in a certain specific file, and the low document-frequency of this word in whole file set, can produce the TF-IDF of high weight.Therefore, TF-IDF tends to filter out common word, retains important word.

By obtaining important word and weight thereof after the process of TF-IDF method, and according to the height of weight, these words are carried out descending sort.

4th, obtain each meaning of a word of the word retained in described previous step, merge the word with the identical meaning of a word, using the word after merging as keyword.

Merge the word with the identical meaning of a word also identical with embodiment 1, removed by synonymyMap set.By the meaning word in seqTerms by TF-IDF descending sort and after utilizing synonymyMap to remove the keyword vector that obtains be keyWords=(kterm ₁, kterm ₂..., kterm _q), wherein kterm _i, i=1 ..., Q, i represent the i-th important keyword, and Q represents the sum of keyword.Kterm _iweight be set to

w_{i} = 1 - \frac{i - 1}{2 Q} .

S23, obtain the text similarity of described first digital resource and described second digital resource.

Text similarity computing formula is: wherein M is the sum of the semantic word of non-duplicate that the keyword of the second digital resource and the descriptor of the first digital resource contain, w _irepresent the weight of the semantic word of i-th non-duplicate in the second digital resource, p _irepresent the distribution probability in the descriptor of the semantic word of i-th non-duplicate in the first digital resource.

Although the account form of text similarity of the prior art is multiple in addition, adopt the said method in the present embodiment can obtain better effect.

S24, obtain the semantic distribution density of described descriptor in described second digital resource.

The computing method of semantic distribution density ρ are as follows herein:

The first step, chooses the non-duplicate word jointly contained in the described descriptor of the first digital resource and the described keyword of the second digital resource.

Second step, sorts from high to low according to the weight of each word in the descriptor of the first digital resource.

3rd step, selects the word of the predetermined number be arranged in front to pay close attention to word as density.3 words can be selected herein, also can select other quantity as required.

4th step, obtains the same semantic word that described density pays close attention to word.Selected each density pays close attention to the corresponding multiple same semantic word with same or similar semanteme of word, adopts mode same with the above-mentioned embodiment herein, just can obtain each density and pay close attention to the same semantic word of word.

5th step, obtains the position of the described same semantic word with occurring first in described second digital resource in semantic word.In this step, obtain multiple same semantic word with occurring at first in semantic word, using this position with semantic word as position the earliest.

6th step, obtains described with the position of same semantic word that in semantic word, last occurs in described second digital resource.In this step, obtaining above-mentioned multiple with the same semantic word occurred last in semantic word, is the position that last occurs by position.

7th step, the distance between the semantic word that the semantic word occurred first described in acquisition and last occur, can add up number of characters or number of words herein.

8th step, using the ratio of described distance and described second digital resource length as described semantic distribution density.The length of the second digital resource also adopts number of characters or number of words to add up.This ratio represents the balanced intensity that the first digital resource descriptor distributes in the second digital resource, the degree of correlation indicated between digital resource that can be quantized by these two values.

S25, judge whether described text similarity is greater than text similarity threshold value and whether semantic distribution density is greater than semantic distribution density threshold value time, if be, using the correlated digital resource of the second digital resource as the first digital resource.

Usually, described text similarity threshold value is set to 0.2-0.4; Described semantic distribution density threshold value is set to 0.4-0.6.Preferably, described text similarity threshold value arranges ξ=0.3, described semantic distribution density threshold value is set to δ=0.5, as s> ξ and ρ > δ time, using the correlated digital resource of the second digital resource as the first digital resource.

In the scheme of the present embodiment, by text similarity and semantic distribution density two aspects of two sections of digital resources, weigh two sections of digital resources whether to be correlated with, text similarity indicates the degree that these two sections of texts describe same subject, semantic distribution density represents the balanced intensity that the first digital resource descriptor distributes in the second digital resource, by the degree of correlation indicated between digital resource that these two values can quantize, thus obtain digital resource relevant accurately, may be used for the recommendation of correlated digital resource, in the fields such as the foundation in thematic library.

Embodiment 3

There is provided a kind of thematic generation method in the present embodiment, the interested file for having read according to user removes to obtain the file read with user in resources bank and belongs to a thematic file, these special topics is pushed to user, adding users experience.The flow process of this subject generating method as shown in Figure 3, comprises the following steps:

S31, select the first digital resource, can select the digital resource that user is interested or pay close attention to herein, also can be some digital resources that user had read.This step is for selecting reference information, and the first digital resource is the reference information of subsequent treatment.

S32, choose one section of candidate numbers resource as the second digital resource successively.In the resources bank of candidate, select one section of digital resource as the second digital resource, carry out follow-up process.

S33, the method for employing described in embodiment 2 obtain second digital resource relevant to the first digital resource, if meet s> ξ and ρ > δ time, using the correlated digital resource of the second digital resource as the first digital resource, otherwise do not think correlated digital resource.Like this, travel through the second digital resource selected successively in all S32, all second digital resources relevant to the first digital resource can be obtained in the resources bank of candidate as the digital resource in described special topic.

By the scheme in the present embodiment, may be used for the reading content current according to user, obtain the digital resource that user pays close attention to, as the descriptor vector of the extraction of semantics user stories of interest according to newsletter archive, and utilize topic relativity to organize from digital newspaper resources bank and generate personalized special topic.The Reporting of the current reading of user can be utilized, pass through text-processing, based on the descriptor vector of extraction of semantics stories of interest, and then relevant report is extracted according to descriptor vector in digital newspaper resources bank, and utilize power and the distribution situation tissue of descriptor, the newpapers and periodicals special topic of generation personalization of correlativity, facilitate this user's quick obtaining stories of interest.The program can be eliminated in prior art the selection of Feature Words and the dependence of named entity recognition, weaken the interference that polysemant and synonym bring to descriptor vector, do not need by human-edited's Feature Words or descriptor candidate list, do not need to adopt named entity technology determination descriptor candidate word, realize user oriented personalized special topic tissue and generate yet.

In further embodiment, also comprise the priority obtaining second digital resource relevant to the first digital resource, described second digital resource is sorted according to the height of priority.That is, for the second digital resource in thematic storehouse, it is not identical with the degree of correlation of the first digital resource, and s is larger, and ρ is larger, then the priority of this digital resource is higher.The priority of this digital resource in special topic calculated by s and ρ is defined as prior.Priority herein can adopt scheme of the prior art to calculate, as the mode of weighting summation, its objective is that the thematic role obtained is defined as works as specialTopic={news1 in order to sort to resource, news2,, newsT}, wherein newsi, i=1,, T, i represent the digital resource that prioritization i-th is high.

In addition, on the basis of the above, for the digital resource of same priority, in order to avoid being the digital resource of repetition, the text similarity between two section of second digital resource with same priority can also be calculated further, if described text similarity is greater than predetermined threshold value, as 0.8 time, then this two records word resource mark to be attached most importance to complex digital resource, remove wherein one section of digital resource.The calculating of text similarity herein adopts scheme of the prior art, as realized by the coupling of word.Certainly, the method for the calculating text similarity in above-described embodiment 2 also can adopt, but due to the method more complicated in embodiment 2, in preferred prior art, the simple method calculating text similarity just can obtain good effect herein.

Embodiment 4

Originally execute example and provide a kind of concrete application example, user oriented newpapers and periodicals special topic tissue mainly comprises two steps with generation.

The first step, utilize the interested news collection of user to generate the descriptor vector of special topic based on semanteme, the input of this step is user's interested newsletter archive set D, and output is thematic descriptor vector topicWords.Particular flow sheet is shown in Fig. 1.After segmenter is loaded user dictionary, coarseness participle is carried out to newsletter archive collection D.Document subject matter model based on semanteme adopts LDA (LatentDirichletAllocation).Get distribution probability sequence after the semantic duplicate removal of synonymyMap higher front 20% as final thematic descriptor, as shown in Figure 4.

Particularly, such as, say for one section of user's selection news, tissue and the generation special topic of telling " 3.8 horses boat event " search-and-rescue work.

In this first step, generate thematic descriptor vector.After segmenter is loaded user dictionary, coarseness participle is carried out to this news.By stopWords and stopSpeeches, word segmentation result is filtered.Utilize the meaning word training LDA model obtained after filtering, calculate descriptor probability distribution, obtain { marine site=0.0432, aircraft=0.0305, passenger plane=0.0029, Malaysia=0.0208, rescue=0.0203, naval=0.0183, search=0.0168, warship=0.0163, Ma Hang=0.0158 ....In synonymyMap, Ma Hang, Malaysia and warship, naval vessel, naval vessels, warship etc. have identical semantic coding respectively, the semantic duplicate removal posterior probability distribution of synonymyMap is utilized to change to { marine site=0.0468, aircraft=0.0336, warship=0.0318, rescue=0.0289, search=0.0275, passenger plane=0.0029, ship=0.0224, Malaysia=0.0208, Ma Hang=0.0204 ..., get the descriptor of distribution probability sequence higher front 20% as " horse navigate search-and-rescue work ".

Second step, is organized by the Similarity Measure of candidate's newsletter archive each in digital newspaper resources bank and descriptor and is generated special topic.The input of this step is the descriptor vector topicWords of digital newspaper resources bank and special topic, and output is the interested thematic role of user.After utilizing the publication time of user's news interested and newpapers and periodicals priority to choose thematic Candidate Set, Candidate Set travels through descriptor density p in the similarity s calculating each news and thematic descriptor and newsletter archive, as s> ξ and ρ > δ time, this news is added in specialTopic.S and ρ is utilized to calculate prior, and according to prior sequential organization news from high to low.Carry out Similarity Measure between two to each newsletter archive under prior same in specialTopic, two sections of news similarity being greater than η are labeled as repetition news, as shown in Figure 5.

In conjunction with above-mentioned concrete example, say and tell that the news of " 3.8 horses boat event " search-and-rescue work is organized and generates special topic for one section that user is selected.In this step, by calculating the similarity tissue of newsletter archive and descriptor in digital newspaper storehouse and generating special topic.The issuing time " on March 10th, 2014 " of news is selected, using all news of newpapers and periodicals important in certain hour before and after this date in digital newspaper storehouse as thematic Candidate Set according to user.Every section of news in Candidate Set is calculated to the similarity s of the descriptor that itself and the first step obtain, news similarity being greater than to 0.3 calculates the distribution density ρ of descriptor in its text further, when distribution density is greater than 0.5, this news is added in " horse boat search-and-rescue work " special topic.To each bar news in special topic, sort from high to low according to the prior calculated by s and ρ, and each bar news similarity in same prior being greater than 0.8 marks." the horse boat search-and-rescue work " special topic finally obtained as shown in Figure 6, represents the news being marked as repetition with the news of group under same priority.

In the present embodiment, be input as the interested newsletter archive set of user, be user oriented personalized special topic tissue and generation, be better than the situation that keywords-based retrieval mode, particularly theme of news are difficult to describe with multiple keyword; With the way selection local feature word of stopWords and stopSpeeches filtering function word, do not use vector space model and named entity recognition, enhance the robustness of method; Extract Special Topics in Journalism descriptor vector in conjunction with LDA and synonymyMap, taken into full account the semantic information of news, reduce the interference that polysemant and synonym bring to descriptor vector; Self-defining similarity calculating method, can unify the threshold value of different special topic, again without the need to setting up Global Vector spatial model, meets the special topic personalization of user oriented newpapers and periodicals and diversified demand.

Embodiment 5

A kind of key phrases extraction device is provided in the present embodiment, as shown in Figure 7, comprises:

Participle unit 11, carries out participle to the text of digital resource;

Word segmentation result processing unit 12, obtains meaning word according to word segmentation result;

Probability distribution unit 13, for each theme, obtain the probability distribution of described meaning word, described probability distribution comprises the weight of meaning word and correspondence thereof;

Merge cells 14, obtains each meaning of a word of described meaning word, merges and has the meaning word of the identical meaning of a word and the weight of correspondence thereof;

Descriptor determining unit 15, according to the meaning word after merging and weight determination descriptor thereof.According to weight large young pathbreaker described in merge after the sequence of meaning word, select the meaning word of predetermined number be arranged in front as descriptor.Described predetermined number is the 10%-30% of total amount, is preferably 20% of total amount.

Wherein, merge cells 14 comprises

Map subelement, set up the mapping relations between word and the meaning of a word.

The meaning of a word obtains subelement, obtains the meaning of a word that described meaning word is corresponding.

Subelement searched in the meaning of a word, searches the meaning word with the identical meaning of a word.

Meaning word merges subelement, and the meaning word with the identical meaning of a word is merged into a meaning word, selects meaning word that corresponding weight is the highest as the meaning word after merging.

Weight calculation subelement, adds up weight corresponding respectively for the described meaning word with the identical meaning of a word as the weight of the meaning word after merging.

Wherein, word segmentation result processing unit 12 comprises:

Denoising subelement, utilizes inactive vocabulary and part of speech to carry out denoising to word segmentation result and obtains sequence of terms;

Word merges subelement, using the word that obtains after the identical word in sequence of terms merges as meaning word.

Embodiment 6

In addition, a kind of device obtaining correlated digital resource is also provided in the present embodiment, as shown in Figure 8, comprises

Key phrases extraction unit 21, extracts the descriptor of the first digital resource;

Keyword determining unit 22, obtains keyword and the weight thereof of the second digital resource;

Text similarity acquiring unit 23, obtains the text similarity of described first digital resource and described second digital resource;

Semantic distribution density acquiring unit 24, obtains the semantic distribution density of described descriptor in described second digital resource;

Related resource determining unit 25, judge whether described text similarity is greater than text similarity threshold value and whether semantic distribution density is greater than semantic distribution density threshold value time, if be, using the correlated digital resource of the second digital resource as the first digital resource.Wherein, described text similarity threshold value is set to 0.2-0.4; And/or described semantic distribution density threshold value is set to 0.4-0.6.Preferably, described text similarity threshold value is set to 0.3; And/or described semantic distribution density threshold value is set to 0.5.

Wherein, keyword determining unit 22 comprises:

Text participle subelement, carries out participle to the text of the second digital resource;

Word segmentation result denoising subelement, carries out denoising to word segmentation result and obtains sequence of terms;

Descending sort subelement, adopts the descending sort of TF-IDF method by the word in described sequence of terms;

Keyword obtains subelement, obtains each meaning of a word of described word, merges the word with the identical meaning of a word, using the word after merging as keyword.

Described keyword vector is keyWords=(kterm ₁, kterm ₂..., kterm _q), wherein kterm _i, i=1 ..., Q, i represent the i-th important keyword, and Q represents the sum of keyword;

Kterm _iweight be set to

Wherein, text similarity acquiring unit 23 comprises

Text similarity computing formula: wherein M is the sum of the semantic word of non-duplicate that the keyword of the second digital resource and the descriptor of the first digital resource contain, w _irepresent the weight of the semantic word of i-th non-duplicate in the second digital resource, p _irepresent the distribution probability in the descriptor of the semantic word of i-th non-duplicate in the first digital resource.

Wherein, semantic distribution density acquiring unit 24 comprises

Non-duplicate word determination subelement, chooses the non-duplicate word jointly contained in the described descriptor of the first digital resource and the described keyword of the second digital resource;

Weight sequencing subelement, sorts from high to low according to the weight of each word in the descriptor of the first digital resource;

Choose subelement, select the word of the predetermined number be arranged in front to pay close attention to word as density;

Obtain subelement with semantic word, obtain the same semantic word that described density pays close attention to word;

There is position acquisition subelement first, obtaining the position of the described semantic word with occurring first in described second digital resource in semantic word;

There is position acquisition subelement in last, obtains described with the position of semantic word that in semantic word, last occurs in described second digital resource;

Distance obtains subelement, the distance between the semantic word that the semantic word occurred first described in acquisition and last occur;

Semantic distribution density computation subunit, using the ratio of described distance and described second digital resource length as described semantic distribution density.

Embodiment 7

A kind of thematic generating apparatus is provided in the present embodiment, as shown in Figure 9, comprises:

First choosing digital resources unit 31, selects the first digital resource;

Second choosing digital resources unit 32, chooses one section of candidate numbers resource successively as the second digital resource;

Special topic generation unit 33, obtains second digital resource relevant to the first digital resource, travels through all second digital resources, using second digital resource relevant to the first digital resource as the digital resource in described special topic.

In addition, also comprise priority calculation unit, obtain the priority of second digital resource relevant to the first digital resource, described second digital resource is sorted according to the height of priority.

Also comprise duplicate removal unit, calculate the text similarity between two section of second digital resource with same priority, if described text similarity is greater than predetermined threshold value, then complex digital resource of this two records word resource mark being attached most importance to, removes wherein one section of digital resource.

Those skilled in the art should understand, embodiments of the invention can be provided as method, system or computer program.Therefore, the present invention can adopt the form of complete hardware embodiment, completely software implementation or the embodiment in conjunction with software and hardware aspect.And the present invention can adopt in one or more form wherein including the upper computer program implemented of computer-usable storage medium (including but not limited to magnetic disk memory, CD-ROM, optical memory etc.) of computer usable program code.

Obviously, above-described embodiment is only for clearly example being described, and the restriction not to embodiment.For those of ordinary skill in the field, can also make other changes in different forms on the basis of the above description.Here exhaustive without the need to also giving all embodiments.And thus the apparent change of extending out or variation be still among the protection domain of the invention.

Claims

1. a key phrases extraction method, is characterized in that, comprises the steps:

Participle is carried out to the text of digital resource;

Meaning word is obtained according to word segmentation result;

2. method according to claim 1, is characterized in that, each meaning of a word of described acquisition described meaning word, merges the step with the meaning word of the identical meaning of a word and the weight of correspondence thereof, comprises

Set up the mapping relations between word and the meaning of a word;

Obtain the meaning of a word that described meaning word is corresponding;

Search the meaning word with the identical meaning of a word;

The meaning word with the identical meaning of a word is merged into a meaning word;

Weight corresponding respectively for the described meaning word with the identical meaning of a word is added up as the weight of the meaning word after merging.

3. method according to claim 2, is characterized in that, is describedly merged in the step of a meaning word by the meaning word with the identical meaning of a word, selects meaning word that corresponding weight is the highest as the meaning word after merging in the meaning word with the identical meaning of a word.

4. the method according to claim 1 or 2 or 3, is characterized in that, according to merge after meaning word and weight determination descriptor step in, comprise

According to weight large young pathbreaker described in merge after the sequence of meaning word, select the meaning word of predetermined number be arranged in front as descriptor.

5. method according to claim 4, is characterized in that, described predetermined number is the 10%-30% of total amount.

6. method according to claim 5, is characterized in that, described predetermined number is 20% of total amount.

7., according to the arbitrary described method of claim 1-4, it is characterized in that, obtain the step of meaning word according to word segmentation result, comprise

Utilize inactive vocabulary and part of speech to carry out denoising to word segmentation result and obtain sequence of terms;

Using the word that obtains after the identical word in sequence of terms merges as meaning word.

8., according to the arbitrary described method of claim 1-5, it is characterized in that, for each theme, obtain in the step of the probability distribution of described meaning word, utilize document subject matter generation model to calculate the probability distribution of described meaning word.

9. obtain a method for correlated digital resource, it is characterized in that, comprise the steps:

The method described in claim 1-8 is adopted to extract the descriptor of the first digital resource;

Obtain keyword and the weight thereof of the second digital resource;

10. method according to claim 9, is characterized in that, obtains the keyword of the second digital resource and the step of weight thereof, comprising:

Participle is carried out to the text of the second digital resource;

Denoising is carried out to word segmentation result and obtains sequence of terms;

Word in described sequence of terms is adopted the descending sort of TF-IDF method;

Obtain each meaning of a word of described word, merge the word with the identical meaning of a word, using the word after merging as keyword.

11. methods according to claim 9 or 10, is characterized in that, obtain in the keyword of the second digital resource and the step of weight thereof, wherein,

Kterm _iweight be set to

12., according to the arbitrary described method of claim 9-11, is characterized in that, in the step of the text similarity of described first digital resource of described acquisition and described second digital resource, comprise

13. according to the arbitrary described method of claim 9-11, and it is characterized in that, the step of the semantic distribution density of the described descriptor of described acquisition in described second digital resource, comprises

Choose the non-duplicate word jointly contained in the described descriptor of the first digital resource and the described keyword of the second digital resource;

Sort from high to low according to the weight of each word in the descriptor of the first digital resource;

The word of the predetermined number be arranged in front is selected to pay close attention to word as density;

Obtain the same semantic word that described density pays close attention to word;

Obtain the position of the described semantic word with occurring first in described second digital resource in semantic word;

Obtain described with the position of semantic word that in semantic word, last occurs in described second digital resource;

Distance between the semantic word that the semantic word occurred first described in acquisition and last occur;

Using the ratio of described distance and described second digital resource length as described semantic distribution density.

14. 1 kinds of key phrases extraction devices, is characterized in that, comprising:

Participle unit, carries out participle to the text of digital resource;

15. 1 kinds of devices obtaining correlated digital resource, is characterized in that, comprise