Key phrases extraction method and the method and device using its acquisition correlated digital resource
Technical field
The present invention relates to digital resource process fields, and in particular to a kind of key phrases extraction method obtains correlated digital money
The method and device in source.
Background technology
With the fast development of internet, digital newspaper becomes increasingly popular, so as to significantly enhance the friendship of user and newpapers and periodicals
Mutual property provides possibility for personalized newpapers and periodicals special topic tissue with generation.In addition, the daily new increase news report in the whole nation, most
For newborn event and with a large amount of neologisms.So-called " neologisms " refer mainly to that content is new, form is new, originally in lexical system without or
Though but the meaning of a word is brand-new word.
In order to preferably be described to these digital resources, convenient for processing such as recommendation, the retrievals of follow-up related special topic, need
These digital resources are carried out with the extraction of descriptor, it is general using the vocabulary extracted after participle in digital resource in the prior art
Mode, the vocabulary more than frequency of occurrence is obtained by way of merging as descriptor, but since each word may have
A variety of different semantic informations, the meaning of different words expression again may be identical, such as mobile phone, mobile phone, expression
Meaning is identical, also based on the extraction write inscription bring interference.In addition, in existing key phrases extraction method, secret service volume is generally required
Volume Feature Words or theme candidate word list determine descriptor candidate word using name entity technology, using vector space model and
Name Entity recognition.Program process complexity is, it is necessary to substantial amounts of data operation quantity.
The descriptor of said extracted can be used in digital resource such as the tissue and generation of Special Topics in Journalism.Special Topics in Journalism
Tissue refers to, by together with relevant news organization, form a special topic with generation.For example, when newpapers and periodicals user plane is emerging to oneself sense
During a certain media event of interest, it is desirable to be able to conveniently and efficiently be obtained from the magnanimity news report of more newpapers and periodicals more related
Report, the personalization for improving the efficiency of acquisition of information and reading.For example, when user reads certain piece in relation to foreign press to " 3.1 elder brothers
During the report of open fire station violence terror case " view, it is desirable to be able to quickly check other related foreign press to the event view
Report when, first, the interested news that this user is selected to read, by analyze obtain the news descriptor, so
Afterwards by the keyword of remaining news compared with above-mentioned descriptor, by the high news linked groups of degree of correlation to just shape together
Into a special topic.At present, mainly using technologies such as vector space model, name Entity recognition, text clusters in advance in newpapers and periodicals
Special topic is extracted on resources bank, user is pushed to and selects to consult for user.Such method knows the selection of Feature Words and name entity
Dependence that Ju You be very not strong, less effective during the newpapers and periodicals text to occur frequently so as to cause processing neologisms, and do not take into full account new
The interference that the semantic information and polysemant and synonym of news are brought to theme term vector, it is impossible to according to the report of user's current interest
Road comes tissue, the personalized special topic of generation.
The content of the invention
Therefore, the technical problem to be solved in the present invention be to overcome during key phrases extraction of the prior art be subject to it is more
Adopted word, synonym interference determine to lead, it is necessary to by human-edited's Feature Words or descriptor candidate list using name entity technology
The defects of writing inscription candidate word, so as to provide a kind of key phrases extraction method and apparatus.
The invention solves another technical problem be to overcome special topic generation of the prior art when needs use to
Quantity space model and name Entity recognition, the defects of poor robustness, so as to provide a kind of method for obtaining correlated digital resource and
Device.
The present invention provides a kind of key phrases extraction method, includes the following steps:
The text of digital resource is segmented;
Meaning word is obtained according to word segmentation result;
For each theme, the probability distribution of the meaning word is obtained, the probability distribution includes meaning word and its correspondence
Weight;
Each meaning of a word of the meaning word is obtained, merges the meaning word with the identical meaning of a word and its corresponding weight;
Descriptor is determined according to the meaning word after merging and its weight.
In addition, the present invention provides a kind of method for obtaining correlated digital resource, include the following steps:
Extract the descriptor of the first digital resource;
Obtain the keyword and its weight of the second digital resource;
Obtain the text similarity of first digital resource and second digital resource;
Obtain semantic distribution density of the descriptor in second digital resource;
Judge whether the text similarity is more than text similarity threshold value and whether semantic distribution density is more than semantic point
During cloth density threshold, using the second digital resource as the correlated digital resource of the first digital resource if being to be.
In addition, the present invention also provides a kind of key phrases extraction device, including:
Participle unit segments the text of digital resource;
Word segmentation result processing unit obtains meaning word according to word segmentation result;
Probability distribution unit for each theme, obtains the probability distribution of the meaning word, the probability distribution is including anticipating
Adopted word and its corresponding weight;
Combining unit, obtains each meaning of a word of the meaning word, merges meaning word with the identical meaning of a word and its corresponding
Weight;
Descriptor determination unit determines descriptor according to the meaning word after merging and its weight.
In addition, the present invention also provides it is a kind of obtain correlated digital resource device, including
Key phrases extraction unit extracts the descriptor of the first digital resource;
Keyword determination unit obtains the keyword and its weight of the second digital resource;
It is similar to the text of second digital resource to obtain first digital resource for text similarity acquiring unit
Degree;
It is close to obtain semanteme distribution of the descriptor in second digital resource for semantic distribution density acquiring unit
Degree;
Related resource determination unit judges whether the text similarity is more than text similarity threshold value and semantic distribution is close
When whether degree is more than semantic distribution density threshold value, using the second digital resource as the dependency number of the first digital resource if being to be
Word resource.
Technical solution of the present invention has the following advantages that:
1. the present invention provides a kind of key phrases extraction method, first, the text of digital resource is segmented, then basis
Word segmentation result obtains meaning word;For each theme, the probability distribution of the meaning word is obtained, the probability distribution includes meaning
Word and its corresponding weight;It obtains each meaning of a word of the meaning word, merges meaning word with the identical meaning of a word and its corresponding
Weight;Descriptor is determined according to the meaning word after merging and its weight.From the angle of the meaning of a word in the program, will have identical
The word of the meaning of a word is merged, and is avoided the interference of polysemant, synonym to key phrases extraction in the prior art, is improved master
Write inscription the accuracy of extraction.In addition, the program need not pass through human-edited's Feature Words or descriptor candidate list, it is not required that
Descriptor candidate word is determined using name entity technology.Select local feature word by way of filtering function word, and without using
Vector space model and name Entity recognition enhance the robustness of key phrases extraction method.
2. key phrases extraction method of the present invention, pre-establishes the mapping relations between word and the meaning of a word, by this
Correspondence can obtain the corresponding meaning of a word of meaning word, then merge the meaning word of the identical meaning of a word and tire out weight
Add, sort according to the descending meaning word by after the merging of weight, the meaning word for the default quantity being arranged in front is selected to make
Based on write inscription, as selected to be arranged in front 20% as keyword, the meaning word identical by merging the meaning of a word improves keyword
Accuracy selects be arranged in front 20% meaning word, can cover the important information of the digital resource substantially, reduces follow-up
Data processing amount.
3. the present invention also provides a kind of methods for obtaining correlated digital resource, first, the theme of the first digital resource is extracted
Then word obtains the keyword and its weight of the second digital resource, obtain first digital resource and provided with the described second number
The text similarity in source obtains semantic distribution density of the descriptor in second digital resource, when the text phase
It is more than text similarity threshold value like degree, and when semantic distribution density is more than semantic distribution density threshold value, the second digital resource is made
For the correlated digital resource of the first digital resource.In the program, text similarity and semantic distribution by two digital resources
The aspect of density two, it is whether related to weigh two digital resources, text similarity indicate this two texts describe it is same
The degree of theme, semantic distribution density represent the balance degree that the first digital resource descriptor is distributed in the second digital resource,
The degree of correlation represented between digital resource that can be quantified by the two values, so as to obtain accurately relevant number money
Source.
Description of the drawings
It, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical solution of the prior art
Embodiment or attached drawing needed to be used in the description of the prior art are briefly described, it should be apparent that, in describing below
Attached drawing is some embodiments of the present invention, for those of ordinary skill in the art, before not making the creative labor
It puts, can also be obtained according to these attached drawings other attached drawings.
Fig. 1 is the flow chart of the key phrases extraction method in the embodiment of the present invention 1;
Fig. 2 is the flow chart of the method for the acquisition correlated digital resource in the embodiment of the present invention 2;
Fig. 3 is the flow chart of the thematic generation method in the embodiment of the present invention 3;
Fig. 4 is the flow chart of the theme term vector of the generation special topic in the embodiment of the present invention 4;
Fig. 5 is the flow chart of the generation special topic in the embodiment of the present invention 4;
Fig. 6 is the thematic list schematic diagram in the embodiment of the present invention 4;
Fig. 7 is the flow chart of the key phrases extraction device in the embodiment of the present invention 5;
Fig. 8 is the flow chart of the device of the acquisition correlated digital resource in the embodiment of the present invention 6;
Fig. 9 is the flow chart of the thematic generating means in the embodiment of the present invention 7.
Specific embodiment
Technical scheme is clearly and completely described below in conjunction with attached drawing, it is clear that described implementation
Example is part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill
Personnel's all other embodiments obtained without making creative work, belong to the scope of protection of the invention.
In the description of the present invention, it is necessary to which explanation, term " first ", " second ", " the 3rd " are only used for description purpose,
And it is not intended that instruction or hint relative importance.It is in addition, involved in invention described below different embodiments
Technical characteristic can be combined with each other as long as they do not conflict with each other.
Embodiment 1
A kind of key phrases extraction method, for extracting the descriptor in digital resource, number herein are provided in the present embodiment
Word resource can be a file or more files, after preselecting digital resource, for selected digital resource come
Extract descriptor.The flow chart of this method is as shown in Figure 1, include the following steps:
S11, the text of digital resource is segmented.
After selected digital resource, the set of selected digital resource is positioned as D={ d1,d2,…,dm, wherein di, i=
1 ..., m represent i-th newsletter archive, and m can be 1.Loading user-oriented dictionary segments single newsletter archive.User dictionary
The set of words being made of idiom, abbreviation and neologisms, effect are some jargoons to specific area, are such as practised
Idiom, abbreviation and neologisms are added, and are improved the precision of segmenter participle, are defined as userLib={ e1,e2,…,er,
Wherein ei, i=1 ..., r represent a word or phrase.
In the step, it can complete to segment by the segmenter of maturation of the prior art, be conducive to by user dictionary
It is reasonably segmented, improves the precision of word segmentation.By participle, above-mentioned digital resource can be divided into a series of phrase and word
Language.
S12, meaning word is obtained according to word segmentation result.
All words in digital resource are contained in word segmentation result, some of words do not have the specific meaning such as tone
Word, auxiliary word, additionally including punctuate and some there is no the insignificant word of specifying information meaning, these words are required for removing.
It pre-establishes deactivated vocabulary and deactivated part of speech is set, wherein deactivated vocabulary is by meaningless word in punctuation mark and journalistic style etc.
The set of words of composition is defined as stopWords={ w1,w2,…,ws, wherein wi, i=1 ..., s represent word, a punctuate
Symbol or phrase.Deactivated part of speech is the set being made of function part of speech, is defined as stopSpeeches={ s1,s2,…,st,
Middle si, i=1 ..., t represent a kind of function part of speech, such as modal particle, auxiliary word.This sentences stopWords and stopSpeeches
The mode of filtering function word selects local feature word, and without using vector space model and name Entity recognition, can enhance master
Write inscription the robustness of extracting method.The step includes following process:
First, denoising is carried out to word segmentation result using deactivated vocabulary and deactivated part of speech and obtains sequence of terms.In word segmentation result
In, remove the punctuate in deactivated vocabulary and insignificant word, and remove functional word, then obtain a series of word, institute
The sequence of terms of generation is defined as seqTerms={ term1,term2,…,termo, wherein termi, i=1 ..., o represent the
I meaning word.In the sequence of terms, each word is arranged in order according to the order of text, and dittograph language is also according to appearance
What order retained in the sequence successively.
Then, the word obtained after the identical word in sequence of terms is merged is as meaning word.For a upper mistake
Sequence of terms in journey, the seqTerms that identical word has been carried out merging element in the set V, all D that form meaning word
In meaning word form D meaning set of words, be defined as V={ v1,v2,…,vn, wherein vi, i=1 ..., n, i represented the in V
I meaning word.
S13, for each theme, obtain the probability distribution of the meaning word, the probability distribution include meaning word and its
Corresponding weight.
Using document subject matter model is generated to calculate in V significant word theme probability distribution, every digital resource can
To belong to multiple and different themes, but theme probability distribution when belonging to different themes is different, herein using document
Theme generation model calculates the probability distribution for being directed to selected theme of the significant word of institute in V.
Document subject matter generation model is realized using scheme of the prior art, such as LDA (Latent Dirichlet
Allocation be) a kind of document subject matter generation model, also referred to as three layers of bayesian probability model, comprising word, theme and
Document three-decker.So-called generation model, that is, it is believed that each word of an article is by " with certain probability
Some theme being selected, and with some word of certain probability selection from this theme " such a process obtains.Document to master
Topic obeys multinomial distribution, and theme to word obeys multinomial distribution.LDA is a kind of non-supervisory machine learning techniques, can be used for
Identify the subject information hidden in extensive document sets (document collection) or corpus (corpus).It is used
Each document is considered as a word frequency vector by the method for bag of words (bag of words), this method, so as to by text envelope
Breath converts the digital information for ease of modeling.But bag of words method does not account for the order between word and word, this simplifies ask
The complexity of topic, while also opportunity is provided for the improvement of model.Formed one of some themes of each documents representative
Probability distribution, and each theme represents the probability distribution that many words are formed.
Therefore, using document subject matter generation model can calculate in V significant word belong to selected theme
Probability distribution arranges these probability descendings, and the drop probability topic term vector of a certain theme is termFreq=(fterm1,
fterm2,…,ftermp), wherein ftermi, i=1 ..., p, the high meaning word of i expression probability i-th, each meaning word correspondence one
A probability right.
S14, each meaning of a word for obtaining the meaning word merge the meaning word with the identical meaning of a word and its corresponding weight,
Process is as follows:
First, establish the mapping relations between word and the meaning of a word.Make W={ wi, i=1 ..., u } and it is ambiguity set of words, M=
{mj, j=1 ..., v } and it is meaning of a word code set, it is defined as by the word of Chinese thesaurus generation and the mapping relations of the meaning of a wordIts meaning expressed is the word for having a variety of semantemes for one
X, corresponding semanteme, which collects, is combined into Y, a kind of semanteme of each word corresponding word x in Y.For example, for word mobile phone, correspond to
Semantic collection be combined into { mobile phone holds phone }.
Second, obtain the corresponding meaning of a word of meaning word.For each meaning word in termFreq, can all obtain
Its corresponding semantic set.
3rd, search the meaning word with the identical meaning of a word.By the way that the word in semantic gather is compared, the two meanings are seen
With the presence or absence of identical semantic coding in the semantic set of word, illustrating the two meaning words there are identical semantic coding, there are phases
Same semanteme, then performs next step, if it is not, not performing any operation then in this way.
4th, the meaning word with the identical meaning of a word is merged into a meaning word.Meaning word after merging, can select
The highest meaning word of weight in meaning word with the identical meaning of a word.
5th, meaning word after the corresponding weight of meaning word with the identical meaning of a word is added up as merging
Weight.
By the above process, meaning word and its corresponding weight after being merged.
S15, descriptor is determined according to the meaning word after merging and its weight.
It sorts according to the descending meaning word by after the merging of weight, selects the meaning of default quantity being arranged in front
Word is as descriptor.Generally, the 10%-30% that quantity is total amount is preset, preferably described default quantity is the 20% of total amount.Pass through
20% meaning word can cover the subject direction of the digital resource substantially, and reduce subsequent arithmetic amount.Utilize synonymyMap
TermFreq is done to the theme term vector that the meaning word of θ before choosing after semantic duplicate removal obtains and is defined as topicWords=
(tterm1,tterm2,…,ttermq), wherein ttermi, i=1 ..., q (q<P) the high descriptor of semantic weight i-th is represented,
Corresponding distribution probability is defined as pi.
From the angle of the meaning of a word in said program in the present embodiment, the word with the identical meaning of a word is closed
And the interference of polysemant, synonym to key phrases extraction in the prior art is avoided, improve the accuracy of key phrases extraction.
In addition, the program need not pass through human-edited's Feature Words or descriptor candidate list, it is not required that using name entity technology
Determine descriptor candidate word.Local feature word is selected in a manner of stopWords and stopSpeeches filtering function words, not
Use vector space model and name Entity recognition, the robustness of enhancing key phrases extraction method.
In further embodiment, the mapping relations between word and the meaning of a word are pre-established, can be obtained by the correspondence
The corresponding multiple meaning of a word of meaning word are obtained, then merge the meaning word containing the identical meaning of a word and weight adds up, according to
The descending meaning word by after the merging of weight sorts, and selects the meaning word for the default quantity being arranged in front as theme
Word, as selected to be arranged in front 20% as keyword, the meaning word identical by merging the meaning of a word improves the accurate of keyword
Degree selects be arranged in front 20% meaning word, can cover the important information of the digital resource substantially, reduce subsequent number
According to treating capacity.
Embodiment 2:
A kind of method for obtaining correlated digital resource is provided in the present embodiment, in the digital resource of magnanimity, obtaining
With the selected relevant digital resource of digital resource, first, the first digital resource is selected, the first digital resource can be one
Can be the more digital resources for belonging to a theme, the purpose of the present embodiment be exactly find out with the first digital resource it is relevant its
His digital resource.The flow chart of this method is as shown in Fig. 2, comprise the following steps:
S21, the descriptor that the first digital resource is extracted using the method in embodiment 1.After selected first digital resource, carry
Taking the descriptor of the first digital resource, details are not described herein again, passes through the side in embodiment 1 using the method in embodiment 1
Method can obtain the theme term vector topicWords=(tterm of the first digital resource1,tterm2,…,ttermq), wherein
ttermi, i=1 ..., q (q<P), i represents the high descriptor of semantic weight i-th, and corresponding distribution probability is defined as pi.
S22, the keyword and its weight for obtaining the second digital resource.Second digital resource is to need to judge and the first number
The whether relevant digital resource of resource, the second digital resource can be other digital resources beyond the first digital resource.It obtains
The keyword of second digital resource and its process of weight are as follows:
First, the text of the second digital resource is segmented.Participle mode is identical in embodiment 1, herein no longer
It repeats.
Secondth, denoising is carried out to word segmentation result and obtains sequence of terms.Also it is identical with the method in embodiment 1, using deactivated
Vocabulary and deactivated part of speech carry out denoising to word segmentation result and obtain sequence of terms seqTerms.In sequence of terms seqTerms, it is
The word that order according to text is arranged in order, dittograph language are also to retain in the sequence successively according to appearance order.
3rd, the word in the sequence of terms is arranged using TF-IDF methods descending.
TF-IDF is a kind of statistical method of the prior art, to assess a words for a file set or a language
Expect the significance level of a copy of it file in storehouse.The directly proportional increasing of number that the importance of words occurs hereof with it
Add, but the frequency that can occur simultaneously with it in corpus is inversely proportional decline.The main thought of TF-IDF is:If some word
Or the frequency TF high that phrase occurs in an article, and seldom occur in other articles, then it is assumed that this word or phrase
With good class discrimination ability, it is adapted to classify.High term frequencies and the word in a certain specific file are whole
Low document-frequency in a file set can produce the TF-IDF of high weight.Therefore, TF-IDF tends to filter out common
Word, retain important word.
Obtained important word and its weight after being handled by TF-IDF methods, and according to weight height by these words
Language has carried out descending arrangement.
4th, each meaning of a word of the word retained in the previous step is obtained, merges the word with the identical meaning of a word, it will
Word after merging is as keyword.
It is also same as Example 1 to merge the word with the identical meaning of a word, is removed by synonymyMap set.It will
The crucial term vector that meaning word in seqTerms is arranged by TF-IDF descendings and obtained after being removed using synonymyMap is
KeyWords=(kterm1,kterm2,…,ktermQ), wherein ktermi, i=1 ..., Q, i the i-th important keywords of expression,
Q represents the sum of keyword.ktermiWeight be arranged to
S23, the text similarity for obtaining first digital resource and second digital resource.
Text similarity computing formula is:Wherein M is the keyword and the first number of the second digital resource
The sum for the non-duplicate semantic word that the descriptor of word resource contains, wiRepresent i-th of non-duplicate semantic word in the second number money
Weight in source, piRepresent the distribution probability in descriptor of i-th of non-duplicate semantic word in the first digital resource.
Although the calculation of text similarity of the prior art also there are many, using the above-mentioned side in the present embodiment
Method can obtain better effect.
S24, semantic distribution density of the descriptor in second digital resource is obtained.
The computational methods of semantic distribution density ρ are as follows herein:
The first step chooses the descriptor of the first digital resource with containing jointly in the keyword of the second digital resource
The non-duplicate word having.
Second step is ranked up from high to low according to weight of each word in the descriptor of the first digital resource.
3rd step selects the word for the default quantity being arranged in front to pay close attention to word as density.3 words can be selected herein
Language can also select other quantity as needed.
4th step obtains the same semantic word of the density concern word.Selected each density concern word, which corresponds to, multiple to be had
Same or similar semantic same semantic word, herein by the way of same with the above-mentioned embodiment, can obtain each density
Pay close attention to the same semantic word of word.
5th step obtains the position with the same semantic word first appeared in semantic word in second digital resource
It puts.In the step, the same semantic word occurred at first in multiple words with semanteme is obtained, using the position of this same semantic word as most
Early position.
6th step obtains the position of the same semantic word that last occurs in second digital resource in the word with semanteme
It puts.In the step, above-mentioned multiple with the same semantic word occurred for the last time in semantic word, the position that position is occurred for last is obtained
It puts.
The distance between 7th step, the semantic word that the semantic word first appeared described in acquisition occurs with last, herein can be with
Count number of characters or number of words.
8th step, using the ratio of the distance and the second digital resource length as the semantic distribution density.The
The length of two digital resources is also counted using number of characters or number of words.The ratio represents the first digital resource descriptor in the second number
The balance degree being distributed in word resource, the degree of correlation represented between digital resource that can be quantified by the two values.
S25, judge whether the text similarity is more than text similarity threshold value and whether semantic distribution density is more than language
During adopted distribution density threshold value, using the second digital resource as the correlated digital resource of the first digital resource if being to be.
Usually, the text similarity threshold value is arranged to 0.2-0.4;The semanteme distribution density threshold value is arranged to 0.4-
0.6.Preferably, the text similarity threshold value sets ξ=0.3, and the semanteme distribution density threshold value is arranged to δ=0.5, works as s>
ξ and ρ>During δ, using the second digital resource as the correlated digital resource of the first digital resource.
In the scheme of the present embodiment, by two aspects of the text similarities of two digital resources and semantic distribution density,
Whether related to weigh two digital resources, text similarity indicates the degree that this two texts describe same subject, language
Adopted distribution density represents the balance degree that the first digital resource descriptor is distributed in the second digital resource, can by the two values
With the degree of correlation represented between digital resource of quantization, so as to obtain accurate relevant digital resource, it can be used for correlation
In the fields such as the recommendation of digital resource, the foundation in thematic library.
Embodiment 3
A kind of thematic generation method is provided in the present embodiment, the interested file for having been read according to user goes to obtain
It obtains the file read in resources bank with user and belongs to a thematic file, these special topics are pushed to user, increase user's body
It tests.The flow of the subject generating method is as shown in figure 3, comprise the following steps:
S31, selection the first digital resource, can select user interested herein or concern digital resource or
Some digital resources that user had read.For the step for selecting reference information, the first digital resource is subsequent processing
With reference to information.
S32, a candidate numbers resource is chosen successively as the second digital resource.One is selected in the resources bank of candidate
Digital resource carries out subsequent processing as the second digital resource.
S33, using the method described in embodiment 2 obtain with relevant second digital resource of the first digital resource, if full
Sufficient s>ξ and ρ>During δ, using the second digital resource as the correlated digital resource of the first digital resource, dependency number is otherwise not considered as
Word resource.In this way, travel through the second digital resource selected successively in all S32, can obtain in the resources bank of candidate it is all with
Relevant second digital resource of first digital resource is as the digital resource in the special topic.
Scheme in through this embodiment can be used for the number that user's concern is obtained according to the current reading content of user
Word resource, such as according to the theme term vector of extraction of semantics user's stories of interest of newsletter archive, and using topic relativity from
It is organized in digital newspaper resources bank and generates personalized special topic.The Reporting that user can be utilized currently to read, passes through text
Processing, the theme term vector based on extraction of semantics stories of interest, and then according to theme term vector in digital newspaper resources bank
Relevant report is extracted, and utilizes the power of correlation and the distribution situation tissue of descriptor, the newpapers and periodicals special topic of generation personalization, side
Just the user's quick obtaining stories of interest.The program can eliminate the selection to Feature Words in the prior art and name entity is known
Other dependence weakens the interference that polysemant and synonym are brought to theme term vector, need not by human-edited's Feature Words or
Descriptor candidate list, it is not required that descriptor candidate word is determined using name entity technology, realizes user oriented personalization
Special topic tissue and generation.
In a further embodiment, further include obtain it is preferential with relevant second digital resource of the first digital resource
Grade, second digital resource is ranked up according to the height of priority.That is, for the second number in thematic storehouse
Resource, with the degree of correlation of the first digital resource and differing, s is bigger, and ρ is bigger, then the priority of the digital resource is higher.
Priority of the digital resource being calculated by s and ρ in special topic is defined as prior.Priority herein may be employed existing
There is the scheme in technology to be calculated, such as the mode of weighting summation, its purpose is to sort to resource, obtained thematic role is determined
Justice is as specialTopic={ news1, news2 ..., newsT }, wherein newsi, i=1 ..., and T, i represent priority row
The high digital resource of sequence i-th.
In addition, on the basis of the above, for the digital resource of same priority, in order to avoid being the digital resource repeated,
The text similarity between two the second digital resources with same priority can also be further calculated, if the text phase
It is more than predetermined threshold value like degree, when such as 0.8, then the two records word resource mark is attached most importance to complex digital resource, remove a wherein record
Word resource.The calculating of text similarity herein uses scheme of the prior art, can such as be realized by the matching of word.
Certainly, the method for the calculating text similarity in above-described embodiment 2 can also use, but due to this method ratio in embodiment 2
More complicated, preferable effect can be obtained by preferably simply calculating the method for text similarity in the prior art herein.
Embodiment 4
Originally apply example and a kind of specific application example is provided, user oriented newpapers and periodicals special topic tissue mainly includes two with generation
Step.
The first step, using theme term vector of the interested news collection of user based on semantic generation special topic, the step it is defeated
It is the interested newsletter archive set D of user to enter, and output is thematic theme term vector topicWords.Particular flow sheet is shown in Fig. 1.
After segmenter is loaded user dictionary, coarseness participle is carried out to newsletter archive collection D.Semantic-based document subject matter model uses
LDA(Latent Dirichlet Allocation).It takes before distribution probability sequence is higher after synonymyMap semanteme duplicate removals
20% as final thematic descriptor, as shown in Figure 4.
Specifically, for example, say and tell for one of user's selection " 3.8 horses navigate event " search-and-rescue work news, tissue and
Generation special topic.
In the first step, thematic theme term vector is generated.After segmenter is loaded user dictionary, coarse grain is carried out to the news
Degree participle.Word segmentation result is filtered by stopWords and stopSpeeches.It is instructed using the meaning word obtained after filtering
Practice LDA models, calculate descriptor probability distribution, obtain { marine site=0.0432, aircraft=0.0305, passenger plane=0.0029, Malaysia
West Asia=0.0208, rescue=0.0203, naval=0.0183, search=0.0168, warship=0.0163, Ma Hang=
0.0158 ... }.Ma Hang, Malaysia and warship, naval vessel, naval vessels, warship etc. are respectively provided with identical semanteme in synonymyMap
Coding, using synonymyMap semanteme duplicate removals posterior probability distribution be changed to marine site=0.0468, aircraft=0.0336, warship=
0.0318, rescue=0.0289, search=0.0275, passenger plane=0.0029, ship=0.0224, Malaysia=0.0208, horse
Boat=0.0204 ... }, take preceding 20% descriptor as " horse boat search-and-rescue work " that distribution probability sequence is higher.
Second step, by each candidate's newsletter archive in digital newspaper resources bank and the similarity calculation of descriptor come tissue and
Generation special topic.The input of the step is the theme term vector topicWords of digital newspaper resources bank and special topic, and output is user
Interested thematic role.After choosing thematic Candidate Set using the publication time of user's news interested and newpapers and periodicals priority, waiting
Traversal calculates each news and descriptor density p in the similarity s and newsletter archive of thematic descriptor on selected works, works as s>ξ and ρ>δ
When, which is added in specialTopic.Prior, and the order group according to prior from high to low are calculated using s and ρ
Knit news.Similarity calculation two-by-two is carried out to each newsletter archive under same prior in specialTopic, similarity is more than
Two news of η are labeled as news is repeated, as shown in Figure 5.
With reference to above-mentioned specific example, the news of " 3.8 horses navigate event " search-and-rescue work is said and told for one of user's selection
Come tissue and generation special topic.In this step, by calculate in digital newspaper storehouse the similarity tissue of newsletter archive and descriptor and
Generation special topic.The issuing time " on March 10th, 2014 " of news is selected according to user, by before and after the date in digital newspaper storehouse one
Fix time interior important newpapers and periodicals all news as thematic Candidate Set.It is calculated to every news in Candidate Set to obtain with the first step
It is close further to calculate distribution of the descriptor in its text for news of the similarity more than 0.3 by the similarity s of the descriptor arrived
ρ is spent, when distribution density is more than 0.5, which is added in " horse boat search-and-rescue work " special topic.It is new to each item in special topic
It hears, is ranked up from high to low according to the prior being calculated by s and ρ, and similarity in same prior is each more than 0.8
News is marked." horse navigate search-and-rescue work " special topic for finally obtaining is as shown in fig. 6, the news of same priority similarly hereinafter group
It represents to be marked as the news repeated.
It in the present embodiment, inputs as the interested newsletter archive set of user, is user oriented personalized special topic tissue
With generation, situation about being described with multiple keywords is difficult to better than keywords-based retrieval mode, particularly theme of news;With
The mode of stopWords and stopSpeeches filtering function words selects local feature word, and without using vector space model and
Entity recognition is named, enhances the robustness of method;Special Topics in Journalism theme term vector is extracted with reference to LDA and synonymyMap, is filled
Divide the semantic information for considering news, reduce the interference that polysemant and synonym are brought to theme term vector;Customized phase
Like degree computational methods, the threshold value of different special topics can be unified and establish Global Vector spatial model, met user oriented
The personalized and diversified demand of newpapers and periodicals special topic.
Embodiment 5
A kind of key phrases extraction device is provided in the present embodiment, as shown in fig. 7, comprises:
Participle unit 11 segments the text of digital resource;
Word segmentation result processing unit 12 obtains meaning word according to word segmentation result;
Probability distribution unit 13 for each theme, obtains the probability distribution of the meaning word, and the probability distribution includes
Meaning word and its corresponding weight;
Combining unit 14 obtains each meaning of a word of the meaning word, merges meaning word and its correspondence with the identical meaning of a word
Weight;
Descriptor determination unit 15 determines descriptor according to the meaning word after merging and its weight.According to the size of weight
Meaning word after the merging is sorted, selects the meaning word for the default quantity being arranged in front as descriptor.The present count
The 10%-30% for total amount is measured, is preferably the 20% of total amount.
Wherein, combining unit 14 includes
Subelement is mapped, establishes the mapping relations between word and the meaning of a word.
The meaning of a word obtains subelement, obtains the corresponding meaning of a word of meaning word.
The meaning of a word searches subelement, searches the meaning word with the identical meaning of a word.
Meaning word merges subelement, and the meaning word with the identical meaning of a word is merged into a meaning word, selects corresponding power
The highest meaning word of weight is as the meaning word after merging.
Weight calculation subelement adds up the corresponding weight of meaning word with the identical meaning of a word as after merging
Meaning word weight.
Wherein, word segmentation result processing unit 12 includes:
Denoising subelement carries out denoising to word segmentation result using deactivated vocabulary and part of speech and obtains sequence of terms;
Word merges subelement, and the word obtained after the identical word in sequence of terms is merged is as meaning word.
Embodiment 6
In addition, a kind of device for obtaining correlated digital resource is also provided in the present embodiment, as shown in figure 8, including
Key phrases extraction unit 21 extracts the descriptor of the first digital resource;
Keyword determination unit 22 obtains the keyword and its weight of the second digital resource;
It is similar to the text of second digital resource to obtain first digital resource for text similarity acquiring unit 23
Degree;
It is close to obtain semanteme distribution of the descriptor in second digital resource for semantic distribution density acquiring unit 24
Degree;
Related resource determination unit 25, judges whether the text similarity is more than text similarity threshold value and semantic distribution
When whether density is more than semantic distribution density threshold value, using the second digital resource as the correlation of the first digital resource if being to be
Digital resource.Wherein, the text similarity threshold value is arranged to 0.2-0.4;And/or the semantic distribution density threshold value is arranged to
0.4-0.6.It is preferred that the text similarity threshold value is arranged to 0.3;And/or the semantic distribution density threshold value is arranged to 0.5.
Wherein, keyword determination unit 22 includes:
Text segments subelement, and the text of the second digital resource is segmented;
Word segmentation result denoising subelement carries out denoising to word segmentation result and obtains sequence of terms;
Descending arranges subelement, and the word in the sequence of terms is arranged using TF-IDF methods descending;
Keyword obtains subelement, obtains each meaning of a word of the word, merges the word with the identical meaning of a word, will merge
Word afterwards is as keyword.
The key term vector is keyWords=(kterm1,kterm2,…,ktermQ), wherein ktermi, i=
1 ..., Q, i represent the i-th important keyword, and Q represents the sum of keyword;
ktermiWeight be arranged to
Wherein, text similarity acquiring unit 23 includes
Text similarity computing formula:Wherein M is the keyword and the first number of the second digital resource
The sum for the non-duplicate semantic word that the descriptor of resource contains, wiRepresent i-th of non-duplicate semantic word in the second digital resource
In weight, piRepresent the distribution probability in descriptor of i-th of non-duplicate semantic word in the first digital resource.
Wherein, semantic distribution density acquiring unit 24 includes
Non-duplicate word determination subelement, choose the first digital resource the descriptor and the second digital resource it is described
The non-duplicate word contained jointly in keyword;
Weight sequencing subelement is arranged from high to low according to weight of each word in the descriptor of the first digital resource
Sequence;
Subelement is chosen, the word for the default quantity being arranged in front is selected to pay close attention to word as density;
Subelement is obtained with semanteme word, obtains the same semantic word of the density concern word;
Position acquisition subelement is first appeared, obtains and is first appeared in the word with semanteme in second digital resource
Semantic word position;
There is position acquisition subelement in last, obtains described with last occurs in second digital resource in semantic word
Semantic word position;
Distance obtains the distance between subelement, the semantic word that the semantic word first appeared described in acquisition occurs with last;
Semantic distribution density computation subunit, using the ratio of the distance and the second digital resource length as described in
Semantic distribution density.
Embodiment 7
A kind of thematic generating means are provided in the present embodiment, as shown in figure 9, including:
First choosing digital resources unit 31 selects the first digital resource;
Second choosing digital resources unit 32 chooses a candidate numbers resource as the second digital resource successively;
Thematic generation unit 33 obtains and relevant second digital resource of the first digital resource, all second numbers of traversal
Resource, using with relevant second digital resource of the first digital resource as it is described special topic in digital resource.
In addition, further including priority calculation unit, obtain preferential with relevant second digital resource of the first digital resource
Grade, second digital resource is ranked up according to the height of priority.
Duplicate removal unit is further included, calculates the text similarity between two the second digital resources with same priority,
If the text similarity is more than predetermined threshold value, which is attached most importance to complex digital resource, removes wherein one
Piece digital resource.
It should be understood by those skilled in the art that, the embodiment of the present invention can be provided as method, system or computer program
Product.Therefore, the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware can be used in the present invention
Apply the form of example.Moreover, the computer for wherein including computer usable program code in one or more can be used in the present invention
The computer program production that usable storage medium is implemented on (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)
The form of product.
Obviously, the above embodiments are merely examples for clarifying the description, and is not intended to limit the embodiments.It is right
For those of ordinary skill in the art, can also make on the basis of the above description it is other it is various forms of variation or
It changes.There is no necessity and possibility to exhaust all the enbodiments.And the obvious variation thus extended out or
Among changing still in the protection domain of the invention.