CN106469187B - The extracting method and device of keyword - Google Patents

The extracting method and device of keyword Download PDF

Info

Publication number
CN106469187B
CN106469187B CN201610751325.2A CN201610751325A CN106469187B CN 106469187 B CN106469187 B CN 106469187B CN 201610751325 A CN201610751325 A CN 201610751325A CN 106469187 B CN106469187 B CN 106469187B
Authority
CN
China
Prior art keywords
word
target text
theme
predicate
institute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610751325.2A
Other languages
Chinese (zh)
Other versions
CN106469187A (en
Inventor
张明亮
齐勇
王明强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neusoft Corp
Original Assignee
Neusoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neusoft Corp filed Critical Neusoft Corp
Priority to CN201610751325.2A priority Critical patent/CN106469187B/en
Publication of CN106469187A publication Critical patent/CN106469187A/en
Application granted granted Critical
Publication of CN106469187B publication Critical patent/CN106469187B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of extracting method of keyword and devices, it is related to technical field of data processing, for solving the problems, such as that existing keyword extraction intelligence, efficiency are lower, main technical schemes of the invention are as follows: obtain the theme disturbance degree vector of each word in target text, the theme disturbance degree vector of institute's predicate is for indicating institute's predicate to the disturbance degree of theme in the target text;The different degree of each word in the target text is calculated according to the theme disturbance degree vector of the word figure of the target text and institute's predicate, the different degree is used to indicate the correlation degree of institute's predicate Yu the target text;Keyword of the word for meeting default different degree as the target text is chosen from the target text.Present invention is mainly used for extract keyword.

Description

The extracting method and device of keyword
Technical field
The present invention relates to technical field of data processing, more particularly to the extracting method and device of a kind of keyword.
Background technique
Keyword extraction is that the word or phrase that can reflect text purport information are extracted from given text, is being plucked automatically It wants, text mining, play an important role in information retrieval, especially realize the key method of automatic marking.Wherein, according to being It is no to need to mark training corpus and keyword abstraction method be divided into two major classes: to have supervision keyword abstraction and unsupervised key Word extracts.
The unsupervised keyword abstraction of word-based figure is to establish word node of graph in turn based on the distribution of word in a document, so Close on what word was transmitted according to three the covering influence power of word, position influence power and frequency influence power aspect weighted calculations afterwards Influence power, that is, the side length of word figure interior joint is calculated, and keyword is extracted from document according to the side length of word figure interior joint.
But the word frequency of text can indicate frequency influence power, Term co-occurrence relationship in the keyword abstraction method of word-based figure It can indicate position influence power and covering influence power, therefore the keyword that the keyword abstraction method of word-based figure extracts is in text There is more word in word frequency and Term co-occurrence relationship, and these words often to the theme of text and uncorrelated, therefore in order to make to extract Keyword more suit text subject, obtain preferable keyword effect and generally require artificial experience and intervened, that is, weighing Often use relatively simple experience assignment method when the importance of quantifier language, such as the word occurred in theme assign compared with High weight.Therefore the keyword abstraction method of existing word-based figure requires manual intervention, and extracts intelligence, the efficiency of keyword It is lower.
Summary of the invention
In view of this, the present invention provides the extracting method and device of a kind of keyword, main purpose is to solve existing pass Keyword extracts the lower problem of intelligence, efficiency.
According to the present invention on one side, a kind of extracting method of keyword is provided, comprising:
The theme disturbance degree vector of each word in target text is obtained, the theme disturbance degree vector of institute's predicate is for indicating institute Disturbance degree of the predicate to theme in the target text;
It is calculated according to the theme disturbance degree vector of the word figure of the target text and institute's predicate each in the target text The different degree of word, the different degree are used to indicate the correlation degree of institute's predicate Yu the target text;
Keyword of the word for meeting default different degree as the target text is chosen from the target text.
Specifically, the theme disturbance degree vector for stating each word in acquisition target text includes:
Model LDA, which is generated, by document subject matter calculates the probability and each master that each theme occurs in the target text The probability that each word occurs in topic;
What the probability that theme each in the target text is occurred occurred with word each in each theme respectively Probability carries out dot product calculating, obtains the theme disturbance degree vector of each word in the target text.
Further, described that the mesh is calculated according to the word figure of the target text and the theme disturbance degree vector of institute's predicate It marks in text before the different degree of each word, the method also includes:
Node in using the word in the target text as institute's predicate figure, the adjacent pass in position of word in the target text System constructs the word figure of the target text as the connection side between the node.
Specifically, described calculate the target according to the word figure of the target text and the theme disturbance degree vector of institute's predicate The different degree of each word in text, comprising:
The similarity in target text between each word is calculated by the theme disturbance degree vector of institute's predicate;
According to each in target text described in the similarity calculation between the word figure of the target text and each word The different degree of word.
Specifically, the theme disturbance degree vector by institute's predicate calculates the similarity in target text between each word Include:
Obtain the two nodes corresponding word in the word figure of the target text with connection side;
By calculating the cosine similarity value of the theme disturbance degree vector of the corresponding word of two nodes with connection side, Determine the similarity between each word.
Specifically, target described in the similarity calculation according between the word figure of the target text and each word The different degree of each word includes: in text
Using the similarity between word and word as the boundary values on corresponding node connection side in the word figure of the target text;
The important of cumulative acquisition institute's predicate is carried out to the boundary values on each connection side of the word figure interior joint of the target text Degree.
Specifically, described calculate the target according to the word figure of the target text and the theme disturbance degree vector of institute's predicate The different degree of each word includes: in text
Set the theme disturbance degree vector of institute's predicate to the weighted value of the word figure interior joint of the target text;
The mesh is calculated according to the weighted value of the keyword abstraction TextRank algorithm of word-based graph model and the node Mark the different degree of each word in text.
Specifically, the pass for choosing the word for meeting default different degree from the target text as the target text Keyword includes:
Keyword of the highest word of different degree as the target text is chosen from the target text.
According to the present invention on the other hand, a kind of extraction element of keyword is provided, comprising:
Acquiring unit, for obtaining the theme disturbance degree vector of each word in target text, the theme disturbance degree of institute's predicate Vector is for indicating institute's predicate to the disturbance degree of theme in the target text;
Computing unit, for calculating the mesh according to the word figure of the target text and the theme disturbance degree vector of institute's predicate The different degree of each word in text is marked, the different degree is used to indicate the correlation degree of institute's predicate Yu the target text;
Selection unit, for choosing the word for meeting default different degree from the target text as the target text Keyword.
Specifically, the acquiring unit includes:
Computing module calculates each theme appearance in the target text for generating model LDA by document subject matter Probability and each theme in the probability that occurs of each word;
Dot product module, the probability for there is theme each in the target text is respectively and in each theme The probability that each word occurs carries out dot product calculating, obtains the theme disturbance degree vector of each word in the target text.
Further, described device further include:
Construction unit, for the node in using the word in the target text as institute's predicate figure, in the target text The position neighbouring relations of word construct the word figure of the target text as the connection side between the node.
Specifically, the computing unit includes:
First computing module calculates in target text between each word for the theme disturbance degree vector by institute's predicate Similarity;
Second computing module, for the similarity calculation institute between the word figure and each word according to the target text State the different degree of each word in target text.
Specifically, first computing module includes:
Acquisition submodule has the corresponding word of two nodes on connection side in the word figure for obtaining the target text;
Submodule is determined, for the theme disturbance degree vector by calculating the corresponding word of two nodes with connection side Cosine similarity value, determine the similarity between each word.
Specifically, second computing module includes:
Submodule is configured, for connecting the similarity between word and word as corresponding node in the word figure of the target text The boundary values of edge fit;
Cumulative submodule, the boundary values on each connection side for the word figure interior joint to the target text carry out cumulative obtain Obtain the different degree of institute's predicate.
Specifically, the computing unit further include:
Setup module, for setting the theme disturbance degree vector of institute's predicate to the word figure interior joint of the target text Weighted value;
Third computing module, for according to the keyword abstraction TextRank algorithm of word-based graph model and the node Weighted value calculates the different degree of each word in the target text.
The selection unit is specifically used for choosing the highest word of different degree from the target text as the target text This keyword.
By above-mentioned technical proposal, technical solution provided in an embodiment of the present invention is at least had the advantage that
The extracting method and device of a kind of keyword provided in an embodiment of the present invention, first each word in acquisition target text Theme disturbance degree vector, the theme disturbance degree vector of institute's predicate is for indicating institute's predicate to the shadow of theme in the target text Then loudness calculates each in the target text according to the theme disturbance degree vector of the word figure of the target text and institute's predicate The different degree of word, the different degree are used to indicate the correlation degree of institute's predicate Yu the target text, finally from the target text Keyword of the word for meeting default different degree as the target text is chosen in this.Exist with word is intervened by artificial experience at present To realize that extracting keyword compares, the embodiment of the present invention generates model by document subject matter and calculates mesh the importance of theme in text The theme disturbance degree vector of each word in text is marked, then using the theme disturbance degree vector of word as measurement word in target text The importance of theme, therefore the embodiment of the present invention is without being arranged importance of the word in text subject by artificial experience again, And it can accurately indicate word to main in target text according to the primary influences degree vector that document subject matter generates the word that model obtains The disturbance degree of topic, therefore key can be extracted from target text according to the theme disturbance degree vector of the word figure of target text and word Word, so that the extraction efficiency of keyword can be improved through the embodiment of the present invention and extract intelligence.
The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, the followings are specific embodiments of the present invention.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 shows a kind of extracting method flow chart of keyword provided in an embodiment of the present invention;
Fig. 2 shows the extracting method flow charts of another keyword provided in an embodiment of the present invention;
Fig. 3 shows a kind of extraction element structural block diagram of keyword provided in an embodiment of the present invention;
What Fig. 4 showed that another keyword provided in an embodiment of the present invention mentions takes apparatus structure block diagram.
The word diagram that Fig. 5 shows target text provided in an embodiment of the present invention is intended to.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.
The embodiment of the invention provides a kind of extracting methods of keyword, as shown in Figure 1, this method comprises:
101, the theme disturbance degree vector of each word in target text is obtained.
Wherein, the theme disturbance degree vector of institute's predicate is for indicating influence of institute's predicate to theme in the target text Degree, the theme disturbance degree vector of word be word in target text to the disturbance degree of all themes.It should be noted that for target For word w in text d, F is enabled to indicate word w to the theme disturbance degree vector in target text d, it is believed that word w is appeared in Probability in one theme z is bigger, then word disturbance degree for theme z is bigger;If the corresponding theme z of word w is in mesh The probability of occurrence marked in text d is bigger, then shows that theme z is bigger relative to the disturbance degree of target text d.Therefore it can pass through target The product for the probability that word w occurs in the probability and theme z that the middle theme z of text d occurs determines word w in target text d to master The theme disturbance degree vector of z is inscribed, but determined according to the product for the probability that word w occurs in the theme z probability occurred and theme z Theme disturbance degree vector is disturbance degree vector of the word w to theme z, rather than influence of the word w to themes whole in target text word d Degree.It and may include multiple themes in target text d, and word w can be appeared in simultaneously in multiple set a question, it is therefore desirable to according to The dot product that the probability that word w occurs in the probability and each theme that each theme occurs in target text d carries out is as a result, really Word w is determined to the theme disturbance degree vector F in target text d.
Based on the above analysis, the embodiment of the present invention can be according to LDA (Latent Dirichlet Allocation, document master Topic generates model) the primary influences degree vector of each word in target text is obtained, it is specific to obtain theme disturbance degree vector process Can be with are as follows: target text is segmented first, then by LDA calculate each theme occurs in target text probability and The probability that each word occurs in each theme, the probability and each theme for then theme each in target text occur In the dot product that carries out of the probability that occurs of each word as a result, primary influences degree vector as word each in target text.
102, it is calculated in the target text according to the theme disturbance degree vector of the word figure of the target text and institute's predicate The different degree of each word.
Wherein, the different degree is used to indicate the correlation degree of institute's predicate Yu the target text, and the different degree of word is bigger, The correlation degree of word and target text is stronger;The significance level of word is smaller, and the correlation degree of word and target text is weaker.It needs Bright, the word figure of target text is constructed based on TextRank (the keyword abstraction algorithm of word-based graph model) algorithm, Node in i.e. using the word in target text as institute's predicate figure, the position neighbouring relations of word are as the node in target text Between connection side, construct the word figure of target text.
It in embodiments of the present invention, can be using the theme disturbance degree vector of word as the weight of target text word figure interior joint Then value substitutes into the weighted value of node in TextRank algorithm formula, and the weighted value based on node and target text word figure In close on the different degree that the influence power that word is transmitted calculates each word in the target text;The theme shadow of word can also be passed through Loudness vector calculates the boundary values that two nodes connection side is closed in target text word figure, and it is same then to count connection in target text word figure One node connects the boundary values on side, finally using the result of statistics as the different degree of word each in target text.
For example, including node A, B, C, D, E, node A equivalent A, node B equivalent B, node C in target text word figure Equivalent C, node D equivalent D, node E equivalent E, node A are connected with node B, C, D, i.e. node A and node B, C, D it Between there is connection side, if the theme disturbance degree vector of node A is a, the theme disturbance degree vector of node B is b, the theme of node C Disturbance degree vector is c, and the theme disturbance degree vector of node D is d, and the theme disturbance degree vector of node E is e.It then can be using a as section The weighted value of point A, weighted value of the b as node B, weighted value of the c as node C, weighted value of the d as node D, e is as section The weighted value of point E calculates the node relationships in the weighted value and target text word figure of node by TextRank algorithm, To obtain the different degree of each node to get the different degree of word each into target text;In addition it can according to word figure interior joint The boundary values for connecting side obtains the different degree of each word in target text, i.e., first by the primary influences degree of word A and word B to meter Between primary influences degree the vector calculate node A and node C for connecting the boundary values ab, word A and word C on side between operator node A and node B The boundary values ad that side is connected between primary influences degree the vector calculate node A and node D of the boundary values ac, word A and word D on side is connected, so It adds up afterwards to boundary values ab, ac, the ad on the connection side of connecting node A, obtains the different degree of node to get word A and target is arrived The correlation degree of text.
103, keyword of the word for meeting default different degree as the target text is chosen from the target text.
Wherein, the default different degree can be configured according to actual needs, can such as be chosen from target text important Keyword of the highest word as target text is spent, the word that different degree can also be chosen from target text more than default value is made For the keyword of target text, the embodiment of the present invention is not specifically limited.It should be noted that default value setting is bigger, The keyword extracted from target text is more;Default value is arranged smaller, and the keyword extracted from target text is fewer.
A kind of extracting method of keyword provided in an embodiment of the present invention generates model by document subject matter first and calculates mesh The theme disturbance degree vector of each word in text is marked, then using the theme disturbance degree vector of word as measurement word in target text The importance of theme, and calculated in the target text respectively according to the theme disturbance degree vector of the word figure of target text and institute's predicate The different degree of a word finally chooses keyword of the word for meeting default different degree as the target text from target text. Since the embodiment of the present invention is during obtaining the keyword of target text, without artificial experience setting word in text subject Importance, and generating the primary influences degree vector of word that model obtains according to document subject matter can accurately indicate word to target text The disturbance degree of theme in this, therefore the extraction efficiency of keyword can be improved through the embodiment of the present invention and extract intelligence.
The embodiment of the invention provides the extracting methods of another keyword, as shown in Figure 2, which comprises
201, the theme disturbance degree vector of each word in target text is obtained.
Wherein, the theme disturbance degree vector of institute's predicate is for indicating influence of institute's predicate to theme in the target text Degree, the theme disturbance degree vector of word is influence of the word in target text to all themes.It is described to obtain for the embodiment of the present invention The theme disturbance degree vector for taking each word in target text includes: to generate model LDA by document subject matter to calculate the target text The probability that each word occurs in the probability and each theme that each theme occurs in this;By master each in the target text It inscribes the probability that the probability occurred occurs with word each in each theme respectively and carries out dot product calculating, obtain the target text In each word theme disturbance degree vector.About obtain target text in each word theme disturbance degree vector associated description, The description of Fig. 1 corresponding part can be referred to, the embodiment of the present invention will not be described in great detail herein.
202, the node in using the word in the target text as word figure, the adjacent pass in position of word in the target text System constructs the word figure of target text as the connection side between the node.
Wherein, the position neighbouring relations of word are sentence sequencing of the word in target text, and the connection side between node is Undirected connection side.Such as there are word A, B, C, D, E in target text, and the sequence that upper predicate occurs in target text is ABCDBEA, then the word figure of the target text that the sequence occurred in the text according to word can construct as shown in figure 5, its interior joint B with The position node A, C, D, E is adjacent, so there is connection side between node B and node A, C, D, E, node E is adjacent with node location A, So there is connection side between node E and node A.
203, it is calculated in the target text according to the theme disturbance degree vector of the word figure of the target text and institute's predicate The different degree of each word.
Wherein, the different degree is used to indicate the correlation degree of institute's predicate Yu the target text, and the different degree of word is bigger, The correlation degree of word and target text is stronger;The significance level of word is smaller, and the correlation degree of word and target text is weaker.
For the embodiment of the present invention, step 203 includes: to be calculated in target text by the theme disturbance degree vector of institute's predicate Similarity between each word;According to target described in the similarity calculation between the word figure of the target text and each word The different degree of each word in text.Wherein, the embodiment of the present invention can calculate word by Euclidean distance, cosine similarity scheduling algorithm Similarity between word, the embodiment of the present invention are not specifically limited.It specifically can be by calculating the theme disturbance degree between word The Euclidean distance or cosine similarity of vector obtain the similarity between word, such as the theme disturbance degree vector of word A is a, word B's Theme disturbance degree vector is b, then the similarity for calculating word A and word B can be by the cosine phase of calculating theme disturbance degree vector a and b Obtained like degree, then using the similarity of word A and word B as between target text word figure interior joint A and node B connect while while Value.
Specifically, the theme disturbance degree vector by institute's predicate calculates the similarity in target text between each word It include: the corresponding word of two nodes in the word figure for obtain the target text with connection side;It is described with connection by calculating The cosine similarity value of the theme disturbance degree vector of the corresponding word of two nodes on side, determines the similarity between each word.Such as exist In the word figure of the target text of Fig. 5, node A equivalent A, node B equivalent B, node C equivalent C, node D equivalent D, section Point E equivalent E has connection side between node B and node A, C, D, E, then will be more than the theme disturbance degree vector of word B and word A The cosine similarity value of the theme disturbance degree vector of similarity of the string similarity value as word B and word A, word B and word C is as word B With the similarity of word C, similarity of the cosine similarity value of the theme disturbance degree vector of word B and word D as word B and word D, word B Similarity with the cosine similarity value of the theme disturbance degree vector of word E as word B and word E.
Specifically, target described in the similarity calculation according between the word figure of the target text and each word The different degree of each word includes: using the similarity between word and word as corresponding node in the word figure of the target text in text Connect the boundary values on side;Cumulative acquisition institute's predicate is carried out to the boundary values on each connection side of the word figure interior joint of the target text Different degree.Such as in the word figure of the target text of Fig. 5, node A equivalent A, node B equivalent B, node C equivalent C, node D Equivalent D, node E equivalent E have connection side, then make the similarity of word B and word A between node B and node A, C, D, E Word B, is connect the boundary values bc on side, by word B and word by the boundary values ba that side is connected for node B, A with the similarity of word C as node B, C Word B, is connect the boundary values on side by boundary values bd of the similarity of D as node B, D connection side with the similarity of word E as node B, E Be, calculating different degree of the word B in target text can be added up to obtain by the boundary values on the connection side to connecting node B, Different degree of the word B in target text is obtained according to the sum of bc+bd+bd+be.
For the embodiment of the present invention, step 203 further include: set the target for the theme disturbance degree vector of institute's predicate The weighted value of the word figure interior joint of text;According to the keyword abstraction TextRank algorithm of word-based graph model and the node Weighted value calculates the different degree of each word in the target text.In embodiments of the present invention, by the theme disturbance degree vector of word It is set as the weighted value of the word figure interior joint of the target text, i.e., word is measured in target text by the theme disturbance degree vector of word Importance in this is omitted the assignment procedure by artificial experience to word in target text, and then improves TextRank algorithm Word importance iterate to calculate formula, therefore calculated according to the weighted value of TextRank algorithm and node each in the target text The different degree of a word can be improved the extraction efficiency of keyword and extract intelligence.
204, keyword of the highest word of different degree as the target text is chosen from the target text.
The extracting method of another kind keyword provided in an embodiment of the present invention, due to the structure composition and target of target text The subject information contained between text is the important evidence of keyword abstraction, therefore the embodiment of the present invention is based on LDA theme mould Type can obtain the theme disturbance degree vector of each word in target text, then according to the word figure of target text and each word it Between similarity calculation described in target text each word different degree, finally using the highest word of different degree in target text as The keyword of target text.I.e. the embodiment of the present invention extracts keyword by LDA topic model and TextRank algorithm, Word is measured to the importance of theme in target text since the theme disturbance degree vector with word can be used as, and raw according to document subject matter The primary influences degree vector of the word obtained at model can accurately indicate that word to the disturbance degree of theme in target text, therefore passes through The embodiment of the present invention can be improved the extraction efficiency of keyword and extract intelligence.
Further, the embodiment of the present invention provides a kind of extraction element of keyword, as shown in figure 3, described device includes: Acquiring unit 31, computing unit 32, selection unit 33.
Acquiring unit 31, for obtaining the theme disturbance degree vector of each word in target text, the theme of institute's predicate influences For degree vector for indicating institute's predicate to the disturbance degree of theme in the target text, the theme disturbance degree vector of word is word in target To the disturbance degree of all themes in text.
It should be noted that enabling F indicate word w to the theme in target text d for the word w in target text d Disturbance degree vector, it is believed that the probability that word w is appeared in a theme z is bigger, then the word influences for theme z It spends bigger;If probability of occurrence of the corresponding theme z of word w in target text d is bigger, show theme z relative to target text d Disturbance degree it is bigger.Therefore it can pass through the probability of the middle theme z of target text d appearance and multiplying for the probability occurred of word w in theme z Determining word w is accumulated in target text d to the theme disturbance degree vector of theme z, but according to the theme z probability occurred and theme z The theme disturbance degree vector that the product for the probability that middle word w occurs determines is disturbance degree vector of the word w to theme z, rather than word w is to mesh Mark the disturbance degree of whole themes in text word d.It and may include multiple themes in target text d, and word w can be appeared in simultaneously It is multiple set a question in, it is therefore desirable to according to theme each in target text d occur probability go out with word w in each theme The dot product that existing probability carries out is as a result, determine word w to the theme disturbance degree vector F in target text d.
Based on the above analysis, the embodiment of the present invention can be according to LDA (Latent Dirichlet Allocation, document master Topic generates model) the primary influences degree vector of each word in target text is obtained, it is specific to obtain theme disturbance degree vector process Can be with are as follows: target text is segmented first, then by LDA calculate each theme occurs in target text probability and The probability that each word occurs in each theme, the probability and each theme for then theme each in target text occur In the dot product that carries out of the probability that occurs of each word as a result, primary influences degree vector as word each in target text.
Computing unit 32, for according to the calculating of the theme disturbance degree vector of the word figure of the target text and institute's predicate The different degree of each word in target text, the different degree are used to indicate the correlation degree of institute's predicate Yu the target text.
Wherein, the different degree of word is bigger, and the correlation degree of word and target text is stronger;The significance level of word is smaller, word with The correlation degree of target text is weaker.It should be noted that the word figure of target text is based on TextRank (word-based graph model Keyword abstraction algorithm) algorithm building, i.e., using the word in target text as institute's predicate figure in node, in target text The position neighbouring relations of word construct the word figure of target text as the connection side between the node.
It in embodiments of the present invention, can be using the theme disturbance degree vector of word as the weight of target text word figure interior joint Then value substitutes into the weighted value of node in TextRank algorithm formula, and the weighted value based on node and target text word figure In close on the different degree that the influence power that word is transmitted calculates each word in the target text;The theme shadow of word can also be passed through Loudness vector calculates the boundary values that two nodes connection side is closed in target text word figure, and it is same then to count connection in target text word figure One node connects the boundary values on side, finally using the result of statistics as the different degree of word each in target text.
For example, including node A, B, C, D, E, node A equivalent A, node B equivalent B, node C in target text word figure Equivalent C, node D equivalent D, node E equivalent E, node A are connected with node B, C, D, i.e. node A and node B, C, D it Between there is connection side, if the theme disturbance degree vector of node A is a, the theme disturbance degree vector of node B is b, the theme of node C Disturbance degree vector is c, and the theme disturbance degree vector of node D is d, and the theme disturbance degree vector of node E is e.It then can be using a as section The weighted value of point A, weighted value of the b as node B, weighted value of the c as node C, weighted value of the d as node D, e is as section The weighted value of point E calculates the node relationships in the weighted value and target text word figure of node by TextRank algorithm, To obtain the different degree of each node to get the different degree of word each into target text;In addition it can according to word figure interior joint The boundary values for connecting side obtains the different degree of each word in target text, i.e., first by the primary influences degree of word A and word B to meter Between primary influences degree the vector calculate node A and node C for connecting the boundary values ab, word A and word C on side between operator node A and node B The boundary values ad that side is connected between primary influences degree the vector calculate node A and node D of the boundary values ac, word A and word D on side is connected, so It adds up afterwards to boundary values ab, ac, the ad on the connection side of connecting node A, obtains the different degree of node to get word A and target is arrived The correlation degree of text.
Selection unit 33, for choosing the word for meeting default different degree from the target text as the target text Keyword.
Wherein, the default different degree can be configured according to actual needs, can such as be chosen from target text important Keyword of the highest word as target text is spent, the word that different degree can also be chosen from target text more than default value is made For the keyword of target text, the embodiment of the present invention is not specifically limited.It should be noted that default value setting is bigger, The keyword extracted from target text is more;Default value is arranged smaller, and the keyword extracted from target text is fewer.
It should be noted that each functional unit involved by a kind of extraction element of keyword provided in an embodiment of the present invention Other are accordingly described, can be with reference to the corresponding description of method shown in Fig. 2, and details are not described herein, it should be understood that in the present embodiment Device can correspond to realize preceding method embodiment in full content.
A kind of extraction element of keyword provided in an embodiment of the present invention generates model by document subject matter first and calculates mesh The theme disturbance degree vector of each word in text is marked, then using the theme disturbance degree vector of word as measurement word in target text The importance of theme, and calculated in the target text respectively according to the theme disturbance degree vector of the word figure of target text and institute's predicate The different degree of a word finally chooses keyword of the word for meeting default different degree as the target text from target text. Since the embodiment of the present invention is during obtaining the keyword of target text, without artificial experience setting word in text subject Importance, and generating the primary influences degree vector of word that model obtains according to document subject matter can accurately indicate word to target text The disturbance degree of theme in this, therefore the extraction efficiency of keyword can be improved through the embodiment of the present invention and extract intelligence.
Further, the embodiment of the present invention provides the extraction element of another keyword, as shown in figure 4, described device packet It includes: acquiring unit 41, computing unit 42, selection unit 43.
Acquiring unit 41, for obtaining the theme disturbance degree vector of each word in target text, the theme of institute's predicate influences Degree vector is for indicating institute's predicate to the disturbance degree of theme in the target text;
Computing unit 42, for according to the calculating of the theme disturbance degree vector of the word figure of the target text and institute's predicate The different degree of each word in target text, the different degree are used to indicate the correlation degree of institute's predicate Yu the target text;
Selection unit 43, for choosing the word for meeting default different degree from the target text as the target text Keyword.
Specifically, the acquiring unit 41 includes:
Computing module 411 goes out for calculating each theme in the target text by document subject matter generation model LDA The probability that each word occurs in existing probability and each theme;
Dot product module 412, probability for there is theme each in the target text respectively with each master The probability that each word occurs in topic carries out dot product calculating, obtains the theme disturbance degree vector of each word in the target text.
Further, described device further include:
Construction unit 44, for the node in using the word in the target text as institute's predicate figure, the target text The position neighbouring relations of middle word construct the word figure of the target text as the connection side between the node.
Wherein, the position neighbouring relations of word are sequencing of the word in target text, and the connection side between node is undirected Connect side.Such as there are word A, B, C, D, E in target text, and the sequence that upper predicate occurs in target text is ABCDBEA, then the word figure of the target text that the sequence occurred in the text according to word can construct as shown in figure 5, its interior joint B with The position node A, C, D, E is adjacent, so there is connection side between node B and node A, C, D, E, node E is adjacent with node location A, So there is connection side between node E and node A.
Specifically, the computing unit 42 includes:
First computing module 421, for by the theme disturbance degree vector of institute's predicate calculate in target text each word it Between similarity;
Second computing module 422, based on the similarity between the word figure and each word according to the target text Calculate the different degree of each word in the target text.
Wherein, the embodiment of the present invention can calculate the phase between word and word by Euclidean distance, cosine similarity scheduling algorithm Like degree, the embodiment of the present invention is not specifically limited.Specifically can by calculate word between theme disturbance degree vector it is European away from From or cosine similarity obtain the similarity between word, such as the theme disturbance degree vector of word A is a, the theme disturbance degree of word B to Amount is b, then the similarity for calculating word A and word B can be obtained by calculating the cosine similarity of theme disturbance degree vector a and b, then Using the similarity of word A and word B as the boundary values for connecting side between target text word figure interior joint A and node B.
Specifically, first computing module 421 includes:
Acquisition submodule 4211 has in the word figure for obtaining the target text two nodes on connection side corresponding Word;
Submodule 4212 is determined, for the theme disturbance degree by calculating the corresponding word of two nodes with connection side The cosine similarity value of vector, determines the similarity between each word.
Specifically, second computing module 422 includes:
Submodule 4221 is configured, for using the similarity between word and word as section corresponding in the word figure of the target text The boundary values on point connection side;
Cumulative submodule 4222, the boundary values on each connection side for the word figure interior joint to the target text carries out tired Add the different degree for obtaining institute's predicate.
Such as in the word figure of the target text of Fig. 5, node A equivalent A, node B equivalent B, node C equivalent C, node D equivalent D, node E equivalent E have connection side, then make the similarity of word B and word A between node B and node A, C, D, E Word B, is connect the boundary values bc on side, by word B and word by the boundary values ba that side is connected for node B, A with the similarity of word C as node B, C Word B, is connect the boundary values on side by boundary values bd of the similarity of D as node B, D connection side with the similarity of word E as node B, E Be, calculating different degree of the word B in target text can be added up to obtain by the boundary values on the connection side to connecting node B, Different degree of the word B in target text is obtained according to the sum of bc+bd+bd+be.
Specifically, the computing unit 42 further include:
Setup module 423, for the theme disturbance degree vector of institute's predicate to be set as saving in the word figure of the target text The weighted value of point;
Third computing module 424, for according to word-based graph model keyword abstraction TextRank algorithm and the section The weighted value of point calculates the different degree of each word in the target text.
In embodiments of the present invention, the theme disturbance degree vector of word is set to the word figure interior joint of the target text Weighted value measures importance of the word in target text by the theme disturbance degree vector of word, is omitted and passes through artificial experience To the assignment procedure of word in target text, and then the word importance for improving TextRank algorithm iterates to calculate formula, therefore basis The weighted value of TextRank algorithm and node calculates the different degree of each word in the target text, and the extraction of keyword can be improved Efficiency and extraction intelligence.
The selection unit 43 is specifically used for choosing the highest word of different degree from the target text as the target The keyword of text.
It should be noted that each functional unit involved by a kind of extraction element of keyword provided in an embodiment of the present invention Other are accordingly described, can be with reference to the corresponding description of method shown in Fig. 2, and details are not described herein, it should be understood that in the present embodiment Device can correspond to realize preceding method embodiment in full content.
The extraction element of another kind keyword provided in an embodiment of the present invention, due to the structure composition and target of target text The subject information contained between text is the important evidence of keyword abstraction, therefore the embodiment of the present invention is based on LDA theme mould Type can obtain the theme disturbance degree vector of each word in target text, then according to the word figure of target text and each word it Between similarity calculation described in target text each word different degree, finally using the highest word of different degree in target text as The keyword of target text.I.e. the embodiment of the present invention extracts keyword by LDA topic model and TextRank algorithm, Word is measured to the importance of theme in target text since the theme disturbance degree vector with word can be used as, and raw according to document subject matter The primary influences degree vector of the word obtained at model can accurately indicate that word to the disturbance degree of theme in target text, therefore passes through The embodiment of the present invention can be improved the extraction efficiency of keyword and extract intelligence.
In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, reference can be made to the related descriptions of other embodiments.
It is understood that the correlated characteristic in the above method and device can be referred to mutually.In addition, in above-described embodiment " first ", " second " etc. be and not represent the superiority and inferiority of each embodiment for distinguishing each embodiment.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
Algorithm and display are not inherently related to any particular computer, virtual system, or other device provided herein. Various general-purpose systems can also be used together with teachings based herein.As described above, it constructs required by this kind of system Structure be obvious.In addition, the present invention is also not directed to any particular programming language.It should be understood that can use various Programming language realizes summary of the invention described herein, and the description done above to language-specific is to disclose this hair Bright preferred forms.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention Example can be practiced without these specific details.In some instances, well known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this specification.
Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of the various inventive aspects, In Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. required to protect Shield the present invention claims features more more than feature expressly recited in each claim.More precisely, as following Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore, Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim itself All as a separate embodiment of the present invention.
Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose It replaces.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed Meaning one of can in any combination mode come using.
Various component embodiments of the invention can be implemented in hardware, or to run on one or more processors Software module realize, or be implemented in a combination thereof.It will be understood by those of skill in the art that can be used in practice Microprocessor or digital signal processor (DSP) are realized in keyword extracting method and device according to an embodiment of the present invention Some or all components some or all functions.The present invention is also implemented as executing side as described herein Some or all device or device programs (for example, computer program and computer program product) of method.It is such It realizes that program of the invention can store on a computer-readable medium, or can have the shape of one or more signal Formula.Such signal can be downloaded from an internet website to obtain, and perhaps be provided on the carrier signal or with any other shape Formula provides.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and ability Field technique personnel can be designed alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real It is existing.In the unit claims listing several devices, several in these devices can be through the same hardware branch To embody.The use of word first, second, and third does not indicate any sequence.These words can be explained and be run after fame Claim.

Claims (10)

1. a kind of extracting method of keyword characterized by comprising
The theme disturbance degree vector of each word in target text is obtained, the theme disturbance degree vector of institute's predicate is for indicating institute's predicate To the disturbance degree of theme in the target text;
The theme disturbance degree vector of each word includes: in the acquisition target text
Model LDA is generated by document subject matter to calculate in the probability and each theme that each theme occurs in the target text The probability that each word occurs;
The probability that the probability that theme each in the target text is occurred occurs with word each in each theme respectively Dot product calculating is carried out, the theme disturbance degree vector of each word in the target text is obtained;
Each word in the target text is calculated according to the theme disturbance degree vector of the word figure of the target text and institute's predicate Different degree, the different degree are used to indicate the correlation degree of institute's predicate Yu the target text;
The theme disturbance degree vector of the word figure and institute's predicate according to the target text calculates each in the target text The different degree of word, comprising:
The similarity in target text between each word is calculated by the theme disturbance degree vector of institute's predicate;
According to each word in target text described in the similarity calculation between the word figure of the target text and each word Different degree;
It is each in target text described in the similarity calculation according between the word figure of the target text and each word The different degree of word includes:
Using the similarity between word and word as the boundary values on corresponding node connection side in the word figure of the target text;
The cumulative different degree for obtaining institute's predicate is carried out to the boundary values on each connection side of the word figure interior joint of the target text;
Keyword of the word for meeting default different degree as the target text is chosen from the target text.
2. the method according to claim 1, wherein the word figure and institute's predicate according to the target text Theme disturbance degree vector calculates in the target text before the different degree of each word, the method also includes:
Node in using the word in the target text as institute's predicate figure, the position neighbouring relations of word are made in the target text Connection side between the node constructs the word figure of the target text.
3. the method according to claim 1, wherein the theme disturbance degree vector by institute's predicate calculates mesh Marking the similarity in text between each word includes:
Obtain the two nodes corresponding word in the word figure of the target text with connection side;
By calculating the cosine similarity value of the theme disturbance degree vector of the corresponding word of two nodes with connection side, determine Similarity between each word.
4. according to the method described in claim 2, it is characterized in that, the word figure and institute's predicate according to the target text The different degree that theme disturbance degree vector calculates each word in the target text includes:
Set the theme disturbance degree vector of institute's predicate to the weighted value of the word figure interior joint of the target text;
The target text is calculated according to the weighted value of the keyword abstraction TextRank algorithm of word-based graph model and the node The different degree of each word in this.
5. method according to claim 1 to 4, which is characterized in that described to be chosen from the target text The word for meeting default different degree includes: as the keyword of the target text
Keyword of the highest word of different degree as the target text is chosen from the target text.
6. a kind of extraction element of keyword characterized by comprising
Acquiring unit, for obtaining the theme disturbance degree vector of each word in target text, the theme disturbance degree vector of institute's predicate Disturbance degree of the predicate to theme in the target text for indicating;
The acquiring unit includes:
Computing module, for by document subject matter generate model LDA calculate each theme in the target text occur it is general The probability that each word occurs in rate and each theme;
Dot product module, probability for there is theme each in the target text respectively with it is each in each theme The probability that word occurs carries out dot product calculating, obtains the theme disturbance degree vector of each word in the target text;
Computing unit, for calculating the target text according to the word figure of the target text and the theme disturbance degree vector of institute's predicate The different degree of each word in this, the different degree are used to indicate the correlation degree of institute's predicate Yu the target text;
The computing unit includes:
First computing module calculates similar between each word in target text for the theme disturbance degree vector by institute's predicate Degree;
Second computing module, for the mesh according to the similarity calculation between the word figure of the target text and each word Mark the different degree of each word in text;
Second computing module includes:
Submodule is configured, for connecting side for the similarity between word and word as corresponding node in the word figure of the target text Boundary values;
Cumulative submodule, the boundary values on each connection side for the word figure interior joint to the target text carry out cumulative acquisition institute The different degree of predicate;
Selection unit, for choosing key of the word for meeting default different degree as the target text from the target text Word.
7. device according to claim 6, which is characterized in that described device further include:
Construction unit, for the node in using the word in the target text as institute's predicate figure, word in the target text Position neighbouring relations construct the word figure of the target text as the connection side between the node.
8. device according to claim 6, which is characterized in that first computing module includes:
Acquisition submodule has the corresponding word of two nodes on connection side in the word figure for obtaining the target text;
Submodule is determined, more than the theme disturbance degree vector by calculating the corresponding word of two nodes with connection side String similarity value determines the similarity between each word.
9. device according to claim 7, which is characterized in that the computing unit further include:
Setup module, the weight of the word figure interior joint for setting the theme disturbance degree vector of institute's predicate to the target text Value;
Third computing module, for according to the keyword abstraction TextRank algorithm of word-based graph model and the weight of the node Value calculates the different degree of each word in the target text.
10. device according to any one of claims 6 to 9, which is characterized in that the selection unit, be specifically used for from Keyword of the highest word of different degree as the target text is chosen in the target text.
CN201610751325.2A 2016-08-29 2016-08-29 The extracting method and device of keyword Active CN106469187B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610751325.2A CN106469187B (en) 2016-08-29 2016-08-29 The extracting method and device of keyword

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610751325.2A CN106469187B (en) 2016-08-29 2016-08-29 The extracting method and device of keyword

Publications (2)

Publication Number Publication Date
CN106469187A CN106469187A (en) 2017-03-01
CN106469187B true CN106469187B (en) 2019-12-03

Family

ID=58229950

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610751325.2A Active CN106469187B (en) 2016-08-29 2016-08-29 The extracting method and device of keyword

Country Status (1)

Country Link
CN (1) CN106469187B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106997382B (en) * 2017-03-22 2020-12-01 山东大学 Innovative creative tag automatic labeling method and system based on big data
CN107220232B (en) * 2017-04-06 2021-06-11 北京百度网讯科技有限公司 Keyword extraction method and device based on artificial intelligence, equipment and readable medium
CN107193973B (en) * 2017-05-25 2021-07-20 百度在线网络技术(北京)有限公司 Method, device and equipment for identifying field of semantic analysis information and readable medium
CN107193803B (en) * 2017-05-26 2020-07-10 北京东方科诺科技发展有限公司 Semantic-based specific task text keyword extraction method
CN108304377B (en) * 2017-12-28 2021-08-06 东软集团股份有限公司 Extraction method of long-tail words and related device
CN108846023A (en) * 2018-05-24 2018-11-20 普强信息技术(北京)有限公司 The unconventional characteristic method for digging and device of text
CN110705282A (en) * 2019-09-04 2020-01-17 东软集团股份有限公司 Keyword extraction method and device, storage medium and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103092950A (en) * 2013-01-15 2013-05-08 重庆邮电大学 Online public opinion geographical location real time monitoring system and method
CN105488023A (en) * 2015-03-20 2016-04-13 广州爱九游信息技术有限公司 Text similarity assessment method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103092950A (en) * 2013-01-15 2013-05-08 重庆邮电大学 Online public opinion geographical location real time monitoring system and method
CN105488023A (en) * 2015-03-20 2016-04-13 广州爱九游信息技术有限公司 Text similarity assessment method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Topic and keyword re-ranking for LDA-based topic modeling;Yangqiu Song 等;《Proceedings of the 18th ACM conference on Information and knowledge management》;20091130;第1757-1760页 *
基于图和LDA主题模型的关键词抽取算法;刘啸剑 等;《情报学报》;20160630;第35卷(第6期);第664-672页,正文第3.2、3.4、5节,图4 *
融合LDA与TextRank的关键词抽取研究;顾益军 等;《现代图书情报技术》;20141231(第248/249期);第41-47页,正文第3、4节,图1 *

Also Published As

Publication number Publication date
CN106469187A (en) 2017-03-01

Similar Documents

Publication Publication Date Title
CN106469187B (en) The extracting method and device of keyword
CN103440335B (en) Video recommendation method and device
CN106611052B (en) The determination method and device of text label
CN105893478B (en) A kind of tag extraction method and apparatus
CN106844314B (en) A kind of duplicate checking method and device of article
CN107729322B (en) Word segmentation method and device and sentence vector generation model establishment method and device
CN105550170B (en) A kind of Chinese word cutting method and device
CN109710948A (en) MT engine recommended method and device
Fernandez-Viagas et al. A new set of high-performing heuristics to minimise flowtime in permutation flowshops
Vrard et al. Helium signature in red giant oscillation patterns observed by Kepler
CN108563703A (en) A kind of determination method of charge, device and computer equipment, storage medium
CN106649288A (en) Translation method and device based on artificial intelligence
CN104462554B (en) Question and answer page relevant issues recommended method and device
CN106528755A (en) Hot topic generation method and device
CN107193806B (en) A kind of automatic prediction method and device that vocabulary justice is former
CN110515838A (en) Method and system for detecting software defects based on topic model
CN109117475B (en) Text rewriting method and related equipment
CN109948140A (en) A kind of term vector embedding grammar and device
CN108153730A (en) A kind of polysemant term vector training method and device
CN105589976B (en) Method and device is determined based on the target entity of semantic relevancy
CN102298618B (en) Method for obtaining matching degree to execute corresponding operations and device and equipment
CN103870563B (en) It is determined that the method and apparatus of the theme distribution of given text
CN110019806A (en) A kind of document clustering method and equipment
CN110489744A (en) A kind of processing method of corpus, device, electronic equipment and storage medium
Fayolle et al. p-Laplace diffusion for distance function estimation, optimal transport approximation, and image enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant