CN106970910B - Keyword extraction method and device based on graph model - Google Patents

Keyword extraction method and device based on graph model Download PDF

Info

Publication number
CN106970910B
CN106970910B CN201710208956.4A CN201710208956A CN106970910B CN 106970910 B CN106970910 B CN 106970910B CN 201710208956 A CN201710208956 A CN 201710208956A CN 106970910 B CN106970910 B CN 106970910B
Authority
CN
China
Prior art keywords
word
candidate
value
text
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710208956.4A
Other languages
Chinese (zh)
Other versions
CN106970910A (en
Inventor
王亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN201710208956.4A priority Critical patent/CN106970910B/en
Publication of CN106970910A publication Critical patent/CN106970910A/en
Application granted granted Critical
Publication of CN106970910B publication Critical patent/CN106970910B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a keyword extraction method and device based on a graph model, wherein the method comprises the following steps: acquiring a text to be processed, and performing word segmentation on the text to be processed to obtain candidate keywords corresponding to the text to be processed; searching a word vector corresponding to the candidate keyword in a word vector model, wherein the word vector model comprises word vectors of the candidate keyword; constructing a word similarity matrix of the candidate keywords according to the word vectors; and sequencing the candidate keywords according to the word similarity matrix of the candidate keywords, and extracting the keywords of the text to be processed. By applying the embodiment of the invention, the accuracy of keyword extraction is effectively improved.

Description

Keyword extraction method and device based on graph model
Technical Field
The invention relates to the technical field of keyword extraction, in particular to a keyword extraction method and device based on a graph model.
Background
Keywords are used as representative words in a text segment, and have been widely applied to information retrieval, text classification, and the like. The keyword extraction method based on the graph model has been widely applied to search ranking, citation analysis, social networking, natural language processing (such as keyword extraction, extraction of article subject sentences, etc.), and the like. The graph model is a generic term of a technology for representing probability distribution by using a graph, and a text can be mapped into a network graph with words as nodes and association relations between the words as edges. Two basic assumptions for the graph model-based keyword extraction method are: 1. the quantity assumes: the more links a certain node is linked with other nodes, the more important the node is; 2. the quality assumption is that: the nodes connected with the node A have different qualities, and the nodes with high quality can transfer more weight to other nodes through the links, so the more the nodes with high quality are linked to the node A, the more important the node A is. Therefore, the key of the graph model-based keyword extraction method is the calculation of link weights, and the link weights between nodes are word-to-word similarities.
The existing keyword extraction method based on the graph model is characterized in that a text is divided into a plurality of composition units (words and sentences) and the graph model is established, the composition units in the text are sorted by using a voting mechanism, and then the composition units which are sorted in the front are selected as keywords. Specifically, a given text is segmented according to a complete sentence; then, performing word segmentation and part-of-speech tagging on each sentence to obtain words and part-of-speech tags corresponding to the words; filtering stop words such as prepositions, auxiliary words, conjunctions, exclamation words and the like in the words according to the words and the part-of-speech labels, reserving words with specified parts-of-speech such as nouns, verbs, adjectives and the like, and taking the words with the specified parts-of-speech as candidate keywords; and then according to the candidate keywords, constructing a candidate keyword graph model, namely taking the candidate keywords as nodes of the candidate keyword graph model, and taking incidence relations among the candidate keywords as edges of the keyword graph model, wherein the incidence relations among the candidate keywords are obtained by calculating the similarity among the candidate keywords. In the keyword extraction method based on the graph model, similarity between words is constructed in a windowing mode, the words in each window vote for adjacent windows, the voting weight depends on the number of votes of the words, and the similarity between words can be obtained by co-occurrence of the words because the co-occurrence of the words exists between each window and the adjacent window; and finally, performing iterative voting on the votes of the candidate keywords on the graph to obtain the vote ranking of the candidate keywords, and selecting the candidate keywords with the top votes as the keywords.
However, in the existing keyword extraction method based on the graph model, the similarity between words can be obtained only by the co-occurrence of words, so that the repeated words are weighted too heavily, for example, some of candidate keywords cannot become keywords, but the repeated words such as content, calculation, processing, resolution, and the like are the highest, which results in low keyword extraction accuracy. In addition, the result of extracting the keywords is sensitive to the size of the window, and since the size of the window needs to be manually set, for example, a sentence is composed of the following words in sequence: w1, w2, w3, w4 and w5 … wn, the size of the window is set to be k, then w1, w2, w3 … wk, w2, w3, w4 … wk +1, w3, w4 and w5 … wk +2 are all windows, an undirected and unweighted edge exists between nodes corresponding to any two words in one window, and then the selection of windows with different sizes can cause distinct results and low keyword extraction accuracy.
Disclosure of Invention
The embodiment of the invention aims to provide a keyword extraction method and device based on a graph model, so that the keyword extraction accuracy is improved. The specific technical scheme is as follows:
the embodiment of the invention discloses a keyword extraction method based on a graph model, which comprises the following steps:
acquiring a text to be processed, and performing word segmentation on the text to be processed to obtain candidate keywords corresponding to the text to be processed;
searching a word vector corresponding to the candidate keyword in a word vector model, wherein the word vector model comprises word vectors of the candidate keyword;
constructing a word similarity matrix of the candidate keywords according to the word vectors;
and sequencing the candidate keywords according to the word similarity matrix of the candidate keywords, and extracting the keywords of the text to be processed.
Optionally, the constructing a word similarity matrix of the candidate keyword according to the word vector includes:
according to the formula:
Figure BDA0001260551400000021
calculating cosine values of corresponding word vector included angles among the candidate keywords, wherein theta represents the included angle of the vectors among the candidate keywords, and x1kCharacteristic value, x, of corresponding vector in n-dimensional space representing one candidate keyword2kRepresenting the characteristic value of a corresponding vector in another candidate keyword n-dimensional space, wherein n represents the dimension of the vector space;
and constructing the similarity matrix of the candidate keywords according to the cosine value of the included angle of the word vectors.
Optionally, the sorting the candidate keywords according to the word similarity matrix of the candidate keywords includes:
calculating a word similarity matrix of the candidate keywords according to a PageRank algorithm to obtain corresponding PageRank values of the candidate keywords;
sorting the candidate keywords according to the PageRank value to obtain the importance degree of the candidate keywords;
and extracting the keywords of the text to be processed according to the importance degree.
Optionally, the calculating a word similarity matrix of the candidate keyword according to the PageRank algorithm includes:
determining an initial value of the PageRank algorithm according to the order number of the word similarity matrix;
calculating an initial characteristic vector value of the candidate keyword according to the initial value and the word similarity matrix;
according to the formula:
pt=MTpt-1
calculating a feature vector value of the candidate keyword, wherein when t is 1, then p1Representing said initial characteristic vector value, p0Representing said initial weight, ptRepresenting the characteristic vector value, p, of the word similarity matrix at step tt-1Representing the characteristic vector value of the word similarity matrix in the t-1 step, M representing the word similarity matrix of the candidate keywords, MTRepresenting the transposition of the word similarity matrix, t representing the number of steps of calculation, and the value of t being greater than or equal to 1;
and when the norm of the characteristic vector value of the t step and the characteristic vector value of the t-1 step is smaller than the error tolerance of the PageRank algorithm, the characteristic vector value of the t step is the corresponding PageRank value of the candidate keyword.
Optionally, the obtaining a text to be processed and performing word segmentation on the text to be processed to obtain candidate keywords corresponding to the text to be processed includes:
obtaining a text to be processed, and segmenting the text to be processed to obtain stop words and words with specified parts of speech, wherein the stop words at least comprise prepositions, auxiliary words, conjunctions and exclamation words, and the words with the specified parts of speech at least comprise nouns, verbs and adjectives;
and filtering the stop words to obtain the words with the specified part of speech, wherein the words with the specified part of speech are candidate keywords corresponding to the text to be processed.
Optionally, the word vector is obtained by word2vec training.
The embodiment of the invention also discloses a keyword extraction device based on the graph model, which comprises:
the acquisition module is used for acquiring a text to be processed and segmenting the text to be processed to obtain candidate keywords corresponding to the text to be processed;
the searching module is used for searching word vectors corresponding to the candidate keywords in a word vector model, and the word vector model comprises the word vectors of the candidate keywords;
the processing module is used for constructing a word similarity matrix of the candidate keywords according to the word vectors;
and the extraction module is used for sequencing the candidate keywords according to the word similarity matrix of the candidate keywords and extracting the keywords of the text to be processed.
Optionally, the processing module includes:
a first calculation unit for, according to the formula:
Figure BDA0001260551400000041
calculating cosine values of corresponding word vector included angles among the candidate keywords, wherein theta represents the included angle of the vectors among the candidate keywords, and x1kCharacteristic value, x, of corresponding vector in n-dimensional space representing one candidate keyword2kRepresenting the characteristic value of a corresponding vector in another candidate keyword n-dimensional space, wherein n represents the dimension of the vector space;
and the constructing unit is used for constructing the candidate keyword similarity matrix according to the cosine value of the word vector included angle.
Optionally, the extracting module includes:
the second calculation unit is used for calculating a word similarity matrix of the candidate keywords according to a PageRank algorithm to obtain corresponding PageRank values of the candidate keywords;
the sorting unit is used for sorting the candidate keywords according to the PageRank value to obtain the importance degree of the candidate keywords;
and the extraction unit is used for extracting the keywords of the text to be processed according to the importance degree.
Optionally, the second computing unit includes:
the first determining subunit is used for determining an initial value of the PageRank algorithm according to the order number of the word similarity matrix;
the first calculating subunit is used for calculating an initial characteristic vector value of the candidate keyword according to the initial value and the word similarity matrix;
a second calculation subunit configured to, according to the formula:
pt=MTpt-1
calculating a feature vector value of the candidate keyword, wherein when t is 1, then p1Representing said initial characteristic vector value, p0Representing said initial weight, ptRepresenting the characteristic vector value, p, of the word similarity matrix at step tt-1Representing the characteristic vector value of the word similarity matrix in the t-1 step, M representing the word similarity matrix of the candidate keywords, MTRepresenting the transposition of the word similarity matrix, t representing the number of steps of calculation, and the value of t being greater than or equal to 1;
and a second determining subunit, configured to, when a norm of the feature vector value of the t step and the feature vector value of the t-1 step is smaller than an error tolerance of the PageRank algorithm, determine that the feature vector value of the t step is a corresponding PageRank value of the candidate keyword.
Optionally, the obtaining module includes:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a text to be processed and segmenting the text to be processed to obtain stop words and words with specified parts of speech, the stop words at least comprise prepositions, auxiliary words, conjunctions and exclamation words, and the words with the specified parts of speech at least comprise nouns, verbs and adjectives;
and the processing unit is used for filtering the stop words to obtain the words with the specified parts of speech, wherein the words with the specified parts of speech are the candidate keywords corresponding to the text to be processed.
Optionally, the word vector is obtained by word2vec training.
According to the method and the device for extracting the keywords based on the graph model, provided by the embodiment of the invention, the similarity between words in the text is calculated through the word vector, and the similarity matrix is constructed, so that the extracted keywords reflect the semantic importance of the keywords in the current text to a certain extent. When the similarity matrix is constructed, the similarity between words is obtained based on word vector calculation instead of the co-occurrence between words, so that the problem of overlarge word weighting caused by the co-occurrence between words in the keyword extraction process is solved, the size of a window does not need to be set artificially, keywords which are more consistent with the theme of a document are selected through the semantic similarity, and the accuracy of keyword extraction is improved. Of course, it is not necessary for any product or method of practicing the invention to achieve all of the above-described advantages at the same time.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic structural diagram of a graph model in a conventional graph model-based keyword extraction method;
fig. 2 is a flowchart of a keyword extraction method based on a graph model according to an embodiment of the present invention;
fig. 3 is a structural diagram of a keyword extraction apparatus based on a graph model according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The method for extracting keywords based on the graph model is an effective method for extracting keywords, wherein the graph model is a generic term of a technology for representing probability distribution by a class of graphs, and a text can be mapped into a network graph which takes words as nodes and takes the association relationship between the words as edges. As shown in fig. 1, fig. 1 is a schematic structural diagram of a graph model in a conventional graph model-based keyword extraction method, where w1, w2, w3 … w10, and w11 in fig. 1 are candidate keywords, and are also nodes of the graph model, respectively, an edge formed by lines between the nodes represents an association relationship of each candidate keyword, and a thicker line represents a greater weight of the edge, i.e., the greater association relationship between the two keywords connected by the edge, and the present invention extracts the keywords based on the graph model.
Referring to fig. 2, fig. 2 is a flowchart of a keyword extraction method based on a graph model according to an embodiment of the present invention, including the following steps:
s201, obtaining a text to be processed, and performing word segmentation on the text to be processed to obtain candidate keywords corresponding to the text to be processed.
Specifically, a text to be processed is obtained, the obtained text to be processed is subjected to word segmentation, and the purpose of word segmentation is to perform word segmentation on the text to be processed according to a certain rule, so that candidate keywords are extracted. Chinese is often expressed in terms of words, phrases, colloquials, etc., and thus Chinese word segmentation has a great uncertainty. The current main word segmentation method comprises the following steps: the method for segmenting words based on character string matching, namely mechanical segmentation, has mature and widely used algorithm, realizes segmentation by matching mail text with dictionary words, and is characterized in that the completeness of the dictionary is used; the word segmentation method based on understanding, namely an artificial intelligence method, has high word segmentation precision and complex algorithm; the word segmentation method based on statistics has the advantages of recognizing unknown words and proper nouns, but the training text amount is large. The word segmentation methods have high word segmentation accuracy and a rapid word segmentation system. Here, the existing word segmentation method is used for segmenting the text to be processed, stop words such as prepositions, auxiliary words, conjunctions and interjections in the words are automatically filtered, words with specified parts of speech such as nouns, verbs and adjectives are reserved, and the words with the specified parts of speech are used as candidate keywords. Thus, the candidate keywords corresponding to the text to be processed are obtained.
S202, searching word vectors corresponding to the candidate keywords in the word vector model, wherein the word vector model comprises the word vectors of the candidate keywords.
Generally, a neural network takes a word in a word list as an input, outputs a low-dimensional vector to represent the word, and then continuously optimizes parameters by using a back propagation method. The output low-dimensional vector is a parameter of the first layer of the neural network. The neural network models for generating word Vectors are divided into two types, one is a word vector model obtained through word2vec or GloVe (Global Vectors for word replication) training and the like, the purpose of the model is to generate the word Vectors, the other is to generate the word Vectors as byproducts, and the difference between the two types is that the calculated amount is different. Another difference between the two models is that the goals of the training are different: the purpose of word2vec and GloVe is to train word vectors that can represent semantic relationships that can be used in subsequent tasks; if semantic relationships are not needed for subsequent tasks, the word vectors generated in this manner are of little use. Another model trains the word vectors according to the specific task requirements. Of course, if the particular task is modeling a language, the word vectors generated by the two models are very similar.
Specifically, a method is first found to mathematically transform the natural language understanding problem into a machine learning problem. The word vector has good semantic characteristics and is a common way for representing word characteristics. The word vector is a multidimensional real number vector, and the vector contains semantic and grammatical relations in natural language. The value of each dimension of the word vector represents a feature with a certain semantic and grammatical interpretation. Each dimension of the word vector may be referred to as a word feature. The word vector is represented by Distributed Representation, a low-dimensional real vector. The word vector calculation is to map the words in the language vocabulary into a vector with fixed length by a training method. distributedRepression is a dense, low-dimensional real finite quantity, each dimension of which represents a potential feature of a word, which captures useful syntactic and semantic features, and is characterized by distributing different syntactic and semantic features of a word to each of its dimensions for representation. The low-dimensional space representation method is adopted, the problem of dimension disaster is solved, the association attribute between words is mined, the similarity between the two words can be obtained by calculating the distance between word vectors, and the accuracy of vector semantics is improved.
The word vector model comprises word vectors corresponding to the candidate keywords, and the word vectors corresponding to the candidate keywords are found in the word vector model, mainly for calculating the distance between the candidate keywords, so as to obtain the similarity between the candidate keywords. The word vectors are introduced into the existing keyword extraction method based on the graph model, and the similarity among the candidate keywords is calculated through the word vectors, so that the problems that the similarity among the words is established in a windowing mode in the existing method, and the window size needs to be set manually, so that the extraction accuracy of the candidate keywords is low are solved.
S203, constructing a word similarity matrix of the candidate keywords according to the word vectors.
Specifically, the magnitude of the cosine distance between word vectors represents the distance of the relation between words, that is, the similarity between candidate keywords is obtained by calculating the cosine distance between word vectors. Here, the obtained similarity between the candidate keywords is expressed by numerical values, and these numerical values constitute elements in the word similarity matrix. Wherein the matrix is an N-th determinant. As shown in table 1, A, B, C, D, E, F, G, H in the table represents word vectors corresponding to each candidate keyword, and the numerical value in the table is the cosine distance between the word vectors, i.e. the similarity between the candidate keywords.
TABLE 1
Figure BDA0001260551400000081
Figure BDA0001260551400000091
Then according to the similarity between these candidate keywords, a similarity matrix of the candidate keywords is constructed, which is denoted by M, that is
Figure BDA0001260551400000092
And S204, sequencing the candidate keywords according to the word similarity matrix of the candidate keywords, and extracting the keywords of the text to be processed.
Specifically, a word similarity matrix of the candidate keywords is calculated through a keyword ranking algorithm in the keyword extraction method based on the graph model, and ranking algorithm values corresponding to the candidate keywords are obtained. And then, sorting the candidate keywords according to a sorting algorithm value. And finally, selecting candidate keywords ranked at the top as keywords of the text to be processed. Here, the number of candidate keywords ranked in the top is selected according to actual needs.
Therefore, according to the keyword extraction method based on the graph model, provided by the embodiment of the invention, the similarity between words in the text is calculated through the word vector, and the similarity matrix is constructed, so that the extracted keywords reflect the semantic importance of the keywords in the current text to a certain extent. When the similarity matrix is constructed, the similarity between words is obtained based on word vector calculation instead of the co-occurrence between words, so that the problem of overlarge word weighting caused by the co-occurrence between words in the keyword extraction process is solved, the size of a window does not need to be set artificially, keywords which are more consistent with the theme of a document are selected through the semantic similarity, and the accuracy of keyword extraction is improved.
In an optional embodiment of the present invention, constructing a word similarity matrix of the candidate keyword according to the word vector includes:
according to the formula:
Figure BDA0001260551400000101
calculating cosine values of corresponding word vector included angles among the candidate keywords, wherein theta represents the included angle of the vector among the candidate keywords, and x1kCharacteristic value, x, of corresponding vector in n-dimensional space representing one candidate keyword2kAnd representing the characteristic value of a corresponding vector in the n-dimensional space of the other candidate keyword, wherein n represents the dimension of the vector space.
And constructing a candidate keyword similarity matrix according to the cosine value of the word vector included angle.
Specifically, the similarity between words is obtained by calculating the distance between word vectors. The distance between the word vectors is calculated through the cosine value of the included angle between the word vectors, so the method constructs the similarity matrix of the candidate keywords by calculating the cosine value of the included angle of the word vectors corresponding to the candidate keywords and then according to the cosine value of the included angle of the word vectors.
The cosine values of the corresponding word vector included angles among the candidate keywords are obtained by a cosine value calculation formula of the included angle of the n-dimensional space vector, and in the n-dimensional space, for example, two vectors are respectively vector A (x)11,x12…x1n) And vector B (x)21,x22…x2n) Then, the cosine of the angle between vector a and vector B is calculated as:
Figure BDA0001260551400000102
where θ represents the angle between vector A and vector B, x1kRepresenting the corresponding eigenvalue, x, of the vector A2kRepresenting the eigenvalues corresponding to vector B, n representing the dimension of the vector space
Here, in the two-dimensional space, for example, there are two vectors each of which is a vector a (x)11,x12) And vector B (x)21,x22) Then, the cosine of the angle between vector a and vector B is calculated as:
Figure BDA0001260551400000111
where θ represents the angle between vector A and vector B, x11And x12Representing the corresponding eigenvalue, x, of the vector A21And x22Representing the corresponding eigenvalue of vector B.
In three-dimensional space, for example, there are two vectors, each vector A (x)11,x12,x13) Vector B (x)21,x22,x23) Then, the cosine of the angle between vector a and vector B is calculated as:
Figure BDA0001260551400000112
where θ represents the angle between vector A and vector B, x11、x12And x13Representing the corresponding eigenvalue, x, of the vector A21、x22And x23Representing the corresponding eigenvalue of vector B.
The cosine values of the included angle between two vectors in a higher-dimensional space are not listed, and all the cosine values which are in accordance with the calculation formula of the cosine values of the included angle of n-dimensional space vectors belong to the protection scope of the invention.
In the embodiment of the present invention, ranking the candidate keywords according to the word similarity matrix of the candidate keywords includes:
calculating a word similarity matrix of the candidate keywords according to a PageRank algorithm to obtain corresponding PageRank values of the candidate keywords;
specifically, the Pagerank algorithm is part of the Google ranking algorithm (ranking formula), is a method used by Google to identify the rank/importance of a web page, and is the only standard used by Google to measure the quality of a web site. The present invention ranks keywords by the principles of the Pagerank algorithm. And calculating a word similarity matrix of the candidate keywords through a PageRank algorithm, and finally obtaining the corresponding PageRank values of the candidate keywords through the iterative regression algorithm.
Sorting the candidate keywords according to the PageRank value to obtain the importance degree of the candidate keywords;
here, the maximum Pagerank value of the candidate keyword indicates that, when the user searches for a keyword, the keyword is the keyword in which the user is most interested, and the other keywords are sequentially decreased. For example, the obtained candidate keywords are ranked in order of B: 1.47, H: 1.41, E: 1.39, A: 1.30, F: 1.14, G: 1.12, D: 1.09, C: 1.08, the most important candidate keyword B is shown, and the importance degrees of other candidate keywords are sequentially reduced according to the ranking.
And extracting the keywords of the text to be processed according to the importance degree.
Here, the top ranked (top N) candidate keywords are extracted as keywords of the text to be processed, as actually required.
In the embodiment of the invention, the word similarity matrix of the candidate keywords is calculated according to the PageRank algorithm, and the method comprises the following steps:
determining an initial value of a PageRank algorithm according to the order number of the word similarity matrix;
specifically, the initial value of the PageRank algorithm is determined according to the size N of the matrix, namely
Figure BDA0001260551400000121
p0Represents the initial value of the PageRank algorithm. Here, since the PageRank algorithm assumes that the probability of each web page is equalThus, it is assumed according to the PageRank algorithm that the probability of occurrence of each candidate keyword is equal, i.e.
Figure BDA0001260551400000122
And will be
Figure BDA0001260551400000123
As an initial value for the PageRank algorithm. Calculating an initial characteristic vector value of the candidate keyword according to the initial value and the word similarity matrix;
in particular, according to the formula
p1=MTp0
Calculating initial feature vector values of the candidate keywords, wherein p1Representing the initial feature vector value, p, of the PageRank algorithm0Representing an initial value of the PageRank algorithm, M representing a word similarity matrix of candidate keywords, MTRepresenting the transpose of the word similarity matrix.
According to the formula:
pt=MTpt-1
calculating a feature vector value of the candidate keyword, wherein when t is 1, p is1Representing said initial characteristic vector value, p0Representing said initial weight, ptCharacteristic vector value, p, representing word similarity matrix at step tt-1Representing the characteristic vector value of the word similarity matrix in the t-1 step, M representing the word similarity matrix of the candidate keywords, MTRepresenting the transposition of a word similarity matrix, wherein t represents the calculated step number, and the value of t is more than or equal to 1;
specifically, the PageRank algorithm is an iterative regression algorithm, and the final PageRank value corresponding to the candidate keyword is obtained by repeatedly and iteratively calculating the word similarity matrix of the candidate keyword, so that the accuracy of the extracted key is more accurate.
And when the norm of the characteristic vector value in the t step and the characteristic vector value in the t-1 step is smaller than the error tolerance of the PageRank algorithm, the characteristic vector value in the t step is the corresponding PageRank value of the candidate keyword.
Here, because an error exists in the vector calculation process, an error tolerance e is preset in the PageRank algorithm, and when the norm of the characteristic vector value in the t-th step and the norm of the characteristic vector value in the t-1-th step is smaller than the error tolerance of the PageRank algorithm, the PageRank value corresponding to the obtained candidate keyword is more accurate, which is beneficial to improving the keyword extraction accuracy. The specific algorithm is as follows:
Figure BDA0001260551400000131
the specific process comprises the following steps:
first, the PageRank algorithm is implemented by inputting a random, irreducible, aperiodic matrix M, the size of the matrix N, and the tolerance of the error. Here, the matrix M is constructed by a word vector, i.e., a word similarity matrix in the present invention, and the size N of the matrix is the order number of the matrix. In addition, because the calculation process of the vector has errors, the PageRank algorithm presets an error tolerance e.
Then, the PageRank algorithm calculates feature vector values for the candidate keywords by:
step 1, determining an initial value of the PageRank algorithm according to the size N of the matrix, namely
Figure BDA0001260551400000141
p0Represents the initial value of the PageRank algorithm. Here, since the PageRank algorithm assumes that the probability of each web page is equal, the probability of each candidate keyword appearing is assumed to be equal according to the PageRank algorithm, i.e.
Figure BDA0001260551400000142
And will be
Figure BDA0001260551400000143
As an initial value for the PageRank algorithm.
In step 2, t is 0, where t represents the number of steps calculated by the PageRank algorithm, and then t is 0, which means that the similarity matrix M has not been calculated yet.
And (3) repeating the calculation according to t as t +1 in the steps 3 and 4.
Step 5, according to the formula
pt=MTpt-1
Calculating a word similarity matrix feature vector value, wherein ptCharacteristic vector value, p, representing word similarity matrix at step tt-1And representing the characteristic vector value of the word similarity matrix in the t-1 step, M representing the word similarity matrix of the candidate keywords, and t representing the calculated step number. Here, since the PageRank algorithm is an iterative regression algorithm, the feature vector value of the word similarity matrix can be obtained more accurately only by continuously performing iterative computation on the word similarity matrix M.
Step 6, δ | | | pt-pt-1||
And 7, the unit delta is less than e, and the calculation is not stopped until the norm of the characteristic vector value of the word similarity matrix at the t step and the characteristic vector value of the word similarity matrix at the t-1 step is less than the error tolerance e.
Step 8, return ptAnd obtaining a final word similarity matrix characteristic vector value.
Finally, the eigenvector P, i.e. the final word similarity matrix eigenvector value P, is outputt
In the embodiment of the present invention, obtaining a text to be processed, and performing word segmentation on the text to be processed to obtain candidate keywords corresponding to the text to be processed, includes:
the method comprises the steps of obtaining a text to be processed, carrying out word segmentation on the text to be processed to obtain stop words and words with appointed parts of speech, wherein the stop words at least comprise prepositions, auxiliary words, conjunctions and exclamation words, and the words with the appointed parts of speech at least comprise nouns, verbs and adjectives.
Specifically, the words obtained after the text to be processed is segmented can be divided into two types: stop words and words of a specified part of speech. In information retrieval, in order to save storage space and improve search efficiency, some characters or words are automatically filtered before or after processing natural language data (or text), and the characters or words are called stop words. And filtering out stop words to obtain words with the appointed part of speech, wherein the words with the appointed part of speech are candidate keywords corresponding to the text to be processed. The stop words are words which appear in a large amount in the text but have little effect on the characteristics of the text, such as the words "i'm, then, yes, then, and additionally" in the text. To filter stop words, a stop word list is constructed, which mainly includes adverbs, conjunctions, prepositions, and word help words mentioned in context. Therefore, after Chinese word segmentation, stop words must be filtered, so that the density of keywords can be effectively improved, the dimensionality of a text can be greatly reduced, and the occurrence of dimensionality disasters is avoided.
In the embodiment of the invention, the word vector is trained by word2vec, and the words are expressed into a vector form.
Specifically, Word2vec is an efficient tool for Google to open sources in 2013 and characterize words as real-valued vectors, and by using the idea of deep learning, processing of text contents can be simplified into vector operation in a K-dimensional vector space through training, and similarity in the vector space can be used for representing similarity in text semantics. Word2vec uses the Word vector Representation of Distributed Representation. Distributored retrieval was first proposed by Hinton in 1986. The basic idea is to map each word to a K-dimensional real number vector (K is generally a hyper-parameter in the model) through training, and judge semantic similarity between words through distances between words (such as cosine similarity, euclidean distance, etc.). It adopts a three-layer neural network, input layer-hidden layer-output layer. The Huffman coding is used according to the word frequency, so that the activated contents of all word hiding layers with similar word frequencies are basically consistent, the number of the word hiding layers activated by the words with higher occurrence frequency is less, and the calculation complexity is effectively reduced. The word2vec algorithm is based on deep learning, and simplifies the processing of text content into vector operation in a K-dimensional vector space through model training. The similarity on the vector space can be used for representing the semantic similarity of the text, word vectors can be converted into vectors, and synonyms can be searched.
Compared with the existing keyword extraction method, the keyword extraction method based on the graph model provided by the invention has the advantage that a better effect is achieved. Table 2 shows the ranking of the keywords obtained by the keyword extraction method proposed in the present invention, compared with the ranking of the keywords obtained by the existing keyword extraction method.
TABLE 2
Figure BDA0001260551400000161
As can be seen from table 2, the 1 st and 2 nd texts belong to short texts, and each candidate keyword in the text only appears once, so that the probability of each candidate keyword becoming a keyword being extracted is the same, and thus, the keywords cannot be accurately extracted from the texts 1 and 2 by the existing keyword extraction method, and the keyword extraction method provided by the present invention can obtain the ranking of each candidate keyword, thereby extracting the keywords. The 3 rd text belongs to a long text, and all candidate keywords appearing in the text also repeatedly appear in the text, so that the results show that the words are rather accepted as keywords in the sequence of the keywords obtained by the existing keyword extraction method, and have no practical significance, but the words are regarded as the candidate keywords due to more repeated occurrences in the text; the keyword sequence obtained by the keyword extraction method provided by the invention enables the keyword extraction accuracy to be higher.
Referring to fig. 3, fig. 3 is a structural diagram of a keyword extraction apparatus based on a graph model according to an embodiment of the present invention, where the apparatus includes the following modules:
the acquiring module 301 is configured to acquire a text to be processed, and perform word segmentation on the text to be processed to obtain candidate keywords corresponding to the text to be processed;
a searching module 302, configured to search a word vector corresponding to the candidate keyword in a word vector model, where the word vector model includes word vectors of the candidate keyword;
the processing module 303 is configured to construct a word similarity matrix of the candidate keyword according to the word vector;
and the extraction module 304 is configured to sort the candidate keywords according to the word similarity matrix of the candidate keywords, and extract keywords of the text to be processed.
Further, the processing module 303 includes:
a first calculation unit for, according to the formula:
Figure BDA0001260551400000171
calculating cosine values of corresponding word vector included angles among the candidate keywords, wherein theta represents the included angle of the vector among the candidate keywords, and x1kCharacteristic value, x, of corresponding vector in n-dimensional space representing one candidate keyword2kRepresenting the characteristic value of a corresponding vector in another candidate keyword n-dimensional space, wherein n represents the dimension of the vector space;
and the construction unit is used for constructing a candidate keyword similarity matrix according to the cosine value of the word vector included angle.
Further, the extraction module 304 includes:
the second calculation unit is used for calculating a word similarity matrix of the candidate keywords according to a PageRank algorithm to obtain corresponding PageRank values of the candidate keywords;
the sorting unit is used for sorting the candidate keywords according to the PageRank value to obtain the importance degree of the candidate keywords;
and the extraction unit is used for extracting the keywords of the text to be processed according to the importance degree.
Further, the second calculation unit includes:
the first determining subunit is used for determining an initial value of the PageRank algorithm according to the order number of the word similarity matrix;
the first calculation subunit is used for calculating an initial characteristic vector value of the candidate keyword according to the initial value and the word similarity matrix;
a second calculation subunit configured to, according to the formula:
pt=MTpt-1
calculating a feature vector value of the candidate keyword, wherein when t is 1, p is1Representing said initial characteristic vector value, p0Representing said initial weight, ptCharacteristic vector value, p, representing word similarity matrix at step tt-1Representing the characteristic vector value of the word similarity matrix in the t-1 step, M representing the word similarity matrix of the candidate keywords, MTRepresenting the transposition of a word similarity matrix, wherein t represents the calculated step number, and the value of t is more than or equal to 1;
and the second determining subunit is used for determining the characteristic vector value in the t step as the corresponding PageRank value of the candidate keyword when the norm of the characteristic vector value in the t step and the characteristic vector value in the t-1 step is smaller than the error tolerance of the PageRank algorithm.
Further, the obtaining module 301 includes:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a text to be processed and segmenting the text to be processed to obtain stop words and words with appointed parts of speech, the stop words at least comprise prepositions, auxiliary words, conjunctions and exclamation words, and the words with appointed parts of speech at least comprise nouns, verbs and adjectives;
and the processing unit is used for filtering stop words to obtain words with the appointed part of speech, and the words with the appointed part of speech are candidate keywords corresponding to the text to be processed.
Further, the word vector is obtained by word2vec training.
Therefore, according to the keyword extraction device based on the graph model provided by the embodiment of the invention, the similarity between words in the text is calculated through the word vector of the processing module, and the similarity matrix is constructed, so that the extracted keywords reflect the semantic importance of the keywords in the current text to a certain extent. When the similarity matrix is constructed, the similarity between words is obtained based on word vector calculation instead of the co-occurrence between words, so that the problem of overlarge word weighting caused by the co-occurrence between words in the keyword extraction process is solved, the size of a window does not need to be set artificially, keywords which are more consistent with the theme of a document are selected through the semantic similarity, and the accuracy of keyword extraction is improved.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (8)

1. A keyword extraction method based on a graph model is characterized by comprising the following steps:
acquiring a text to be processed, and performing word segmentation on the text to be processed to obtain candidate keywords corresponding to the text to be processed;
searching a word vector corresponding to the candidate keyword in a word vector model, wherein the word vector model comprises word vectors of the candidate keyword;
constructing a word similarity matrix of the candidate keywords according to the word vectors;
sorting the candidate keywords according to the word similarity matrix of the candidate keywords, and extracting the keywords of the text to be processed;
the sorting the candidate keywords according to the word similarity matrix of the candidate keywords comprises:
calculating a word similarity matrix of the candidate keywords according to a PageRank algorithm to obtain corresponding PageRank values of the candidate keywords;
sorting the candidate keywords according to the PageRank value to obtain the importance degree of the candidate keywords;
extracting keywords of the text to be processed according to the importance degree;
the calculating of the word similarity matrix of the candidate keywords according to the PageRank algorithm includes:
determining an initial value of the PageRank algorithm according to the order number of the word similarity matrix;
calculating an initial characteristic vector value of the candidate keyword according to the initial value and the word similarity matrix;
according to the formula:
pt=MTpt-1
calculating a feature vector value of the candidate keyword, wherein when t is 1, then p1Representing said initial characteristic vector value, p0Represents the initial value, ptRepresenting the characteristic vector value, p, of the word similarity matrix at step tt-1Representing the characteristic vector value of the word similarity matrix in the t-1 step, M representing the word similarity matrix of the candidate keywords, MTRepresenting the transposition of the word similarity matrix, t representing the number of steps of calculation, and the value of t being greater than or equal to 1;
and when the norm of the characteristic vector value of the t step and the characteristic vector value of the t-1 step is smaller than the error tolerance of the PageRank algorithm, the characteristic vector value of the t step is the corresponding PageRank value of the candidate keyword.
2. The method of claim 1, wherein said constructing a word similarity matrix for the candidate keywords from the word vectors comprises:
according to the formula:
Figure FDA0002234279670000021
calculating cosine values of corresponding word vector included angles among the candidate keywords, wherein theta represents the included angle of the vectors among the candidate keywords, and x1kCharacteristic value, x, of corresponding vector in n-dimensional space representing one candidate keyword2kRepresenting the characteristic value of a corresponding vector in another candidate keyword n-dimensional space, wherein n represents the dimension of the vector space;
and constructing the similarity matrix of the candidate keywords according to the cosine value of the included angle of the word vectors.
3. The method according to any one of claims 1 to 2, wherein the obtaining of the text to be processed and the word segmentation of the text to be processed to obtain the candidate keywords corresponding to the text to be processed comprises:
obtaining a text to be processed, and segmenting the text to be processed to obtain stop words and words with specified parts of speech, wherein the stop words at least comprise prepositions, auxiliary words, conjunctions and exclamation words, and the words with the specified parts of speech at least comprise nouns, verbs and adjectives;
and filtering the stop words to obtain the words with the specified part of speech, wherein the words with the specified part of speech are candidate keywords corresponding to the text to be processed.
4. The method of any of claims 1-2, wherein the word vector is obtained by word2vec training.
5. A keyword extraction apparatus based on a graph model, the apparatus comprising:
the acquisition module is used for acquiring a text to be processed and segmenting the text to be processed to obtain candidate keywords corresponding to the text to be processed;
the searching module is used for searching word vectors corresponding to the candidate keywords in a word vector model, and the word vector model comprises the word vectors of the candidate keywords;
the processing module is used for constructing a word similarity matrix of the candidate keywords according to the word vectors;
the extraction module is used for sequencing the candidate keywords according to the word similarity matrix of the candidate keywords and extracting the keywords of the text to be processed;
the extraction module comprises:
the second calculation unit is used for calculating a word similarity matrix of the candidate keywords according to a PageRank algorithm to obtain corresponding PageRank values of the candidate keywords;
the sorting unit is used for sorting the candidate keywords according to the PageRank value to obtain the importance degree of the candidate keywords;
the extraction unit is used for extracting the keywords of the text to be processed according to the importance degree;
the second calculation unit includes:
the first determining subunit is used for determining an initial value of the PageRank algorithm according to the order number of the word similarity matrix;
the first calculating subunit is used for calculating an initial characteristic vector value of the candidate keyword according to the initial value and the word similarity matrix;
a second calculation subunit configured to, according to the formula:
pt=MTpt-1
calculating a feature vector value of the candidate keyword, wherein when t is 1, then p1Representing said initial characteristic vector value, p0Represents the initial value, ptRepresenting the word phaseCharacteristic vector value, p, of similarity matrix in t stept-1Representing the characteristic vector value of the word similarity matrix in the t-1 step, M representing the word similarity matrix of the candidate keywords, MTRepresenting the transposition of the word similarity matrix, t representing the number of steps of calculation, and the value of t being greater than or equal to 1;
and a second determining subunit, configured to, when a norm of the feature vector value of the t step and the feature vector value of the t-1 step is smaller than an error tolerance of the PageRank algorithm, determine that the feature vector value of the t step is a corresponding PageRank value of the candidate keyword.
6. The apparatus of claim 5, wherein the processing module comprises:
a first calculation unit for, according to the formula:
Figure FDA0002234279670000041
calculating cosine values of corresponding word vector included angles among the candidate keywords, wherein theta represents the included angle of the vectors among the candidate keywords, and x1kCharacteristic value, x, of corresponding vector in n-dimensional space representing one candidate keyword2kRepresenting the characteristic value of a corresponding vector in another candidate keyword n-dimensional space, wherein n represents the dimension of the vector space;
and the constructing unit is used for constructing the candidate keyword similarity matrix according to the cosine value of the word vector included angle.
7. The apparatus of any one of claims 5 to 6, wherein the obtaining module comprises:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a text to be processed and segmenting the text to be processed to obtain stop words and words with specified parts of speech, the stop words at least comprise prepositions, auxiliary words, conjunctions and exclamation words, and the words with the specified parts of speech at least comprise nouns, verbs and adjectives;
and the processing unit is used for filtering the stop words to obtain the words with the specified parts of speech, wherein the words with the specified parts of speech are the candidate keywords corresponding to the text to be processed.
8. The apparatus of any of claims 5 to 6, wherein the word vector is obtained by word2vec training.
CN201710208956.4A 2017-03-31 2017-03-31 Keyword extraction method and device based on graph model Active CN106970910B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710208956.4A CN106970910B (en) 2017-03-31 2017-03-31 Keyword extraction method and device based on graph model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710208956.4A CN106970910B (en) 2017-03-31 2017-03-31 Keyword extraction method and device based on graph model

Publications (2)

Publication Number Publication Date
CN106970910A CN106970910A (en) 2017-07-21
CN106970910B true CN106970910B (en) 2020-03-27

Family

ID=59336925

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710208956.4A Active CN106970910B (en) 2017-03-31 2017-03-31 Keyword extraction method and device based on graph model

Country Status (1)

Country Link
CN (1) CN106970910B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325509B (en) * 2017-07-31 2023-01-17 北京国双科技有限公司 Similarity determination method and device
CN107704503A (en) * 2017-08-29 2018-02-16 平安科技(深圳)有限公司 User's keyword extracting device, method and computer-readable recording medium
CN107992470A (en) * 2017-11-08 2018-05-04 中国科学院计算机网络信息中心 A kind of text duplicate checking method and system based on similarity
CN108009149A (en) * 2017-11-23 2018-05-08 东软集团股份有限公司 A kind of keyword extracting method, extraction element, medium and electronic equipment
CN108804423B (en) * 2018-05-30 2023-09-08 深圳平安医疗健康科技服务有限公司 Medical text feature extraction and automatic matching method and system
CN109325178A (en) * 2018-09-14 2019-02-12 北京字节跳动网络技术有限公司 Method and apparatus for handling information
CN109408826A (en) * 2018-11-07 2019-03-01 北京锐安科技有限公司 A kind of text information extracting method, device, server and storage medium
CN110032632A (en) * 2019-04-04 2019-07-19 平安科技(深圳)有限公司 Intelligent customer service answering method, device and storage medium based on text similarity
CN110263343B (en) * 2019-06-24 2021-06-15 北京理工大学 Phrase vector-based keyword extraction method and system
CN110489758B (en) * 2019-09-10 2023-04-18 深圳市和讯华谷信息技术有限公司 Value view calculation method and device for application program
CN110580290B (en) * 2019-09-12 2022-12-13 北京小米智能科技有限公司 Method and device for optimizing training set for text classification
CN110852100B (en) * 2019-10-30 2023-07-21 北京大米科技有限公司 Keyword extraction method and device, electronic equipment and medium
CN112818091A (en) * 2019-11-15 2021-05-18 北京京东尚科信息技术有限公司 Object query method, device, medium and equipment based on keyword extraction
CN111353301B (en) * 2020-02-24 2023-07-21 成都网安科技发展有限公司 Auxiliary secret determination method and device
CN113742602A (en) * 2020-05-29 2021-12-03 中国电信股份有限公司 Method, apparatus, and computer-readable storage medium for sample optimization
CN112434188B (en) * 2020-10-23 2023-09-05 杭州未名信科科技有限公司 Data integration method, device and storage medium of heterogeneous database

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101196904A (en) * 2007-11-09 2008-06-11 清华大学 News keyword abstraction method based on word frequency and multi-component grammar
CN101398814A (en) * 2007-09-26 2009-04-01 北京大学 Method and system for simultaneously abstracting document summarization and key words
CN104281653A (en) * 2014-09-16 2015-01-14 南京弘数信息科技有限公司 Viewpoint mining method for ten million microblog texts
CN105740229A (en) * 2016-01-26 2016-07-06 中国人民解放军国防科学技术大学 Keyword extraction method and device
CN105893410A (en) * 2015-11-18 2016-08-24 乐视网信息技术(北京)股份有限公司 Keyword extraction method and apparatus

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9075846B2 (en) * 2012-12-12 2015-07-07 King Fahd University Of Petroleum And Minerals Method for retrieval of arabic historical manuscripts
US20150046152A1 (en) * 2013-08-08 2015-02-12 Quryon, Inc. Determining concept blocks based on context

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398814A (en) * 2007-09-26 2009-04-01 北京大学 Method and system for simultaneously abstracting document summarization and key words
CN101196904A (en) * 2007-11-09 2008-06-11 清华大学 News keyword abstraction method based on word frequency and multi-component grammar
CN104281653A (en) * 2014-09-16 2015-01-14 南京弘数信息科技有限公司 Viewpoint mining method for ten million microblog texts
CN105893410A (en) * 2015-11-18 2016-08-24 乐视网信息技术(北京)股份有限公司 Keyword extraction method and apparatus
CN105740229A (en) * 2016-01-26 2016-07-06 中国人民解放军国防科学技术大学 Keyword extraction method and device

Also Published As

Publication number Publication date
CN106970910A (en) 2017-07-21

Similar Documents

Publication Publication Date Title
CN106970910B (en) Keyword extraction method and device based on graph model
CN107122413B (en) Keyword extraction method and device based on graph model
CN106776562B (en) Keyword extraction method and extraction system
CN109190117B (en) Short text semantic similarity calculation method based on word vector
CN106776711B (en) Chinese medical knowledge map construction method based on deep learning
CN101079026B (en) Text similarity, acceptation similarity calculating method and system and application system
CN104834747B (en) Short text classification method based on convolutional neural networks
CN113011533A (en) Text classification method and device, computer equipment and storage medium
CN109960786A (en) Chinese Measurement of word similarity based on convergence strategy
CN107423282B (en) Method for concurrently extracting semantic consistency subject and word vector in text based on mixed features
CN106844632B (en) Product comment emotion classification method and device based on improved support vector machine
CN106997382A (en) Innovation intention label automatic marking method and system based on big data
CN112256939B (en) Text entity relation extraction method for chemical field
CN108874896B (en) Humor identification method based on neural network and humor characteristics
CN103473280A (en) Method and device for mining comparable network language materials
CN112417155B (en) Court trial query generation method, device and medium based on pointer-generation Seq2Seq model
CN114048354B (en) Test question retrieval method, device and medium based on multi-element characterization and metric learning
CN114048305A (en) Plan recommendation method for administrative penalty documents based on graph convolution neural network
CN112507109A (en) Retrieval method and device based on semantic analysis and keyword recognition
CN111694927A (en) Automatic document review method based on improved word-shifting distance algorithm
CN110929022A (en) Text abstract generation method and system
CN111984782A (en) Method and system for generating text abstract of Tibetan language
CN114265936A (en) Method for realizing text mining of science and technology project
CN112949293A (en) Similar text generation method, similar text generation device and intelligent equipment
Kore et al. Legal document summarization using nlp and ml techniques

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant