CN108304364A - keyword extracting method and device - Google Patents
keyword extracting method and device Download PDFInfo
- Publication number
- CN108304364A CN108304364A CN201710101012.7A CN201710101012A CN108304364A CN 108304364 A CN108304364 A CN 108304364A CN 201710101012 A CN201710101012 A CN 201710101012A CN 108304364 A CN108304364 A CN 108304364A
- Authority
- CN
- China
- Prior art keywords
- word
- target word
- text
- term vector
- result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
Abstract
A kind of keyword extracting method of present invention offer and device, this method include:Obtain pending text;It is pre-processed to obtain the word in pending text to pending text and determines the target word in pending text;Obtain the term vector of each word in pending text;The term vector of the term vector of each target word in pending text and each word after the target word is input to trained obtained first circulation neural network in reverse order respectively, obtains the first result of each target word of first circulation neural network output;The probability value of each target word is calculated based on the output result of each target word, output result includes the first result;The keyword in pending text is determined according to the probability value of each target word and predetermined threshold value.This method can accurately determine whether target word is keyword, improve the accuracy rate of keyword extraction.
Description
Technical field
The present invention relates to information technology technical fields, more particularly to a kind of keyword extracting method and device.
Background technology
Keyword is the refining to text subject information, and the high level overview main contents of text can help user quick
Understand the purport of text, family easy to use judges whether text is content needed for oneself, to improve message reference and letter
Cease the efficiency of search.Moreover, due to keyword refining, succinct feature, keyword can be utilized with lower complexity
The calculating for carrying out text relevant, to efficiently carry out the processing such as text classification, text cluster and information retrieval.
Some more common deep learning methods are also gradually applied to keyword extraction field, can by deep learning
Machine is allowed to learn the feature of keyword automatically.It is more common that deep learning is carried out using Recognition with Recurrent Neural Network, pending text is complete
Portion is input to trained good Recognition with Recurrent Neural Network, and output obtains the keyword of pending text.Traditional Recognition with Recurrent Neural Network
Including input layer, hidden layer and output layer, the implicit unit wherein in hidden layer completes most important work.Hidden layer it is anti-
Feedback has not only entered the output at current time, but also enters the input of the hidden layer of subsequent time.Therefore, traditional cycle
Neural network is it can be considered that historical information.But pending long sentence is fully entered in trained good Recognition with Recurrent Neural Network,
The historical information of each word can not be considered to semantic influence, to have ignored the semantic information of word.
Thus, the accuracy of traditional crucial method using deep learning extraction is low.
Invention content
Based on this, it is necessary to provide a kind of keyword extracting method and device that accuracy is high.
In order to achieve the above objectives, the embodiment of the present invention uses following technical scheme:
A kind of keyword extracting method, including:
Obtain pending text;
It is pre-processed to obtain the word in pending text to the pending text and determines the mesh in pending text
Mark word;
Obtain the term vector of each word in the pending text;
The term vector of the term vector of each target word in the pending text and each word after the target word is pressed respectively
The trained obtained first circulation neural network of backward input obtains each target word of the first circulation neural network output
First result;
The probability value of each target word is calculated based on the output result of each target word, the output result includes first knot
Fruit;
The keyword in the pending text is determined according to the probability value of each target word and predetermined threshold value.
A kind of keyword extracting device, including:
Acquisition module, for obtaining pending text;
Preprocessing module, for being pre-processed to obtain the word in pending text to the pending text and determining institute
State the target word in pending text;
Conversion module, the term vector for obtaining each word in the pending text;
First circulation Processing with Neural Network module, for respectively according to each mesh in the pending text inputted in reverse order
The term vector of word and the term vector of each word after the target word are marked, the first result of each target word is obtained;
Computing module, the probability value for calculating each target word based on the output result of each target word, the output result
Including first result;
Keyword determining module, for being determined in the pending text with predetermined threshold value according to the probability value of each target word
Keyword.
Above-mentioned keyword extracting method and device, during keyword extraction, by the pending text of determination
The term vector of each word after target word and the target word is inputted in Recognition with Recurrent Neural Network, is obtained by target word in reverse order respectively
The output of each target word is as a result, calculate the probability value of each target word according to the output result of each target word, to according to each target
The probability value and predetermined threshold value of word determine the keyword in pending text.Due to being to determine target word from pending text,
Input in Recognition with Recurrent Neural Network is the term vector of each target word of backward and the term vector of each word after the target word, root
Recognition with Recurrent Neural Network, output result is repeatedly utilized to consider the hereafter historical information of each target word according to the quantity of target word,
The semantic information of i.e. each target word improves keyword extraction so as to accurately determine whether target word is keyword
Accuracy rate.
Description of the drawings
Fig. 1 is the application environment schematic diagram of the keyword extracting method and device of one embodiment;
Fig. 2 is the internal structure schematic diagram of the server of one embodiment;
Fig. 3 is the flow chart of the keyword extracting method of one embodiment;
Fig. 4 is the structure chart of the LSTM units of one embodiment;
Fig. 5 is the flow chart of the keyword extracting method of another embodiment;
Fig. 6 is the structural schematic diagram of the corresponding model of keyword extracting method of one embodiment;
Fig. 7 is the structural schematic diagram of the corresponding model of keyword extracting method of another embodiment;
Fig. 8 is the structure diagram of the keyword extracting device of one embodiment;
Fig. 9 is the structure diagram of the keyword extracting device of another embodiment;
Figure 10 is the structure diagram of the keyword extracting device of further embodiment;
Figure 11 is the structure diagram of the keyword extracting device of another embodiment.
Specific implementation mode
To make the objectives, technical solutions, and advantages of the present invention more comprehensible, with reference to the accompanying drawings and embodiments, to this
Invention is described in further detail.It should be appreciated that the specific embodiments described herein are only used to explain the present invention,
Do not limit protection scope of the present invention.
Fig. 1 is the application environment schematic diagram of keyword extracting method and device that one embodiment provides.As shown in Figure 1,
The application environment includes user terminal 101 and server 102, and user terminal 101 is communicated to connect with server 102.User terminal
101 are equipped with search engine or question answering system, and user inputs text by user terminal 101, and input text passes through communication network
It is sent to server 102, server 102 handles input text, and the keyword in extraction input text carries for user
For search result or question and answer result.Alternatively, user by user terminal 101 input text, user terminal 101 to input text into
Keyword is sent to server 102 by row processing, the keyword of extraction input text by communication network, and server 102 is to use
Family provides search result or question and answer result.
Fig. 2 is the internal structure schematic diagram of the server in one embodiment.As shown in Fig. 2, server includes passing through to be
Processor, storage medium, memory and the network interface of bus of uniting connection.Wherein, the storage medium of server is stored with operation system
System and a kind of keyword extracting device, the keyword extracting device is for realizing a kind of keyword extracting method.The processor
For providing calculating and control ability, the operation of entire server is supported.The data in storage medium are saved as in server
Processing unit operation provide environment, network interface be used for user terminal carry out network communication, receive user terminal send it is defeated
Enter text, by according to input text in keyword query to search result or question and answer result be sent to the user terminal.Ability
Field technique personnel are appreciated that structure shown in Figure 2, only with the block diagram of the relevant part-structure of the present invention program, and
The restriction for the server being applied thereon to the present invention program is not constituted, specific server may include than as shown in the figure more
More or less component either combines certain components or is arranged with different components.
Fig. 3 is please referred to, in one embodiment, provides a kind of keyword extracting method, this method operates in as shown in Figure 1
Server 102 in, this approach includes the following steps:
S302:Obtain pending text.
User inputs text by user terminal, and server obtains text input by user by communication network and obtains waiting locating
Manage text.
S304:It is pre-processed to obtain the word in pending text to pending text and determines the mesh in pending text
Mark word.
Pending text is usually made of individual character, and for the individual character that compares, word can more express semanteme, with more practical meaning
Justice.Target word is the potential keyword determined from each word in pending text.Specifically, can by pending text into
Row participle, obtains the word of pending text.Participle is exactly by continuous word in pending text according to certain specification again group
Synthesize the process of word sequence.It in one embodiment, can be by each word in pending text all as target word.Another
In one embodiment, the word with practical significance is extracted in the word that can be pending text as target word.
In a specific embodiment, step S304 includes the following steps 1 to step 2:
Step 1:Word segmentation processing is carried out to pending text, obtains the word in pending text.
Step 2:It identifies the stop words in pending text, the word in pending text in addition to stop words is determined as mesh
Mark word.
Credit word (stopword) refers to that the processing of stopping immediately being encountered then during text-processing, the word being thrown away,
Stop words includes mainly English character, number, punctuation mark and the extra-high Chinese word character etc. of frequency of use.Stop words does not have usually
There is the function word of physical meaning.Specifically, the stop words in deactivated dictionary can be compared with the word in pending text, really
Stop words in fixed pending text.For example, common stop words have " ", " ", " what " etc., these words centainly can not
Keyword can be used as.In the present embodiment, the word in pending text in addition to stop words is determined as target word, and removes stop words
Except word be usually notional word, using notional word as target word, be not input to cycle nerve net using stop words as target word
In network, the accuracy rate that the output result because of stop words influences keyword extraction on the one hand can be avoided on the other hand can to improve
The speed of keyword extraction.
S306:Obtain the term vector of each word in pending text.
Term vector is that the corresponding vector of a word indicates, is a kind of side that the word in natural language is carried out to mathematicization
Formula, term vector can be by training to obtain large-scale data using language model.Common language model is Word2vec, profit
With the thought of deep learning, can be transported by training the vector in K dimensional vector spaces is reduced to the processing of content of text
It calculates.In a particular embodiment, it by large scale text data, trains to obtain the term vector of each word using Word2vec,
By searching for obtaining the term vector of each word in pending text.
S308:The term vector of the term vector of each target word in pending text and each word after the target word is pressed respectively
The trained obtained first circulation neural network of backward input obtains the first of each target word of first circulation neural network output
As a result.
RNN models (Recurrent Neural Net), length can be used in short-term in Recognition with Recurrent Neural Network structure in this implementation
Memory models (Long Short-Term Memory, LSTM) or GRU (gated recurrent unit) model.RNN networks
Including input layer, hidden layer and output layer, the implicit unit wherein in hidden layer completes most important work.Hidden layer it is anti-
Feedback has not only entered the output at current time, but also enters the input of the hidden layer of subsequent time.Therefore, RNN networks knot
Structure is it can be considered that historical information.
LSTM is on the basis of RNN, using LSTM units instead of the hidden layer in RNN.A kind of LSTM cellular constructions figure
As shown in Figure 4.Wherein, mnemon is controlled by three doors respectively for storing historical information, the update and use of historical information
System --- input gate (Input Gate) forgets door (Forget Gate) and out gate (Output Gate).
The present embodiment is illustrated so that first circulation neural network is LSTM networks as an example.In the present embodiment, with target word
Centered on, the term vector of each word after the term vector of target word and the target word is input to housebroken in reverse order respectively
One LSTM networks, thus, the first LSTM networks are repeatedly utilized according to the quantity of target word, obtain the first result of each target word.
And each target word is the input as the last one unit of LSTM, and the output result of each target word considers
The hereafter historical information of each target word, i.e., the semantic information of each target word.Each target word of first LSTM networks output
The first result be the first LSTM networks the last one hidden layer (LSTM units) output.
S310:The probability value of each target word is calculated based on the output result of each target word, output result includes the first result.
The output result of each target word in the present embodiment includes the of each target word of first circulation neural network output
One result.
The input of first LSTM networks is term vector, and output result is also vector.In order to the output knot of each target word
Fruit is mapped within the scope of 0-1 to indicate the probability of each target word, and softmax functions or sigmoid functions can be used.softmax
Function is a kind of common regression models of classifying more.Judge that target word whether be keyword is a two-dimensional problems, it is corresponding
Softmaxt has two dimension, and one-dimensional representation is the probability of keyword, and two-dimensional representation is not the probability of keyword.
S312:The keyword in pending text is determined according to the probability value of each target word and predetermined threshold value.
It is respectively that the probability value of keyword is compared with predetermined threshold value by target word in pending text, probability value is big
It is determined as the keyword in pending text in the target word of predetermined threshold value.
The setting of threshold value is related with specific requirements, and the height of threshold value setting, accuracy rate is just high, and recall rate accordingly reduces.If
Threshold value is arranged low, and for accuracy rate with regard to low, recall rate is just high, and user can be arranged as required to threshold value.
Above-mentioned keyword extracting method, during keyword extraction, by the target word in the pending text of determination,
The term vector of each word after target word and the target word is input in Recognition with Recurrent Neural Network in reverse order respectively, obtains each target
The output of word is as a result, calculate the probability value of each target word according to the output result of each target word, to according to the general of each target word
Rate value and predetermined threshold value determine the keyword in pending text.Due to being to determine target word from pending text, cycle god
It is the term vector of each word after each target word and the target word of backward through the input in network, according to the quantity of target word
Repeatedly utilize Recognition with Recurrent Neural Network, output result to consider the hereafter historical information of each target word, i.e., each target word
Semantic information improves the accuracy rate of keyword extraction so as to accurately determine whether target word is keyword.
As shown in figure 5, In yet another embodiment, providing a kind of keyword extracting method, this method is operated in such as Fig. 1 institutes
In the server 102 shown, this approach includes the following steps:
S502:Obtain pending text.
User inputs text by user terminal, and server obtains text input by user by communication network and obtains waiting locating
Manage text.
S504:It is pre-processed to obtain the word in pending text to pending text and determines the mesh in pending text
Mark word.
Pending text is usually made of individual character, and for the individual character that compares, word can more express semanteme, with more practical meaning
Justice.Specifically, the word of pending text can be obtained by being segmented to pending text.Participle is exactly by pending text
In continuous word the process of word sequence is reassembled into according to certain specification.It in one embodiment, can will be pending
Each word is all as target word in text.In another embodiment, can be extracted from the word of pending text has
The word of practical significance is as target word.
S506:Obtain the term vector of each word in pending text.
Term vector is that the corresponding vector of a word indicates, is a kind of side that the word in natural language is carried out to mathematicization
Formula, term vector can be by training to obtain large-scale data using language model.Common language model is Word2vec, profit
With the thought of deep learning, can be transported by training the vector in K dimensional vector spaces is reduced to the processing of content of text
It calculates.In a particular embodiment, it by large scale text data, trains to obtain the term vector of each word using Word2vec,
By searching for obtaining the term vector of each word in pending text.
S508:The term vector of the term vector of each target word in pending text and each word after the target word is pressed respectively
The trained obtained first circulation neural network of backward input obtains the first of each target word of first circulation neural network output
As a result.
In the present embodiment, centered on target word, respectively by each word after the term vector of target word and the target word
Term vector is input to housebroken first LSTM networks in reverse order, thus, the first LSTM is repeatedly utilized according to the quantity of target word
Network obtains the first result of each target word.And each target word is the input as the last one unit of LSTM, often
The output result of one target word all considers the hereafter historical information of each target word, i.e., the semantic letter of each target word
Breath.First LSTM networks output each target word the first result be the first LSTM networks the last one hidden layer (LSTM is mono-
Member) output.
S510:The term vector of the term vector of each target word in pending text and each word before the target word is pressed respectively
Trained obtained second circulation neural network is sequentially inputted, the second of each target word of second circulation neural network output is obtained
As a result.
In the present embodiment, centered on target word, respectively by each word before the term vector of target word and the target word
Term vector is successively inputted to housebroken 2nd LSTM networks, thus, the 2nd LSTM is repeatedly utilized according to the quantity of target word
Network obtains the second result of each target word.And each target word is as the defeated of the 2nd LSTM the last one unit
Enter, the output result of each target word considers the historical information above of each target word, i.e., the language of each target word
Adopted information.Second result of each target word of the 2nd LSTM networks output is the last one hidden layer of the 2nd LSTM networks
The output of (LSTM units).
S512:The probability value of each target word is calculated based on the output result of each target word, output result includes the first result
With the second result.Specifically, calculating the probability value that each target word is keyword according to the output result of each target word.
The first of each target word that the output result of each target word is exported based on first circulation neural network in the present embodiment
As a result and the second result of each target word of second circulation neural network output obtains.First result considers under target word
Literary historical information, the second result consider the historical information above of target word, have target to the output result of each target word
Word direction above and the hereafter information in direction.Since the output of Recognition with Recurrent Neural Network is vector, so the output of each target word
It as a result, can be preferably to target word by target word in direction above and the hereafter common modeling a to vector of information in direction
Significance level be indicated, to predict its whether be keyword probability.
The output result of each target word can be attached with the second result by the first result to each target word, be added,
It is multiplied, is average, dot product, the modes such as being maximized and obtain.
In a particular embodiment, the first result and the second result that can be separately connected each target word obtain each target word
Output result.Connection refers to that will splice before and after the first result and the second result, to obtain the output result of each target word.Tool
Body, the output of the first the last one moment of LSTM networks of connection corresponding hidden layer (LSTM units) and the 2nd LSTM networks
The last one moment corresponding hidden layer (LSTM units) obtains the output result of each target word.
In order to which the output result of each target word is mapped within the scope of 0-1 to indicate the probability of each target word, can be used
Softmax functions or sigmoid functions.Softmax functions are a kind of common regression models of classifying more.Whether judge target word
It is a two-dimensional problems for keyword, corresponding softmaxt has two dimension, and one-dimensional representation is the probability of keyword, the second dimension table
Show be not keyword probability.
S514:The keyword in pending text is determined according to the probability value of each target word and predetermined threshold value.
It is respectively that the probability value of keyword is compared with predetermined threshold value by target word in pending text, probability value is big
It is determined as the keyword in pending text in the target word of predetermined threshold value.
Above-mentioned keyword extracting method, by by the word of each word after the term vector of each target word and the target word
Vector be input in reverse order housebroken first nerves network obtain each target word first as a result, by the word of each target word to
The term vector of each word after amount and the target word is successively inputted to housebroken nervus opticus network and obtains each target word
Second as a result, the first result considers the hereafter historical information of target word, and the second result considers the history above letter of target word
Breath, to which the output result of each target word has target word direction above and the hereafter information in direction.Due to recycling nerve net
The output of network is vector, so the output result of each target word is by target word direction above and hereafter the information in direction is built jointly
In mould a to vector, preferably the significance level of target word can be indicated, to predict whether it is keyword
Probability.
In one embodiment, further include the steps that trained first circulation neural network, the step before step S308
Include the following steps 1 to step 4:
Step 1:Acquisition waits for training text and respectively waits for the corresponding keyword of training text.
Step 2:It treats training text and is pre-processed and respectively waited for word in training text and determination respectively waits for training text
In target word.
Step 3:Obtain the term vector for respectively waiting for each word in training text.
Step 4:According to the target word respectively waited in training text for respectively waiting for the corresponding keyword of training text, inputting in reverse order
Term vector and the target word after the term vector of each word train to obtain first circulation neural network.
In the present embodiment, target word can be the non-stop words respectively waited in training text, or respectively wait for training text
In all words.Preferably, using all words respectively waited in training text as target word, Recognition with Recurrent Neural Network training can be improved
Accuracy rate.
Specifically, it is to wait for that training text, K are keyword to define S, and i-th of target word of training text will be waited for when training
The term vector of each word after term vector and the target word is input to first circulation neural network in reverse order, then crucial by i-th
The term vector of word is input to first circulation neural network, obtains the loss of each keyword, updates first using gradient decline
The parameter of Recognition with Recurrent Neural Network.Recognition with Recurrent Neural Network in the present embodiment uses LSTM.Gaussian Profile is used in the training process
The network parameter of loop initialization neural network trains network using stochastic gradient descent method.
In another embodiment, further include trained first circulation neural network and second circulation before step S508
The step of neural network, the step include the following steps 1 to step 4:
Step 1:Acquisition waits for training text and respectively waits for the corresponding keyword of training text.
Step 2:It treats training text and is pre-processed and respectively waited for word in training text and determination respectively waits for training text
In target word.
Step 3:Obtain the term vector for respectively waiting for each word in training text;
Step 4:According to the mesh respectively waited in training text for respectively waiting for that the corresponding keyword of training text is trained, inputting in reverse order
The target word respectively waited in training text marked the term vector of word and the term vector of each word after the target word, inputted in order
The term vector of each word before term vector and the target word trains to obtain first circulation neural network and second circulation neural network.
Specifically, it is to wait for that training text, K are keyword to define S, and i-th of target word of training text will be waited for when training
The term vector of each word after term vector and the target word is input to first circulation neural network in reverse order, by i target word
The term vector of each word before term vector and the target word is successively inputted to second circulation neural network;Again by i-th of key
The term vector of word is input to first circulation neural network and second circulation neural network, obtains the loss of each keyword, uses
Gradient declines to update the parameter of first circulation neural network and second circulation neural network.Cycle nerve net in the present embodiment
Network uses LSTM.The network parameter for using Gaussian Profile loop initialization neural network in the training process, uses stochastic gradient
Descent method trains network.
Above-mentioned keyword extracting method can allow machine to learn the feature of keyword automatically, and it is special to remove artificial selection from
The process of sign.By using two LSTM networks, the information above and context information of target word are input in model simultaneously, it will
In information modeling a to vector in both direction, preferably keyword can be indicated.Further, since LSTM is tied
Structure can overcome the shortcomings of, to random length sequence inputting, preferably to store historical information, can more make full use of word in sentence
The contextual information of language.
The keyword extracting method of the present invention is illustrated with reference to specific embodiment.
A kind of model corresponding with keyword extracting method is as shown in fig. 6, include the first LSTM networks and Sotfmax layers, mould
The input of type is term vector, and the first LSTM networks are that the output of target word is indicated as a result, exporting result with vector, export result
The probability value of corresponding target word is calculated by Sotfmax.The probability value of target word is compared with predetermined threshold value and can determine
Whether target word is keyword.
By taking pending text is " what specialty Ningbo has can occupy a tiny space in Shanghai World's Fair " as an example, segmenting
After processing, determining target word includes " Ningbo ", " specialty ", " Shanghai ", " World Expo ", " occupying " and " one seat ".Respectively
The term vector backward input of each word after the term vector of each target word and the target word is followed to trained first obtained
In ring neural network, the output result of each target word is obtained.For example, target word is " World Expo ", then with " ", " one seat "
The sequence of " occupying ", " World Expo ", corresponding term vector is input in Recognition with Recurrent Neural Network, wherein the term vector of " " is defeated
Enter to the first LSTM units of the first LSTM, the term vector of " one seat " is input to the 2nd LSTM units, and so on, target
The term vector of word " World Expo " inputs the last one LSTM unit, each LSTM unit is exported by a upper LSTM unit
It influences.The output of first LSTM is the output vector of the last one LSTM unit, each target is considered to export result
The hereafter historical information of word, i.e., the semantic information of each target word, so as to accurately determine whether target word is keyword,
Improve the accuracy rate of keyword extraction.
One kind model corresponding with keyword extracting method is as shown in fig. 7, comprises the first LSTM networks LSTMR, the 2nd LSTM
Network LSTMLWith Sotfmax layers, the term vector of the input of model, LSTMRThe first of network output target word is as a result, LSTMLIt is defeated
Go out the second of target word as a result, the first result of connection and the second result obtain the output of target word as a result, output result is with vector
It indicates, output result calculates the probability value of corresponding target word by Sotfmax.By the probability value and predetermined threshold value of target word
It is compared and can determine whether target word is keyword.
By taking pending text is " what specialty Ningbo has can occupy a tiny space in Shanghai World's Fair " as an example, segmenting
After processing, determining target word includes " Ningbo ", " specialty ", " Shanghai ", " World Expo ", " occupying " and " one seat ".Respectively
By the term vector backward input of each word after the term vector of each target word and the target word to trained obtained LSTMR
In, the term vector of each word before the term vector of each target word and the target word is sequentially input to LSTMLIn, obtain each mesh
The first result and second of word is marked as a result, the first result of connection and the second result obtain the output result of each target word.For example, mesh
Mark word be " World Expo ", then with " ", " one seat " " occupying ", " World Expo " sequencing, corresponding term vector is defeated
Enter to LSTMRIn, wherein the term vector of " " is input to LSTMRThe first LSTM units, " one seat " term vector input
To LSTMRThe 2nd LSTM units, and so on, target word " World Expo " be used as LSTMRThe last one LSTM unit.With
" Ningbo ", " having ", " what ", " specialty ", " energy ", " " and " Shanghai " sequencing, corresponding term vector is input to
LSTML, wherein the term vector in " Ningbo " is input to LSTMLThe first LSTM units, the term vector of " having " is input to LSTML
Two LSTM units, and so on, target word " World Expo " is input to LSTMLThe last one LSTM unit.Each LSTM is mono-
Member is all influenced by the output of a upper LSTM unit.The output of first LSTM is the output vector of the last one LSTM unit, from
And export result and consider the hereafter historical information of each target word and historical information above, the output result of each target word will
Target word, can be preferably to the important of target word in direction above and the hereafter common modeling a to vector of information in direction
Degree is indicated, to predict its whether be keyword probability.
In one embodiment, a kind of keyword extracting device is provided, as shown in figure 8, including acquisition module 801, pre- place
Manage module 802, conversion module 803, first circulation Processing with Neural Network module 804, output processing module 805, computing module 806
With keyword determining module 807.
Acquisition module 801, for obtaining pending text.
Preprocessing module 802 is waited for for being pre-processed to obtain the word in pending text to pending text and being determined
Handle the target word in text.
Conversion module 803, the term vector for obtaining each word in pending text.
First circulation Processing with Neural Network module 804, for respectively according to each mesh in the pending text inputted in reverse order
The term vector of word and the term vector of each word after the target word are marked, the first result of each target word is obtained.
Computing module 805, the probability value that each target word is calculated for the output result based on each target word export result packet
Include the first result.
Keyword determining module 806, for being determined in pending text with predetermined threshold value according to the probability value of each target word
Keyword.
Above-mentioned keyword extracting device, during keyword extraction, by the target word in the pending text of determination,
The term vector of each word after target word and the target word is input in Recognition with Recurrent Neural Network in reverse order respectively, obtains each target
The output of word is as a result, calculate the probability value of each target word according to the output result of each target word, to according to the general of each target word
Rate value and predetermined threshold value determine the keyword in pending text.Due to being to determine target word from pending text, cycle god
It is the term vector of each word after each target word and the target word of backward through the input in network, according to the quantity of target word
Repeatedly utilize Recognition with Recurrent Neural Network, output result to consider the hereafter historical information of each target word, i.e., each target word
Semantic information improves the accuracy rate of keyword extraction so as to accurately determine whether target word is keyword.
In another embodiment, a kind of keyword extracting device is provided, as shown in figure 9, including acquisition module 901, pre-
Processing module 902, conversion module 903, first circulation Processing with Neural Network module 904, second circulation Processing with Neural Network module
905, output processing module 906, computing module 907 and keyword determining module 908.
Acquisition module 901, for obtaining pending text.
Preprocessing module 902 is waited for for being pre-processed to obtain the word in pending text to pending text and being determined
Handle the target word in text.
Conversion module 903, the term vector for obtaining each word in pending text.
First circulation Processing with Neural Network module 904, for respectively according to each mesh in the pending text inputted in reverse order
The term vector of word and the term vector of each word after the target word are marked, the first result of each target word is obtained.
Second circulation Processing with Neural Network module 905, for respectively according to each mesh in the pending text inputted in order
The term vector of word and the term vector of each word before the target word are marked, the second result of each target word is obtained.
Output processing module 906, for based on each target word the first result and the second result obtain the defeated of each target word
Go out result.Specifically, being separately connected the first result of each target word and the second result obtains the output result of each target word.
Computing module 907, the probability value for calculating each target word based on the output result of each target word.
Keyword determining module 908, for being determined in pending text with predetermined threshold value according to the probability value of each target word
Keyword.
Above-mentioned keyword extracting device, by by the word of each word after the term vector of each target word and the target word
Vector be input in reverse order housebroken first nerves network obtain each target word first as a result, by the word of each target word to
The term vector of each word after amount and the target word is successively inputted to housebroken nervus opticus network and obtains each target word
Second as a result, the first result considers the hereafter historical information of target word, and the second result considers the history above letter of target word
Breath, to which the output result of each target word has target word direction above and the hereafter information in direction.Due to recycling nerve net
The output of network is vector, so the output result of each target word is by target word direction above and hereafter the information in direction is built jointly
In mould a to vector, preferably the significance level of target word can be indicated, to predict whether it is keyword
Probability.
In another embodiment, referring to Fig. 10, preprocessing module 802 includes word-dividing mode 8021 and identification module
8022。
Word-dividing mode 8021 obtains the word in pending text for carrying out word segmentation processing to pending text.
Identification module 8022, the stop words in pending text for identification, by pending text in addition to stop words
Word be determined as target word.
In another embodiment, please continue to refer to Figure 10, keyword extracting device further includes the first training module 808.
Acquisition module 801 is additionally operable to acquisition and waits for training text and respectively wait for the corresponding keyword of training text.
Preprocessing module 802 is additionally operable to treat training text and is pre-processed and respectively waited for word in training text and true
The fixed target word respectively waited in training text.
Conversion module 803 is additionally operable to obtain the term vector for respectively waiting for each word in training text.
First training module 808, for according to respectively wait for the corresponding keyword of training text, input in reverse order respectively wait training
The term vector of each word after the term vector of target word in text and the target word trains to obtain first circulation neural network.
In another embodiment, as shown in figure 11, keyword extracting device further includes the second training module 909.
Acquisition module 901 is additionally operable to acquisition and waits for training text and respectively wait for the corresponding keyword of training text.
Preprocessing module 902 is additionally operable to treat training text and is pre-processed and respectively waited for word in training text and true
The fixed target word respectively waited in training text.
Conversion module 903 is additionally operable to obtain the term vector for respectively waiting for each word in training text.
Second training module 909, for according to respectively wait for the corresponding keyword of training text, input in reverse order respectively wait training
The term vector of each word after the term vector of target word in text and the target word, respectively waiting in training text of inputting in order
The term vector of target word and the term vector of each word before the target word train to obtain first circulation neural network and second and follow
Ring neural network.
Above-mentioned keyword extracting device can allow machine to learn the feature of keyword automatically, and it is special to remove artificial selection from
The process of sign.By using two LSTM networks, the information above and context information of target word are input in model simultaneously, it will
In information modeling a to vector in both direction, preferably keyword can be indicated.Further, since LSTM is tied
Structure can overcome the shortcomings of, to random length sequence inputting, preferably to store historical information, can more make full use of word in sentence
The contextual information of language.
One of ordinary skill in the art will appreciate that realizing all or part of flow in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, it is non-volatile computer-readable that the program can be stored in one
It takes in storage medium, in the embodiment of the present invention, which can be stored in the storage medium of computer system, and by the calculating
At least one of machine system processor executes, and includes the flow such as the embodiment of above-mentioned each method with realization.Wherein, described
Storage medium can be magnetic disc, CD, read-only memory (Read-Only Memory, ROM) or random access memory
(Random Access Memory, RAM) etc..
Each technical characteristic of embodiment described above can be combined arbitrarily, to keep description succinct, not to above-mentioned reality
It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited
In contradiction, it is all considered to be the range of this specification record.
Several embodiments of the invention above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously
It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art
It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the protection of the present invention
Range.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.
Claims (12)
1. a kind of keyword extracting method, which is characterized in that including:
Obtain pending text;
It is pre-processed to obtain the word in pending text to the pending text and determines the mesh in the pending text
Mark word;
Obtain the term vector of each word in the pending text;
Respectively in reverse order by the term vector of the term vector of each target word in the pending text and each word after the target word
Trained obtained first circulation neural network is inputted, the first of each target word of the first circulation neural network output is obtained
As a result;
The probability value of each target word is calculated based on the output result of each target word, the output result includes first result;
The keyword in the pending text is determined according to the probability value of each target word and predetermined threshold value.
2. according to the method described in claim 1, it is characterized in that, calculating each target word in the output result based on each target word
Probability value the step of before, further include:It respectively will be before the term vector of each target word in the pending text and the target word
The term vector of each word input trained obtained second circulation neural network in order, obtain the second circulation neural network
Second result of each target word of output;
The output result further includes second result.
3. according to the method described in claim 2, it is characterized in that, being separately connected the first result and second of each target word
As a result the output result of each target word is obtained.
4. according to the method described in claim 1, it is characterized in that, described pre-processed to the pending text is waited for
It handles the word in text and includes the step of determining the target word in pending text:
Word segmentation processing is carried out to the pending text, obtains the word in pending text;
It identifies the stop words in the pending text, the word in the pending text in addition to the stop words is determined as
Target word.
5. according to the method described in claim 1, it is characterized in that, respectively by the word of each target word in the pending text
The term vector of each word after vector and the target word inputs housebroken first circulation neural network in reverse order, obtains described the
Before the step of first result of each target word of one Recognition with Recurrent Neural Network output, further include:
Acquisition waits for training text and respectively waits for the corresponding keyword of training text;
To the word for waiting for that training text is pre-processed and respectively being waited in training text and determine the mesh respectively waited in training text
Mark word;
Obtain each term vector for waiting for each word in training text;
The corresponding keyword of training text, the target word respectively waited in training text that sequentially inputs in reverse order are respectively waited for according to described
The term vector of each word after term vector and the target word trains to obtain the first circulation neural network.
6. according to the method described in claim 2, it is characterized in that, respectively by the word of each target word in the pending text
The term vector of each word after vector and the target word inputs housebroken first circulation neural network in reverse order, obtains described the
Before the step of first result of each target word of one Recognition with Recurrent Neural Network output, further include:
Acquisition waits for training text and respectively waits for the corresponding keyword of training text;
To the word for waiting for that training text is pre-processed and respectively being waited in training text and determine the mesh respectively waited in training text
Mark word;
Obtain each term vector for waiting for each word in training text;
According to respectively wait for the corresponding keyword of training text, the term vector for respectively waiting for the target word in training text inputted in reverse order and
The term vector of each word after the target word, the term vector for respectively waiting for the target word in training text inputted in order and the target
The term vector of each word before word trains to obtain the first circulation neural network and second circulation neural network.
7. a kind of keyword extracting device, which is characterized in that including:
Acquisition module, for obtaining pending text;
Preprocessing module is waited for for being pre-processed to obtain to the pending text described in the word in pending text and determination
Handle the target word in text;Conversion module, the term vector for obtaining each word in the pending text;
First circulation Processing with Neural Network module, for respectively according to each target word in the pending text inputted in reverse order
Term vector and the target word after each word term vector, obtain the first result of each target word;
Computing module, the probability value for calculating each target word based on the output result of each target word, the output result include
First result;
Keyword determining module, for determining the pass in the pending text with predetermined threshold value according to the probability value of each target word
Keyword.
8. device according to claim 7, which is characterized in that further include second circulation Processing with Neural Network module, it is described
Second circulation Processing with Neural Network module, for respectively according to the word of each target word in the pending text inputted in order
The term vector of each word before vector and the target word, obtains the second result of each target word;
The computing module, the probability value for calculating each target word based on the output result of each target word, the output result
Including first result and second result.
9. device according to claim 8, which is characterized in that further include output processing module, for being separately connected each institute
The first result and the second result for stating target word obtain the output result of each target word.
10. device according to claim 7, which is characterized in that the preprocessing module includes word-dividing mode and identification mould
Block;
The word-dividing mode obtains the word in pending text for carrying out word segmentation processing to the pending text;
The identification module, the stop words in the pending text, will stop in the pending text except described for identification
Word except word is determined as target word.
11. device according to claim 7, which is characterized in that further include the first training module;
The acquisition module is additionally operable to acquisition and waits for training text and respectively wait for the corresponding keyword of training text;
The preprocessing module is additionally operable to the word for waiting for that training text is pre-processed and respectively being waited in training text and true
The fixed target word respectively waited in training text;
The conversion module is additionally operable to obtain each term vector for waiting for each word in training text;
First training module, for according to it is described respectively wait for the corresponding keyword of training text, input in reverse order respectively wait instructing
Practice the term vector of target word and the term vector of each word after the target word in text to train to obtain the first circulation nerve
Network.
12. device according to claim 8, which is characterized in that further include the second training module:
The acquisition module is additionally operable to acquisition and waits for training text and respectively wait for the corresponding keyword of training text;
The preprocessing module is additionally operable to the word for waiting for that training text is pre-processed and respectively being waited in training text and true
The fixed target word respectively waited in training text;
The conversion module is additionally operable to obtain each term vector for waiting for each word in training text;
Second training module, for according to respectively wait for the corresponding keyword of training text, input in reverse order respectively wait for training text
The term vector of target word in this and the term vector of each word after the target word, respectively waiting in training text of inputting in order
The term vector of each word before the term vector of target word and the target word trains to obtain the first circulation neural network and second
Recognition with Recurrent Neural Network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710101012.7A CN108304364A (en) | 2017-02-23 | 2017-02-23 | keyword extracting method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710101012.7A CN108304364A (en) | 2017-02-23 | 2017-02-23 | keyword extracting method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108304364A true CN108304364A (en) | 2018-07-20 |
Family
ID=62872340
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710101012.7A Pending CN108304364A (en) | 2017-02-23 | 2017-02-23 | keyword extracting method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108304364A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109145107A (en) * | 2018-09-27 | 2019-01-04 | 平安科技(深圳)有限公司 | Subject distillation method, apparatus, medium and equipment based on convolutional neural networks |
CN109388806A (en) * | 2018-10-26 | 2019-02-26 | 北京布本智能科技有限公司 | A kind of Chinese word cutting method based on deep learning and forgetting algorithm |
CN109471938A (en) * | 2018-10-11 | 2019-03-15 | 平安科技(深圳)有限公司 | A kind of file classification method and terminal |
CN109635273A (en) * | 2018-10-25 | 2019-04-16 | 平安科技(深圳)有限公司 | Text key word extracting method, device, equipment and storage medium |
CN110110330A (en) * | 2019-04-30 | 2019-08-09 | 腾讯科技(深圳)有限公司 | Text based keyword extracting method and computer equipment |
CN110851574A (en) * | 2018-07-27 | 2020-02-28 | 北京京东尚科信息技术有限公司 | Statement processing method, device and system |
CN110866393A (en) * | 2019-11-19 | 2020-03-06 | 北京网聘咨询有限公司 | Resume information extraction method and system based on domain knowledge base |
CN112528655A (en) * | 2020-12-18 | 2021-03-19 | 北京百度网讯科技有限公司 | Keyword generation method, device, equipment and storage medium |
CN112749251A (en) * | 2020-03-09 | 2021-05-04 | 腾讯科技(深圳)有限公司 | Text processing method and device, computer equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106095749A (en) * | 2016-06-03 | 2016-11-09 | 杭州量知数据科技有限公司 | A kind of text key word extracting method based on degree of depth study |
US9508340B2 (en) * | 2014-12-22 | 2016-11-29 | Google Inc. | User specified keyword spotting using long short term memory neural network feature extractor |
CN106383817A (en) * | 2016-09-29 | 2017-02-08 | 北京理工大学 | Paper title generation method capable of utilizing distributed semantic information |
-
2017
- 2017-02-23 CN CN201710101012.7A patent/CN108304364A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9508340B2 (en) * | 2014-12-22 | 2016-11-29 | Google Inc. | User specified keyword spotting using long short term memory neural network feature extractor |
CN106095749A (en) * | 2016-06-03 | 2016-11-09 | 杭州量知数据科技有限公司 | A kind of text key word extracting method based on degree of depth study |
CN106383817A (en) * | 2016-09-29 | 2017-02-08 | 北京理工大学 | Paper title generation method capable of utilizing distributed semantic information |
Non-Patent Citations (1)
Title |
---|
王煦祥: "面向问答的问句关键词提取技术研究", 《中国优秀硕士学位论文全文数据库》 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110851574A (en) * | 2018-07-27 | 2020-02-28 | 北京京东尚科信息技术有限公司 | Statement processing method, device and system |
CN109145107A (en) * | 2018-09-27 | 2019-01-04 | 平安科技(深圳)有限公司 | Subject distillation method, apparatus, medium and equipment based on convolutional neural networks |
CN109145107B (en) * | 2018-09-27 | 2023-07-25 | 平安科技(深圳)有限公司 | Theme extraction method, device, medium and equipment based on convolutional neural network |
CN109471938B (en) * | 2018-10-11 | 2023-06-16 | 平安科技(深圳)有限公司 | Text classification method and terminal |
CN109471938A (en) * | 2018-10-11 | 2019-03-15 | 平安科技(深圳)有限公司 | A kind of file classification method and terminal |
CN109635273A (en) * | 2018-10-25 | 2019-04-16 | 平安科技(深圳)有限公司 | Text key word extracting method, device, equipment and storage medium |
WO2020082560A1 (en) * | 2018-10-25 | 2020-04-30 | 平安科技(深圳)有限公司 | Method, apparatus and device for extracting text keyword, as well as computer readable storage medium |
CN109388806A (en) * | 2018-10-26 | 2019-02-26 | 北京布本智能科技有限公司 | A kind of Chinese word cutting method based on deep learning and forgetting algorithm |
CN109388806B (en) * | 2018-10-26 | 2023-06-27 | 北京布本智能科技有限公司 | Chinese word segmentation method based on deep learning and forgetting algorithm |
CN110110330A (en) * | 2019-04-30 | 2019-08-09 | 腾讯科技(深圳)有限公司 | Text based keyword extracting method and computer equipment |
CN110110330B (en) * | 2019-04-30 | 2023-08-11 | 腾讯科技(深圳)有限公司 | Keyword extraction method based on text and computer equipment |
CN110866393A (en) * | 2019-11-19 | 2020-03-06 | 北京网聘咨询有限公司 | Resume information extraction method and system based on domain knowledge base |
CN112749251A (en) * | 2020-03-09 | 2021-05-04 | 腾讯科技(深圳)有限公司 | Text processing method and device, computer equipment and storage medium |
CN112749251B (en) * | 2020-03-09 | 2023-10-31 | 腾讯科技(深圳)有限公司 | Text processing method, device, computer equipment and storage medium |
CN112528655A (en) * | 2020-12-18 | 2021-03-19 | 北京百度网讯科技有限公司 | Keyword generation method, device, equipment and storage medium |
CN112528655B (en) * | 2020-12-18 | 2023-12-29 | 北京百度网讯科技有限公司 | Keyword generation method, device, equipment and storage medium |
US11899699B2 (en) | 2020-12-18 | 2024-02-13 | Beijing Baidu Netcom Science Technology Co., Ltd. | Keyword generating method, apparatus, device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108304364A (en) | keyword extracting method and device | |
US10963637B2 (en) | Keyword extraction method, computer equipment and storage medium | |
Bang et al. | Explaining a black-box by using a deep variational information bottleneck approach | |
CN106156003B (en) | A kind of question sentence understanding method in question answering system | |
CN110534087A (en) | A kind of text prosody hierarchy Structure Prediction Methods, device, equipment and storage medium | |
CN104598611B (en) | The method and system being ranked up to search entry | |
CN109871538A (en) | A kind of Chinese electronic health record name entity recognition method | |
CN109977416A (en) | A kind of multi-level natural language anti-spam text method and system | |
CN109241255A (en) | A kind of intension recognizing method based on deep learning | |
CN108804677A (en) | In conjunction with the deep learning question classification method and system of multi-layer attention mechanism | |
CN108334487A (en) | Lack semantics information complementing method, device, computer equipment and storage medium | |
CN110083700A (en) | A kind of enterprise's public sentiment sensibility classification method and system based on convolutional neural networks | |
CN110489755A (en) | Document creation method and device | |
CN110502753A (en) | A kind of deep learning sentiment analysis model and its analysis method based on semantically enhancement | |
CN110427461A (en) | Intelligent answer information processing method, electronic equipment and computer readable storage medium | |
CN110969020A (en) | CNN and attention mechanism-based Chinese named entity identification method, system and medium | |
CN110647612A (en) | Visual conversation generation method based on double-visual attention network | |
CN107437100A (en) | A kind of picture position Forecasting Methodology based on the association study of cross-module state | |
CN108549658A (en) | A kind of deep learning video answering method and system based on the upper attention mechanism of syntactic analysis tree | |
CN110287489A (en) | Document creation method, device, storage medium and electronic equipment | |
CN106682089A (en) | RNNs-based method for automatic safety checking of short message | |
CN113254782B (en) | Question-answering community expert recommendation method and system | |
CN110516035A (en) | A kind of man-machine interaction method and system of mixing module | |
CN111309887A (en) | Method and system for training text key content extraction model | |
CN105975497A (en) | Automatic microblog topic recommendation method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180720 |