CN108304364A - keyword extracting method and device - Google Patents

keyword extracting method and device Download PDF

Info

Publication number
CN108304364A
CN108304364A CN201710101012.7A CN201710101012A CN108304364A CN 108304364 A CN108304364 A CN 108304364A CN 201710101012 A CN201710101012 A CN 201710101012A CN 108304364 A CN108304364 A CN 108304364A
Authority
CN
China
Prior art keywords
word
target word
text
term vector
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710101012.7A
Other languages
Chinese (zh)
Inventor
王煦祥
尹庆宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201710101012.7A priority Critical patent/CN108304364A/en
Publication of CN108304364A publication Critical patent/CN108304364A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Abstract

A kind of keyword extracting method of present invention offer and device, this method include:Obtain pending text;It is pre-processed to obtain the word in pending text to pending text and determines the target word in pending text;Obtain the term vector of each word in pending text;The term vector of the term vector of each target word in pending text and each word after the target word is input to trained obtained first circulation neural network in reverse order respectively, obtains the first result of each target word of first circulation neural network output;The probability value of each target word is calculated based on the output result of each target word, output result includes the first result;The keyword in pending text is determined according to the probability value of each target word and predetermined threshold value.This method can accurately determine whether target word is keyword, improve the accuracy rate of keyword extraction.

Description

Keyword extracting method and device
Technical field
The present invention relates to information technology technical fields, more particularly to a kind of keyword extracting method and device.
Background technology
Keyword is the refining to text subject information, and the high level overview main contents of text can help user quick Understand the purport of text, family easy to use judges whether text is content needed for oneself, to improve message reference and letter Cease the efficiency of search.Moreover, due to keyword refining, succinct feature, keyword can be utilized with lower complexity The calculating for carrying out text relevant, to efficiently carry out the processing such as text classification, text cluster and information retrieval.
Some more common deep learning methods are also gradually applied to keyword extraction field, can by deep learning Machine is allowed to learn the feature of keyword automatically.It is more common that deep learning is carried out using Recognition with Recurrent Neural Network, pending text is complete Portion is input to trained good Recognition with Recurrent Neural Network, and output obtains the keyword of pending text.Traditional Recognition with Recurrent Neural Network Including input layer, hidden layer and output layer, the implicit unit wherein in hidden layer completes most important work.Hidden layer it is anti- Feedback has not only entered the output at current time, but also enters the input of the hidden layer of subsequent time.Therefore, traditional cycle Neural network is it can be considered that historical information.But pending long sentence is fully entered in trained good Recognition with Recurrent Neural Network, The historical information of each word can not be considered to semantic influence, to have ignored the semantic information of word.
Thus, the accuracy of traditional crucial method using deep learning extraction is low.
Invention content
Based on this, it is necessary to provide a kind of keyword extracting method and device that accuracy is high.
In order to achieve the above objectives, the embodiment of the present invention uses following technical scheme:
A kind of keyword extracting method, including:
Obtain pending text;
It is pre-processed to obtain the word in pending text to the pending text and determines the mesh in pending text Mark word;
Obtain the term vector of each word in the pending text;
The term vector of the term vector of each target word in the pending text and each word after the target word is pressed respectively The trained obtained first circulation neural network of backward input obtains each target word of the first circulation neural network output First result;
The probability value of each target word is calculated based on the output result of each target word, the output result includes first knot Fruit;
The keyword in the pending text is determined according to the probability value of each target word and predetermined threshold value.
A kind of keyword extracting device, including:
Acquisition module, for obtaining pending text;
Preprocessing module, for being pre-processed to obtain the word in pending text to the pending text and determining institute State the target word in pending text;
Conversion module, the term vector for obtaining each word in the pending text;
First circulation Processing with Neural Network module, for respectively according to each mesh in the pending text inputted in reverse order The term vector of word and the term vector of each word after the target word are marked, the first result of each target word is obtained;
Computing module, the probability value for calculating each target word based on the output result of each target word, the output result Including first result;
Keyword determining module, for being determined in the pending text with predetermined threshold value according to the probability value of each target word Keyword.
Above-mentioned keyword extracting method and device, during keyword extraction, by the pending text of determination The term vector of each word after target word and the target word is inputted in Recognition with Recurrent Neural Network, is obtained by target word in reverse order respectively The output of each target word is as a result, calculate the probability value of each target word according to the output result of each target word, to according to each target The probability value and predetermined threshold value of word determine the keyword in pending text.Due to being to determine target word from pending text, Input in Recognition with Recurrent Neural Network is the term vector of each target word of backward and the term vector of each word after the target word, root Recognition with Recurrent Neural Network, output result is repeatedly utilized to consider the hereafter historical information of each target word according to the quantity of target word, The semantic information of i.e. each target word improves keyword extraction so as to accurately determine whether target word is keyword Accuracy rate.
Description of the drawings
Fig. 1 is the application environment schematic diagram of the keyword extracting method and device of one embodiment;
Fig. 2 is the internal structure schematic diagram of the server of one embodiment;
Fig. 3 is the flow chart of the keyword extracting method of one embodiment;
Fig. 4 is the structure chart of the LSTM units of one embodiment;
Fig. 5 is the flow chart of the keyword extracting method of another embodiment;
Fig. 6 is the structural schematic diagram of the corresponding model of keyword extracting method of one embodiment;
Fig. 7 is the structural schematic diagram of the corresponding model of keyword extracting method of another embodiment;
Fig. 8 is the structure diagram of the keyword extracting device of one embodiment;
Fig. 9 is the structure diagram of the keyword extracting device of another embodiment;
Figure 10 is the structure diagram of the keyword extracting device of further embodiment;
Figure 11 is the structure diagram of the keyword extracting device of another embodiment.
Specific implementation mode
To make the objectives, technical solutions, and advantages of the present invention more comprehensible, with reference to the accompanying drawings and embodiments, to this Invention is described in further detail.It should be appreciated that the specific embodiments described herein are only used to explain the present invention, Do not limit protection scope of the present invention.
Fig. 1 is the application environment schematic diagram of keyword extracting method and device that one embodiment provides.As shown in Figure 1, The application environment includes user terminal 101 and server 102, and user terminal 101 is communicated to connect with server 102.User terminal 101 are equipped with search engine or question answering system, and user inputs text by user terminal 101, and input text passes through communication network It is sent to server 102, server 102 handles input text, and the keyword in extraction input text carries for user For search result or question and answer result.Alternatively, user by user terminal 101 input text, user terminal 101 to input text into Keyword is sent to server 102 by row processing, the keyword of extraction input text by communication network, and server 102 is to use Family provides search result or question and answer result.
Fig. 2 is the internal structure schematic diagram of the server in one embodiment.As shown in Fig. 2, server includes passing through to be Processor, storage medium, memory and the network interface of bus of uniting connection.Wherein, the storage medium of server is stored with operation system System and a kind of keyword extracting device, the keyword extracting device is for realizing a kind of keyword extracting method.The processor For providing calculating and control ability, the operation of entire server is supported.The data in storage medium are saved as in server Processing unit operation provide environment, network interface be used for user terminal carry out network communication, receive user terminal send it is defeated Enter text, by according to input text in keyword query to search result or question and answer result be sent to the user terminal.Ability Field technique personnel are appreciated that structure shown in Figure 2, only with the block diagram of the relevant part-structure of the present invention program, and The restriction for the server being applied thereon to the present invention program is not constituted, specific server may include than as shown in the figure more More or less component either combines certain components or is arranged with different components.
Fig. 3 is please referred to, in one embodiment, provides a kind of keyword extracting method, this method operates in as shown in Figure 1 Server 102 in, this approach includes the following steps:
S302:Obtain pending text.
User inputs text by user terminal, and server obtains text input by user by communication network and obtains waiting locating Manage text.
S304:It is pre-processed to obtain the word in pending text to pending text and determines the mesh in pending text Mark word.
Pending text is usually made of individual character, and for the individual character that compares, word can more express semanteme, with more practical meaning Justice.Target word is the potential keyword determined from each word in pending text.Specifically, can by pending text into Row participle, obtains the word of pending text.Participle is exactly by continuous word in pending text according to certain specification again group Synthesize the process of word sequence.It in one embodiment, can be by each word in pending text all as target word.Another In one embodiment, the word with practical significance is extracted in the word that can be pending text as target word.
In a specific embodiment, step S304 includes the following steps 1 to step 2:
Step 1:Word segmentation processing is carried out to pending text, obtains the word in pending text.
Step 2:It identifies the stop words in pending text, the word in pending text in addition to stop words is determined as mesh Mark word.
Credit word (stopword) refers to that the processing of stopping immediately being encountered then during text-processing, the word being thrown away, Stop words includes mainly English character, number, punctuation mark and the extra-high Chinese word character etc. of frequency of use.Stop words does not have usually There is the function word of physical meaning.Specifically, the stop words in deactivated dictionary can be compared with the word in pending text, really Stop words in fixed pending text.For example, common stop words have " ", " ", " what " etc., these words centainly can not Keyword can be used as.In the present embodiment, the word in pending text in addition to stop words is determined as target word, and removes stop words Except word be usually notional word, using notional word as target word, be not input to cycle nerve net using stop words as target word In network, the accuracy rate that the output result because of stop words influences keyword extraction on the one hand can be avoided on the other hand can to improve The speed of keyword extraction.
S306:Obtain the term vector of each word in pending text.
Term vector is that the corresponding vector of a word indicates, is a kind of side that the word in natural language is carried out to mathematicization Formula, term vector can be by training to obtain large-scale data using language model.Common language model is Word2vec, profit With the thought of deep learning, can be transported by training the vector in K dimensional vector spaces is reduced to the processing of content of text It calculates.In a particular embodiment, it by large scale text data, trains to obtain the term vector of each word using Word2vec, By searching for obtaining the term vector of each word in pending text.
S308:The term vector of the term vector of each target word in pending text and each word after the target word is pressed respectively The trained obtained first circulation neural network of backward input obtains the first of each target word of first circulation neural network output As a result.
RNN models (Recurrent Neural Net), length can be used in short-term in Recognition with Recurrent Neural Network structure in this implementation Memory models (Long Short-Term Memory, LSTM) or GRU (gated recurrent unit) model.RNN networks Including input layer, hidden layer and output layer, the implicit unit wherein in hidden layer completes most important work.Hidden layer it is anti- Feedback has not only entered the output at current time, but also enters the input of the hidden layer of subsequent time.Therefore, RNN networks knot Structure is it can be considered that historical information.
LSTM is on the basis of RNN, using LSTM units instead of the hidden layer in RNN.A kind of LSTM cellular constructions figure As shown in Figure 4.Wherein, mnemon is controlled by three doors respectively for storing historical information, the update and use of historical information System --- input gate (Input Gate) forgets door (Forget Gate) and out gate (Output Gate).
The present embodiment is illustrated so that first circulation neural network is LSTM networks as an example.In the present embodiment, with target word Centered on, the term vector of each word after the term vector of target word and the target word is input to housebroken in reverse order respectively One LSTM networks, thus, the first LSTM networks are repeatedly utilized according to the quantity of target word, obtain the first result of each target word. And each target word is the input as the last one unit of LSTM, and the output result of each target word considers The hereafter historical information of each target word, i.e., the semantic information of each target word.Each target word of first LSTM networks output The first result be the first LSTM networks the last one hidden layer (LSTM units) output.
S310:The probability value of each target word is calculated based on the output result of each target word, output result includes the first result.
The output result of each target word in the present embodiment includes the of each target word of first circulation neural network output One result.
The input of first LSTM networks is term vector, and output result is also vector.In order to the output knot of each target word Fruit is mapped within the scope of 0-1 to indicate the probability of each target word, and softmax functions or sigmoid functions can be used.softmax Function is a kind of common regression models of classifying more.Judge that target word whether be keyword is a two-dimensional problems, it is corresponding Softmaxt has two dimension, and one-dimensional representation is the probability of keyword, and two-dimensional representation is not the probability of keyword.
S312:The keyword in pending text is determined according to the probability value of each target word and predetermined threshold value.
It is respectively that the probability value of keyword is compared with predetermined threshold value by target word in pending text, probability value is big It is determined as the keyword in pending text in the target word of predetermined threshold value.
The setting of threshold value is related with specific requirements, and the height of threshold value setting, accuracy rate is just high, and recall rate accordingly reduces.If Threshold value is arranged low, and for accuracy rate with regard to low, recall rate is just high, and user can be arranged as required to threshold value.
Above-mentioned keyword extracting method, during keyword extraction, by the target word in the pending text of determination, The term vector of each word after target word and the target word is input in Recognition with Recurrent Neural Network in reverse order respectively, obtains each target The output of word is as a result, calculate the probability value of each target word according to the output result of each target word, to according to the general of each target word Rate value and predetermined threshold value determine the keyword in pending text.Due to being to determine target word from pending text, cycle god It is the term vector of each word after each target word and the target word of backward through the input in network, according to the quantity of target word Repeatedly utilize Recognition with Recurrent Neural Network, output result to consider the hereafter historical information of each target word, i.e., each target word Semantic information improves the accuracy rate of keyword extraction so as to accurately determine whether target word is keyword.
As shown in figure 5, In yet another embodiment, providing a kind of keyword extracting method, this method is operated in such as Fig. 1 institutes In the server 102 shown, this approach includes the following steps:
S502:Obtain pending text.
User inputs text by user terminal, and server obtains text input by user by communication network and obtains waiting locating Manage text.
S504:It is pre-processed to obtain the word in pending text to pending text and determines the mesh in pending text Mark word.
Pending text is usually made of individual character, and for the individual character that compares, word can more express semanteme, with more practical meaning Justice.Specifically, the word of pending text can be obtained by being segmented to pending text.Participle is exactly by pending text In continuous word the process of word sequence is reassembled into according to certain specification.It in one embodiment, can will be pending Each word is all as target word in text.In another embodiment, can be extracted from the word of pending text has The word of practical significance is as target word.
S506:Obtain the term vector of each word in pending text.
Term vector is that the corresponding vector of a word indicates, is a kind of side that the word in natural language is carried out to mathematicization Formula, term vector can be by training to obtain large-scale data using language model.Common language model is Word2vec, profit With the thought of deep learning, can be transported by training the vector in K dimensional vector spaces is reduced to the processing of content of text It calculates.In a particular embodiment, it by large scale text data, trains to obtain the term vector of each word using Word2vec, By searching for obtaining the term vector of each word in pending text.
S508:The term vector of the term vector of each target word in pending text and each word after the target word is pressed respectively The trained obtained first circulation neural network of backward input obtains the first of each target word of first circulation neural network output As a result.
In the present embodiment, centered on target word, respectively by each word after the term vector of target word and the target word Term vector is input to housebroken first LSTM networks in reverse order, thus, the first LSTM is repeatedly utilized according to the quantity of target word Network obtains the first result of each target word.And each target word is the input as the last one unit of LSTM, often The output result of one target word all considers the hereafter historical information of each target word, i.e., the semantic letter of each target word Breath.First LSTM networks output each target word the first result be the first LSTM networks the last one hidden layer (LSTM is mono- Member) output.
S510:The term vector of the term vector of each target word in pending text and each word before the target word is pressed respectively Trained obtained second circulation neural network is sequentially inputted, the second of each target word of second circulation neural network output is obtained As a result.
In the present embodiment, centered on target word, respectively by each word before the term vector of target word and the target word Term vector is successively inputted to housebroken 2nd LSTM networks, thus, the 2nd LSTM is repeatedly utilized according to the quantity of target word Network obtains the second result of each target word.And each target word is as the defeated of the 2nd LSTM the last one unit Enter, the output result of each target word considers the historical information above of each target word, i.e., the language of each target word Adopted information.Second result of each target word of the 2nd LSTM networks output is the last one hidden layer of the 2nd LSTM networks The output of (LSTM units).
S512:The probability value of each target word is calculated based on the output result of each target word, output result includes the first result With the second result.Specifically, calculating the probability value that each target word is keyword according to the output result of each target word.
The first of each target word that the output result of each target word is exported based on first circulation neural network in the present embodiment As a result and the second result of each target word of second circulation neural network output obtains.First result considers under target word Literary historical information, the second result consider the historical information above of target word, have target to the output result of each target word Word direction above and the hereafter information in direction.Since the output of Recognition with Recurrent Neural Network is vector, so the output of each target word It as a result, can be preferably to target word by target word in direction above and the hereafter common modeling a to vector of information in direction Significance level be indicated, to predict its whether be keyword probability.
The output result of each target word can be attached with the second result by the first result to each target word, be added, It is multiplied, is average, dot product, the modes such as being maximized and obtain.
In a particular embodiment, the first result and the second result that can be separately connected each target word obtain each target word Output result.Connection refers to that will splice before and after the first result and the second result, to obtain the output result of each target word.Tool Body, the output of the first the last one moment of LSTM networks of connection corresponding hidden layer (LSTM units) and the 2nd LSTM networks The last one moment corresponding hidden layer (LSTM units) obtains the output result of each target word.
In order to which the output result of each target word is mapped within the scope of 0-1 to indicate the probability of each target word, can be used Softmax functions or sigmoid functions.Softmax functions are a kind of common regression models of classifying more.Whether judge target word It is a two-dimensional problems for keyword, corresponding softmaxt has two dimension, and one-dimensional representation is the probability of keyword, the second dimension table Show be not keyword probability.
S514:The keyword in pending text is determined according to the probability value of each target word and predetermined threshold value.
It is respectively that the probability value of keyword is compared with predetermined threshold value by target word in pending text, probability value is big It is determined as the keyword in pending text in the target word of predetermined threshold value.
Above-mentioned keyword extracting method, by by the word of each word after the term vector of each target word and the target word Vector be input in reverse order housebroken first nerves network obtain each target word first as a result, by the word of each target word to The term vector of each word after amount and the target word is successively inputted to housebroken nervus opticus network and obtains each target word Second as a result, the first result considers the hereafter historical information of target word, and the second result considers the history above letter of target word Breath, to which the output result of each target word has target word direction above and the hereafter information in direction.Due to recycling nerve net The output of network is vector, so the output result of each target word is by target word direction above and hereafter the information in direction is built jointly In mould a to vector, preferably the significance level of target word can be indicated, to predict whether it is keyword Probability.
In one embodiment, further include the steps that trained first circulation neural network, the step before step S308 Include the following steps 1 to step 4:
Step 1:Acquisition waits for training text and respectively waits for the corresponding keyword of training text.
Step 2:It treats training text and is pre-processed and respectively waited for word in training text and determination respectively waits for training text In target word.
Step 3:Obtain the term vector for respectively waiting for each word in training text.
Step 4:According to the target word respectively waited in training text for respectively waiting for the corresponding keyword of training text, inputting in reverse order Term vector and the target word after the term vector of each word train to obtain first circulation neural network.
In the present embodiment, target word can be the non-stop words respectively waited in training text, or respectively wait for training text In all words.Preferably, using all words respectively waited in training text as target word, Recognition with Recurrent Neural Network training can be improved Accuracy rate.
Specifically, it is to wait for that training text, K are keyword to define S, and i-th of target word of training text will be waited for when training The term vector of each word after term vector and the target word is input to first circulation neural network in reverse order, then crucial by i-th The term vector of word is input to first circulation neural network, obtains the loss of each keyword, updates first using gradient decline The parameter of Recognition with Recurrent Neural Network.Recognition with Recurrent Neural Network in the present embodiment uses LSTM.Gaussian Profile is used in the training process The network parameter of loop initialization neural network trains network using stochastic gradient descent method.
In another embodiment, further include trained first circulation neural network and second circulation before step S508 The step of neural network, the step include the following steps 1 to step 4:
Step 1:Acquisition waits for training text and respectively waits for the corresponding keyword of training text.
Step 2:It treats training text and is pre-processed and respectively waited for word in training text and determination respectively waits for training text In target word.
Step 3:Obtain the term vector for respectively waiting for each word in training text;
Step 4:According to the mesh respectively waited in training text for respectively waiting for that the corresponding keyword of training text is trained, inputting in reverse order The target word respectively waited in training text marked the term vector of word and the term vector of each word after the target word, inputted in order The term vector of each word before term vector and the target word trains to obtain first circulation neural network and second circulation neural network.
Specifically, it is to wait for that training text, K are keyword to define S, and i-th of target word of training text will be waited for when training The term vector of each word after term vector and the target word is input to first circulation neural network in reverse order, by i target word The term vector of each word before term vector and the target word is successively inputted to second circulation neural network;Again by i-th of key The term vector of word is input to first circulation neural network and second circulation neural network, obtains the loss of each keyword, uses Gradient declines to update the parameter of first circulation neural network and second circulation neural network.Cycle nerve net in the present embodiment Network uses LSTM.The network parameter for using Gaussian Profile loop initialization neural network in the training process, uses stochastic gradient Descent method trains network.
Above-mentioned keyword extracting method can allow machine to learn the feature of keyword automatically, and it is special to remove artificial selection from The process of sign.By using two LSTM networks, the information above and context information of target word are input in model simultaneously, it will In information modeling a to vector in both direction, preferably keyword can be indicated.Further, since LSTM is tied Structure can overcome the shortcomings of, to random length sequence inputting, preferably to store historical information, can more make full use of word in sentence The contextual information of language.
The keyword extracting method of the present invention is illustrated with reference to specific embodiment.
A kind of model corresponding with keyword extracting method is as shown in fig. 6, include the first LSTM networks and Sotfmax layers, mould The input of type is term vector, and the first LSTM networks are that the output of target word is indicated as a result, exporting result with vector, export result The probability value of corresponding target word is calculated by Sotfmax.The probability value of target word is compared with predetermined threshold value and can determine Whether target word is keyword.
By taking pending text is " what specialty Ningbo has can occupy a tiny space in Shanghai World's Fair " as an example, segmenting After processing, determining target word includes " Ningbo ", " specialty ", " Shanghai ", " World Expo ", " occupying " and " one seat ".Respectively The term vector backward input of each word after the term vector of each target word and the target word is followed to trained first obtained In ring neural network, the output result of each target word is obtained.For example, target word is " World Expo ", then with " ", " one seat " The sequence of " occupying ", " World Expo ", corresponding term vector is input in Recognition with Recurrent Neural Network, wherein the term vector of " " is defeated Enter to the first LSTM units of the first LSTM, the term vector of " one seat " is input to the 2nd LSTM units, and so on, target The term vector of word " World Expo " inputs the last one LSTM unit, each LSTM unit is exported by a upper LSTM unit It influences.The output of first LSTM is the output vector of the last one LSTM unit, each target is considered to export result The hereafter historical information of word, i.e., the semantic information of each target word, so as to accurately determine whether target word is keyword, Improve the accuracy rate of keyword extraction.
One kind model corresponding with keyword extracting method is as shown in fig. 7, comprises the first LSTM networks LSTMR, the 2nd LSTM Network LSTMLWith Sotfmax layers, the term vector of the input of model, LSTMRThe first of network output target word is as a result, LSTMLIt is defeated Go out the second of target word as a result, the first result of connection and the second result obtain the output of target word as a result, output result is with vector It indicates, output result calculates the probability value of corresponding target word by Sotfmax.By the probability value and predetermined threshold value of target word It is compared and can determine whether target word is keyword.
By taking pending text is " what specialty Ningbo has can occupy a tiny space in Shanghai World's Fair " as an example, segmenting After processing, determining target word includes " Ningbo ", " specialty ", " Shanghai ", " World Expo ", " occupying " and " one seat ".Respectively By the term vector backward input of each word after the term vector of each target word and the target word to trained obtained LSTMR In, the term vector of each word before the term vector of each target word and the target word is sequentially input to LSTMLIn, obtain each mesh The first result and second of word is marked as a result, the first result of connection and the second result obtain the output result of each target word.For example, mesh Mark word be " World Expo ", then with " ", " one seat " " occupying ", " World Expo " sequencing, corresponding term vector is defeated Enter to LSTMRIn, wherein the term vector of " " is input to LSTMRThe first LSTM units, " one seat " term vector input To LSTMRThe 2nd LSTM units, and so on, target word " World Expo " be used as LSTMRThe last one LSTM unit.With " Ningbo ", " having ", " what ", " specialty ", " energy ", " " and " Shanghai " sequencing, corresponding term vector is input to LSTML, wherein the term vector in " Ningbo " is input to LSTMLThe first LSTM units, the term vector of " having " is input to LSTML Two LSTM units, and so on, target word " World Expo " is input to LSTMLThe last one LSTM unit.Each LSTM is mono- Member is all influenced by the output of a upper LSTM unit.The output of first LSTM is the output vector of the last one LSTM unit, from And export result and consider the hereafter historical information of each target word and historical information above, the output result of each target word will Target word, can be preferably to the important of target word in direction above and the hereafter common modeling a to vector of information in direction Degree is indicated, to predict its whether be keyword probability.
In one embodiment, a kind of keyword extracting device is provided, as shown in figure 8, including acquisition module 801, pre- place Manage module 802, conversion module 803, first circulation Processing with Neural Network module 804, output processing module 805, computing module 806 With keyword determining module 807.
Acquisition module 801, for obtaining pending text.
Preprocessing module 802 is waited for for being pre-processed to obtain the word in pending text to pending text and being determined Handle the target word in text.
Conversion module 803, the term vector for obtaining each word in pending text.
First circulation Processing with Neural Network module 804, for respectively according to each mesh in the pending text inputted in reverse order The term vector of word and the term vector of each word after the target word are marked, the first result of each target word is obtained.
Computing module 805, the probability value that each target word is calculated for the output result based on each target word export result packet Include the first result.
Keyword determining module 806, for being determined in pending text with predetermined threshold value according to the probability value of each target word Keyword.
Above-mentioned keyword extracting device, during keyword extraction, by the target word in the pending text of determination, The term vector of each word after target word and the target word is input in Recognition with Recurrent Neural Network in reverse order respectively, obtains each target The output of word is as a result, calculate the probability value of each target word according to the output result of each target word, to according to the general of each target word Rate value and predetermined threshold value determine the keyword in pending text.Due to being to determine target word from pending text, cycle god It is the term vector of each word after each target word and the target word of backward through the input in network, according to the quantity of target word Repeatedly utilize Recognition with Recurrent Neural Network, output result to consider the hereafter historical information of each target word, i.e., each target word Semantic information improves the accuracy rate of keyword extraction so as to accurately determine whether target word is keyword.
In another embodiment, a kind of keyword extracting device is provided, as shown in figure 9, including acquisition module 901, pre- Processing module 902, conversion module 903, first circulation Processing with Neural Network module 904, second circulation Processing with Neural Network module 905, output processing module 906, computing module 907 and keyword determining module 908.
Acquisition module 901, for obtaining pending text.
Preprocessing module 902 is waited for for being pre-processed to obtain the word in pending text to pending text and being determined Handle the target word in text.
Conversion module 903, the term vector for obtaining each word in pending text.
First circulation Processing with Neural Network module 904, for respectively according to each mesh in the pending text inputted in reverse order The term vector of word and the term vector of each word after the target word are marked, the first result of each target word is obtained.
Second circulation Processing with Neural Network module 905, for respectively according to each mesh in the pending text inputted in order The term vector of word and the term vector of each word before the target word are marked, the second result of each target word is obtained.
Output processing module 906, for based on each target word the first result and the second result obtain the defeated of each target word Go out result.Specifically, being separately connected the first result of each target word and the second result obtains the output result of each target word.
Computing module 907, the probability value for calculating each target word based on the output result of each target word.
Keyword determining module 908, for being determined in pending text with predetermined threshold value according to the probability value of each target word Keyword.
Above-mentioned keyword extracting device, by by the word of each word after the term vector of each target word and the target word Vector be input in reverse order housebroken first nerves network obtain each target word first as a result, by the word of each target word to The term vector of each word after amount and the target word is successively inputted to housebroken nervus opticus network and obtains each target word Second as a result, the first result considers the hereafter historical information of target word, and the second result considers the history above letter of target word Breath, to which the output result of each target word has target word direction above and the hereafter information in direction.Due to recycling nerve net The output of network is vector, so the output result of each target word is by target word direction above and hereafter the information in direction is built jointly In mould a to vector, preferably the significance level of target word can be indicated, to predict whether it is keyword Probability.
In another embodiment, referring to Fig. 10, preprocessing module 802 includes word-dividing mode 8021 and identification module 8022。
Word-dividing mode 8021 obtains the word in pending text for carrying out word segmentation processing to pending text.
Identification module 8022, the stop words in pending text for identification, by pending text in addition to stop words Word be determined as target word.
In another embodiment, please continue to refer to Figure 10, keyword extracting device further includes the first training module 808.
Acquisition module 801 is additionally operable to acquisition and waits for training text and respectively wait for the corresponding keyword of training text.
Preprocessing module 802 is additionally operable to treat training text and is pre-processed and respectively waited for word in training text and true The fixed target word respectively waited in training text.
Conversion module 803 is additionally operable to obtain the term vector for respectively waiting for each word in training text.
First training module 808, for according to respectively wait for the corresponding keyword of training text, input in reverse order respectively wait training The term vector of each word after the term vector of target word in text and the target word trains to obtain first circulation neural network.
In another embodiment, as shown in figure 11, keyword extracting device further includes the second training module 909.
Acquisition module 901 is additionally operable to acquisition and waits for training text and respectively wait for the corresponding keyword of training text.
Preprocessing module 902 is additionally operable to treat training text and is pre-processed and respectively waited for word in training text and true The fixed target word respectively waited in training text.
Conversion module 903 is additionally operable to obtain the term vector for respectively waiting for each word in training text.
Second training module 909, for according to respectively wait for the corresponding keyword of training text, input in reverse order respectively wait training The term vector of each word after the term vector of target word in text and the target word, respectively waiting in training text of inputting in order The term vector of target word and the term vector of each word before the target word train to obtain first circulation neural network and second and follow Ring neural network.
Above-mentioned keyword extracting device can allow machine to learn the feature of keyword automatically, and it is special to remove artificial selection from The process of sign.By using two LSTM networks, the information above and context information of target word are input in model simultaneously, it will In information modeling a to vector in both direction, preferably keyword can be indicated.Further, since LSTM is tied Structure can overcome the shortcomings of, to random length sequence inputting, preferably to store historical information, can more make full use of word in sentence The contextual information of language.
One of ordinary skill in the art will appreciate that realizing all or part of flow in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, it is non-volatile computer-readable that the program can be stored in one It takes in storage medium, in the embodiment of the present invention, which can be stored in the storage medium of computer system, and by the calculating At least one of machine system processor executes, and includes the flow such as the embodiment of above-mentioned each method with realization.Wherein, described Storage medium can be magnetic disc, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..
Each technical characteristic of embodiment described above can be combined arbitrarily, to keep description succinct, not to above-mentioned reality It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, it is all considered to be the range of this specification record.
Several embodiments of the invention above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the protection of the present invention Range.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims (12)

1. a kind of keyword extracting method, which is characterized in that including:
Obtain pending text;
It is pre-processed to obtain the word in pending text to the pending text and determines the mesh in the pending text Mark word;
Obtain the term vector of each word in the pending text;
Respectively in reverse order by the term vector of the term vector of each target word in the pending text and each word after the target word Trained obtained first circulation neural network is inputted, the first of each target word of the first circulation neural network output is obtained As a result;
The probability value of each target word is calculated based on the output result of each target word, the output result includes first result;
The keyword in the pending text is determined according to the probability value of each target word and predetermined threshold value.
2. according to the method described in claim 1, it is characterized in that, calculating each target word in the output result based on each target word Probability value the step of before, further include:It respectively will be before the term vector of each target word in the pending text and the target word The term vector of each word input trained obtained second circulation neural network in order, obtain the second circulation neural network Second result of each target word of output;
The output result further includes second result.
3. according to the method described in claim 2, it is characterized in that, being separately connected the first result and second of each target word As a result the output result of each target word is obtained.
4. according to the method described in claim 1, it is characterized in that, described pre-processed to the pending text is waited for It handles the word in text and includes the step of determining the target word in pending text:
Word segmentation processing is carried out to the pending text, obtains the word in pending text;
It identifies the stop words in the pending text, the word in the pending text in addition to the stop words is determined as Target word.
5. according to the method described in claim 1, it is characterized in that, respectively by the word of each target word in the pending text The term vector of each word after vector and the target word inputs housebroken first circulation neural network in reverse order, obtains described the Before the step of first result of each target word of one Recognition with Recurrent Neural Network output, further include:
Acquisition waits for training text and respectively waits for the corresponding keyword of training text;
To the word for waiting for that training text is pre-processed and respectively being waited in training text and determine the mesh respectively waited in training text Mark word;
Obtain each term vector for waiting for each word in training text;
The corresponding keyword of training text, the target word respectively waited in training text that sequentially inputs in reverse order are respectively waited for according to described The term vector of each word after term vector and the target word trains to obtain the first circulation neural network.
6. according to the method described in claim 2, it is characterized in that, respectively by the word of each target word in the pending text The term vector of each word after vector and the target word inputs housebroken first circulation neural network in reverse order, obtains described the Before the step of first result of each target word of one Recognition with Recurrent Neural Network output, further include:
Acquisition waits for training text and respectively waits for the corresponding keyword of training text;
To the word for waiting for that training text is pre-processed and respectively being waited in training text and determine the mesh respectively waited in training text Mark word;
Obtain each term vector for waiting for each word in training text;
According to respectively wait for the corresponding keyword of training text, the term vector for respectively waiting for the target word in training text inputted in reverse order and The term vector of each word after the target word, the term vector for respectively waiting for the target word in training text inputted in order and the target The term vector of each word before word trains to obtain the first circulation neural network and second circulation neural network.
7. a kind of keyword extracting device, which is characterized in that including:
Acquisition module, for obtaining pending text;
Preprocessing module is waited for for being pre-processed to obtain to the pending text described in the word in pending text and determination Handle the target word in text;Conversion module, the term vector for obtaining each word in the pending text;
First circulation Processing with Neural Network module, for respectively according to each target word in the pending text inputted in reverse order Term vector and the target word after each word term vector, obtain the first result of each target word;
Computing module, the probability value for calculating each target word based on the output result of each target word, the output result include First result;
Keyword determining module, for determining the pass in the pending text with predetermined threshold value according to the probability value of each target word Keyword.
8. device according to claim 7, which is characterized in that further include second circulation Processing with Neural Network module, it is described Second circulation Processing with Neural Network module, for respectively according to the word of each target word in the pending text inputted in order The term vector of each word before vector and the target word, obtains the second result of each target word;
The computing module, the probability value for calculating each target word based on the output result of each target word, the output result Including first result and second result.
9. device according to claim 8, which is characterized in that further include output processing module, for being separately connected each institute The first result and the second result for stating target word obtain the output result of each target word.
10. device according to claim 7, which is characterized in that the preprocessing module includes word-dividing mode and identification mould Block;
The word-dividing mode obtains the word in pending text for carrying out word segmentation processing to the pending text;
The identification module, the stop words in the pending text, will stop in the pending text except described for identification Word except word is determined as target word.
11. device according to claim 7, which is characterized in that further include the first training module;
The acquisition module is additionally operable to acquisition and waits for training text and respectively wait for the corresponding keyword of training text;
The preprocessing module is additionally operable to the word for waiting for that training text is pre-processed and respectively being waited in training text and true The fixed target word respectively waited in training text;
The conversion module is additionally operable to obtain each term vector for waiting for each word in training text;
First training module, for according to it is described respectively wait for the corresponding keyword of training text, input in reverse order respectively wait instructing Practice the term vector of target word and the term vector of each word after the target word in text to train to obtain the first circulation nerve Network.
12. device according to claim 8, which is characterized in that further include the second training module:
The acquisition module is additionally operable to acquisition and waits for training text and respectively wait for the corresponding keyword of training text;
The preprocessing module is additionally operable to the word for waiting for that training text is pre-processed and respectively being waited in training text and true The fixed target word respectively waited in training text;
The conversion module is additionally operable to obtain each term vector for waiting for each word in training text;
Second training module, for according to respectively wait for the corresponding keyword of training text, input in reverse order respectively wait for training text The term vector of target word in this and the term vector of each word after the target word, respectively waiting in training text of inputting in order The term vector of each word before the term vector of target word and the target word trains to obtain the first circulation neural network and second Recognition with Recurrent Neural Network.
CN201710101012.7A 2017-02-23 2017-02-23 keyword extracting method and device Pending CN108304364A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710101012.7A CN108304364A (en) 2017-02-23 2017-02-23 keyword extracting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710101012.7A CN108304364A (en) 2017-02-23 2017-02-23 keyword extracting method and device

Publications (1)

Publication Number Publication Date
CN108304364A true CN108304364A (en) 2018-07-20

Family

ID=62872340

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710101012.7A Pending CN108304364A (en) 2017-02-23 2017-02-23 keyword extracting method and device

Country Status (1)

Country Link
CN (1) CN108304364A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145107A (en) * 2018-09-27 2019-01-04 平安科技(深圳)有限公司 Subject distillation method, apparatus, medium and equipment based on convolutional neural networks
CN109388806A (en) * 2018-10-26 2019-02-26 北京布本智能科技有限公司 A kind of Chinese word cutting method based on deep learning and forgetting algorithm
CN109471938A (en) * 2018-10-11 2019-03-15 平安科技(深圳)有限公司 A kind of file classification method and terminal
CN109635273A (en) * 2018-10-25 2019-04-16 平安科技(深圳)有限公司 Text key word extracting method, device, equipment and storage medium
CN110110330A (en) * 2019-04-30 2019-08-09 腾讯科技(深圳)有限公司 Text based keyword extracting method and computer equipment
CN110851574A (en) * 2018-07-27 2020-02-28 北京京东尚科信息技术有限公司 Statement processing method, device and system
CN110866393A (en) * 2019-11-19 2020-03-06 北京网聘咨询有限公司 Resume information extraction method and system based on domain knowledge base
CN112528655A (en) * 2020-12-18 2021-03-19 北京百度网讯科技有限公司 Keyword generation method, device, equipment and storage medium
CN112749251A (en) * 2020-03-09 2021-05-04 腾讯科技(深圳)有限公司 Text processing method and device, computer equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095749A (en) * 2016-06-03 2016-11-09 杭州量知数据科技有限公司 A kind of text key word extracting method based on degree of depth study
US9508340B2 (en) * 2014-12-22 2016-11-29 Google Inc. User specified keyword spotting using long short term memory neural network feature extractor
CN106383817A (en) * 2016-09-29 2017-02-08 北京理工大学 Paper title generation method capable of utilizing distributed semantic information

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9508340B2 (en) * 2014-12-22 2016-11-29 Google Inc. User specified keyword spotting using long short term memory neural network feature extractor
CN106095749A (en) * 2016-06-03 2016-11-09 杭州量知数据科技有限公司 A kind of text key word extracting method based on degree of depth study
CN106383817A (en) * 2016-09-29 2017-02-08 北京理工大学 Paper title generation method capable of utilizing distributed semantic information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王煦祥: "面向问答的问句关键词提取技术研究", 《中国优秀硕士学位论文全文数据库》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110851574A (en) * 2018-07-27 2020-02-28 北京京东尚科信息技术有限公司 Statement processing method, device and system
CN109145107A (en) * 2018-09-27 2019-01-04 平安科技(深圳)有限公司 Subject distillation method, apparatus, medium and equipment based on convolutional neural networks
CN109145107B (en) * 2018-09-27 2023-07-25 平安科技(深圳)有限公司 Theme extraction method, device, medium and equipment based on convolutional neural network
CN109471938B (en) * 2018-10-11 2023-06-16 平安科技(深圳)有限公司 Text classification method and terminal
CN109471938A (en) * 2018-10-11 2019-03-15 平安科技(深圳)有限公司 A kind of file classification method and terminal
CN109635273A (en) * 2018-10-25 2019-04-16 平安科技(深圳)有限公司 Text key word extracting method, device, equipment and storage medium
WO2020082560A1 (en) * 2018-10-25 2020-04-30 平安科技(深圳)有限公司 Method, apparatus and device for extracting text keyword, as well as computer readable storage medium
CN109388806A (en) * 2018-10-26 2019-02-26 北京布本智能科技有限公司 A kind of Chinese word cutting method based on deep learning and forgetting algorithm
CN109388806B (en) * 2018-10-26 2023-06-27 北京布本智能科技有限公司 Chinese word segmentation method based on deep learning and forgetting algorithm
CN110110330A (en) * 2019-04-30 2019-08-09 腾讯科技(深圳)有限公司 Text based keyword extracting method and computer equipment
CN110110330B (en) * 2019-04-30 2023-08-11 腾讯科技(深圳)有限公司 Keyword extraction method based on text and computer equipment
CN110866393A (en) * 2019-11-19 2020-03-06 北京网聘咨询有限公司 Resume information extraction method and system based on domain knowledge base
CN112749251A (en) * 2020-03-09 2021-05-04 腾讯科技(深圳)有限公司 Text processing method and device, computer equipment and storage medium
CN112749251B (en) * 2020-03-09 2023-10-31 腾讯科技(深圳)有限公司 Text processing method, device, computer equipment and storage medium
CN112528655A (en) * 2020-12-18 2021-03-19 北京百度网讯科技有限公司 Keyword generation method, device, equipment and storage medium
CN112528655B (en) * 2020-12-18 2023-12-29 北京百度网讯科技有限公司 Keyword generation method, device, equipment and storage medium
US11899699B2 (en) 2020-12-18 2024-02-13 Beijing Baidu Netcom Science Technology Co., Ltd. Keyword generating method, apparatus, device and storage medium

Similar Documents

Publication Publication Date Title
CN108304364A (en) keyword extracting method and device
US10963637B2 (en) Keyword extraction method, computer equipment and storage medium
Bang et al. Explaining a black-box by using a deep variational information bottleneck approach
CN106156003B (en) A kind of question sentence understanding method in question answering system
CN110534087A (en) A kind of text prosody hierarchy Structure Prediction Methods, device, equipment and storage medium
CN104598611B (en) The method and system being ranked up to search entry
CN109871538A (en) A kind of Chinese electronic health record name entity recognition method
CN109977416A (en) A kind of multi-level natural language anti-spam text method and system
CN109241255A (en) A kind of intension recognizing method based on deep learning
CN108804677A (en) In conjunction with the deep learning question classification method and system of multi-layer attention mechanism
CN108334487A (en) Lack semantics information complementing method, device, computer equipment and storage medium
CN110083700A (en) A kind of enterprise's public sentiment sensibility classification method and system based on convolutional neural networks
CN110489755A (en) Document creation method and device
CN110502753A (en) A kind of deep learning sentiment analysis model and its analysis method based on semantically enhancement
CN110427461A (en) Intelligent answer information processing method, electronic equipment and computer readable storage medium
CN110969020A (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
CN110647612A (en) Visual conversation generation method based on double-visual attention network
CN107437100A (en) A kind of picture position Forecasting Methodology based on the association study of cross-module state
CN108549658A (en) A kind of deep learning video answering method and system based on the upper attention mechanism of syntactic analysis tree
CN110287489A (en) Document creation method, device, storage medium and electronic equipment
CN106682089A (en) RNNs-based method for automatic safety checking of short message
CN113254782B (en) Question-answering community expert recommendation method and system
CN110516035A (en) A kind of man-machine interaction method and system of mixing module
CN111309887A (en) Method and system for training text key content extraction model
CN105975497A (en) Automatic microblog topic recommendation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180720