CN110110330A

CN110110330A - Text based keyword extracting method and computer equipment

Info

Publication number: CN110110330A
Application number: CN201910360872.1A
Authority: CN
Inventors: 李钊
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-04-30
Filing date: 2019-04-30
Publication date: 2019-08-09
Anticipated expiration: 2039-04-30
Also published as: CN110110330B

Abstract

This application discloses a kind of text based keyword extracting method and computer equipments, belong to field of artificial intelligence, for efficiently excavating the keyword in text.The process employs Seq2seq network structures.The network structure includes that encoder and decoder and the neural network module with attention mechanism are adjusted the output result of encoder.Using entire text as input in this method, neural network is enabled to understand the contextual information of text.Due to eliminating the trouble for taking out feature in TextRank from text without extracting feature vector.Since without subjective carry out feature abstraction, so realization is relatively easy, the extraction of keyword is applicable in long text and short text, and effect is also more stable.In addition, this method output is vector rather than keyword, there is good generalization ability.It is further outer, by introducing attention mechanism, enable to keyword to excavate more accurate.

Description

Text based keyword extracting method and computer equipment

Technical field

This application involves field of artificial intelligence, in particular to a kind of text based keyword extracting method and calculating Machine equipment.

Background technique

It in order to facilitate understanding and retrieves, the meaning of text is usually expressed with some keywords.Since different terms are expressed Semantic ability is different, so different terms are also different to the embodiment degree of text purport.Text master can be expressed by how extracting The keyword of purport is one important topic of natural language processing field.The extraction of keyword simultaneously, is also widely used in content and pushes away It recommends, the fields such as semantic search.

The index for portraying word significance level has TF-IDF (term frequency-inverse in the related technology Document frequency, word frequency), the methods of textRank (automatic abstract algorithm), classification.Wherein, TF-IDF, based on pair Document frequency weighted calculation counts word to the importance of text；TextRank is counted by the context relation of vocabulary and is calculated The importance of word；Sorting algorithm will be converted to classification problem to the excavation of text key word, pass through feature extraction, Seq2seq The word of text is divided into keyword and non-key word by neural metwork training, Seq2seq neural network prediction.However the above method There are respective some disadvantages, shows in practical applications unsatisfactory.

Summary of the invention

The embodiment of the present application provides a kind of text based keyword extracting method and computer equipment, for intelligence compared with Accurately to extract key.

On the one hand, a kind of text based keyword extracting method is provided, which comprises

The matrix of text to be analyzed is constructed, the term vector of the participle in the matrix including arranged in sequence, wherein put in order For sequence of the term vector in the text to be analyzed；

It is analysed to Seq2seq (sequence to sequence, sequence-sequence of the Input matrix to pre-training of text Column) neural network, output matrix is obtained, includes at least one output vector in the output matrix；Wherein, the Seq2seq Neural network is obtained according to the corpus training for being labeled with keyword, and when training, the input of the Seq2seq neural network When training text matrix, output is the matrix that the corresponding keyword of training text is constituted；In the matrix that wherein keyword is constituted Each vector is corresponding with keyword；

According to the corresponding relationship of output vector and keyword, the keyword of the text to be analyzed is determined.

It optionally, include encoder, decoder and the nerve net with attention mechanism in the Seq2seq neural network Network module, the encoder and decoder are Recognition with Recurrent Neural Network, and the neural network module with attention mechanism is used The coding result of each term vector is directed in the adjustment encoder.

Optionally, it is analysed to Seq2seq neural network of the Input matrix to pre-training of text, obtains output matrix, Include:

By the term vector in the matrix of the text to be analyzed according to the sequence in the text to be analyzed, sequentially input to The encoder obtains the state of the term vector of each input；

By the state of the current input term vector of the encoder and a upper term vector for the current input term vector The neural network module for having attention mechanism is inputed to, the weight parameter of a upper term vector is obtained；

The weight parameter of a upper term vector is multiplied with the state of a upper term vector, it is described after being adjusted The state of a upper term vector；

The state of each term vector adjusted is sequentially inputed into the decoder, obtains the output matrix.

Optionally, the neural network module with attention mechanism includes the full articulamentum being sequentially connected in series, random mistake Layer living and normalization layer softmax；

The current input term vector of the encoder of the full articulamentum for handling input and the current input The state of a upper term vector for term vector；

The random deactivating layer is used to handle the processing result of the full articulamentum；

The softmax for obtain after the processing result of the random deactivating layer is normalized a upper word to The weight parameter of amount.

Optionally, the matrix of text to be analyzed is constructed, comprising:

Word segmentation processing is carried out to text to be analyzed, obtains each participle；

Term vector is converted by each participle；

Sequential configuration matrix by the term vector of each participle according to participle in the text to be analyzed.

Optionally, it according to the corresponding relationship of output vector and keyword, determines the keyword of the text to be analyzed, wraps It includes:

It is searched in keyword vector set and output vector is apart from nearest vector；

The corresponding keyword of the vector found is determined as to the keyword of the text to be analyzed.

Optionally, the corresponding keyword of the vector found is determined as to the keyword of the text to be analyzed, comprising:

For the corresponding each keyword of vector found from keyword vector set, if the keyword is included in institute It states in text to be analyzed, then the keyword is determined as to the keyword of the text to be analyzed；If the keyword is not included in institute It states in text to be analyzed, then abandons the keyword.

Optionally, the method also includes:

If the keyword quantity of the text to be analyzed is greater than preset quantity；Then from the keyword of the text to be analyzed Partial key word is rejected so that remaining keyword quantity is equal to the preset quantity.

Optionally, the method also includes:

If the keyword quantity of the text to be analyzed is less than preset quantity；Then searched from the keyword vector set Keyword similar with the keyword of the text to be analyzed；

The similar keyword found is determined as to the newly-increased keyword of the text to be analyzed.

Second aspect, the embodiment of the present application also provide a kind of text based keyword extracting device, and described device includes:

Text matrix construction unit includes the participle of arranged in sequence for constructing the matrix of text to be analyzed, in the matrix Term vector, wherein the sequence to put in order for term vector in the text to be analyzed；

Output matrix determination unit, for being analysed to Seq2seq neural network of the Input matrix to pre-training of text, Output matrix is obtained, includes at least one output vector in the output matrix；Wherein, the Seq2seq neural network is root When obtaining, and training according to the corpus training for being labeled with keyword, the training text when input of the Seq2seq neural network Matrix, output are the matrixes that the corresponding keyword of training text is constituted；Each vector and pass in the matrix that wherein keyword is constituted Keyword is corresponding；

Keyword determination unit determines the text to be analyzed for the corresponding relationship according to output vector and keyword Keyword.

Optionally, output matrix determination unit is used for:

Optionally, text matrix construction unit is used for:

Term vector is converted by each participle；

Optionally, keyword determination unit is used for:

Optionally, described device further include:

Filter element, if the keyword quantity for the text to be analyzed is greater than preset quantity；Then from described to be analyzed Partial key word is rejected in the keyword of text so that remaining keyword quantity is equal to the preset quantity.

Optionally, described device further include:

Expanding element, if the keyword quantity for the text to be analyzed is less than preset quantity；Then from the keyword Keyword similar with the keyword of the text to be analyzed is searched in vector set；

The third aspect, provides a kind of computer equipment, including memory, processor and storage on a memory and can located The computer program run on reason device,

The processor realizes method and step described in above-mentioned aspect when executing the computer program.

Fourth aspect provides a kind of computer readable storage medium,

The computer-readable recording medium storage has computer instruction, when the computer instruction is run on computers When, it enables a computer to execute method described in above-mentioned aspect.

The embodiment of the present application provides a kind of method for extracting keyword, uses Seq2seq network structure in this method. The network structure includes encoder and decoder, using entire text as the input of Seq2seq neural network, so that neural network The contextual information of text can be understood.In addition, without extracting feature vector in this method, so eliminate in TextRank from The trouble of feature is taken out in text.Since without subjective carry out feature abstraction, so realization is relatively easy, keyword is mentioned It takes and is applicable in long text and short text, effect is also more stable.In addition, this method output be vector rather than it is crucial Word has good generalization ability.

Detailed description of the invention

In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Apply embodiment, for those of ordinary skill in the art, without creative efforts, can also basis mention The attached drawing of confession obtains other attached drawings.

Fig. 1 is one of the structural schematic diagram of Seq2seq neural network provided by the embodiments of the present application；

Fig. 2 is the second structural representation of Seq2seq neural network provided by the embodiments of the present application；

Fig. 3 is the third structural representation of Seq2seq neural network provided by the embodiments of the present application；

Fig. 4 is the Processing Algorithm general flow chart provided by the embodiments of the present application for extracting keyword；

Fig. 5 is the flow diagram of trained Seq2seq neural network provided by the embodiments of the present application；

Fig. 6 is the flow diagram of text based keyword extracting method provided by the embodiments of the present application；

Fig. 7 is another flow diagram of text based keyword extracting method provided by the embodiments of the present application；

Fig. 8 is the four of the structural schematic diagram of eq2seq neural network provided by the embodiments of the present application；

The effect display diagram of Fig. 9-Figure 11 text based keyword extracting method provided by the embodiments of the present application；

Figure 12 is the structural schematic diagram for the text based keyword extracting device that inventive embodiments provide；

Figure 13 is a kind of structural schematic diagram of computer equipment provided by the embodiments of the present application.

Specific embodiment

For the purposes, technical schemes and advantages of the application are more clearly understood, below in conjunction in the embodiment of the present application Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only It is some embodiments of the present application, instead of all the embodiments.Based on the embodiment in the application, ordinary skill people Member's every other embodiment obtained without making creative work, shall fall in the protection scope of this application.? In the case where not conflicting, the features in the embodiments and the embodiments of the present application can mutual any combination.Although also, flowing Logical order is shown in journey figure, but in some cases, it can be to be different from shown or described by sequence execution herein The step of.

Technical solution provided by the embodiments of the present application for ease of understanding, some passes that first the embodiment of the present application is used here Key name word explains:

Text: referring to the form of expression of written language, froms the perspective of from literature angle, usually has complete, system meaning one The combination of a sentence or multiple sentences.One text can be a sentence (Sentence), a paragraph (Paragraph) or One chapter (Discourse) of person.

Keyword extraction: refer to the technology of the keyword of Computer Automatic Extraction text.

The abbreviation of APP:application refers in particular to be installed on the application program on smart machine.

Attention mechanism (Attention Mechanism): derived from the research to human vision.In cognitive science, by In the bottleneck of information processing, the mankind can selectively pay close attention to a part of all information, while ignore other visible information.On The mechanism of stating is commonly known as attention mechanism.The different position of human retina has different degrees of information processing capability, i.e., Acuity (Acuity), only fovea centralis position have strongest acuity.In order to rationally be believed using limited vision Process resource is ceased, human needs select the specific part in visual zone, then concentrate and pay close attention to it.For example, people read when, Usually only a small amount of word to be read can be concerned and handle.To sum up, mainly there are two aspects for attention mechanism: determining to need Any part of concern input；Limited messaging resource is distributed to part and parcel.In cognition neural, attention is one The indispensable complicated cognitive function of the kind mankind, refers to that people can ignore the selection of other information while paying close attention to some information Ability.In daily life, people receive a large amount of feeling input by modes such as vision, the sense of hearing, tactiles.But human brain can be with It can also work without any confusion in these extraneous INFORMATION BOMBs, be because human brain can be either intentionally or unintentionally a large amount of from these It selects the useful information of fraction to carry out emphasis processing in input information, and ignores other information.This ability is just called attention. Attention can be presented as external stimulation (sense of hearing, vision, sense of taste etc.), can also be presented as that internal consciousness (recall by thinking Deng).

In addition, the terms "and/or", only a kind of incidence relation for describing affiliated partner, indicates may exist Three kinds of relationships, for example, A and/or B, can indicate: individualism A exists simultaneously A and B, these three situations of individualism B.Separately Outside, character "/" herein typicallys represent the relationship that forward-backward correlation object is a kind of "or" in the case where not illustrating.

In the related technology, TF-IDF method, which only measures that word from the angle of word frequency, can be used as the keyword of text. This method fails the contextual information in conjunction with text, so the keyword scope of application extracted is limited.And classification method is to text It carries out implementing relatively difficult when feature abstraction, the extraction of keyword also fails to consider contextual information.Although TextRank Combine the contextual information of text, but it is since its process for carrying out feature abstraction is realized complicated and subjective factor is needed to join With effect is poor and unstable in the case where short text corpus small scale.

In view of this, the embodiment of the present application provides a kind of method for extracting keyword, Seq2seq is used in this method (sequence to sequence, sequence-sequence) network structure.The network structure includes encoder and decoder, will be entire Input of the text as Seq2seq neural network, enables neural network to understand the contextual information of text.In addition, the party Without extracting feature vector in method, so eliminate the trouble for taking out feature in TextRank from text.Due to without master The carry out feature abstraction of sight, so realization is relatively easy, the extraction of keyword is applicable in long text and short text, effect It is more stable.In addition, this method output is vector rather than keyword, there is good generalization ability.

After having introduced the design concept of the embodiment of the present application, the implementation method of the embodiment of the present application is done into one below Walk explanation.

One, Seq2seq neural metwork training

The contents of the section mainly introduces the composition of Seq2seq neural network in the embodiment of the present application, and how to train this Seq2seq neural network can carry out keyword excavation.

It is as described in Figure 1 the structural schematic diagram of Seq2seq neural network, which includes encoder 11 With decoder 12.For encoding to the data of input, decoder is used to carry out the output result of encoder encoder, defeated Outgoing vector.The vector sum keyword wherein exported is corresponding.

In training process, the text for being labeled with keyword is obtained first as corpus.The corpus of selection may include different length The text of degree.Its corresponding matrix is constructed to each training text in corpus.The specific configuration of matrix is implementable for first to training Text carries out word segmentation processing, obtains each participle；Then term vector is converted by each participle；The term vector of each participle is pressed later According to sequential configuration matrix of the participle in the training text.Namely the term vector of the participle in matrix including arranged in sequence, In, it puts in order and segments the sequence in the training text for term vector is corresponding.Correspondingly, the corresponding key of construction text The matrix of word.Wherein, the corresponding participle of a vector in the matrix of text；In the matrix of keyword, a vector corresponding one is closed Keyword.

Then using the matrix of text as the input of Seq2seq neural network, using the matrix of corresponding keyword as Seq2seq neural network is trained in the output of Seq2seq neural network.

Further, strengthen in order to reach and can weaken point that cannot function as keyword as the participle of keyword The purpose of word.Attention mechanism also is introduced to Seq2seq neural network in the embodiment of the present application.

As shown in Fig. 2, another structural schematic diagram of the Seq2seq neural network.Including encoder 11 and decoder 12, and Neural network module 13 with attention mechanism.Nervus opticus network is the neural network with attention mechanism, main The output for acting on adjustment encoder, so that word important in text is strengthened, and weakens unessential word.In this way, When the coding result of encoder inputs to decoder after being adjusted, important keyword can be more accurately excavated.

When it is implemented, as shown in figure 3, the aforementioned neural network module with attention mechanism include be sequentially connected in series it is complete Articulamentum 31, random deactivating layer 32 and normalization layer softmax33；Wherein:

Letter speech the embodiment of the present application in process flow may include four-stage as described in Figure 4 i.e.:

Data prediction: segmenting text and obtains the term vector of each participle.

Seq2seq neural metwork training: Seq2seq neural network is trained according to the text of mark keyword To the Seq2seq neural network that can extract keyword.

Seq2seq neural network prediction: the candidate of text to be analyzed is excavated using trained Seq2seq neural network The term vector (can be discussed in detail hereinafter about the point) of keyword.

As a result post-process: the vector obtained according to Seq2seq neural network prediction determines the key of text to be analyzed Word.

For example, as shown in figure 5, keyword mark is manually carried out for a collection of text, as training corpus.Then, to instruction The each text practiced in corpus carries out word segmentation processing, obtains the sequence of word.Sequence Transformed by word obtains text sequence for term vector Column (are labeled as A), and the keyword then through every article is also converted to term vector and obtains keyword sequence (labeled as B), then A is inputed to Seq2seq neural network to be trained, enable Seq2seq neural network export text for crucial word order Arrange B.

Two, Seq2seq neural network prediction

The part mainly introduces how by the Seq2seq neural network of aforementioned training to extract keyword, such as Fig. 6 institute Show, be the flow diagram of this method, it may include following steps:

Step 601: constructing the matrix of text to be analyzed, the term vector of the participle in the matrix including arranged in sequence, wherein The sequence to put in order for term vector in the text to be analyzed.

In one embodiment, word segmentation processing can be carried out to text to be analyzed, obtains each participle；Then by each participle It is converted into term vector；Later, the sequential configuration matrix by the term vector of each participle according to participle in the text to be analyzed.

In one embodiment, word2vec (word to vector, for generating the model of term vector) can be passed through Term vector is converted by each participle of acquisition.When it is implemented, some stop words that can also obtain analysis are rejected, to simplify The data volume of the matrix of text to be analyzed.

Step 602: it is analysed to Seq2seq neural network of the Input matrix to pre-training of text, obtains output matrix, It include at least one output vector in the output matrix；Wherein, the Seq2seq neural network is that basis is labeled with keyword Corpus training obtain, and when training, the matrix of training text when the input of the Seq2seq neural network, output is instruction Practice the matrix that the corresponding keyword of text is constituted；Each vector is corresponding with keyword in the matrix that wherein keyword is constituted.

Step 603: according to the corresponding relationship of output vector and keyword, determining the keyword of the text to be analyzed.

In one embodiment, it has not been able to preferably excavate keyword, as previously mentioned, introducing in the embodiment of the present application Attention mechanism.So as previously mentioned, including encoder, decoder and having attention machine in the Seq2seq neural network The neural network module of system, the encoder and decoder are Recognition with Recurrent Neural Network, the nerve with attention mechanism Network module is used to adjust the coding result that the encoder is directed to each term vector.In such manner, it is possible to strengthen the work of important information With the effect of inessential information being weakened, so that the excavation of keyword is more accurate.

In one embodiment, using attention mechanism neural network module when, as shown in fig. 7, being analysed to text Input matrix obtains output matrix, it may include following steps to the Seq2seq neural network of pre-training:

Step 701: by the term vector in the matrix of the text to be analyzed according to the sequence in the text to be analyzed, according to The secondary state for inputing to the encoder and obtaining the term vector of each input；

Step 702: by the current input term vector of the encoder and it is described it is current input term vector a upper word to The state of amount inputs to the neural network module for having attention mechanism, obtains the weight parameter of a upper term vector；

Step 703: the weight parameter of a upper term vector being multiplied with the state of a upper term vector, is adjusted The state of a upper term vector afterwards；

Step 704: the state of each term vector adjusted sequentially being inputed into the decoder, obtains the output square Battle array.

For example, including the term vector of multiple participles in the matrix of text.Then first vector inputs to encoder, encoder Obtain the state of the vector.When handling second vector, the state of second vector sum, first vector is inputed to attention The neural network module of power mechanism obtains the weight parameter of first vector.The weight parameter and first vector of first vector State be multiplied to obtain the vector for inputing to decoder.And so on each vector be pocessed so that each inputing to decoding The vector of device can integrate contextual information.And encoder be Recognition with Recurrent Neural Network when, the state of each vector can also integrate The state of a upper vector, enables the state of each vector further to consider contextual information.

In one embodiment, it after the matrix predicted, for each output vector in the matrix, can close It is searched in keyword vector set and output vector is apart from nearest vector；And the corresponding keyword of the vector found is determined as The keyword of the text to be analyzed.

Certainly when it is implemented, the distance between the vector in output vector and keyword vector set can be calculated, when The distance of the two just can determine when being less than distance to a declared goal finds corresponding vector in vector set.In such manner, it is possible to guarantee to surpass Find accurate vector.

Further, it is generally the case that the keyword of extraction should be included in text to be analyzed.So the application is real It applies in example, for the corresponding each keyword of vector found from keyword vector set, if the keyword is included in institute It states in text to be analyzed, then the keyword is determined as to the keyword of the text to be analyzed；If the keyword is not included in institute It states in text to be analyzed, then abandons the keyword.That is, being not suitable for if the keyword extracted does not include in text to be analyzed In the final keyword for doing the text, the keyword in text to be analyzed will be filtered.As a result, the key extracted Word is more accurate.

In one embodiment, the quantity of keyword can be set according to actual needs.When the keyword that decoder extracts When more, a part of keyword can be weeded out, when the keyword that decoder extracts is less, some keywords can be extended. The implementable program is in terms of including following two:

1, extra keyword is rejected

It in one embodiment, can be according between the vector in each output vector of output matrix and keyword vector set Distance come determine reject which keyword, such as reject apart from biggish keyword.

2, similar keyword is extended

In one embodiment, if the keyword quantity of the text to be analyzed is less than preset quantity；Then from the key Keyword similar with the keyword of the text to be analyzed is searched in term vector set；The similar key that will be found Word is determined as the newly-increased keyword of the text to be analyzed.

For example, 3 keywords of actual needs, obtain an output vector by decoder, are being closed according to the output vector A keyword is found in keyword vector set.For expanded keyword, distance can be searched in keyword vector set and obtained Keyword its nearest vector, and using this apart from the nearest corresponding keyword of vector as the keyword extended.

Certainly, in one embodiment, can also using with keyword similar in determining keywords semantics as extension Keyword.For example, it is lovely and stay and sprout semantic similarity to a certain extent, it can will stay the keyword sprouted as extension.

It is described in detail below for how to excavate keyword using attention mechanism.In the embodiment of the present application, have The neural network module of attention mechanism includes full articulamentum, random deactivating layer and softmax；It is illustrated in figure 8 the application reality The structural schematic diagram of the Seq2seq neural network of example offer is provided.Wherein, encoder (Encoder) and decoder (Decoder) are equal Recognition with Recurrent Neural Network can be used, for example, by using LSTM (Long Short-Term Memory, shot and long term memory network).With note After the internal structure expansion of the neural network of meaning power mechanism (Attention) as shown in the right side in Fig. 8, comprising: full articulamentum, Random deactivating layer and normalization layer.Wherein, Input table shows the term vector sequence of input, in₁…..in_nIndicate current term vector, h₁……h_nIndicate the state of a upper term vector for current term vector, α₁…….α_nIndicate the weight parameter of a upper term vector.Needle For any term vector, the dimension of weight parameter and the dimension of the term vector are identical.

When the keyword for carrying out text to be analyzed excavates, the Input matrix of term vector composition of text is analysed to volume Code device, encoder successively handle term vector to obtain the state of each term vector.The state of a term vector on current vector sum The neural network module with attention mechanism is inputed to, by the full articulamentum of the neural network module with attention mechanism After processing, random deactivating layer is transferred to handle, normalization layer is finally transferred to handle to obtain the power of a upper term vector for current term vector Weight parameter.Then the state of a upper term vector, which is multiplied after (Multi) with its weight parameter, inputs to decoder processes.

Decoder is decoded the vector of input to obtain output vector, then found in keyword vector set with respectively The vector of output vector sequences match, and determine key of the keyword corresponding with the matched vector as text to be analyzed Word.

There is Seq2seq neural network provided by the embodiments of the present application to extract keyword, due to output be keyword to Amount rather than specific keyword, therefore Seq2seq neural network have better generalization ability；In addition, filtering out not Keyword in text to be analyzed makes output result carry out the filtering of urtext, improves the robust of keyword extraction Property.Furthermore due to having fully considered contextual information in extraction process, ambiguity can be inhibited effectively to improve keyword extraction Accuracy.

The result of the extracting method of keyword provided by the embodiments of the present application is opened up below with reference to three measured results Show explanation.

1) as shown in figure 9, being when carrying out keyword excavation in the description text for this App of king's honor The output of attention module (having the neural network of attention mechanism).Color more superficial shows that weight is higher in Fig. 6, can Reinforced with the local weight that the keyword (i.e. having the keyword of underscore in Fig. 6) found out in mark occurs.So In Seq2seq neural network provided by the embodiments of the present application, attention module can be good at playing the work for excavating keyword With.

2), as shown in Figure 10, the effect to carry out the keyword obtained after keyword extraction in the text to description game Figure.It follows that the keyword of extraction includes: big waste discipline, strategy, and rpg trains and cultivate oneself to attain immortality, these are crucial for big waste this game of recording Word is capable of the content of the corresponding text of accurate description.

This game is killed for hero, effect is identical, and which is not described herein again.

3) it is directed to the text to be analyzed as term, text amount is typically small in text.It is provided by the embodiments of the present application The scheme for extracting keyword also can be good at this kind of short and small text to extract keyword, the search for row information of going forward side by side.

Such as shown in Figure 11, it is assumed that the term of input is provided for " game for being suitble to children to play " by the embodiment of the present application The keyword that is extracted from the term of scheme include " children " and " game ", so assuming that the keyword needed is 4, then Expand keyword " picture mosaic " and " intelligence development ".As a result, when carrying out the retrieval of App, skilful brave jigsaw puzzle can be accurately positioned Game as recommendation.

In the embodiment of the present application, finally need how many keyword that can determine according to actual needs, when one key of needs When word, the hit rate of keyword can reach 96%, and when needing multiple keywords, the hit rate of keyword can reach 84%, therefore This, Seq2seq neural network provided by the embodiments of the present application can be good at extracting keyword.

Referring to Figure 12, based on the same inventive concept, the embodiment of the present application also provides a kind of text based keywords Extraction element, comprising:

Text matrix construction unit 1201, for constructing the matrix of text to be analyzed, including arranged in sequence in the matrix The term vector of participle, wherein the sequence to put in order for term vector in the text to be analyzed；

Output matrix determination unit 1202, for being analysed to Seq2seq nerve of the Input matrix to pre-training of text Network obtains output matrix, includes at least one output vector in the output matrix；Wherein, the Seq2seq neural network It is to be obtained according to the corpus training for being labeled with keyword, and when training, training text when the input of the Seq2seq neural network This matrix, output are the matrixes that the corresponding keyword of training text is constituted；Each vector in the matrix that wherein keyword is constituted It is corresponding with keyword；

Keyword determination unit 1203 determines described to be analyzed for the corresponding relationship according to output vector and keyword The keyword of text.

Optionally, output matrix determination unit is used for:

Optionally, text matrix construction unit is used for:

Term vector is converted by each participle；

Optionally, keyword determination unit is used for:

Optionally, described device further include:

Referring to Figure 13, it is based on same technical concept, the embodiment of the present application also provides a kind of computer equipments 130, can To include memory 1301 and processor 1302.

The memory 1301, the computer program executed for storage processor 1302.Memory 1301 can be wrapped mainly Include storing program area and storage data area, wherein storing program area can application needed for storage program area, at least one function Program etc.；Storage data area, which can be stored, uses created data etc. according to computer equipment.Processor 1302, can be one A central processing unit (central processing unit, CPU), or be digital processing element etc..The application is implemented The specific connection medium between above-mentioned memory 1301 and processor 1302 is not limited in example.The embodiment of the present application in Figure 13 with It is connected between memory 1301 and processor 1302 by bus 1303, bus 1303 is indicated in Figure 13 with thick line, other portions Connection type between part is only to be schematically illustrated, does not regard it as and be limited.It is total that the bus 1303 can be divided into address Line, data/address bus, control bus etc..Only to be indicated with a thick line in Figure 13 convenient for indicating, it is not intended that only one total Line or a type of bus.

Memory 1301 can be volatile memory (volatile memory), such as random access memory (random-access memory, RAM)；Memory 1301 is also possible to nonvolatile memory (non-volatile Memory), such as read-only memory, flash memory (flash memory), hard disk (hard disk drive, HDD) or solid State hard disk (solid-state drive, SSD) or memory 1301 can be used for carrying or storing have instruction or number According to structure type desired program code and can by any other medium of computer access, but not limited to this.Memory 1301 can be the combination of above-mentioned memory.

Processor 1302 executes when for calling the computer program stored in the memory 1301 such as institute in Fig. 6-7 Method performed by equipment in the embodiment shown.

In some possible embodiments, the various aspects of method provided by the present application are also implemented as a kind of program The form of product comprising program code, when described program product is run on a computing device, said program code is used for Execute the computer equipment in the method according to the various illustrative embodiments of the application of this specification foregoing description Step, for example, the computer equipment can execute method performed by equipment in embodiment as shown in figs. 6-7.

Described program product can be using any combination of one or more readable mediums.Readable medium can be readable letter Number medium or readable storage medium storing program for executing.Readable storage medium storing program for executing for example may be-but not limited to-electricity, magnetic, optical, electromagnetic, red The system of outside line or semiconductor, device or device, or any above combination.The more specific example of readable storage medium storing program for executing (non exhaustive list) includes: the electrical connection with one or more conducting wires, portable disc, hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc Read memory (CD-ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.

Although the preferred embodiment of the application has been described, it is created once a person skilled in the art knows basic Property concept, then additional changes and modifications can be made to these embodiments.So it includes excellent that the following claims are intended to be interpreted as It selects embodiment and falls into all change and modification of the application range.

Obviously, those skilled in the art can carry out various modification and variations without departing from the essence of the application to the application Mind and range.In this way, if these modifications and variations of the application belong to the range of the claim of this application and its equivalent technologies Within, then the application is also intended to include these modifications and variations.

Claims

1. a kind of text based keyword extracting method, which is characterized in that the described method includes:

The matrix of text to be analyzed is constructed, the term vector of the participle in the matrix including arranged in sequence, wherein put in order as word Sequence of the vector in the text to be analyzed；

It is analysed to Seq2seq neural network of the Input matrix to pre-training of text, obtains output matrix, the output matrix In include at least one output vector；Wherein, the Seq2seq neural network is trained according to the corpus for being labeled with keyword When arriving, and training, the matrix of training text when the input of the Seq2seq neural network, output is that training text is corresponding The matrix that keyword is constituted；Each vector is corresponding with keyword in the matrix that wherein keyword is constituted；

2. the method according to claim 1, wherein including encoder, decoding in the Seq2seq neural network Device and neural network module with attention mechanism, the encoder and decoder are Recognition with Recurrent Neural Network, described to have The neural network module of attention mechanism is used to adjust the coding result that the encoder is directed to each term vector.

3. according to the method described in claim 2, it is characterized in that, being analysed to the Input matrix of text to pre-training Seq2seq neural network, obtains output matrix, comprising:

By the term vector in the matrix of the text to be analyzed according to the sequence in the text to be analyzed, sequentially input to described Encoder obtains the state of the term vector of each input；

By the state input of the current input term vector of the encoder and a upper term vector for the current input term vector To the neural network module with attention mechanism, the weight parameter of a upper term vector is obtained；

The weight parameter of a upper term vector is multiplied with the state of a upper term vector, described upper one after being adjusted The state of term vector；

4. according to the method described in claim 3, it is characterized in that, the neural network module with attention mechanism includes Full articulamentum, random deactivating layer and the normalization layer softmax being sequentially connected in series；

The full articulamentum be used for handle input the encoder current input term vector and the current input word to The state of a upper term vector for amount；

The softmax is for obtaining a upper term vector after the processing result of the random deactivating layer is normalized Weight parameter.

5. the method according to claim 1, wherein constructing the matrix of text to be analyzed, comprising:

Term vector is converted by each participle；

6. the method according to claim 1, wherein being determined according to the corresponding relationship of output vector and keyword The keyword of the text to be analyzed, comprising:

7. according to the method described in claim 5, it is characterized in that, the corresponding keyword of the vector found is determined as described The keyword of text to be analyzed, comprising:

For the corresponding each keyword of vector found from keyword vector set, if the keyword be included in it is described to It analyzes in text, then the keyword is determined as to the keyword of the text to be analyzed；If the keyword be not included in it is described to It analyzes in text, then abandons the keyword.

8. the method according to claim 1, wherein the method also includes:

If the keyword quantity of the text to be analyzed is greater than preset quantity；Then rejected from the keyword of the text to be analyzed Partial key word is so that remaining keyword quantity is equal to the preset quantity.

9. the method according to claim 1, wherein the method also includes:

If the keyword quantity of the text to be analyzed is less than preset quantity；Then lookup and institute from the keyword vector set State the similar keyword of keyword of text to be analyzed；

10. a kind of computer equipment including memory, processor and stores the meter that can be run on a memory and on a processor Calculation machine program, which is characterized in that

The processor realizes method and step described in claim 1 to 9 any claim when executing the computer program.