CN109710760A - Clustering method, device, medium and the electronic equipment of short text - Google Patents

Clustering method, device, medium and the electronic equipment of short text Download PDF

Info

Publication number
CN109710760A
CN109710760A CN201811563089.7A CN201811563089A CN109710760A CN 109710760 A CN109710760 A CN 109710760A CN 201811563089 A CN201811563089 A CN 201811563089A CN 109710760 A CN109710760 A CN 109710760A
Authority
CN
China
Prior art keywords
short text
sorted
feature
semantic
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811563089.7A
Other languages
Chinese (zh)
Inventor
李渊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taikang Insurance Group Co Ltd
Taikang Online Property Insurance Co Ltd
Original Assignee
Taikang Insurance Group Co Ltd
Taikang Online Property Insurance Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taikang Insurance Group Co Ltd, Taikang Online Property Insurance Co Ltd filed Critical Taikang Insurance Group Co Ltd
Priority to CN201811563089.7A priority Critical patent/CN109710760A/en
Publication of CN109710760A publication Critical patent/CN109710760A/en
Pending legal-status Critical Current

Links

Abstract

Present disclose provides a kind of clustering methods of short text, this method obtains the semantic feature vector of multiple short texts to be sorted by Recognition with Recurrent Neural Network and attention mechanism, and cluster is iterated to the semantic feature vector of the multiple short text to be sorted according to k initial cluster center point using clustering algorithm, the semantic feature vector of the multiple short text to be sorted is finally divided into multiple short text classes, wherein the semantic feature vector of each of semantic feature vector of the multiple short text to be sorted short text to be sorted contains the context local feature of short text to be sorted, the context local feature and global characteristics of global characteristics and semantic related short text, make to can achieve preferable short text clustering effect when the semantic feature vector clusters to multiple short texts to be sorted in this way, the multiple short text classes obtained are more accurate.

Description

Clustering method, device, medium and the electronic equipment of short text
Technical field
The present invention relates to natural language processing technique fields, clustering method, dress in particular to a kind of short text It sets, medium and electronic equipment.
Background technique
The high speed of current Internet technology is universal and social media is widely used, and text data quantity is promoted rapidly to increase Long, principal mode is short text, such as evaluation information, client ask questions, micro- guarantor comments on etc..How from short text data Extracting valuable information becomes a challenging job.
Traditional Text Clustering Method focuses primarily upon the feature on text word face, considers its word word frequency and inverse text frequency Rate etc..Each short text is indicated to its construction feature vector with vector space model, since short text word is few, can be gone out Existing character representation sparsity, it is computationally intensive the disadvantages of, while not accounting for the semantic information between short text internal vocabulary, no yet Suitable for short text clustering.Meanwhile the topic models such as PLSA, LDA introduce the concept of theme to text and vocabulary, analyze vocabulary Theme distribution in the text solves the problems, such as near synonym, but difficulty in computation is big, and effect when to short text clustering is poor. For growing short text data at present, and not applicable and short text clustering.Therefore, the present invention is based on the above problem, Propose a kind of short text clustering method based on deep learning semantic matches model.
It should be noted that information is only used for reinforcing the reason to background of the invention disclosed in above-mentioned background technology part Solution, therefore may include the information not constituted to the prior art known to persons of ordinary skill in the art.
Summary of the invention
The clustering method for being designed to provide a kind of short text, device, medium and the electronic equipment of the embodiment of the present invention, into And it can at least overcome the problems, such as that the matched accuracy of question and answer text semantic is lower to a certain extent.
Other characteristics and advantages of the invention will be apparent from by the following detailed description, or partially by the present invention Practice and acquistion.
According to a first aspect of the embodiments of the present invention, a kind of clustering method of short text is provided, which is characterized in that the party Method includes: the semantic feature vector that multiple short texts to be sorted are obtained by Recognition with Recurrent Neural Network and attention mechanism, described more The semantic feature vector of each of the semantic feature vector of a short text to be sorted short text to be sorted contains short essay to be sorted The context local feature and global characteristics of context local feature originally, global characteristics and semantic related short text, institute's predicate Adopted correlation short text is the semantic supplement to the short text to be sorted;Using clustering algorithm according to k initial cluster center Point is iterated cluster to the semantic feature vector of the multiple short text to be sorted, by the multiple short text to be sorted Semantic feature vector is divided into multiple short text classes, and the k initial cluster center point includes from the multiple short text to be sorted Semantic feature vector in the k that chooses short texts to be sorted semantic feature vector.
In some embodiments of the invention, the clustering algorithm includes K-means algorithm.
In some embodiments of the invention, aforementioned schemes are based on, using clustering algorithm according to k initial cluster center Point is iterated cluster to the semantic feature vector of the multiple short text to be sorted, by the multiple short text to be sorted Semantic feature vector be divided into multiple short text classes include: successively calculated using the K-means algorithm it is the multiple to be sorted short The semantic feature vector of not selected short text to be sorted is at a distance from k cluster centre in the semantic feature vector of text, And the not selected semantic feature vector is clustered according to minimal distance principle;According to cluster as a result, by each The mean value of the semantic feature vector of each of cluster short text to be sorted is as the central point in each cluster;According to institute The central point in each cluster is stated, the semantic feature vector of each of each cluster short text to be sorted is iterated Cluster, until meeting preset condition so that the semantic feature vector of the multiple short text to be sorted is divided into multiple short text classes.
In some embodiments of the invention, aforementioned schemes are based on, are obtained by Recognition with Recurrent Neural Network and attention mechanism The semantic feature vector of multiple short texts to be sorted includes: to obtain having up and down for short text to be sorted using Recognition with Recurrent Neural Network The characteristic vector sequence with context local feature of the characteristic vector sequence of literary local feature and semantic related short text, institute The related short text of predicate justice is the semantic supplement to short text to be sorted;There is context based on the short text to be sorted In the characteristic vector sequence of local feature and the characteristic vector sequence with context local feature of the short text to be sorted The attention weight of each feature vector, generate the short text to be sorted with context local feature and global characteristics Feature vector, and the characteristic vector sequence and institute's predicate with context local feature based on the semantic related short text The attention weight of each feature vector, generates in the characteristic vector sequence with context local feature of adopted correlation short text The feature vector with context local feature and global characteristics of the semantic related short text;According to the short essay to be sorted This has context part with context local feature short text related to the feature vector of global characteristics, the semanteme The feature vector of feature and global characteristics determines the semantic feature vector of the multiple short text to be sorted.
In some embodiments of the invention, aforementioned schemes are based on, are obtaining short essay to be sorted using Recognition with Recurrent Neural Network The spy with context local feature of this short text related to semanteme of the characteristic vector sequence with context local feature Before levying sequence vector, this method further include: the short text to be sorted short text related to the semanteme is divided respectively Word processing, obtains the word of the word short text related to the semanteme of the short text to be sorted;To the short essay to be sorted The word of this word short text related to the semanteme carries out distributed expression respectively, obtains the term vector of short text to be sorted The term vector sequence of sequence and semantic related short text.
In some embodiments of the invention, aforementioned schemes are based on, the Recognition with Recurrent Neural Network includes bidirectional circulating nerve Network, the Recognition with Recurrent Neural Network in the bidirectional circulating neural network include based on long short-term memory LSTM and/or based on gate The network of cycling element GRU.
In some embodiments of the invention, aforementioned schemes, the above method are based on further include: be based on the short essay to be sorted This short text related to the semanteme of the characteristic vector sequence with context local feature has context local feature Characteristic vector sequence, generate background information, the background information include short text to be sorted term vector and semanteme it is related short The semantic information of all time states of the term vector of text before the Recognition with Recurrent Neural Network last moment;Believed according to the background Breath, determines each moment feature vector in the characteristic vector sequence with context local feature of the short text to be sorted Each moment is special in the characteristic vector sequence with context local feature of attention weight short text related to the semanteme Levy the attention weight of vector.
In some embodiments of the invention, aforementioned schemes are based on, context is had based on the short text to be sorted In the characteristic vector sequence of local feature and the characteristic vector sequence with context local feature of the short text to be sorted The attention weight of each feature vector, generate the short text to be sorted with context local feature and global characteristics Feature vector includes: to determine the background information with the short text to be sorted with context according to the background information The similarity of each moment feature vector in the characteristic vector sequence of local feature;According to the background information with described wait divide The similarity of each moment feature vector in the characteristic vector sequence with context local feature of class short text, determines institute State the attention of each moment feature vector in the characteristic vector sequence with context local feature of short text to be sorted Weight;According to each moment feature in the characteristic vector sequence with context local feature of the short text to be sorted to The attention weight of amount, when to each in the characteristic vector sequence with context local feature of the short text to be sorted It carves feature vector to weight and sum, obtains the feature with context local feature and global characteristics of the short text to be sorted Vector.
In some embodiments of the invention, aforementioned schemes are based on, based on having up and down for the semantic related short text The feature vector sequence with context local feature of the characteristic vector sequence of literary local feature short text related to the semanteme The attention weight of each feature vector in column, generate the semantic related short text has context local feature and the overall situation The feature vector of feature includes: to determine the tool of background information short text related to the semanteme according to the background information There is the similarity of each moment feature vector in the characteristic vector sequence of context local feature;According to the background information with The phase of each moment feature vector in the characteristic vector sequence with context local feature of the semantic related short text Like degree, each moment feature in the characteristic vector sequence with context local feature of the semantic related short text is determined The attention weight of vector;According in the characteristic vector sequence with context local feature of the semantic related short text The attention weight of each moment feature vector, to the feature with context local feature of the semantic related short text to Each moment feature vector in amount sequence is weighted and is summed, and obtain the semantic related short text has context part special It seeks peace the feature vectors of global characteristics.
According to a second aspect of the embodiments of the present invention, a kind of clustering apparatus of short text is provided, comprising: module is obtained, The semantic feature vector of multiple short texts to be sorted is obtained by Recognition with Recurrent Neural Network and attention mechanism, it is the multiple to be sorted The semantic feature vector of each of the semantic feature vector of short text short text to be sorted contains the upper and lower of short text to be sorted The context local feature and global characteristics of literary local feature, global characteristics and semantic related short text, it is described semantic related short Text is the semantic supplement to the short text to be sorted;Cluster module, using clustering algorithm according to k initial cluster center Point is iterated cluster to the semantic feature vector of the multiple short text to be sorted, by the multiple short text to be sorted Semantic feature vector is divided into multiple short text classes, and the k initial cluster center point includes from the multiple short text to be sorted Semantic feature vector in the k that chooses short texts to be sorted semantic feature vector.
According to a third aspect of the embodiments of the present invention, a kind of electronic equipment is provided, comprising: one or more processors; Storage device, for storing one or more programs, when one or more of programs are held by one or more of processors When row, so that one or more of processors realize the cluster side of the short text as described in first aspect in above-described embodiment Method.
According to a fourth aspect of the embodiments of the present invention, a kind of computer-readable medium is provided, computer is stored thereon with Program realizes the clustering method of the short text as described in first aspect in above-described embodiment when described program is executed by processor.
Technical solution provided in an embodiment of the present invention can include the following benefits:
In the inventive solutions, it can be obtained by Recognition with Recurrent Neural Network and attention mechanism multiple to be sorted short The semantic feature vector of text, and utilize clustering algorithm according to k initial cluster center point to the multiple short text to be sorted Semantic feature vector be iterated cluster, finally the semantic feature vector of the multiple short text to be sorted is divided into multiple short Text class, wherein the semantic feature of each of semantic feature vector of the multiple short text to be sorted short text to be sorted to Amount containing the context local feature of short text to be sorted, global characteristics and semantic related short text context local feature and Global characteristics, so that it is poly- to can achieve preferable short text when to the semantic feature vector clusters of multiple short texts to be sorted Class effect is more accurate to get the multiple short text classes arrived.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not It can the limitation present invention.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows and meets implementation of the invention Example, and be used to explain the principle of the present invention together with specification.It should be evident that the accompanying drawings in the following description is only the present invention Some embodiments for those of ordinary skill in the art without creative efforts, can also basis These attached drawings obtain other attached drawings.In the accompanying drawings:
Fig. 1 diagrammatically illustrates the flow chart of the clustering method of short text according to an embodiment of the invention;
Fig. 2 diagrammatically illustrates the flow chart of the clustering method of short text according to another embodiment of the invention;
Fig. 3 diagrammatically illustrates the flow chart of the clustering method of short text according to another embodiment of the invention;
Fig. 4 diagrammatically illustrates the flow chart of the clustering method of short text according to another embodiment of the invention;
Fig. 5 diagrammatically illustrates the flow chart of the clustering method of short text according to another embodiment of the invention;
Fig. 6 diagrammatically illustrates the flow chart of the clustering method of short text according to another embodiment of the invention;
Fig. 7 meaning property shows the flow chart of the clustering method of short text according to another embodiment of the invention;
Fig. 8 meaning property shows the block diagram of the clustering apparatus of short text according to an embodiment of the invention;
Fig. 9 has gone out the structural schematic diagram suitable for the computer system for being used to realize the electronic equipment of the embodiment of the present invention.
Specific embodiment
Example embodiment is described more fully with reference to the drawings.However, example embodiment can be with a variety of shapes Formula is implemented, and is not understood as limited to example set forth herein;On the contrary, thesing embodiments are provided so that the present invention will more Fully and completely, and by the design of example embodiment comprehensively it is communicated to those skilled in the art.
In addition, described feature, structure or characteristic can be incorporated in one or more implementations in any suitable manner In example.In the following description, many details are provided to provide and fully understand to the embodiment of the present invention.However, It will be appreciated by persons skilled in the art that technical solution of the present invention can be practiced without one or more in specific detail, Or it can be using other methods, constituent element, device, step etc..In other cases, it is not shown in detail or describes known side Method, device, realization or operation are to avoid fuzzy each aspect of the present invention.
Block diagram shown in the drawings is only functional entity, not necessarily must be corresponding with physically separate entity. I.e., it is possible to realize these functional entitys using software form, or realized in one or more hardware modules or integrated circuit These functional entitys, or these functional entitys are realized in heterogeneous networks and/or processor device and/or microcontroller device.
Flow chart shown in the drawings is merely illustrative, it is not necessary to including all content and operation/step, It is not required to execute by described sequence.For example, some operation/steps can also decompose, and some operation/steps can close And or part merge, therefore the sequence actually executed is possible to change according to the actual situation.
Fig. 1 diagrammatically illustrates the flow chart of the clustering method of short text according to an embodiment of the invention.
As shown in Figure 1, the clustering method of short text includes step S110 and step S120.
In step s 110, the semanteme for obtaining multiple short texts to be sorted by Recognition with Recurrent Neural Network and attention mechanism is special Vector is levied, the semantic feature vector of each of semantic feature vector of the multiple short text to be sorted short text to be sorted contains There are the context local feature of short text to be sorted, the context local feature and the overall situation of global characteristics and semantic related short text Feature, the semantic related short text are the semantic supplements to the short text to be sorted.
In the step s 120, using clustering algorithm according to k initial cluster center point, to the multiple short text to be sorted Semantic feature vector be iterated cluster, the semantic feature vector of the multiple short text to be sorted is divided into multiple short texts Class, the k initial cluster center point include the k that is chosen from the semantic feature vector of the multiple short text to be sorted to The semantic feature vector of classification short text.
This method can obtain the semantic feature of multiple short texts to be sorted by Recognition with Recurrent Neural Network and attention mechanism Vector, and utilize clustering algorithm according to k initial cluster center point to the semantic feature vector of the multiple short text to be sorted It is iterated cluster, the semantic feature vector of the multiple short text to be sorted is finally divided into multiple short text classes, wherein institute State each of the semantic feature vector of multiple short texts to be sorted short text to be sorted semantic feature vector contain it is to be sorted The context local feature and global characteristics of the context local feature of short text, global characteristics and semantic related short text, this Sample to can achieve when the semantic feature vector clusters to multiple short texts to be sorted preferable short text clustering effect to get The multiple short text classes arrived are more accurate.
In one embodiment of the invention, above-mentioned short text to be sorted can be the evaluation information of shopping platform, client Ask questions, micro- guarantor comment etc., but not limited to this.Semantic correlation short text can be based on short text to be sorted from database What retrieval obtained, therefore, semantic correlation short text can be understood as the semantic supplement to short text to be sorted, i.e., semantic phase It is related to short text to be sorted to close short text.For example, short text to be sorted is the evaluation information of shopping platform, can be based at this time The evaluation information retrieves the multiple multiple evaluation informations relevant to the evaluation information of institute from the database of evaluation information.Example again Such as, short text to be sorted asks questions for client, can be asked questions based on the client retrieve from Q & A database at this time Relevant multiple candidate answers are asked questions to the client.
In one embodiment of the invention, the Recognition with Recurrent Neural Network in step S110 includes bidirectional circulating neural network, The middle Recognition with Recurrent Neural Network of the bidirectional circulating neural network can be based on long short-term memory LSTM and/or based on gating cycle list The network etc. of first GRU.
In one embodiment of the invention, short text to be sorted is extracted by Recognition with Recurrent Neural Network and semanteme is related short Then the context local feature information of text can extract short text to be sorted and semantic related short essay in conjunction with attention mechanism This global characteristics information can make to obtain in this way the study of the semantic feature vector of multiple short texts to be sorted to the two Between deeper semantic feature information, the multiple short text classes for helping to improve Clustering Effect to get arriving are more accurate.
In one embodiment of the invention, k initial cluster center point, which can be, is previously set.For example, can lead to It crosses and chooses the semantic feature vector of k short texts to be sorted as initial from the semantic feature vector of multiple short texts to be sorted Cluster centre point.For example, k initial cluster center point can be the semantic feature vector setting for multiple short texts to be sorted The initial short essay classes of k.
Fig. 2 diagrammatically illustrates the flow chart of the clustering method of short text according to another embodiment of the invention.
As shown in Fig. 2, above-mentioned steps S120 can specifically include step S121~step S123.
In step S121, the semanteme for successively calculating the multiple short text to be sorted using the K-means algorithm is special The semantic feature vector of not selected short text to be sorted in vector is levied at a distance from k cluster centre, and according to most narrow spacing The not selected semantic feature vector is clustered from principle.
In step S122, according to cluster as a result, by the semantic feature of each of each cluster short text to be sorted The mean value of vector is as the central point in each cluster.
It is to be sorted to each of each cluster according to the central point in each cluster in step S123 The semantic feature vector of short text is iterated cluster, until meeting preset condition with by the language of the multiple short text to be sorted Adopted feature vector is divided into multiple short text classes.
In one embodiment of the invention, k short essay class, k short text be can be in the cluster result of step S122 Each short text class in class may include the semantic feature vector of multiple short texts to be sorted.It can be directed to each short essay at this time The semantic feature vector of multiple short texts to be sorted in this class updates the value of k.For example, calculate each of each cluster to The mean value of the semantic feature vector of classification short text, and as the central point in each cluster, mode can update k at this time Value, i.e., can redefine the central point of cluster process by this way.
In one embodiment of the invention, step 122 and step 123 can be executed by circulation until meeting default item Part is to be divided into multiple short text classes for the semantic feature vector of multiple short texts to be sorted.For example, constantly in each cluster The semantic feature vector of each short text to be sorted is iterated cluster, until cluster time has counted to designated value or canonical measure Function convergence, i.e. clustering algorithm terminate, otherwise jump procedure S122.
In one embodiment of the invention, before to a large amount of short text clusterings to be sorted, need to be translated into Classify the semantic feature vector of short text, in conversion, by means of the related short essay of semanteme relevant to short text clustering to be sorted This, is expanded with the semanteme to short text to be sorted, so that Clustering Effect is more preferable.Below with reference to Fig. 3~Fig. 7, specifically describe Obtain the implementation process of the semantic feature vector of multiple short texts to be sorted.
Fig. 3 diagrammatically illustrates the flow chart of the clustering method of short text according to another embodiment of the invention.
As shown in figure 3, above-mentioned steps S110 can specifically include step 111~step S113.
In step S111, the spy with context local feature of short text to be sorted is obtained using Recognition with Recurrent Neural Network Levy the characteristic vector sequence with context local feature of sequence vector to semantic related short text.
In step S112, characteristic vector sequence with context local feature based on the short text to be sorted and The attention weight of each feature vector in the characteristic vector sequence with context local feature of the short text to be sorted, The feature vector with context local feature and global characteristics of the short text to be sorted is generated, and is based on the semanteme The short text related to the semanteme of the characteristic vector sequence with context local feature of related short text has context The attention weight of each feature vector in the characteristic vector sequence of local feature generates having for the semantic related short text The feature vector of context local feature and global characteristics.
In step S113, according to the feature with context local feature and global characteristics of the short text to be sorted The feature vector with context local feature and global characteristics of vector, the semantic related short text, determines the multiple The semantic feature vector of short text to be sorted.
This method can feature according to the short text to be sorted with context local feature and global characteristics to Amount, the feature vector with context local feature and global characteristics of the semantic related short text come determine it is the multiple to Classify the semantic feature vector of short text, the semantic feature vector for each of obtaining short text to be sorted by this method, which contains, to be needed Classify the context local feature of short text, the context local feature of global characteristics and semantic related short text and overall situation spy The deeper of short text to be sorted short text related to semanteme has been arrived in sign, i.e., the semantic feature vector study of each short text to be sorted The semantic feature information of layer.
In one embodiment of the invention, the deep learning semantic matches model acquisition that can use training completion is multiple The semantic feature vector of short text to be sorted.For example, extracting short text to be sorted and semantic related short text using Bi-LSTM Context local feature information, the global characteristics information both extracted with attention mechanism, by this method multiple to be sorted The semantic feature vector of short text can sufficiently extract the Deep Semantics characteristic information of the two.
In the present embodiment with certain internet insurance channel client's question and answer data as an example come utilize training complete depth Degree study semantic matches model obtains the semantic feature vector of multiple short texts to be sorted, for example, question text can be used as The short text to be sorted stated is that candidate answers text can be used as above-mentioned semantic related short text, wherein question and answer text and candidate Answer text semantic is related.
Specifically, the sequence vector of the term vector sequence of question text and the term vector of candidate answers text is inputted respectively Learn to respective Recognition with Recurrent Neural Network and extracts the characteristic vector sequence with context local feature.For example, double to utilize To the context office of the long capture numeralization of memory network Bi-LSTM in short-term question and answer text (i.e. question text and candidate answers text) Portion's feature is described in detail step S111 for obtaining the characteristic vector sequence with contextual feature of the two.
Specifically, in step S111, can be by using identical vocabulary length after professional question and answer dictionary conversion the problem of The term vector sequence of text and candidate answers text is separately input to two two-way length, and memory network Bi-LSTM is extracted up and down in short-term Literary local feature.It, can be respectively by the term vector sequence and inverted order of positive sequence question text and candidate answers text in Bi-LSTM Term vector sequence inputting two long memory network LSTM in short-term of question text and candidate answers text, can during input To combine the information of last time, the text information at current time is calculated.The calculation formula of LSTM is as follows:
it=σ (Wxixt+Whiht-1+Wcict-1+bi)
ft=σ (Wxfxt+Whfht-1+Wcfct-1+bf)
ct=ftct-1+ittanh(Wxcxt+Whcht-1+bc)
ot=σ (Wxoxt+Whoht-1+Wcoct+bo)
ht=ottanh(ct)
Wherein, σ indicates that sigmoid activation primitive, tanh indicate tanh activation primitive, xtIndicate that step S110 is obtained The word of t moment is embedded in vector, itIndicate the output vector of t moment input gate, ftIndicate that t moment forgets the output vector of door, otTable Show the output vector of t moment out gate, ctAnd ct-1Respectively indicate the memory stream mode of t moment and the cell factory at t-1 moment, ht And ht-1Respectively indicate t moment and t-1 moment hidden layer vector.Weight matrix and offset parameter description have apparent meaning, such as WxiIndicate the weight matrix of input and input gate, WhiIndicate the weight matrix of hidden layer and input gate, WciIndicate cell factory and The weight matrix of input gate, bi、bfIt indicates input gate and forgets the offset parameter of door, footmark indicates affiliated calculating section.This In parameter matrix and offset parameter be all first random initializtion, then in the model training based on bidirectional circulating neural network Automatic amendment, can finally obtain final weight with Recognition with Recurrent Neural Network.
For each moment t, the input of moment t can be allowed to learn to preceding moment (for example, t+1) and rear moment (for example, t- 1) semantic information is asked by splicing positive sequence question and answer text (i.e. question text and candidate answers text) term vector sequence and inverted order Answer text (i.e. question text and candidate answers text) term vector sequence two length memory network LSTM output feature to Measure hfwAnd hbw, as Bi-LSTM moment t final feature vector export, the dimension of feature vector be LSTM export feature to 2 times for measuring dimension.
ht=[hfw,hbw]
Wherein, hfwIndicate the LSTM of processing positive sequence question and answer text (i.e. question text and candidate answers text) term vector sequence The output of network, hbwIndicate the LSTM net of processing inverted order question and answer text (i.e. question text and candidate answers text) term vector sequence The output of network, htIndicate the feature vector output of Bi-LSTM moment t.
According to an embodiment of the invention, above-mentioned Bi-LSTM is based on the two-way of the LSTM of memory network in short-term two long formation Long memory network in short-term.
According to an embodiment of the invention, can use the calculation formula of LSTM in question text and candidate answers text After each term vector is handled, the characteristic vector sequence with context local feature of available question text and candidate The characteristic vector sequence with context local feature of answer text.
Fig. 4 diagrammatically illustrates the flow chart of the clustering method of short text according to another embodiment of the invention.
As shown in figure 4, the above method further includes step S210 and step S220 before step S111.
In step S210, word segmentation processing is carried out respectively to the short text to be sorted short text related to the semanteme, Obtain the word of the word short text related to the semanteme of the short text to be sorted.
In step S220, to the word of the word of the short text to be sorted short text related to the semanteme respectively into Row is distributed to be indicated, the term vector sequence of the term vector sequence short text related to semanteme of short text to be sorted is obtained.
According to an embodiment of the invention, being divided to semantic related short text using participle tool for short text to be sorted Word processing, it is respective to short text to be sorted and semantic related short text using the embedding layer of deep learning frame keras Textual words carry out distributed expression and are converted into respective term vector, embedding layers of parameters with depth learning model one Play training gained.And the respective term vector of short text to be sorted short text related to semanteme and syntax vector are formed into vector sequence Column.In order to facilitate the calculating of term vector sequence, length being carried out to term vector sequence and is selected, short sequence vector length is supplemented with 0, Sequence vector length, which is greater than limit value, to be intercepted.
Fig. 5 diagrammatically illustrates the flow chart of the clustering method of short text according to another embodiment of the invention.
If Fig. 5 shows, aforementioned schemes are based on, the above method further includes step S310 and step S320.
In step s310, the characteristic vector sequence with context local feature based on the short text to be sorted and The characteristic vector sequence with context local feature of the semantic related short text generates background information, the background letter Breath includes the term vector of the term vector short text related to semanteme of short text to be sorted before the Recognition with Recurrent Neural Network last moment All time states semantic information.
In step s 320, according to the background information, determine the short text to be sorted has context part special The attention weight of each moment feature vector short text related to the semanteme has up and down in the characteristic vector sequence of sign The attention weight of each moment feature vector in the characteristic vector sequence of literary local feature.
This method passes through using the semantic information of all time states before the Recognition with Recurrent Neural Network last moment as wait divide The background information of class short text and semantic related short text, and having up and down for short text to be sorted is calculated with reference to the background information The attention weight of each moment feature vector and having for semantic related short text are upper in literary local feature characteristic vector sequence Hereafter in the characteristic vector sequence of local feature each moment feature vector attention weight, the note being calculated by this method Meaning power weight can be effectively reflected the Deep Semantics information of short text to be sorted to semantic related short text, to overcome The prior art only reflects the word word frequency of short text and its defect of inverse text frequency.
In one embodiment of the invention, above-mentioned background information can be chooses short text to be sorted and semantic phase respectively It closes short text and carries out vector splicing as background information expression, this background in the feature vector of the last moment state of Bi-LSTM Information includes the semantic information of short text to be sorted and the related short text of semanteme all time states before this.Specifically, Ke Yifen Characteristic vector sequence not from above-mentioned short text to be sorted with context local feature has to semantic related short text In the characteristic vector sequence of context local feature choose both the last moment state of Bi-LSTM feature vector carry out to Amount carries out splicing as above-mentioned background information.In addition, since the background information is short text to be sorted and semantic related short essay Originally in the feature vector of the last moment state of Bi-LSTM, therefore last moment shape can be obtained by the calculation formula of LSTM The feature vector of state.For example, the feature vector of all last time states before the last moment in combination LSTM can be passed through It is calculated, therefore the background information includes the semanteme of short text to be sorted and the related short text of semanteme all time states before this Information.
Having for the short text to be sorted how to obtain and the related short text of semanteme is specifically described below with reference to Fig. 6 and Fig. 7 Context local feature and global characteristics feature vectors.
Fig. 6 diagrammatically illustrates the flow chart of the clustering method of short text according to another embodiment of the invention.
As shown in fig. 6, " the feature with context local feature based on the short text to be sorted in step S112 Each feature vector in the characteristic vector sequence with context local feature of sequence vector and the short text to be sorted Attention weight generates the feature vector with context local feature and global characteristics of the short text to be sorted " it is specific It may include step S112-1, step S112-2 and step S112-3.
In step S112-1, according to the background information, the background information and the short text to be sorted are determined The similarity of each moment feature vector in characteristic vector sequence with context local feature.
In step S112-2, according to the background information with the short text to be sorted with context local feature Characteristic vector sequence in each moment feature vector similarity, determine the short text to be sorted have context office The attention weight of each moment feature vector in the characteristic vector sequence of portion's feature.
In step S112-3, according to the characteristic vector sequence with context local feature of the short text to be sorted In each moment feature vector attention weight, to the feature with context local feature of the short text to be sorted Each moment feature vector in sequence vector is weighted and is summed, and obtain the short text to be sorted has context part special It seeks peace the feature vectors of global characteristics.
Fig. 7 diagrammatically illustrates the flow chart of the clustering method of short text according to another embodiment of the invention.
As shown in fig. 7, " the spy with context local feature based on the semantic related short text in step S112 Levy in the characteristic vector sequence with context local feature of related to the semanteme short text of sequence vector each feature to The attention weight of amount, generate the feature with context local feature and global characteristics of the semantic related short text to Amount " can specifically include step S112-4, step S112-5 and step S112-6.
In step S112-4, according to the background information, background information short text related to the semanteme is determined The characteristic vector sequence with context local feature in each moment feature vector similarity.
In step S112-5, have context part special according to background information short text related to the semanteme The similarity of each moment feature vector in the characteristic vector sequence of sign determines having up and down for the semantic related short text The attention weight of each moment feature vector in the characteristic vector sequence of literary local feature.
In step S112-6, according to the feature vector sequence with context local feature of the semantic related short text The attention weight of each moment feature vector in column, to the semantic related short text with context local feature Each moment feature vector in characteristic vector sequence is weighted and is summed, and obtain the semantic related short text has context The feature vector of local feature and global characteristics.
According to an embodiment of the invention, calculating the background information and short text to be sorted by reference to above-mentioned background information The characteristic vector sequence with context local feature in each moment feature vector similarity and to it is semantic related short The similarity of the feature vector at each moment in the characteristic vector sequence with context local feature of text, then according to The similarity of the feature vector at each moment of short text of the classifying short text related to semanteme in Bi-LSTM is to be sorted to calculate The attention weight and semanteme of each moment feature vector in the characteristic vector sequence with context local feature of short text The attention weight of each moment feature vector in the characteristic vector sequence with context local feature of related short text, with The attention weight that this mode is calculated can be effectively reflected the deep layer of short text to be sorted to semantic related short text Semantic information, so that overcoming the prior art only reflects the word word frequency of short text and its defect of inverse text frequency.
In one embodiment of the invention, short text to be sorted can be chosen according to the basic thought of attention mechanism The feature vector of last moment state to semantic related short text in Bi-LSTM carries out vector splicing and is used as background information table Show, this background information includes the semantic information of short text to be sorted and the related short text of semanteme all time states before this.By Dimension is dropped to half by full articulamentum, output sequence vector of the short text related to short text to be sorted and semanteme in Bi-LSTM Dimension is consistent.Its parameter is expressed as bkg.Short text to be sorted and semantic related short essay can be specifically obtained by three phases This characteristic vector sequence with context local feature.
First stage: background information bkg and short text to be sorted and language can be calculated using text similarity formula respectively T exports feature vector h at the time of adopted correlation short text is in Bi-LSTMtSimilarity, specific formula is as follows:
simt=bkght
Wherein, simtBe expressed as background information bkg short text related to semanteme to short text to be sorted has context Some term vector h in the characteristic vector sequence of local featuretAt the time of t similarity.Therefore it is calculated separately according to the formula Short text to be sorted corresponds to similarity vector Sim to semantic related short text outqAnd Sima.It can be calculated separately according to the formula Short text to be sorted corresponds to similarity vector Sim to semantic related short text outqAnd Sima
Second stage introduces softmax calculation, carries out numerical value conversion, a side to the similarity score of first stage Face can be normalized, and original calculation score value is organized into the probability distribution that all elements weights sum is 1;On the other hand The power of short text to be sorted to important information in semantic related short text can be more protruded by the inherent mechanism of softmax Weight.Formula is as follows:
Wherein, atFor the spy with context local feature of t moment short text to be sorted and semantic related short text The attention weight of some feature vector in sequence vector is levied, N is the feature of short text to be sorted to semantic related short text The length of sequence vector.Similarity vector Sim can be passed through respectively according to the formulaqAnd SimaCalculate short text to be sorted and The attention weight a of each moment t of semantic correlation short textqtAnd aat
Phase III, aqtAnd aatShort text respectively to be sorted has context part special to semantic related short text Some feature vector in the characteristic vector sequence of sign is needed in the attention weight of t moment with short text to be sorted and language The output vector h of adopted correlation short text t moment wordtThe weighting of attention weight is carried out, short text and semanteme to be sorted are constituted The new vector s of related short text t moment wordt.The formula is as follows:
st=atht
Then to the s that each moment obtainstIt sums, generates short text to be sorted and semantic related short text is respective Attention numerical value vector, i.e., the feature vector and semantic phase with context local feature and global characteristics of short text to be sorted The feature vector with context local feature and global characteristics of short text is closed, specific formula is as follows:
Wherein, atFor the attention weight of t moment word, N is the spy of short text to be sorted to semantic related short text The length of sequence vector is levied, Attention is attention numerical value vector.
By the above stage, the feature that N is short text to be sorted to semantic related short text is calculated according to background information Each feature vector in the characteristic vector sequence with context local feature of the length of sequence vector is in each t moment Then attention weight carries out attention weight to the feature vector of semantic related short text t moment to short text to be sorted Weighting, then sum.Short text to be sorted can be constructed respectively in this way to semantic related short text with context office The feature vector of portion's feature and global characteristics.Then by short text to be sorted with context local feature and global characteristics Feature vector and semantic related short text have the feature vector of context local feature and global characteristics be spliced into one to Classify the feature vector of short text, the semantic feature vector of available multiple short texts to be sorted by this method, wherein at this time Each of the semantic feature vector of multiple short texts to be sorted short text to be sorted semantic feature vector contain it is to be sorted The context local feature and global characteristics of the context local feature of short text, global characteristics and semantic related short text, this The semantic feature vector of multiple short texts to be sorted can be input to clustering algorithm use by sample, improve Clustering Effect.
Fig. 8 diagrammatically illustrates the block diagram of the clustering apparatus of short text according to an embodiment of the invention.
As shown in figure 8, the clustering apparatus 800 of short text includes obtaining module 810 and cluster module 820.
Specifically, module 810 is obtained, multiple short texts to be sorted are obtained by Recognition with Recurrent Neural Network and attention mechanism Semantic feature vector, the semantic feature of each of semantic feature vector of the multiple short text to be sorted short text to be sorted Vector contains the context local feature of the context local feature of short text to be sorted, global characteristics and semantic related short text And global characteristics, the semantic related short text are the semantic supplements to the short text to be sorted;
Cluster module 820, using clustering algorithm according to k initial cluster center point, to the multiple short text to be sorted Semantic feature vector be iterated cluster, the semantic feature vector of the multiple short text to be sorted is divided into multiple short texts Class, the k initial cluster center point include the k that is chosen from the semantic feature vector of the multiple short text to be sorted to The semantic feature vector of classification short text.
The clustering apparatus 800 of the short text can be obtained multiple to be sorted short by Recognition with Recurrent Neural Network and attention mechanism The semantic feature vector of text, and utilize clustering algorithm according to k initial cluster center point to the multiple short text to be sorted Semantic feature vector be iterated cluster, finally the semantic feature vector of the multiple short text to be sorted is divided into multiple short Text class, wherein the semantic feature of each of semantic feature vector of the multiple short text to be sorted short text to be sorted to Amount containing the context local feature of short text to be sorted, global characteristics and semantic related short text context local feature and Global characteristics, so that it is poly- to can achieve preferable short text when to the semantic feature vector clusters of multiple short texts to be sorted Class effect is more accurate to get the multiple short text classes arrived.
According to an embodiment of the invention, the clustering apparatus 800 of short text can be used to implement above-mentioned FIG. 1 to FIG. 7 description The clustering method of short text.
Due to the modules of the clustering apparatus 800 of the short text of example embodiments of the present invention can be used to implement it is above-mentioned The step of example embodiment of the cluster dress method of short text, therefore for undisclosed details in apparatus of the present invention embodiment, Please refer to the embodiment of the clustering method of the above-mentioned short text of the present invention.
Below with reference to Fig. 9, it illustrates the computer systems 90 for the electronic equipment for being suitable for being used to realize the embodiment of the present invention Structural schematic diagram.The computer system 800 for the electronic equipment that Fig. 9 goes out is only an example, should not be to the function of the embodiment of the present invention Any restrictions can be brought with use scope.
If Fig. 9 shows, computer system 900 includes central processing unit (CPU) 501, can be according to being stored in read-only deposit Program in reservoir (ROM) 902 is held from the program that storage section 908 is loaded into random access storage device (RAM) 903 The various movements appropriate of row and processing.In RAM 903, it is also stored with various programs and data needed for system operatio. CPU901, ROM 902 and RAM 903 is connected with each other by bus 904.Input/output (I/O) interface 905 is also connected to always Line 904.
I/O interface 905 is connected to lower component: the importation 906 including keyboard, mouse etc.;It is penetrated including such as cathode The output par, c 907 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage section 908 including hard disk etc.; And the communications portion 909 of the network interface card including LAN card, modem etc..Communications portion 909 via such as because The network of spy's net executes communication process.Driver 910 is also connected to I/O interface 905 as needed.Detachable media 911, such as Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 910, in order to read from thereon Computer program be mounted into storage section 908 as needed.
Particularly, according to an embodiment of the invention, may be implemented as computer above with reference to the process of flow chart description Software program.For example, the embodiment of the present invention includes a kind of computer program product comprising be carried on computer-readable medium On computer program, which includes the program code for method shown in execution flow chart.In such reality It applies in example, which can be downloaded and installed from network by communications portion 909, and/or from detachable media 911 are mounted.When the computer program is executed by central processing unit (CPU) 501, executes and limited in the system of the application Above-mentioned function.
It should be noted that computer-readable medium shown in the present invention can be computer-readable signal media or meter Calculation machine readable storage medium storing program for executing either the two any combination.Computer readable storage medium for example can be --- but not Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.Meter The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, just of one or more conducting wires Taking formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In the present invention, computer readable storage medium can be it is any include or storage journey The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.And at this In invention, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for By the use of instruction execution system, device or device or program in connection.Include on computer-readable medium Program code can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc. are above-mentioned Any appropriate combination.
Flow chart and block diagram in attached drawing are illustrated according to the system of various embodiments of the invention, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of above-mentioned module, program segment or code include one or more Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical On can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it wants It is noted that the combination of each box in block diagram or flow chart and the box in block diagram or flow chart, can use and execute rule The dedicated hardware based systems of fixed functions or operations is realized, or can use the group of specialized hardware and computer instruction It closes to realize.
Being described in unit involved in the embodiment of the present invention can be realized by way of software, can also be by hard The mode of part realizes that described unit also can be set in the processor.Wherein, the title of these units is in certain situation Under do not constitute restriction to the unit itself.
As on the other hand, present invention also provides a kind of computer-readable medium, which be can be Included in electronic equipment described in above-described embodiment;It is also possible to individualism, and without in the supplying electronic equipment. Above-mentioned computer-readable medium carries one or more program, when the electronics is set by one for said one or multiple programs When standby execution, so that the electronic equipment realizes the clustering method such as above-mentioned short text as described in the examples.
For example, the electronic equipment may be implemented as shown in Figure 1: in step s 110, by recycling nerve net Network and attention mechanism obtain the semantic feature vector of multiple short texts to be sorted, and the semanteme of the multiple short text to be sorted is special The semantic feature vector of each of sign vector short text to be sorted contains the context local feature of short text to be sorted, the overall situation The context local feature and global characteristics of feature and semantic related short text, the semantic related short text is to described wait divide The semantic supplement of class short text;In step S120, using clustering algorithm according to k initial cluster center point, to the multiple The semantic feature vector of short text to be sorted is iterated cluster, by the semantic feature vector of the multiple short text to be sorted point For multiple short text classes, the k initial cluster center point includes the semantic feature vector from the multiple short text to be sorted In the k that chooses short texts to be sorted semantic feature vector.
It should be noted that although being referred to several modules or list for acting the equipment executed in the above detailed description Member, but this division is not enforceable.In fact, embodiment according to the present invention, it is above-described two or more Module or the feature and function of unit can embody in a module or unit.Conversely, an above-described mould The feature and function of block or unit can be to be embodied by multiple modules or unit with further division.
Through the above description of the embodiments, those skilled in the art is it can be readily appreciated that example described herein is implemented Mode can also be realized by software realization in such a way that software is in conjunction with necessary hardware.Therefore, according to the present invention The technical solution of embodiment can be embodied in the form of software products, which can store non-volatile at one Property storage medium (can be CD-ROM, USB flash disk, mobile hard disk etc.) in or network on, including some instructions are so that a calculating Equipment (can be personal computer, server, touch control terminal or network equipment etc.) executes embodiment according to the present invention Method.
Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to of the invention its Its embodiment.This application is intended to cover any variations, uses, or adaptations of the invention, these modifications, purposes or Person's adaptive change follows general principle of the invention and including the undocumented common knowledge in the art of the present invention Or conventional techniques.The description and examples are only to be considered as illustrative, and true scope and spirit of the invention are by following Claim is pointed out.
It should be understood that the present invention is not limited to the precise structure already described above and shown in the accompanying drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present invention is limited only by the attached claims.

Claims (12)

1. a kind of clustering method of short text, which is characterized in that this method comprises:
Obtain the semantic feature vector of multiple short texts to be sorted by Recognition with Recurrent Neural Network and attention mechanism, it is the multiple to The semantic feature vector of each of the semantic feature vector of classification short text short text to be sorted contains short text to be sorted The context local feature and global characteristics of context local feature, global characteristics and semantic related short text, the semanteme phase Closing short text is the semantic supplement to the short text to be sorted;
Using clustering algorithm according to k initial cluster center point, to the semantic feature vector of the multiple short text to be sorted into The semantic feature vector of the multiple short text to be sorted is divided into multiple short text classes by row iteration cluster, and the k initial poly- Class central point includes the semanteme for the k short texts to be sorted chosen from the semantic feature vector of the multiple short text to be sorted Feature vector.
2. the method according to claim 1, wherein the clustering algorithm includes K-means algorithm.
3. right according to the method described in claim 2, it is characterized in that, using clustering algorithm according to k initial cluster center point The semantic feature vector of the multiple short text to be sorted is iterated cluster, and the semanteme of the multiple short text to be sorted is special Sign vector is divided into multiple short text classes and includes:
It is successively calculated using the K-means algorithm not selected in the semantic feature vector of the multiple short text to be sorted The semantic feature vector of short text to be sorted is not chosen at a distance from k cluster centre, and according to minimal distance principle to described In semantic feature vector clustered;
According to cluster as a result, using the mean value of the semantic feature vector of each of each cluster short text to be sorted as described in Central point in each cluster;
According to the central point in each cluster, to the semantic feature of each of each cluster short text to be sorted to Amount is iterated cluster, until it is multiple the semantic feature vector of the multiple short text to be sorted to be divided into meet preset condition Short text class.
4. the method according to claim 1, wherein being obtained by Recognition with Recurrent Neural Network and attention mechanism multiple The semantic feature vector of short text to be sorted includes:
The characteristic vector sequence and semanteme with context local feature of short text to be sorted are obtained using Recognition with Recurrent Neural Network The characteristic vector sequence with context local feature of related short text;
Characteristic vector sequence with context local feature and the short text to be sorted based on the short text to be sorted The characteristic vector sequence with context local feature in each feature vector attention weight, generate it is described to be sorted short The feature vector with context local feature and global characteristics of text, and having based on the semantic related short text The feature with context local feature of the characteristic vector sequence of context local feature short text related to the semanteme to Measure the attention weight of each feature vector in sequence, generate the semantic related short text have context local feature and The feature vector of global characteristics;
Have context local feature related to the feature vector of global characteristics, the semanteme according to the short text to be sorted The feature vector with context local feature and global characteristics of short text, determines the semanteme of the multiple short text to be sorted Feature vector.
5. according to the method described in claim 4, it is characterized in that, obtaining short text to be sorted using Recognition with Recurrent Neural Network Characteristic vector sequence with context local feature and the feature with context local feature of semantic related short text to Before measuring sequence, this method further include:
Word segmentation processing is carried out to the short text to be sorted short text related to the semanteme respectively, obtains the short essay to be sorted The word of this word short text related to the semanteme;
Distributed expression is carried out to the word of the word of the short text to be sorted short text related to the semanteme respectively, is obtained The term vector sequence of the term vector sequence of short text to be sorted and semantic related short text.
6. the method according to claim 1, wherein the Recognition with Recurrent Neural Network includes bidirectional circulating nerve net Network, the Recognition with Recurrent Neural Network in the bidirectional circulating neural network include being followed based on long short-term memory LSTM and/or based on gate The network of ring element GRU.
7. according to the method described in claim 5, it is characterized in that, this method further include:
The short essay related to the semanteme of the characteristic vector sequence with context local feature based on the short text to be sorted This characteristic vector sequence with context local feature generates background information, and the background information includes short essay to be sorted All time states of this term vector to the term vector of semantic related short text before the Recognition with Recurrent Neural Network last moment Semantic information;
According to the background information, determine in the characteristic vector sequence with context local feature of the short text to be sorted The feature with context local feature of the attention weight of each moment feature vector short text related to the semanteme to Measure the attention weight of each moment feature vector in sequence.
8. the method according to the description of claim 7 is characterized in that there is context part based on the short text to be sorted It is each in the characteristic vector sequence of feature and the characteristic vector sequence with context local feature of the short text to be sorted The attention weight of feature vector generates the feature with context local feature and global characteristics of the short text to be sorted Vector includes:
According to the background information, determine the background information and the short text to be sorted with context local feature The similarity of each moment feature vector in characteristic vector sequence;
According in the background information and the characteristic vector sequence with context local feature of the short text to be sorted The similarity of each moment feature vector determines the feature vector sequence with context local feature of the short text to be sorted The attention weight of each moment feature vector in column;
According to each moment feature in the characteristic vector sequence with context local feature of the short text to be sorted to The attention weight of amount, when to each in the characteristic vector sequence with context local feature of the short text to be sorted It carves feature vector to weight and sum, obtains the feature with context local feature and global characteristics of the short text to be sorted Vector.
9. the method according to the description of claim 7 is characterized in that there is context office based on the semantic related short text In the characteristic vector sequence with context local feature of the characteristic vector sequence of portion's feature short text related to the semanteme The attention weight of each feature vector, generate the semantic related short text has context local feature and global characteristics Feature vector include:
According to the background information, determine background information short text related to the semanteme has context local feature Characteristic vector sequence in each moment feature vector similarity;
According in the characteristic vector sequence with context local feature of background information short text related to the semanteme Each moment feature vector similarity, determine the feature with context local feature of the semantic related short text to Measure the attention weight of each moment feature vector in sequence;
According to each moment feature in the characteristic vector sequence with context local feature of the semantic related short text The attention weight of vector, to every in the characteristic vector sequence with context local feature of the semantic related short text A moment feature vector is weighted and is summed, and obtain the semantic related short text has context local feature and global characteristics Feature vector.
10. a kind of clustering apparatus of short text, which is characterized in that the device includes:
Module is obtained, the semantic feature vector of multiple short texts to be sorted is obtained by Recognition with Recurrent Neural Network and attention mechanism, The semantic feature vector of each of the semantic feature vector of the multiple short text to be sorted short text to be sorted is containing needing point The context local feature and global characteristics of the context local feature of class short text, global characteristics and semantic related short text, The semantic related short text is the semantic supplement to the short text to be sorted;
Cluster module, it is special to the semanteme of the multiple short text to be sorted using clustering algorithm according to k initial cluster center point Sign vector is iterated cluster, and the semantic feature vector of the multiple short text to be sorted is divided into multiple short text classes, the k A initial cluster center point includes the k short essays to be sorted chosen from the semantic feature vector of the multiple short text to be sorted This semantic feature vector.
11. a kind of electronic equipment, comprising:
One or more processors;And
Storage device, for storing one or more programs, when one or more of programs are by one or more of processing When device executes, so that one or more of processors realize method described in any one according to claim 1~9.
12. a kind of computer-readable medium, is stored thereon with computer program, which is characterized in that described program is held by processor The method according to claim 1~any one of 9 is realized when row.
CN201811563089.7A 2018-12-20 2018-12-20 Clustering method, device, medium and the electronic equipment of short text Pending CN109710760A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811563089.7A CN109710760A (en) 2018-12-20 2018-12-20 Clustering method, device, medium and the electronic equipment of short text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811563089.7A CN109710760A (en) 2018-12-20 2018-12-20 Clustering method, device, medium and the electronic equipment of short text

Publications (1)

Publication Number Publication Date
CN109710760A true CN109710760A (en) 2019-05-03

Family

ID=66256134

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811563089.7A Pending CN109710760A (en) 2018-12-20 2018-12-20 Clustering method, device, medium and the electronic equipment of short text

Country Status (1)

Country Link
CN (1) CN109710760A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134965A (en) * 2019-05-21 2019-08-16 北京百度网讯科技有限公司 Method, apparatus, equipment and computer readable storage medium for information processing
CN110222349A (en) * 2019-06-13 2019-09-10 成都信息工程大学 A kind of model and method, computer of the expression of depth dynamic context word
CN110297887A (en) * 2019-06-26 2019-10-01 山东大学 Service robot personalization conversational system and method based on cloud platform
CN110298005A (en) * 2019-06-26 2019-10-01 上海观安信息技术股份有限公司 The method that a kind of couple of URL is normalized
CN111862985A (en) * 2019-05-17 2020-10-30 北京嘀嘀无限科技发展有限公司 Voice recognition device, method, electronic equipment and storage medium
CN113485738A (en) * 2021-07-19 2021-10-08 上汽通用五菱汽车股份有限公司 Intelligent software fault classification method and readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120030163A1 (en) * 2006-01-30 2012-02-02 Xerox Corporation Solution recommendation based on incomplete data sets
CN104834747A (en) * 2015-05-25 2015-08-12 中国科学院自动化研究所 Short text classification method based on convolution neutral network
CN106649853A (en) * 2016-12-30 2017-05-10 儒安科技有限公司 Short text clustering method based on deep learning
CN106776713A (en) * 2016-11-03 2017-05-31 中山大学 It is a kind of based on this clustering method of the Massive short documents of term vector semantic analysis
CN107451187A (en) * 2017-06-23 2017-12-08 天津科技大学 Sub-topic finds method in half structure assigned short text set based on mutual constraint topic model
CN108846077A (en) * 2018-06-08 2018-11-20 泰康保险集团股份有限公司 Semantic matching method, device, medium and the electronic equipment of question and answer text

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120030163A1 (en) * 2006-01-30 2012-02-02 Xerox Corporation Solution recommendation based on incomplete data sets
CN104834747A (en) * 2015-05-25 2015-08-12 中国科学院自动化研究所 Short text classification method based on convolution neutral network
CN106776713A (en) * 2016-11-03 2017-05-31 中山大学 It is a kind of based on this clustering method of the Massive short documents of term vector semantic analysis
CN106649853A (en) * 2016-12-30 2017-05-10 儒安科技有限公司 Short text clustering method based on deep learning
CN107451187A (en) * 2017-06-23 2017-12-08 天津科技大学 Sub-topic finds method in half structure assigned short text set based on mutual constraint topic model
CN108846077A (en) * 2018-06-08 2018-11-20 泰康保险集团股份有限公司 Semantic matching method, device, medium and the electronic equipment of question and answer text

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111862985A (en) * 2019-05-17 2020-10-30 北京嘀嘀无限科技发展有限公司 Voice recognition device, method, electronic equipment and storage medium
CN110134965A (en) * 2019-05-21 2019-08-16 北京百度网讯科技有限公司 Method, apparatus, equipment and computer readable storage medium for information processing
CN110134965B (en) * 2019-05-21 2023-08-18 北京百度网讯科技有限公司 Method, apparatus, device and computer readable storage medium for information processing
CN110222349A (en) * 2019-06-13 2019-09-10 成都信息工程大学 A kind of model and method, computer of the expression of depth dynamic context word
CN110297887A (en) * 2019-06-26 2019-10-01 山东大学 Service robot personalization conversational system and method based on cloud platform
CN110298005A (en) * 2019-06-26 2019-10-01 上海观安信息技术股份有限公司 The method that a kind of couple of URL is normalized
CN110297887B (en) * 2019-06-26 2021-07-27 山东大学 Service robot personalized dialogue system and method based on cloud platform
CN113485738A (en) * 2021-07-19 2021-10-08 上汽通用五菱汽车股份有限公司 Intelligent software fault classification method and readable storage medium

Similar Documents

Publication Publication Date Title
CN109241524B (en) Semantic analysis method and device, computer-readable storage medium and electronic equipment
CN109726396A (en) Semantic matching method, device, medium and the electronic equipment of question and answer text
CN110796190B (en) Exponential modeling with deep learning features
CN111444340B (en) Text classification method, device, equipment and storage medium
CN109101537B (en) Multi-turn dialogue data classification method and device based on deep learning and electronic equipment
CN109710760A (en) Clustering method, device, medium and the electronic equipment of short text
CN108846077A (en) Semantic matching method, device, medium and the electronic equipment of question and answer text
CN109753566A (en) The model training method of cross-cutting sentiment analysis based on convolutional neural networks
CN110516073A (en) A kind of file classification method, device, equipment and medium
CN109271493A (en) A kind of language text processing method, device and storage medium
CN108959482A (en) Single-wheel dialogue data classification method, device and electronic equipment based on deep learning
CN112163165A (en) Information recommendation method, device, equipment and computer readable storage medium
CN109766557A (en) A kind of sentiment analysis method, apparatus, storage medium and terminal device
CN110334186A (en) Data query method, apparatus, computer equipment and computer readable storage medium
CN109933792A (en) Viewpoint type problem based on multi-layer biaxially oriented LSTM and verifying model reads understanding method
CN113392209A (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN111259647A (en) Question and answer text matching method, device, medium and electronic equipment based on artificial intelligence
CN109739960A (en) Sentiment analysis method, sentiment analysis device and the terminal of text
Bokka et al. Deep Learning for Natural Language Processing: Solve your natural language processing problems with smart deep neural networks
CN115422944A (en) Semantic recognition method, device, equipment and storage medium
CN110188158A (en) Keyword and topic label generating method, device, medium and electronic equipment
CN112084307A (en) Data processing method and device, server and computer readable storage medium
CN110851650B (en) Comment output method and device and computer storage medium
CN108268629A (en) Image Description Methods and device, equipment, medium, program based on keyword
Zhou et al. Deep personalized medical recommendations based on the integration of rating features and review sentiment analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190503