CN109388706A

CN109388706A - A kind of problem fine grit classification method, system and device

Info

Publication number: CN109388706A
Application number: CN201710678652.4A
Authority: CN
Inventors: 吕钊; 谢雨飞; 贺樑
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2017-08-10
Filing date: 2017-08-10
Publication date: 2019-02-26

Abstract

The present invention provides a kind of problem fine grit classification method, system and device, comprising the following steps: semantic primitive extraction step extracts semantic primitive in former question text；Semantic primitive spread step is extended semantic primitive using vector space model, to obtain scaling problem text；Chinese word coding step, using two-way length, memory network carries out Chinese word coding to scaling problem text in short-term, to obtain encoded question text；Word focus steps carry out word focusing using attention mechanism to encoded question text, to obtain question text vector.Fine grit classification step carries out fine grit classification to question text vector using softmax classifier.The present invention realizes the automation of this process to the greatest extent under the premise of guaranteeing certain accuracy rate, improves the efficiency of problem fine grit classification, can save human cost to greatest extent.

Description

A kind of problem fine grit classification method, system and device

Technical field

The present invention relates to text fine grit classification field, especially a kind of problem fine grit classification method, system and device.

Background technique

With the development of social networks, the social tools such as community's question and answer, microblogging, wechat become more and more popular.Social question and answer net It stands, such as Quora, Research Gate, Yahoo！Answers, know, bean cotyledon etc. cause the pass of many domestic and foreign scholars Note.Question Classification is the core component of question and answer website, directly affects the retrieval rate of user.And in recent years, with depth The development of study is spent, the fine grit classification of problem is increasingly by the favor of researcher.Problem fine grit classification is directed to The classification of problem more refined is current research hotspot.

Generally, problem refers to sentence word number text the problem of 140-150 or so is needed by answer or explanation.It asks Topic fine grit classification is under the jurisdiction of short text classification, and main task is that accurately explanation distinguishes the son for belonging to the same superclass Classification.With short text classification compared with, problem fine grit classification is faced with following problem, mainly have: (1) segment classification it Between global feature it is more similar, only had differences in some regional area；(2) it is difficult to find out the feature locally to differ greatly Region；(3) question text itself is more short and small, and feature space is more sparse.Therefore traditional text representation model is directly applied It is ineffective on to problem fine grit classification.

Summary of the invention

For the above existing issue, it is an object of the invention to overcome short text feature space itself more it is sparse not Foot, proposes a kind of problem fine grit classification method.

According to the first aspect of the invention, a kind of problem fine grit classification method is provided, for former question text into Row Question Classification, comprising the following steps: semantic primitive extraction step extracts semantic primitive in former question text；Semantic primitive Spread step is extended semantic primitive using vector space model, to obtain scaling problem text；Chinese word coding step, makes With two-way length, memory network carries out Chinese word coding to scaling problem text in short-term, to obtain encoded question text；Word focus steps are right Encoded question text carries out word focusing using attention mechanism, to obtain question text vector.Fine grit classification step uses Softmax classifier carries out fine grit classification to question text vector.

Preferably, semantic primitive extraction step includes: by finding out institute in former question text to interdependent syntax traversal of tree Some noun phrase nodes and verb phrase node are as semantic primitive.

Preferably, semantic primitive spread step includes that semantic primitive is converted semanteme by generative semantics unit vector step Unit vector, semantic primitive vector V_uCoincidence formula:

V_u=V_u1+V_u2+…+V_um={ z₁,z₂,…,z_d}

Wherein, V_u1, V_u2, V_umUnit vector corresponding to multiple phrase nodes in semantic primitive is respectively indicated, m is derived from so Number；Z₁, Z₂, Z_dVector set is constituted, d takes natural number；Calculate cosine similarity step, calculate semantic primitive vector with The cosine similarity of the vector of all words in Word2Vec model；Scaling problem text steps are obtained, select cosine similarity most Big preceding several corresponding words or phrase of being worth are as scaling problem text.

Preferably, in Chinese word coding step, two-way length memory network coincidence formula in short-term:

Wherein x_iIndicate term vector,To unit LSTM before indicating, according to from the 1st word to the direction of the T word Word is read,It indicates reversed unit LSTM, reads word according to from the T word to the direction of the 1st word,It indicates just Output to i-th layer of hidden layer,Indicate the output of reversed i-th layer of hidden layer.

Preferably, word focus steps obtain question text vector, formula using formula are as follows:

u_i=tanh (E_wh_i+b_w)

Wherein α_iIndicate that normalized weighted value, s indicate finally obtained question text vector, E_wAnd μ_wIndicate random first The weight parameter value of beginningization, b_wFor amount of bias, μ_iFor the output of Single layered perception neural networks, h_iFor the output of hidden layer.

Preferably, in fine grit classification step, softmax classifier includes formula:

P=softmax (W_cs+b_c)

Wherein j indicates the label of question text vector s, W_cIndicate the weight parameter value of random initializtion, b_cFor amount of bias, p Value is the probability score value of each classification, and L indicates loss function value.

According to the second aspect of the invention, a kind of problem fine grit classification system is provided, for former question text into Row Question Classification, comprising: semantic primitive extraction module, for extracting semantic primitive in former question text；Semantic primitive extension Module, for being extended to semantic primitive using vector space model, to obtain scaling problem text；Chinese word coding device module, For using two-way length, memory network carries out Chinese word coding to scaling problem text in short-term, to obtain encoded question text；Word focuses Module, for carrying out word focusing using attention mechanism to encoded question text, to obtain question text vector；Softmax points Class device module carries out fine grit classification to question text vector using softmax classifier.

According to the third aspect of the present invention, a kind of problem fine grit classification device is provided, for former question text into Row Question Classification, comprising: storage unit, for storing program, program unit processed realizes above-mentioned first aspect when executing The problem of fine grit classification method the step of；Execution unit, for executing program in storage unit.

With it is traditional based on statistical learning the problem of classification method compared with, the present invention extracts semantic single in former question text Member is extended semantic primitive using vector space model, carries out Chinese word coding to scaling problem text and word focuses, utilize The problem of softmax classifier indicates obtained vectorization text carries out fine grit classification, improves problem fine grit classification Accuracy rate.Compared with existing the problem of being based on craft or being assisted by machine fine grit classification, the present invention is guaranteeing centainly The automation for realizing this process under the premise of accuracy rate to the greatest extent, improves the efficiency of problem fine grit classification, can be most The saving human cost of limits.

Detailed description of the invention

Technical solution of the present invention is described in detail below in conjunction with the drawings and specific embodiments, so that of the invention Characteristics and advantages become apparent.

The flow diagram of the problem of Fig. 1 is embodiment of the present invention fine grit classification；

Fig. 2 is the detailed process schematic diagram of step S102 in Fig. 1；

Fig. 3 is the curve synoptic diagram of the accuracy rate of thresholding and problem fine grit classification；

Fig. 4 is the module diagram of problem fine grit classification system of the invention；

Fig. 5 is the schematic diagram of problem fine grit classification device of the invention.

Specific embodiment

Detailed description will be provided to the embodiment of the present invention below.Although the present invention will combine some specific embodiments It is illustrated and illustrates, but should be noted that the present invention is not merely confined to these embodiments.On the contrary, to the present invention The modification or equivalent replacement of progress, are intended to be within the scope of the claims of the invention.

Some exemplary embodiments are described as the processing or method described as flow chart.Although flow chart grasps items It is described into the processing of sequence, but many of these operations can be implemented concurrently, concomitantly or simultaneously.In addition, each The sequence of item operation can be rearranged.The processing can be terminated when its operations are completed, it is also possible to have not Including additional step in the accompanying drawings.The processing can correspond to method, function, regulation, subroutine, subprogram etc..

The flow diagram of the problem of Fig. 1 is embodiment of the present invention fine grit classification method, as shown in Figure 1, of the invention Problem fine grit classification method comprising steps of

S101: semantic primitive is extracted, and extracts semantic primitive in former question text.

S102: semantic primitive extension is extended semantic primitive using vector space model, to obtain scaling problem text This.

S103: Chinese word coding, using two-way length, memory network carries out Chinese word coding to scaling problem text in short-term, to be encoded Question text.

S104: word focus, to encoded question text using attention mechanism progress word focusing, with obtain question text to Amount.

S105: fine grit classification carries out fine grit classification to question text vector using softmax classifier.

First such as step S101, semantic primitive is extracted in former question text.Former question text is the original proposition of user Need the problem of being answered or being explained text.Former question text is analyzed using interdependent parsing tree, by according to Syntax traversal of tree is deposited, finds out noun phrase node and verb phrase node all in the former question text as semantic single Member.

Interdependent syntactic analysis is proposed at first by French linguist L.Tesniere.He is interdependent at one by the analysis of sentence Syntax tree, is depicted the dependence between each word, that is, indicates in syntactical Matching Relation between word, this to take It is associated with semanteme with relationship.It is generally acknowledged that noun phrase and verb phrase most can reflect the semantic letter an of sentence Breath, therefore by interdependent syntax traversal of tree, finding out wherein all noun phrase node and verb phrase node as whole The semantic primitive of a sentence.

Such as step S102, semantic primitive is extended using vector space model, to obtain scaling problem text.Fig. 2 For the detailed process schematic diagram of step S102 in Fig. 1.As shown in Fig. 2, step S1021 is first carried out, generative semantics unit vector, Semantic primitive vector is converted by semantic primitive, converts semantic primitive vector for semantic primitive.Semantic primitive vector V_uSymbol Close formula:

V_u=V_u1+V_u2+…+V_um={ z₁,z₂,…,z_d}

Wherein, V_u1, V_u2, V_umUnit vector corresponding to multiple phrase nodes in semantic primitive is respectively indicated, m is derived from so Number；Z₁, Z₂, Z_dVector set is constituted, d takes natural number.

Such as step S1022, the cosine similarity of the vector of all words in semantic primitive vector and Word2Vec model is calculated cosθ.Word2Vec model is the tool of a kind of open source that Google provides calculated for term vector, Word2Vec model Including Skip-gram algorithm, Skip-gram algorithm be calculate input word input term vector and target word output word to Cosine similarity between amount.In the present embodiment, calculating semantic primitive vector is input term vector, is owned in Word2Vec model The vector of word is output term vector, calculates cosine similarity cos θ between the two, and calculation formula is as follows,

Wherein y_iIndicate value of the word in i dimension.

Such as step S1023, scaling problem text is obtained.Before the cosine similarity calculated in selection step S1022 is maximum Several target words being worth in corresponding Word2Vec model are as scaling problem text.

Preceding several values determine that the cosine similarity of i.e. target word needs to be greater than threshold value by given threshold.Fig. 3 is thresholding With the curve synoptic diagram of the accuracy rate of problem fine grit classification, abscissa indicates thresholding, and ordinate indicates problem fine grit classification Accuracy rate, curve negotiating EM algorithm (Expectation Maximization Algorithm, EM) passes through in Fig. 3 Training study fitting.EM algorithm is a kind of iterative algorithm, for the probability containing hidden variable (latent variable) The maximal possibility estimation or maximum a posteriori estimate of parameter model.As shown in figure 3, threshold value value is 35 in the present embodiment, when When threshold value is greater than 35, accuracy rate is higher.

Such as step S103 and step S104, Chinese word coding is carried out to scaling problem text and word focuses.S103 Chinese word coding Step include by the two-way length of scaling problem text input in short-term memory network (Long Short-Term Memory, LSTM) to obtain Obtain encoded question text.LSTM meets following formula:

In S104 word focus steps, word is carried out using attention mechanism to the encoded question text that S103 step obtains and is gathered Coke, to obtain question text vector.Specifically, word, which focuses, uses formula following formula:

u_i=tanh (E_wh_i+b_w)

After obtaining question text vector S, execute step S105, using softmax classifier to question text vector S into Row fine grit classification, and output category result.

Softmax classifier includes formula:

P=softmax (W_cs+b_c)

It can be seen that the present invention is in former question text by the description above to problem fine grit classification method of the invention Extract semantic primitive, semantic primitive is extended using vector space model, thus improve problem semanteme is obtained it is comprehensive Property and accuracy.Chinese word coding is carried out to scaling problem text and word focuses, the vectorization using softmax classifier to obtaining The problem of expression text carries out fine grit classification, improves the accuracy rate of problem fine grit classification.Problem fine granularity of the invention The step of classification method, completes essentially by machine, realizes this process to the greatest extent under the premise of guaranteeing certain accuracy rate Automation, improve the efficiency of problem fine grit classification, human cost can be saved to greatest extent.

The present invention also provides a kind of problem fine grit classification system 100, Fig. 4 is problem fine grit classification system of the invention The module diagram of system.As shown in figure 4, problem fine grit classification system 100 includes semantic primitive extraction module 101, it is semantic single First expansion module 102, Chinese word coding device module 103, word focus module 104 and softmax classifier modules 105.

Semantic primitive extraction module 101 is for extracting semantic primitive in former question text.Semantic primitive expansion module 102 For being extended to semantic primitive using vector space model, to obtain scaling problem text.Chinese word coding device module 103 is used for Using two-way length, memory network carries out Chinese word coding to scaling problem text in short-term, to obtain encoded question text.Word focus module 104 for carrying out word focusing using attention mechanism to encoded question text, to obtain question text vector.Softmax classification Device module 105 carries out fine grit classification to question text vector using softmax classifier.

Problem fine grit classification system 100 implements each step of above problem fine grit classification method, and specific steps are such as Above to the description of fine grit classification method, details are not described herein again.

The present invention also provides a kind of problem fine grit classification device 200, Fig. 5 is that problem fine grit classification of the invention fills The schematic diagram set.As shown in figure 5, problem fine grit classification device 200 includes storage unit 201 and execution unit 202.Storage Unit 201 is for storing program, each step of problem of implementation fine grit classification method when program unit 201 processed executes. Execution unit 202 is for executing program in storage unit 201.Specific steps retouch problem fine grit classification method with reference to above-mentioned It states, details are not described herein again.

It should be noted that the present invention is the application to problem fine grit classification technology.During realization of the invention, It can be related to carrying out the knowledge of semantic study, the application to Word2Vec model by neural network.Such as reading over application After file, accurate understanding realization principle and goal of the invention of the invention, in the case where combining existing well-known technique, this field Technical staff can realize the present invention with the knowledge for the software programming technical ability and neural network that it is grasped completely.It is aforementioned Word2Vec model, long memory network, the specific algorithm and step of softmax classifier can refer in the prior art in short-term Structure and method, category this scope that all the present patent application files refer to, applicant will not enumerate.

The above is only specific application examples of the invention, are not limited in any way to protection scope of the present invention.Except above-mentioned Outside embodiment, the present invention can also have other embodiment.All technical solutions formed using equivalent substitution or equivalent transformation, It falls within scope of the present invention.

Claims

1. a kind of problem fine grit classification method, for carrying out Question Classification to former question text, which is characterized in that including following Step:

Semantic primitive extraction step extracts semantic primitive in the former question text；

Semantic primitive spread step is extended institute's meaning elements using vector space model, to obtain scaling problem text This；

Chinese word coding step, using two-way length, memory network carries out Chinese word coding to the scaling problem text in short-term, to be encoded Question text；

Word focus steps carry out word focusing using attention mechanism to the encoded question text, to obtain question text vector.

Fine grit classification step carries out fine grit classification to described problem text vector using softmax classifier.

2. problem fine grit classification method as described in claim 1, which is characterized in that

Semantic primitive extraction step includes:

By finding out noun phrase node and verb phrase section all in the former question text to interdependent syntax traversal of tree Point is used as institute's meaning elements.

3. problem fine grit classification method as described in claim 1, which is characterized in that institute's meaning elements spread step packet It includes,

Institute's meaning elements are converted semantic primitive vector, institute meaning elements vector V by generative semantics unit vector step_uSymbol Close formula:

V_u=V_u1+V_u2+…+V_um={ z₁,z₂,…,z_d}

Wherein, V_u1, V_u2, V_umIt respectively indicates unit vector corresponding to multiple phrase nodes, m in semantic primitive and takes natural number；Z₁, Z₂, Z_dVector set is constituted, d takes natural number；

Cosine similarity step is calculated, the cosine of the vector of all words in institute's meaning elements vector and Word2Vec model is calculated Similarity；

Scaling problem text steps are obtained, selects the corresponding word of the maximum preceding several values of cosine similarity or phrase as expansion Open up question text.

4. problem fine grit classification method as described in claim 1, which is characterized in that described double in the Chinese word coding step To long memory network coincidence formula in short-term:

Wherein x_iIndicate term vector,To unit LSTM before indicating, read according to from the 1st word to the direction of the T word Word,It indicates reversed unit LSTM, reads word according to from the T word to the direction of the 1st word,Indicate positive the The output of i layers of hidden layer,Indicate the output of reversed i-th layer of hidden layer.

5. problem fine grit classification method as described in claim 1, which is characterized in that institute's predicate focus steps using formula with Obtain question text vector, the formula are as follows:

u_i=tan h (E_wh_i+b_w)

Wherein α_iIndicate that normalized weighted value, s indicate finally obtained question text vector, E_wAnd μ_wIndicate random initializtion Weight parameter value, b_wFor amount of bias, μ_iFor the output of Single layered perception neural networks, h_iFor the output of hidden layer.

6. problem fine grit classification method as described in claim 1, which is characterized in that in the fine grit classification step, Softmax classifier includes formula:

P=softmax (W_cs+b_c)

Wherein j indicates the label of question text vector s, W_cIndicate the weight parameter value of random initializtion, b_cFor amount of bias, p value is The probability score value of each classification, L indicate loss function value.

7. a kind of problem fine grit classification system, for carrying out Question Classification to former question text characterized by comprising

Semantic primitive extraction module, for extracting semantic primitive in the former question text；

Semantic primitive expansion module is asked for being extended to institute's meaning elements using vector space model with obtaining extension Inscribe text；

Chinese word coding device module, for using two-way length, memory network carries out Chinese word coding to the scaling problem text in short-term, to obtain Obtain encoded question text；

Word focus module, for carrying out word focusing using attention mechanism to the encoded question text, to obtain question text Vector；

Softmax classifier modules carry out fine grit classification to described problem text vector using softmax classifier.

8. a kind of problem fine grit classification device, for carrying out Question Classification to former question text characterized by comprising

Storage unit, for storing program, described program unit processed realizes any one of claim 1 to 6 when executing The step of described problem fine grit classification method；

Execution unit, for executing program described in storage unit.