CN109033413A

CN109033413A - A kind of requirement documents neural network based and service document matches method

Info

Publication number: CN109033413A
Application number: CN201810883232.4A
Authority: CN
Inventors: 邹祥文; 吴悦
Original assignee: Shanghai Federation Of Scientific And Technological Enterprises; University of Shanghai for Science and Technology
Current assignee: Shanghai Federation Of Scientific And Technological Enterprises; University of Shanghai for Science and Technology
Priority date: 2018-03-12
Filing date: 2018-08-06
Publication date: 2018-12-18
Anticipated expiration: 2038-08-06
Also published as: CN109033413B

Abstract

The present invention relates to a kind of requirement documents neural network based and service document matches method.The present invention utilizes requirement documents and service file structure, by being extracted to document, vector is transformed a document to using paragraph insertion, article is split by shot and long term Memory Neural Networks, similarity is calculated using convolutional neural networks on segmentation text, calculates weighted average after acquiring the similarity of all segmentation documents；Finally obtain the similarity of requirement documents and service documents.

Description

A kind of requirement documents neural network based and service document matches method

Technical field

The present invention relates to Computer Natural Language Processing field, mainly in the matching of requirement documents and service documents, specifically It is related to a kind of requirement documents neural network based and service document matches method.

Background technique

As internet rapidly develops and popularizes, modern enterprise production method becomes cooperating with each other based on technology.In order to Co-operating enterprise is found, party in request writes the requirement documents for meeting enterprise demand, and technical side writes Technological Capability Corresponding service documents accelerate discovery collaborative enterprise, reduce enterprise's time and human cost by connecting internet.

Enterprise demand document includes enterprise's problem to be solved and solves the problems, such as to need index to be achieved, enterprises service when this Document then includes to summarize the method for solving the problem technology, the experience for solving similar item, accept the technology that this project has Deposit, related patents obtained, the quasi- research method taken, the technical indicator mainly realized and project schedule plan.How It is quickly that enterprise Finding Cooperative partner becomes when next hot and difficult issue by requirement documents and service documents.

Currently used document matches method is by converting the text to document vector space model (Vector Space Model, VSM), in inverse document frequency (Term Frequency-Inverse Document Frequency Model, TF- IDF the similarity for) calculating two documents on the basis of model by distance function, apart from smaller more similar.Due to demand text Shelves may comprising need cooperative enterprise simultaneously satisfaction several demands, and service documents may enumerate enterprise at present can be most The technological service that big degree provides, service documents need to be only just in the case where most or whole in meet demand document True matching, there is also deficiencies in this respect for current matching process.

Summary of the invention

In order to overcome the shortcomings of that current matching process on requirement documents and service document matches, improves requirement documents and service The accuracy rate of document matches, the invention proposes a kind of requirement documents neural network based and service document matches method, benefits With the particularity of requirement documents and the content of service documents, document content is extracted, is matched in more fine granularity, finally comprehensive Matching result out.

In order to achieve the above objectives, the present invention adopts the following technical solutions:

Step 1: as document to be matched, requirement documents include that enterprise needs to solve for one requirement documents of input and a service documents Certainly the problem of and index to be achieved is needed when solving the problems, such as this, service documents then include to summarize the side for solving the problem technology Method, the experience for solving similar item, accept technological reserve, related patents obtained that this project has, it is quasi- take grind Study carefully method, the technical indicator mainly realized and project schedule plan；

Step 2: judging that input document is requirement documents or service documents according to document content；

Step 2.1: it is then demand that indexing section to be achieved is needed when including enterprise's problem to be solved and solving the problems, such as this Document extracts enterprise's problem to be solved and needs indexing section to be achieved when solving the problems, such as this；

Step 2.2: having including summarizing the method for solving the problem technology, the experience for solving similar item, accepting this project Technological reserve, related patents obtained, the quasi- research method taken, the technical indicator mainly realized and project schedule plan Part is then service documents, extracts the method for solving the problem technology of summarizing, the experience for solving similar item, accepts this project Technological reserve, related patents obtained, the quasi- research method taken, the technical indicator and project process mainly realized having Plan part；

Step 2.3: the similarity of final requirement documents and service documents will extract part and all clothes to all requirement documents Business document extracts part and calculates similarity, and the general introduction of the problem to be solved and service documents of requirement documents is taken to solve to be somebody's turn to do below For the method for problem technology；

Step 3: the method that the general introduction of problem to be solved part and service documents to requirement documents solves the problem technology Sentence in part carries out paragraph insertion (Paragraph Embedding, PE) processing, obtains sentence vector；

Step 4: document cut-point is judged by shot and long term memory network (Long Short-Term Memory, LSTM)；

Step 4.1: the sentence vector of acquisition is inputted into trained shot and long term memory network (Long Short-Term Memory, LSTM) in, judge whether previous sentence is a cut-point by shot and long term memory network output result；

Step 4.2: according to cut-point by a partial segmentation at the different several sections of texts of looking like, part the problem of to requirement documents It is exactly demand one by one, the solution part of service documents is exactly method one by one.

Step 5: being inputted according to processing result type structure similarity model；

Step 5.1: if it is requirement documents, then by all sentences of a demand by obtaining sentence vector after PE model treatment A matrix is constituted, while all sentence vectors an of method being taken to constitute another matrix；

Step 5.2: if it is service documents, then by all sentences of a method by obtaining sentence vector after PE model treatment A matrix is constituted, while all sentence vectors an of demand being taken to constitute another matrix；

Step 6: passing through trained convolutional neural networks (Convolutional Neural for two matrixes as input Networks, CNNs) similarity is calculated, method that each demand is intersected and each calculates similarity, takes to each demand similar Spend end value of the maximum value as this demand；

Step 7: final similarity is obtained to similarity value weighted average；

Step 7.1: asking weighted average final as the problem to be solved of requirement documents after obtaining each demand end value Similarity value；

Step 7.2: above-mentioned steps solve the problem technology with the general introduction of the problem to be solved of requirement documents and service documents Method for, requirement documents include problem to be solved and need indexing section to be achieved when solving the problems, such as this, according to The above method acquires requirement documents again and solves the problems, such as to need indexing section similarity to be achieved when this, seeks two parts weighted average As requirement documents and the final similarity of service documents；

Step 8: final similarity compares with preset threshold, is greater than threshold value then two document matches, is less than threshold value then two texts Shelves mismatch.

Wherein, cut-point described in step 4 refers to that the previous sentence of document and the latter sentence meaning be not identical, then previous sentence is one A cut-point.Shot and long term memory network historical information more new formula are as follows:

C_t=0 (when h_t-1→1)

Wherein C_tThe historical information of duration short-term memory network t moment, h_t-1It is the output of Last status.

When updating historical information, if the output that the previous time obtains is cut-point, by C_tBe updated to 0, be not cut-point then It does not handle.

The present invention compared with prior art, has following obvious prominent substantive distinguishing features and significant technological progress: logical It crosses text segmenting method to be split requirement documents and service documents, obtains specific demand and service, finally based on specific Demand and service calculate matching degree, solve requirement documents and service document matches when need most or all meet Problem.The independent structuring one-dimensional addition of the indication information of appearance is originally inputted matrix, is solved in requirement documents and service documents Influence of the indication information to matching result.Cross-matched has been carried out again after acquiring each segmentation Documents Similarity, has taken best match As a result, solving because user is accustomed to the different influences to matching result.

Detailed description of the invention

Fig. 1 is flow chart of the present invention.

Fig. 2 is similarity calculation convolutional network figure of the present invention.

Fig. 3 is convolution operation figure in similarity calculation of the present invention.

Fig. 4 is similarity layer figure in similarity calculation of the present invention.

Fig. 5 is cross-matched figure of the present invention.

Specific embodiment

Embodiment 1

Below with reference to the attached drawing in the present invention, to technical solution of the present invention carry out it is clear, be fully described by.

The invention proposes a kind of requirement documents and service document matches invention, step is embodied in specific flow chart as shown in Figure 1 It is rapid as follows:

In word insertion (Word Embedding, WE) model, each word can be mapped to only one in document matrix W Column, the index of column is exactly position of the word in vocabulary, and then term vector cascades up, and it is next in sentence to predict Word.Give a word sequence w₁, w₂, w₃..., w_T, the target of word incorporation model is exactly to maximize average log probability, is calculated Shown in formula such as formula (I):

Wherein Probability p is the probability of correctly predicted next word.

Prediction task is completed by multi-categorizer, such as softmax classifier, shown in calculation formula such as formula (II):

Word i, y are inputted for each_iIt is non-normalized log probability, shown in calculation formula such as formula (III):

Y=b+Uh (w_t-k..., w_t+k；W) (Ⅲ)

Wherein U and b is the parameter of softmax classifier, and h is made of the connection for the word vector extracted from W or average value.

The inspiration of PE model can also be used to next word in prediction sentence from WE, paragraph insertion.Each paragraph list Word is mapped to an only column in matrix D, and each word is mapped to an only column in matrix W.It is compared with WE model, PE mould Type uniquely changes formula (III), and h is made of the connection for the word vector extracted from W or average value to be become to be made of W and D.

LSTM network includes three kinds of doors: forgeing door (Forget Gate), input gate (Input Gate) and out gate (Output Gate).Each gate action is different, and specific effect is as follows:

Forget door: forgeing door and be used to handle the historical information of preservation.Forget door operation and uses current input information and upper One time state, then by one layer sigmoid layers, output area [0,1] is 0 when exporting, gives up historical information, work as input When being 1, retain historical information.Formula (IV) is used whether judgement abandons:

f_t=σ (W_f[h_t-1, x_t]+b_f) (IV)

Wherein σ represents sigmoid function, x be by having obtained vector after PE model treatment, h represent output as a result, judge whether be Cut-point, w are shot and long term memory network Connecting quantities, and b is bias, and f determines us in the information to be forgotten of t moment.

Input gate: how input gate decision updates historical information.Input gate can to input information operation after know whether by Current input is updated into historical information.Comprising one sigmoid layers and one tanh layers, sigmoid layers determine that we will more It is new what, the new candidate value of tanh layer generation.Shown in calculation formula such as formula (V) and formula (VI):

i_t=σ (W_f[h_t-1, x_t]+b_i) (V)

Wherein i determines that the numerical value updated, h represent output as a result, judging whether it is cut-point, w is the connection of shot and long term memory network Parameter, b are bias, C_tIt is the historical information of shot and long term memory network t moment.

Historical information is obtained from door is forgotten, update Candidate Key is obtained from input gate, is believed using such as (VII) formula more new historical Breath:

Wherein C is the historical information of shot and long term memory network, and f calculates gained by formula (IV), determines the letter to be forgotten of t moment Breath, i calculate gained by public formula (V), determine the numerical value of update.

Out gate: out gate is used to control present node output information.One sigmoid layers are first passed through to determine to export that A little information, it is then multiplied to output with tanh layers of output phase.Shown in calculation formula such as formula (VIII) and formula (Ⅸ):

o_t=σ (W_f[h_t-1, x_t]+b_o) (VIII)

h_t=o_t*tanh(C_t) (IX)

Wherein σ represents sigmoid function, x be by having obtained vector after PE model treatment, h represent output as a result, judge whether be Cut-point, w are shot and long term memory network Connecting quantities, and b is bias.

After obtaining LSTM output, by one layer sigmoid layer, so that output is between [0,1], when output is close to 1, representative Previous node is cut-point, on the contrary then be continuity point.

When updating historical information using formula (Ⅹ), if the output that the previous time obtains is cut-point, Ct is updated to 0, It is not that cut-point is not handled then.

C_t=0 (when h_t-1→1) (X)

Formula (IV) σ in (Ⅹ) represents sigmoid function, and x represents input, and h represents output, judges whether to be cut-point w generation Table Connecting quantity, b represent bias.

CNNs model is as shown in Figure 2 in the present invention.

CNNs network is generally divided into input layer, output layer, convolutional layer and full articulamentum.

Input layer: input layer directly acts on input matrix, is the segmentation text sentence after PE model treatment in the present invention Matrix.

Output layer: by CNNs treated output, the present invention output be two sections of texts similarity.

Convolutional layer: feature extraction is carried out to input.It is made of convolutional layer and sample level.Convolutional layer effect is to extract input data Feature, the feature that different convolution kernels extract are different.Sample level effect is also to retain important information while reducing data, With speed up processing, the sampling neuron of same layer shares weight.Sample level uses sigmoid function as activation letter Number, so that sample level has shift invariant.

After obtaining segmentation text, word segmentation processing is carried out for text, leaves the word of TF-IDF high, due to being passed through in demand and service Often contain indication information, therefore can also leave all numbers herein, using PE model to each sentence of text after segmentation at Reason, by gained sentence Vector Groups composite matrix, as individually one-dimensional after repetition of figures.

The matrix that requirement documents and service documents are formed first passes through respective convolutional layer, reconnects one layer of similarity after process of convolution Layer exports similarity finally by one layer of full articulamentum.

For the feature as much as possible for obtaining text, using two kinds of convolution operations, as shown in Figure 3: the window size on the left side is 2, entire word vector.A dimension of word vector is only included when the right window size is also 2 every time.In actual experiment, window Mouth size uses 1, Dim/2 and tri- kinds of ∞

When by sample level, maximum value pond, minimum value pond and mean value are used respectively for two kinds of convolution results obtained Chi Hua, different pond methods can be collected into different information, facilitate and carry out subsequent processing.

The similarity invention that similarity layer uses is cosine similarity.Due to having used maximum value, minimum value and mean value three Kind of pond method, therefore, they will mutually seek similarity, due to after sampling the result is that matrix, for each matrix, Every a line all seeks similarity with the every a line of another matrix, and each column all seek similarity with each column of another matrix, as shown in Figure 4. Such as the result is that the matrix of a N × M behind hypothesis maximum value pond.I-th row of matrix will seek phase with the N row of another matrix Like degree, the m column of matrix will be arranged with the jth of another matrix seeks similarity, and the result finally acquired is as similarity layer, simultaneously It will also be to similarity of entire matrix and another Matrix Calculating, due to asking row and column the result of similarity to compare entire matrix Ask similarity result more, therefore the similarity result that duplication obtains entire Matrix Calculating finally connects one so that three's weight is equal A full articulamentum exports similarity result.

Full articulamentum: as articulamentum complete in traditional neural network, the present invention uses one layer of full articulamentum before output.

Step 7: final similarity is obtained to similarity value weighted average；

Final similarity calculation is the segmentation knot in each part of segmentation result and service documents of each part of requirement documents It is carried out on fruit, as shown in figure 5, since requirement documents are only there are two part, i.e., problem to be solved and solves the problems, such as this When need index to be achieved, therefore each part carry out text segmentation after can be with the result after each partial segmentation of service documents Intersection seeks similarity, takes the maximum value for intersecting result as the part matching value, such as the problem to be solved of requirement documents Partial segmentation goes out N number of segment, and the method partial segmentation that service documents general introduction solves the problem technology goes out M as a result, calculated crosswise After have N × M matching result, to each part of requirement documents take the maximum value of similarity as this part end value, obtain Take the final similarity of problem to be solved for seeking weighted average as requirement documents after all part end values of requirement documents Value.Similarly, best intersection result is sought in the problem to be solved part of requirement documents and all parts of service documents.

Above-mentioned steps are in the method that the general introduction of the problem to be solved of requirement documents and service documents solves the problem technology Example, requirement documents include problem to be solved and solve the problems, such as to need indexing section to be achieved when this, according to the above method Requirement documents are acquired again to solve the problems, such as to need indexing section similarity to be achieved when this, ask two parts weighted average as demand Document and the final similarity of service documents.

Wherein, the cut-point in the step 4 refers to that the previous sentence of document and the latter sentence meaning be not identical, then previous sentence is One cut-point.The historical information more new formula of the shot and long term memory network are as follows:

C_t=0 (when h_t-1→1)

Wherein C_tIt is the historical information of shot and long term memory network t moment, h_t-1It is the output of Last status, judges whether it is point Cutpoint.

Claims

1. a kind of requirement documents neural network based and service document matches method, it is characterised in that operating procedure is as follows:

Step 3: the method that the general introduction of problem to be solved part and service documents to requirement documents solves the problem technology Sentence in part carries out paragraph insertion processing, obtains sentence vector；

Step 4: document cut-point is judged by shot and long term memory network；

Step 4.1: the sentence vector of acquisition being inputted in trained shot and long term memory network, is exported by shot and long term memory network As a result judge whether previous sentence is a cut-point；

Step 6: similarity is calculated by trained convolutional neural networks using two matrixes as input, what each demand was intersected Similarity is calculated with each method, end value of the maximum value of similarity as this demand is taken to each demand；

Step 7: final similarity is obtained to similarity value weighted average；

2. requirement documents neural network based according to claim 1 and service document matches method, it is characterised in that:

Cut-point in the step 4 refers to that the previous sentence of document and the latter sentence meaning be not identical, then previous sentence is one Cut-point.The historical information more new formula of the shot and long term memory network are as follows:

C_t=0 (when h_t-1→1)

Wherein C_tThe historical information of duration short-term memory network t moment, h_t-1It is the output of Last status, judges whether to be segmentation Point；

When updating historical information, if the output that the previous time obtains is cut-point, by C_tBe updated to 0, be not cut-point then not Processing.