CN109033413B

CN109033413B - Neural network-based demand document and service document matching method

Info

Publication number: CN109033413B
Application number: CN201810883232.4A
Authority: CN
Inventors: 邹祥文; 吴悦
Original assignee: Shanghai Federation Of Science And Technology Enterprises; University of Shanghai for Science and Technology
Current assignee: Shanghai Federation Of Science And Technology Enterprises; University of Shanghai for Science and Technology
Priority date: 2018-03-12
Filing date: 2018-08-06
Publication date: 2022-12-23
Anticipated expiration: 2038-08-06
Also published as: CN109033413A

Abstract

The invention relates to a method for matching demand documents and service documents based on a neural network. The method comprises the steps of extracting a document by using a demand document and a service document structure, converting the document into vectors by using paragraph embedding, segmenting an article by using a long-short term memory neural network, calculating the similarity on segmented texts by using a convolutional neural network, and calculating a weighted average value after the similarity of all segmented documents is obtained; and finally, obtaining the similarity of the requirement document and the service document.

Description

Neural network-based demand document and service document matching method

Technical Field

The invention relates to the field of computer natural language processing, mainly aims at matching a demand document and a service document, and particularly relates to a neural network-based demand document and service document matching method.

Background

With the rapid development and popularity of the internet, modern enterprise production approaches become technology-based inter-collaboration. In order to find enterprises which are mutually cooperated, a demand party compiles demand documents meeting the requirements of the enterprises, a technical party compiles service documents corresponding to the technical capacity of the enterprises, and the discovery of the cooperation enterprises is accelerated and the time and labor cost of the enterprises are reduced by connecting the internet.

The enterprise requirement document comprises the problems to be solved by the enterprise and indexes to be achieved when the problems are solved, and the enterprise service document comprises a method for summarizing the technology for solving the problems, experience for solving similar projects, technical reserves for accepting the projects, obtained related patents, a research method to be adopted, mainly realized technical indexes and project progress plans. How to quickly find partners for enterprises through demand documents and service documents becomes the next hotspot and difficulty.

The Document matching method commonly used at present converts a text into a Document Vector Space Model (VSM), and calculates the similarity of two documents through a distance function on the basis of a terminal Frequency-Inverse Document Frequency Model (TF-IDF) Model, wherein the smaller the distance is, the more similar the two documents are. The current matching method is not sufficient in this respect because the requirement document may contain several requirements that need to be met by the cooperating enterprise at the same time, the service document may list the technical services that the enterprise can provide to the greatest extent at present, and the service document needs to meet most or all of the requirement documents and is correct for matching.

Disclosure of Invention

In order to overcome the defect of the matching of a demand document and a service document in the existing matching method and improve the matching accuracy of the demand document and the service document, the invention provides a demand document and service document matching method based on a neural network.

In order to achieve the purpose, the invention adopts the following technical scheme:

step 1: inputting a requirement document and a service document as documents to be matched, wherein the requirement document comprises problems to be solved by an enterprise and indexes to be achieved when the problems are solved, and the service document comprises a method for summarizing and solving the difficult problem technology, experience of solving similar projects, technical reserves of the project, obtained related patents, a research method to be adopted, mainly realized technical indexes and a project progress plan;

step 2: judging whether the input document is a demand document or a service document according to the document content;

step 2.1: the method comprises the steps that a problem needing to be solved by an enterprise and an index part needing to be achieved when the problem is solved are required documents, and the problem needing to be solved by the enterprise and the index part needing to be achieved when the problem is solved are extracted;

step 2.2: the method comprises the steps of summarizing a method for solving the difficult technology, experience of solving similar projects, taking over technical reserves of the project, obtained related patents, a research method to be adopted, a main realized technical index and project progress planning part which are service documents, extracting and summarizing the method for solving the difficult technology, experience of solving the similar projects, taking over the technical reserves of the project, obtained related patents, the research method to be adopted, the main realized technical index and the project progress planning part;

step 2.3: calculating the similarity of all the requirement document extraction parts and all the service document extraction parts according to the similarity of the final requirement document and the final service document, and taking the problems to be solved of the requirement document and a method for solving the difficult problem technology by summarizing the service document as an example;

and step 3: the method comprises the steps of carrying out Paragraph Embedding (PE) processing on sentences in a problem part of a requirement document to be solved and a method part of a service document for solving the difficult problem technology to obtain sentence vectors;

and 4, step 4: judging a document segmentation point through a Long Short-Term Memory network (LSTM);

step 4.1: inputting the obtained sentence vector into a trained Long Short-Term Memory network (LSTM), and judging whether the previous sentence is a segmentation point or not according to the output result of the Long Short-Term Memory network;

and 4.2: according to the dividing point, one part is divided into several text sections with different meanings, the problem part of the demand document is a demand, and the solution part of the service document is a method.

And 5: constructing similarity model input according to the type of the processing result;

step 5.1: if the sentence vector is the requirement document, all sentences of a requirement are processed by a PE model to obtain sentence vectors to form a matrix, and all sentence vectors of a method are taken to form another matrix;

step 5.2: if the sentence vector is the service document, all sentences of a method are processed by a PE model to obtain sentence vectors to form a matrix, and all sentence vectors of a requirement are taken to form another matrix;

step 6: calculating similarity by using the two matrixes as input through a trained Convolutional Neural Network (CNNs), calculating the similarity by using the sum of each requirement intersection and each method, and taking the value with the maximum similarity for each requirement as the final value of the requirement;

and 7: carrying out weighted average on the similarity values to obtain final similarity;

step 7.1: after each requirement final value is obtained, a weighted average value is obtained to serve as a final similarity value of the problem needing to be solved of the requirement document;

step 7.2: the steps take the problem to be solved of the demand document and the method for summarizing the service document to solve the difficult problem technology as an example, the demand document comprises the problem to be solved and an index part which needs to be achieved when the problem is solved, the similarity of the index part which needs to be achieved when the problem is solved by the demand document is solved according to the method, and the weighted average of the two parts is worked out to be used as the final similarity of the demand document and the service document;

and 8: and comparing the final similarity with a preset threshold, wherein if the final similarity is larger than the threshold, the two documents are matched, and if the final similarity is smaller than the threshold, the two documents are not matched.

The dividing point in the step 4 means that the meanings of the previous sentence and the next sentence of the document are different, and the previous sentence is a dividing point. The long-short term memory network history information updating formula is as follows:

C _t ＝0(when h _t-1 →1)

wherein C is _t Historical information of time t of long-term and short-term memory network，h _t-1 Is the output of the last state.

When updating the history information, if the output obtained in the previous time is a division point, C is added _t Update to 0, and do not process if it is not a division point.

Compared with the prior art, the invention has the following obvious and prominent substantive characteristics and remarkable technical progress: the method comprises the steps of segmenting a demand document and a service document by a text segmentation method to obtain specific demands and services, and finally calculating the matching degree based on the specific demands and services, so that the problem that most or all of the demands are met when the demand document and the service document are matched is solved. The generated index information is independently constructed into a one-dimensional input matrix, and the influence of the index information in the demand document and the service document on the matching result is solved. After the similarity of each segmented document is obtained, cross matching is carried out, the best matching result is obtained, and the influence of different habits of users on the matching result is solved.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a diagram of a convolution network of the similarity calculation model according to the present invention.

FIG. 3 is a diagram of convolution operations in the similarity calculation model according to the present invention.

FIG. 4 is a diagram of a similarity layer in the similarity calculation model according to the present invention.

Fig. 5 is a cross-matching diagram of the present invention.

Detailed Description

Example 1

The technical scheme of the invention is clearly and completely described below by combining the attached drawings in the invention.

The invention provides a matching invention of a demand document and a service document, and a specific flow chart is shown as a figure 1, and the specific implementation steps are as follows:

step 2.3: calculating the similarity of all the extraction parts of the requirement documents and all the extraction parts of the service documents according to the similarity of the final requirement documents and the service documents, and taking a method for solving the problem of the requirement documents needing to be solved and summarizing the service documents as an example;

in the Word Embedding (WE) model, each Word can be mapped to a unique column in the document matrix W, the index of the column is the position of the Word in the vocabulary, and then the Word vectors are concatenated to predict the next Word in the sentence. Given a word sequence w ₁ ，w ₂ ，w ₃ ，…，w _T The objective of the word embedding model is to maximize the mean log probability, which is calculated as shown in equation (I):

where the probability p is the probability of correctly predicting the next word.

The prediction task is completed by a multi-classifier, such as a softmax classifier, and the calculation formula is shown as formula (II):

for each input word i, y _i Non-normalized logarithmic probability, the calculation formula is shown in formula (III):

y＝b+Uh(w _t-k ，…，w _t+k ；W) (Ⅲ)

where U and b are parameters of the softmax classifier and h consists of a concatenated or average value of the word vectors extracted from W.

The inspiration of the PE model comes from WE, and paragraph embedding can also be used to predict the next word in a sentence. Each paragraph word is mapped to a unique column in matrix D and each word is mapped to a unique column in matrix W. Compared to the WE model, the PE model varies only in formula (iii), h being composed of a concatenation or an average of word vectors extracted from W to W and D.

step 4.2: according to the dividing point, one part is divided into several text sections with different meanings, the problem part of the demand document is a demand, and the solution part of the service document is a method.

The LSTM network contains three gate structures: forgetting Gate (Forget Gate), input Gate (Input Gate), and Output Gate (Output Gate). Each gate functions differently, specifically as follows:

forget the door: the forgetting gate is used for processing the stored history information. The forgetting gate operation uses the current input information and the last time state, then passes through a sigmoid layer, outputs a range [0,1], discards the history information when the output is 0, and retains the history information when the input is 1. Judging whether to discard or not by using the formula (IV):

f _t ＝σ(W _f [h _t-1 ，x _t ]+b _f ) (IV)

wherein sigma represents a sigmoid function, x is a vector obtained after processing by a PE model, h represents an output result, whether the output result is a division point or not is judged, w is a long-term and short-term memory network connection parameter, b is an offset value, and f determines information to be forgotten at the moment t.

An input gate: the entry gate decides how to update the history information. The input gate can know whether to update the current input into the history information after operating the input information. A sigmoid layer and a tanh layer are included, the sigmoid layer determines what we will update, and the tanh layer generates new candidate values. The calculation formula is shown as formula (V) and formula (VI):

i _t ＝σ(W _f [h _t-1 ，x _t ]+b _i ) (V)

wherein i determines the updated value, h represents the output result, and determines whether it is a division point, w is a long-short term memory network connection parameter, b is an offset value, C _t The history information of the long-term and short-term memory network at the time t is obtained.

History information is acquired from a forgetting gate, an update candidate key is acquired from an input gate, and the history information is updated by using a formula (VII):

wherein C is history information of the long-short term memory network, f is calculated by formula (IV) and determines information to be forgotten at the time t, and i is calculated by formula (V) and determines an updated numerical value.

An output gate: the output gate is used for controlling the current node to output information. The output of the information is determined by a sigmoid layer, and then multiplied by the output of the tanh layer to obtain the output. The calculation formula is shown as formula (VIII) and formula (IX):

o _t ＝σ(W _f [h _t-1 ，x _t ]+b _o ) (VIII)

h _t ＝o _t *tanh(C _t ) (IX)

wherein sigma represents a sigmoid function, x is a vector obtained after processing by a PE model, h represents an output result, whether the result is a division point or not is judged, w is a long-term and short-term memory network connection parameter, and b is an offset value.

After obtaining the LSTM output, the output is between [0,1] through a sigmoid layer, when the output is close to 1, the previous node is represented as a dividing point, otherwise, the previous node is represented as a continuous point.

When history information is updated using the formula (x), ct is updated to 0 if the output obtained at the previous time is a division point, and processing is not performed if the output is not a division point.

C _t ＝0(when h _t-1 →1) (X)

In the formulas (IV) to (X), sigma represents a sigmoid function, x represents input, h represents output, whether a division point w represents a connection parameter or not is judged, and b represents an offset value.

the CNNs model in the present invention is shown in fig. 2.

CNNs networks are generally divided into an input layer, an output layer, a convolutional layer, and a fully-connected layer.

An input layer: the input layer directly acts on the input matrix, and the invention is a segmented text sentence matrix processed by a PE model.

An output layer: the output after the CNNs processing is the similarity of two sections of texts.

And (3) rolling layers: and performing feature extraction on the input. Consists of a convolution layer and a sampling layer. The convolutional layer has the function of extracting the characteristics of input data, and the characteristics extracted by different convolutional kernels are different. The sampling layer is used for reducing data and simultaneously keeping important information so as to accelerate the processing speed, and the sampling neurons of the same layer share the weight. The sampling layer adopts a sigmoid function as an activation function, so that the sampling layer has displacement invariance.

After the segmented text is obtained, word segmentation processing is carried out on the text, words with high TF-IDF are left, all numbers are left in the text due to the fact that index information is frequently contained in the demand and service, each sentence of the segmented text is processed by using a PE model, obtained sentence vectors are combined into a matrix, and the repeated numbers are used as a single one-dimension.

The matrix formed by the requirement document and the service document is firstly subjected to respective convolution layers, then connected with a similarity layer after convolution processing, and finally output the similarity through a full connection layer.

To capture as many features of the text as possible, two convolution operations are used, as shown in FIG. 3: the window size on the left is 2, the entire word vector. The right window size is also 2 and includes only one dimension of the word vector at a time. In practical experiments, three window sizes of 1,dim/2 and infinity are adopted

When the convolution is processed by a sampling layer, the results obtained by the two types of convolution are respectively subjected to maximum pooling, minimum pooling and mean pooling, different pooling methods can collect different information, and subsequent processing is facilitated.

The similarity used by the similarity layer is the cosine similarity. Since three pooling methods of maximum, minimum and mean are used, they require similarity to each other, and since the result after sampling is a matrix, for each matrix, each row and each column is similar to each other matrix, as shown in fig. 4. For example, suppose the result is an N × M matrix after the maximum pooling. The similarity is obtained between the ith row of the matrix and the N rows of the other matrix, the similarity is obtained between the Mth column of the matrix and the jth column of the other matrix, the finally obtained result is used as a similarity layer, and simultaneously, the similarity is obtained for the whole matrix and the other matrix once.

Full connection layer: the present invention uses a fully-connected layer prior to output, as in a fully-connected layer in a conventional neural network.

step 7.1: obtaining the final value of each demand, and then calculating a weighted average value as the final similarity value of the problem to be solved of the demand document;

step 7.2: the steps take the problem to be solved of the requirement document and the method for summarizing the service document to solve the difficult problem as an example, the requirement document comprises the problem to be solved and an index part which needs to be reached when the problem is solved, the similarity of the index part which needs to be reached when the problem is solved of the requirement document is solved according to the method, and the weighted average of the two parts is solved to be used as the final similarity of the requirement document and the service document;

the final similarity calculation is performed on the segmentation result of each part of the requirement document and the segmentation result of each part of the service document, as shown in fig. 5, because the requirement document has only two parts, namely, the problem to be solved and the index to be achieved when solving the problem, each part, after text segmentation, will cross with the result of each part of the service document after segmentation to obtain similarity, take the maximum value of the cross result as the matching value of the part, for example, the problem part of the requirement document that needs to be solved is segmented into N segments, the service document summarizes the method for solving the difficult problem technique to partially segment M results, after cross calculation, there are N × M matching results, the value with the maximum similarity is taken for each part of the requirement document as the final value of the part, and after obtaining the final values of all parts of the requirement document, the weighted average value is taken as the final similarity value of the problem of the requirement document that needs to be solved. Similarly, the problem part of the demand document to be solved and all parts of the service document find the best cross result.

The steps are taken as an example of a method for solving the problem of the requirement document and the summary of the service document to solve the difficult problem, the requirement document comprises the problem to be solved and an index part required to be achieved when the problem is solved, the similarity of the index part required to be achieved when the problem is solved by the requirement document is solved according to the method, and the weighted average of the two parts is worked out to serve as the final similarity of the requirement document and the service document.

Wherein, the dividing point in the step 4 means that the meanings of the previous sentence and the next sentence of the document are different, and the previous sentence is a dividing point. The historical information updating formula of the long-short term memory network is as follows:

C _t ＝0(when h _t-1 →1)

wherein C is _t History information of the long-term and short-term memory network at time t, h _t-1 It is the output of the previous state, and it is determined whether it is a division point.

Claims

1. A demand document and service document matching method based on a neural network is characterized by comprising the following operation steps:

step 2.1: the method comprises the steps that problems needing to be solved by an enterprise and index parts needing to be achieved when the problems are solved are required documents, and the problems needing to be solved by the enterprise and the index parts needing to be achieved when the problems are solved are extracted;

and step 3: the method comprises the steps of carrying out paragraph embedding processing on sentences in a problem part needing to be solved of a requirement document and a method part for solving the difficult problem of the technology by summarizing a service document to obtain sentence vectors;

and 4, step 4: judging document segmentation points through a long-term and short-term memory network;

step 4.1: inputting the obtained sentence vector into a trained long-short term memory network, and judging whether the previous sentence is a division point or not according to the output result of the long-short term memory network;

when judging whether the previous sentence is a division point, after obtaining the output of the long-term and short-term memory network, the output is between [0,1] through a sigmoid layer, when the output is close to 1, the previous node is a division point, otherwise, the previous node is a continuous point;

step 4.2: according to the dividing point, one part is divided into a plurality of texts with different meanings, the problem part of the demand document is a demand, and the solution part of the service document is a method;

step 5.2: if the sentence vectors are the service documents, all sentences of one method are processed through a PE model to obtain sentence vectors to form a matrix, and all sentence vectors of one requirement are taken to form another matrix;

step 6: calculating similarity by using the two matrixes as input through a trained convolutional neural network, calculating the similarity by using the cross of each requirement and each method, and taking the value with the maximum similarity of each requirement as the final value of the requirement;

two convolution operations are used, three types of 1,dim/2 and infinity are adopted in different window sizes, maximum pooling, minimum pooling and mean pooling are respectively used for results obtained by the two types of convolution when sampling is carried out, and different information is collected by different pooling methods; after sampling, the result is a matrix, for each matrix, each row and each row of the other matrix calculate the similarity, and each column of the other matrix calculate the similarity; similarity is also solved for the whole matrix and the other matrix once, and because the result of solving the similarity for the rows and the columns is more than the result of solving the similarity for the whole matrix, the result of the similarity obtained for the whole matrix is copied, so that the weights of the three are equal, and finally, a full connection layer is connected to output the result of the similarity;

2. The neural network-based demand document and service document matching method according to claim 1, wherein:

the division point in the step 4 means that the meanings of the previous sentence and the next sentence of the document are different, and the previous sentence is a division point; the historical information updating formula of the long and short term memory network is as follows:

C _t ＝0(when h _t-1 →1)

wherein C is _t Duration short-term memory of historical information, h, at time t of the network _t-1 If the current state is the output of the previous state, judging whether the current state is a segmentation point;