CN109710744B

CN109710744B - Data matching method, device, equipment and storage medium

Info

Publication number: CN109710744B
Application number: CN201811620836.6A
Authority: CN
Inventors: 吴飞; 王硕; 汪鸿翔; 方四安; 徐承
Original assignee: Hefei Ustc Iflytek Co ltd
Current assignee: Hefei Ustc Iflytek Co ltd
Priority date: 2018-12-28
Filing date: 2018-12-28
Publication date: 2021-04-06
Anticipated expiration: 2038-12-28
Also published as: CN109710744A

Abstract

The application provides a data matching method, a device, equipment and a storage medium, wherein the data matching method comprises the following steps: obtaining a target question sentence; determining a first semantic vector which corresponds to the target question sentence and contains deep semantic information of the target question sentence, and determining a second semantic vector which corresponds to the target question sentence and contains the dependency syntax relation of the target question sentence; and determining a target answer sentence corresponding to the target question sentence based on the first semantic vector and the second semantic vector corresponding to the target question sentence. The data matching method provided by the application can accurately match answers for the target question sentences.

Description

Data matching method, device, equipment and storage medium

Technical Field

The present application relates to the field of intelligent question answering technologies, and in particular, to a data matching method, apparatus, device, and storage medium.

Background

Driven by the development of a deep learning algorithm in the field of natural language processing, the intelligent question-answering system generates hot application of intelligent question-answering, and automatically analyzes, understands and returns corresponding answers according to user input. At present, a plurality of intelligent question and answer products exist in the market, for example, in the field of intelligent customer service, a machine can answer simple and repeated questions by building a customer service knowledge base; in the field of vehicle-mounted control, functions such as navigation interaction, vehicle-mounted system control and the like can be realized through intelligent voice; in the field of intelligent chatting, the method can be used for simulating human conversation, constructing an intelligent chatting object and setting personalized answers through an algorithm. In conclusion, intelligent question answering has wide development space in a plurality of industry fields and plays an increasingly important role.

Under the influence of the rapid development of the intelligent question-answering system, the intelligent question-answering system for some fields is developed. At present, the intelligent question-answering system which can be customized in a targeted way is expected in many fields. The key point for realizing the customized intelligent question-answering system is that question data in a specified field is fully understood so as to accurately match answers corresponding to the question data. However, the current question data matching scheme cannot fully understand the question data, and further the answer matched for the question data is inaccurate or irrelevant, that is, the current question data matching scheme does not meet the requirements of the customized intelligent question-answering system.

Disclosure of Invention

In view of this, the present application provides a data matching method, which is used to solve the problem that the existing problem data matching scheme cannot fully understand the problem data, and further causes the problem that the answer for matching the problem data is inaccurate or irrelevant, and the technical scheme is as follows:

a method of data matching, comprising:

obtaining a target question sentence;

determining a first semantic vector corresponding to the target question statement and containing deep semantic information of the target question statement, and determining a second semantic vector corresponding to the target question statement and containing a dependency syntax relationship of the target question statement;

determining a target answer sentence corresponding to the target question sentence based on the first semantic vector and the second semantic vector corresponding to the target question sentence;

wherein the process of determining the first semantic vector comprises:

determining the first semantic vector through a pre-established semantic vector determination model, wherein the semantic vector determination model comprises: the first semantic vector is determined by a basic semantic vector output by the basic mapping module and a residual semantic vector output by the residual mapping module, the basic semantic vector contains basic semantic information of the target question sentence, and the residual semantic vector contains semantic information lost by the basic semantic vector.

Optionally, the semantic vector determination model is obtained by training based on training question sentences in a training sentence set, and the training process of the semantic vector determination model includes:

obtaining original training question sentences from the training sentence set and generating training question sentences, wherein the generated training question sentences are generated based on the original training question sentences;

inputting the original training question sentence and the generated training question sentence into a basic mapping module of the semantic vector determination model to obtain basic semantic vectors corresponding to the original training question sentence and the generated training question sentence respectively;

inputting the basic semantic vectors corresponding to the original training question sentence and the generated training question sentence into a residual mapping module of the semantic vector determination model to obtain residual semantic vectors corresponding to the original training question sentence and the generated training question sentence;

inputting the basic semantic vector and the residual semantic vector corresponding to the original training question sentence and the basic semantic vector and the residual semantic vector corresponding to the generated training question sentence into a semantic vector determination module of the semantic vector determination module respectively to obtain semantic vectors corresponding to the original training question sentence and the generated training question sentence respectively;

and updating parameters of the semantic vector determination model based on semantic vectors respectively corresponding to the original training question sentences and the generated training question sentences and a preset loss function.

Optionally, the process of generating the training question sentence based on the original training question sentence includes:

and inputting the original training question sentence into a pre-established sentence generation model, and obtaining a generated training question sentence which is output by the sentence generation model and is related to or close to the original training question sentence in semantic.

Optionally, the sentence generation model is obtained by training original training problem sentences, and the sentence generation model includes a generation module and a confrontation discrimination module;

the training process of the statement generation model comprises the following steps:

inputting an original training question sentence into a generation module of the sentence generation model, and obtaining a generated training question sentence output by the generation module;

evaluating the generated training question sentences output by the generating module through the confrontation judging module to obtain an evaluation result;

updating parameters of the generation module based on the evaluation result.

Optionally, the determining a second semantic vector corresponding to the target question statement includes:

removing stop words from the target question sentences to obtain sentences from which the stop words are removed;

determining word vectors corresponding to all words in the sentence without the stop words, and determining a plurality of dependency syntax relations of the sentence without the stop words;

determining vectors corresponding to the plurality of dependency syntax relations respectively based on the word vectors corresponding to the words and the plurality of dependency syntax relations;

and weighting and summing the vectors corresponding to the plurality of dependency syntax relations respectively, wherein the vector obtained by weighting and summing is used as the second semantic vector corresponding to the target question sentence, and the weight value corresponding to each dependency syntax relation is determined based on the importance of each dependency syntax relation in a sentence structure.

Optionally, the determining a target answer sentence corresponding to the target question sentence based on the first semantic vector and the second semantic vector corresponding to the target question sentence includes:

determining a target semantic vector corresponding to the target question statement through the first semantic vector and the second semantic vector corresponding to the target question statement;

determining a question sentence with the highest similarity to the target question sentence from a question sentence set based on the target semantic vector corresponding to the target question sentence;

and determining the answer sentence corresponding to the question sentence with the highest similarity to the target question sentence as the target answer sentence corresponding to the target question sentence.

for any question statement in a question statement set, determining a first similarity between a target question statement and the question statement based on the first semantic vector corresponding to the target question statement, and determining a second similarity between the target question statement and the question statement based on a second semantic vector corresponding to the target question statement; determining a target similarity corresponding to the question based on the first similarity and the second similarity to obtain a target similarity corresponding to each question statement in the question statement set;

acquiring the maximum similarity among the target similarities corresponding to all question sentences in the question sentence set;

and taking the answer sentence corresponding to the question sentence corresponding to the maximum similarity as the target answer sentence corresponding to the target question sentence.

A data matching apparatus, comprising: the system comprises a question sentence acquisition module, a first semantic vector determination module, a second semantic vector determination module and an answer sentence determination module;

the question sentence acquisition module is used for acquiring a target question sentence;

the first semantic vector determining module is used for determining a first semantic vector which corresponds to the target question statement and contains deep semantic information of the target question statement;

the second semantic vector determining module is configured to determine a second semantic vector corresponding to the target question statement and including a dependency syntax relationship of the target question statement;

the answer sentence determining module is configured to determine a target answer sentence corresponding to the target question sentence based on the first semantic vector and the second semantic vector corresponding to the target question sentence;

the first semantic vector determining module, when determining the first semantic vector, is specifically configured to determine the first semantic vector through a pre-established semantic vector determining model, where the semantic vector determining model includes: the first semantic vector is determined by a basic semantic vector output by the basic mapping module and a residual semantic vector output by the residual mapping module, the basic semantic vector contains basic semantic information of the target question sentence, and the residual semantic vector contains semantic information lost by the basic semantic vector.

A data matching device, comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement each step of the data matching method.

A readable storage medium, having stored thereon a computer program which, when executed by a processor, carries out the steps of the data matching method.

According to the technical scheme, after the target question statement is obtained, the data matching method, the data matching device, the data matching equipment and the data matching storage medium can obtain a first semantic vector which corresponds to the target question statement and contains deep semantic information of the target question statement and a second semantic vector which corresponds to the target question statement and contains dependency syntax of the target question statement, and further determine the target answer statement corresponding to the target question statement based on the first semantic vector and the second semantic vector which correspond to the target question statement. Because the first semantic vector contains deep semantic information of the target question sentence and the second semantic vector contains the dependency syntax relation of the target question sentence, the two vectors can better represent the target question sentence, and an accurate answer can be matched for the target question sentence based on the two semantic vectors.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of a data matching method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a residual error network provided in an embodiment of the present application;

fig. 3 is a schematic diagram of determining a semantic vector based on CNN according to an embodiment of the present disclosure;

fig. 4 and fig. 5 are schematic diagrams of a training process of a semantic vector determination model provided in an embodiment of the present application;

FIG. 6 is a diagram illustrating reinforcement learning according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a generation of a confrontation model provided by an embodiment of the present application;

fig. 8 is a schematic flowchart illustrating a process of determining a second semantic vector corresponding to a target question statement and including a dependency syntax relationship of the target question statement in the data matching method according to the embodiment of the present application;

FIG. 9 is a diagram illustrating dependency syntax for an example statement provided by an embodiment of the present application;

fig. 10 is a schematic structural diagram of a data matching apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a data matching device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The current intelligent question-answering technology is mainly a CNN semantic matching method, and the implementation process of the method roughly comprises the following steps: collecting question data, sorting out corresponding question answers and generating a training set; generating a vocabulary library by using a word segmentation tool and a stop word tool according to the obtained training set, and then training the vocabularies by using a vectorization method to obtain a vector model; and according to a specified rule, carrying out classification labeling on the training set, then carrying out vector conversion on the vocabulary library, training the classified labeled data by using the obtained dictionary to obtain a vector model of the training set, training the vector model of the training set by using a CNN network, and finally obtaining the classifier. Inputting unknown problem data, vectorizing the input problem data through a training set vector model, and sending the vectorized data to a trained classifier to obtain a prediction classification label; and generating a corresponding answer according to the obtained classification label.

The inventor finds that the prior scheme has the following problems in the process of realizing the invention: firstly, the data size is small, overfitting is easy to occur in classifier training, and the generalization capability of the classifier is finally influenced, so that the robustness of the model is poor; secondly, the accuracy of the classification method based on the single model is not enough, and there are two kinds of classifiers commonly used at present, one is based on word vector matching and the other is based on deep learning, which are both insufficient, and specifically, the matching method based on the word vector has a good matching effect on the word surface layer, but cannot extract the features of the word meaning deep layer, for example, "the whole process of the establishment of the air force in china? "and" the development history of the air force in china? The similarity score of the method using word vector matching is not high, the deep learning-based matching method has a good extraction effect on deep semantic features, but the matching effect of the sentence with a very simple sentence structure and few words is not as good as that of the word vector-based matching method.

In view of the above problems, the present inventors have conducted intensive studies and finally have proposed a solution, and the data matching method provided by the present application is described by the following embodiments.

Referring to fig. 1, a schematic flow chart of a data matching method provided in an embodiment of the present application is shown, where the method may include:

step S101: and acquiring a target question statement.

Step S102: and determining a first semantic vector corresponding to the target question sentence and containing deep semantic information of the target question sentence, and determining a second semantic vector corresponding to the target question sentence and containing the dependency syntax relation of the target question sentence.

The specific implementation process of determining the first semantic vector and the second semantic vector corresponding to the target question statement may refer to the description of the following embodiments.

Step S103: and determining a target answer sentence corresponding to the target question sentence based on the first semantic vector and the second semantic vector corresponding to the target question sentence.

The implementation manner of determining the target answer sentence corresponding to the target question sentence based on the first semantic vector and the second semantic vector corresponding to the target question sentence is various:

in one possible implementation, the determining a target answer sentence corresponding to the target question sentence based on the first semantic vector and the second semantic vector corresponding to the target question sentence may include: determining a target semantic vector corresponding to the target question sentence through a first semantic vector and a second semantic vector corresponding to the target question sentence; determining a question sentence with the highest similarity to the target question sentence from the question sentence set based on the target semantic vector corresponding to the target question sentence; and determining the answer sentence corresponding to the question sentence with the highest similarity to the target question sentence as the target answer sentence corresponding to the target question sentence.

There are various implementation methods for determining the target semantic vector corresponding to the target question statement through the first semantic vector and the second semantic vector corresponding to the target question statement, and in one possible implementation manner, the first semantic vector and the second semantic vector corresponding to the target question statement can be directly summed, and the summed vector is used as the target semantic vector corresponding to the target question statement. In another possible implementation manner, the first semantic vector and the second semantic vector corresponding to the target question statement may be subjected to weighted summation, and the semantic vector obtained through weighted summation is used as the target semantic vector corresponding to the target question statement. And determining the weight value corresponding to the first semantic vector and the second semantic vector based on the specific application scene.

In another possible implementation manner, the determining, based on the first semantic vector and the second semantic vector corresponding to the target question statement, the target answer statement corresponding to the target question statement may include: for any question statement in the question statement set, determining a first similarity between a target question statement and the question statement based on a first semantic vector corresponding to the target question statement, and determining a second similarity between the target question statement and the question statement based on a second semantic vector corresponding to the target question statement; determining target similarity corresponding to the question statement based on the first similarity and the second similarity to obtain target similarity corresponding to each question statement in the question statement set; acquiring the maximum similarity in the target similarities corresponding to all question sentences in the question set; and taking the answer sentence corresponding to the question sentence corresponding to the maximum similarity as the target answer sentence corresponding to the target question sentence.

In one possible implementation manner, the first similarity and the second similarity corresponding to the question sentence may be directly summed, and the summed similarity is used as the target similarity corresponding to the question sentence, in another possible implementation manner, the first similarity and the second similarity corresponding to the question sentence may be weighted and summed, and the weighted and summed similarity is used as the target similarity corresponding to the question sentence. The weight corresponding to the first similarity and the second similarity can be determined based on a specific application scene.

According to the data matching method provided by the embodiment of the application, after the target question statement is obtained, a first semantic vector which corresponds to the target question statement and contains deep semantic information of the target question statement and a second semantic vector which corresponds to the target question statement and contains dependency syntax of the target question statement can be obtained, and then the target answer statement corresponding to the target question statement is determined based on the first semantic vector and the second semantic vector which correspond to the target question statement. Because the first semantic vector contains deep semantic information of the target question sentence and the second semantic vector contains the dependency syntax relation of the target question sentence, the two vectors can better represent the target question sentence, and the accurate answer can be matched for the target question sentence by combining the two semantic vectors.

In another embodiment of the present application, a process for determining a first semantic vector corresponding to a target question statement and containing deep semantic information of the target question statement is described.

The process of determining a first semantic vector corresponding to the target question statement and containing deep semantic information of the target question statement may include: and determining a first semantic vector corresponding to the target question sentence through a pre-established semantic vector determination model.

The semantic vector determination model is a model based on a residual error network. The residual error network comprises a basic mapping module and a residual error mapping module. The first semantic vector is determined by the basic semantic vector output by the basic mapping module and the residual semantic vector output by the residual mapping module (for example, the basic semantic vector and the residual semantic vector are summed to obtain the first semantic vector). The basic semantic vector comprises basic semantic information of a target question sentence, and the residual semantic vector comprises semantic information lost by the basic semantic vector.

It should be noted that, semantic information included in a semantic vector determined by the current semantic vector determination scheme usually has some loss, and for this situation, the present application provides a semantic vector determination model based on a residual error network, where a basic mapping module in the residual error network captures basic semantic information of a target problem statement, and the residual error mapping module complements important semantic information lost by the basic mapping module.

Referring to FIG. 2, a schematic diagram of a residual error network is shown, assuming H (X) tableShows the optimal mapping of input X, F_B(X) as a basic mapping, the following residual function is set:

F_R(X)＝H(X)-F_B(X) (1)

the training goal of the residual network is to approach the residual result to 0, and the network output expected by the present application is as follows:

F(X)＝F_B(X)+F_R(X)＝F_B(X,{W_B})+F_R(X,{W_R}) (2)

wherein, F_R(X,{W_R}) represents residual map learning, W_RIs a general form of a convolutional layer with an offset, and the ReLU is omitted for simplicity of notation. In residual learning, a three-layer small-scale filter can be adopted to capture residual which is not shown in the output of the basic mapping module, and finally, the final mapping of the input X is obtained through regression basic mapping and residual mapping.

The residual error network in the embodiment can well capture the difference between the semantic information contained in the basic semantic vector output by the basic mapping module and the real semantic information, and further supplement the output of the basic mapping module, so that the output of the whole network is closer to the real value.

The basic mapping module in this embodiment may be, but is not limited to, a CNN, for example, may also be an RNN, where as 3, a schematic diagram that determines a semantic vector based on the CNN is shown, specifically, a stop word in a sentence is removed, then a word vector corresponding to each word in the sentence from which the stop word is removed is determined, the word vector corresponding to each word is input to the CNN, convolution calculation is performed on the input word vectors by using convolution kernels of 3 types and different sizes to obtain 3 types of different column vectors, then the column vectors are passed through a pooling layer, a maximum value of each column vector is extracted, finally, the column vectors are superimposed to obtain a final output, and the CNN outputs a basic semantic vector of the sentence.

The semantic vector determination model in the embodiment is obtained by training with a training question sentence. Referring to fig. 4 and 5, schematic diagrams of a training process of the semantic vector determination model are shown, which may include:

step S401: and acquiring original training question sentences from the training sentence set and generating training question sentences.

For the implementation process of generating the training question sentence based on the original training question sentence, reference may be made to the description of the following embodiments.

It should be noted that, in this embodiment, the semantic vector determination model is trained by using the original training question sentence and the generated training question sentence simultaneously, because the generalization ability of the original training question sentence and the generated training question sentence is different, the original training question sentence is closer to the true value, and the generated training question sentence has more various degrees of freedom.

Step S402: inputting the original training question sentence and the generated training question sentence into a basic mapping module of a semantic vector determination model, and obtaining basic semantic vectors which are output by the basic mapping module and respectively correspond to the original training question sentence and the generated training question sentence.

Step S403: and inputting the basic semantic vectors corresponding to the original training question sentence and the generated training question sentence into a residual mapping module of the semantic vector determination model to obtain the residual semantic vectors corresponding to the original training question sentence and the generated training question sentence respectively.

Step S404: and inputting the basic semantic vector and the residual semantic vector corresponding to the original training question sentence and the generated basic semantic vector and the residual semantic vector corresponding to the training question sentence into a semantic vector determination module of a semantic vector determination model respectively to obtain the semantic vectors corresponding to the original training question sentence and the generated training question sentence respectively.

Assume the original training question sentence is X_iBased on X_iThe generated training question sentence is X_gThen the original training question sentence X_iCorresponding semantic vector F_i(X) and generating a training question statement X_gCorresponding semantic vector F_g(X) is respectively:

F_i(X)＝F_B-i(X)+F_R-i(X) (3)

F_g(X)＝F_B-g(X)+F_R-g(X) (4)

wherein, F_B-i(X) is the original training question statement X_iCorresponding basic semantic vector, F_R-i(X) is the original training question statement X_iCorresponding residual semantic vector, F_B-g(X) to generate a training question statement X_gCorresponding basic semantic vector, F_R-g(X) to generate a training question statement X_gA corresponding residual semantic vector.

Step S405: and updating the parameters of the semantic vector determination model based on the semantic vectors respectively corresponding to the original training question sentences and the generated training question sentences and a preset loss function.

Specifically, a semantic vector F corresponding to the original training question sentence_i(X) semantic vector F corresponding to the sentence that generated the training question_g(X) weighted summation as follows:

F＝ηF_i+(1-η)F_g (5)

updating the model parameters based on the vector obtained by the weighted summation and the loss function of the following formula:

wherein N is the total number of training samples, ω and b are the weight and deviation of the model, η is the weighting parameter, F_kF corresponding to the kth training sample.

In another embodiment of the present application, an introduction is presented to generate a corresponding generated training question statement based on an original training question statement.

The process of generating a corresponding generated training question statement based on the original training question statement includes: inputting the original training question sentence into a pre-established sentence generation model, and obtaining a generated training question sentence which is output by the sentence generation model and related to or similar to the semantic of the training sentence.

It should be noted that in some fields, the problem data is usually less, and the less problem data results in less training data for training the semantic vector determination model, and the less training data affects the generalization ability and robustness of the model. In view of this, the present application provides a sentence generation model, which is used to generate new training question sentences according to original training question sentences, so as to increase the number of training question sentences.

The statement generation model in this embodiment may generate a confrontation discriminant network, which may include a generation module and a confrontation discriminant module. The generation module is used for learning real sentences so as to enable the sentences generated by the generation module to be more real, and the countermeasure judgment module judges whether the sentences generated by the generation module are true or false.

The sentence generation model in this embodiment is obtained by training the original training question sentence, and the training process of the sentence generation model may include: inputting the original training question sentence into a generation module of a sentence generation model, and obtaining a generated training question sentence output by the generation module; evaluating the generated training question sentences output by the generating module through the confrontation judging module to obtain an evaluation result; and updating the parameters of the generation module based on the evaluation result.

The embodiment can utilize a reinforcement learning algorithm to evaluate the generated training question sentence output by the generation module, and then feed back the evaluation result to the generation module to guide the parameter of the generation module to perform iterative update, so that the sentence generated by the generation module is as close to the real sentence as possible.

As shown in fig. 6, the reinforcement learning mainly includes two parts, namely an Agent (Agent) and an environment (environment), and when the Agent (Agent) makes an Action (Action), the environment has a State (State) influence, and at this time, the environment (environment) returns a Reward (Reward) to the Agent, and the Reward (Reward) guides the Agent (Agent) to perform the next Action.

In the application, by combining the particularity of linguistic data in the field of intelligent question answering and having very strong problem pertinence, reinforcement learning is applied to the generation of the confrontation discriminant model, as shown in fig. 7, a generation module in the generation of the confrontation model is regarded as an Agent (Agent) of reinforcement learning, a confrontation discriminant module in the generation of the confrontation model is regarded as an environment (environment) of reinforcement learning, and a sentence generated by the generation module is regarded as an Action (Action) of reinforcement learning.

Specifically, statements generated by the generation module are evaluated firstly, the evaluation result is used as Reward (Reward) of reinforcement learning, and then parameters of the generation module are updated through the Reward (Reward) by the strategy gradient method, so that the generated statements are close to real statements as much as possible.

It should be noted that, the countermeasure decision module in the countermeasure decision network scores the whole sentence, which may have certain problems in sentence generation, such as "how many degrees are checked in surgery for physical examination? "the overall score of the sentence is high, but it is not smooth enough because the sentence is composed of single word, for this case, the application can use monte carlo search algorithm to score and complement each word in the sentence, and then get a better sentence, for example, after scoring and complementing each word in the sentence, it becomes" how do the medical examination system in physical examination? ".

The following describes a specific structure of the above-described sentence generation model. In a possible implementation manner, the generation module of the statement generation model may be a long-time memory network LSTM, and the response result of the LSTM is as follows:

h_t＝g(h_t-1,x_t) (7)

where the input to LSTM is x_t，...,x_t，h_tIs a vector of sentences that will be input through the LSTM and output by the hidden layer of the LSTM.

The sentence vector output by the hidden layer of the LSTM is calculated by using the softmax layer, and the calculation formula is as follows:

p(y_t|x₁，...，x_t)＝z(h_t)＝soft max(c+Vh_t) (8)

c and V are parameters of LSTM, and the confrontation discrimination module guides the output result of the generator by updating the parameters so that the finally output statement is as close to the real statement as possible

It should be noted that the input of LSTM is a word vector corresponding to each word in a sentence, LSTM learns the structure of the sentence and the interrelationship of the words in the sentence based on the word vector corresponding to each word in the sentence, thereby generating the sentence, and the sentence generated by LSTM is input to the confrontation determination module. In this embodiment, a word vector corresponding to each word in the sentence may be determined by using, but not limited to, a word2vec model, and a word vector corresponding to one word is used to characterize the word.

Illustratively, the words "commander", "yes", "what", "level" in the sentence "what level the commander is" are characterized by the following word vectors:

commander [0.992734, -0.476647...... 0.217249]

Is [ -0.135216, 0.156160........ 0.001139]

What is [0.088582, 0.240145. ].. and-0.006931 ]

Grade [0.616357, -0.150043...... 0.243982]

After the word vectors corresponding to the words "commander", "yes", "what", and "level" are respectively input to the generation module, the generation module outputs sentences related to or similar to the semantics of "commander is at what level".

The confrontation determination module in this embodiment may be DNN, CNN, RCNN, etc., and considering that CNN has a better effect in the case of complete and incomplete input sentences, the confrontation determination module in this embodiment is preferably CNN. Meanwhile, in order to improve the effect and efficiency of the confrontation judgment module, a Highway structure is preferably used on a pooling layer, and the main problem solved by the Highway structure is that the network depth is deepened, and the network training is difficult due to the fact that the gradient information backflow is blocked.

Table 1 below shows sentences generated using the sentence generation model obtained by the training, and it can be seen from table 1 that the sentence generation effect of the sentence generation model is good, and the generated sentences are smooth.

TABLE 1 statement Generation model generated statements

In another embodiment of the subject application, a second semantic vector that determines dependency syntax relationships for the target question statement is introduced.

The sentence dependency syntactic relation can reveal semantic modification relations among all components in the sentence, namely indicate the collocating relation among words in the sentence on the syntax, and analyze the main, predicate, object, definite, shape and complement structure of a sentence. The dependency syntax relationship formula proposed by Robinson states that: neither component can depend on two or more components. Each entity in a sentence must appear as a semantic component in the dependency structure. Based on this, the application provides a semantic vector determination method based on dependency syntax relationship.

Referring to FIG. 8, a flowchart illustrating a second semantic vector for determining a dependency syntax relationship of a target question statement may include:

step S801: and removing the stop words from the target question sentences to obtain the sentences from which the stop words are removed.

Specifically, word segmentation is performed on the target question sentence, and then stop words are screened from a plurality of words obtained through word segmentation and removed based on a stop word table constructed in advance, so that the sentence with the stop words removed is obtained. It should be noted that stop words are words having no meaning to the words, such as "o", "j", "kao", "no", and the like. Illustratively, the target question statement is "which items are included in all surgical examinations in a flight examination. "the statement after removing the stop word is" what items are included in the surgical examination in the fly ".

Step S802: determining a word vector corresponding to each word in the sentence without the stop word, and determining a plurality of dependency syntax relations of the sentence without the stop word.

The dependency syntax relationship comprises one or more of a main predicate relationship (SBV), a dynamic object relationship (VOB), an inter object relationship (IOB), a preposed object (FOB), a bilingual (DBL), a centering relationship (ATT), an intermediate structure (ADV), a dynamic complement structure (CMP), a parallel relationship (COO), a mediate object relationship (POB), a left additional relationship (LAD), a right additional Relationship (RAD), an Independent Structure (IS), a punctuation (WP) and a core relationship (HED).

Referring to FIG. 9, the dependency syntax relationship of the statement "what items are included in the surgical examination in the fly" after removing stop words is shown.

Step S803: and determining vectors corresponding to the plurality of dependency syntax relations respectively based on the plurality of dependency syntax relations of the sentence without the stop word and the word vectors corresponding to the words in the sentence without the stop word.

Specifically, the vector corresponding to any dependency syntax relationship is the sum of word vectors corresponding to all words corresponding to the dependency syntax relationship in the sentence from which the stop word is removed.

Step S804: and weighting and summing the vectors respectively corresponding to the plurality of dependency syntax relations, wherein the vector obtained by weighting and summing is used as a second semantic vector corresponding to the target question statement.

The weight corresponding to each dependency syntax relationship is determined based on the importance of each dependency syntax relationship in the sentence structure, for example, the core relationship is the core of the whole sentence, and a larger weight can be set.

Specifically, the second semantic vector corresponding to the target question statement may be determined based on the following formula:

SV_DPM＝ω_SBV*V_SBV+ω_VOB*V_VOB+…+ω_HED*V_HED (9)

wherein, ω is_SBV+ω_VOB+...+ω_HED＝1，SV_DPMA second semantic vector, V, corresponding to the target question statement_SBV、V_VOB、…V_HEDVectors corresponding to the respective dependency syntaxes.

After the first semantic vector and the second semantic vector corresponding to the target question sentence are obtained, the target question sentence can be matched based on the first semantic vector and the second semantic vector, matching is performed based on the first semantic vector, the deep meaning of the question sentence is emphasized, matching is performed based on the second semantic vector, the dependency relationship of the sentence is emphasized, and the deep meaning and the dependency relationship of the comprehensive sentence are matched, so that the matching effect of the sentence can be improved.

In the data matching method provided in the embodiment of the application, aiming at the condition that problem data in some fields are usually few, a sentence generation model based on reinforcement learning is provided, new problem data is generated by using original problem data through the sentence generation model, the original problem data and the corresponding generated problem data are used as training problem data to train a semantic vector determination model, the semantic vector determination model comprises a basic mapping module capable of capturing basic semantic information in the problem data and a residual mapping module capable of complementing semantic information lost by the basic mapping module, so that the understanding capability of the model on deep semantics is greatly improved, in order to improve the matching effect of the problem data, a dependency relationship-based semantic vector determination method is introduced, a first semantic vector containing the deep semantic information can be obtained based on the semantic vector determination model, the dependency relationship-based semantic vector determination method can determine the second semantic vector containing the dependency syntax relationship and match the problem data based on the first semantic vector and the second semantic vector, so that the matching effect of the problem data can be improved.

The data matching device provided by the embodiment of the application is described below, and the data matching device described below and the data matching method described above can be referred to correspondingly.

Referring to fig. 10, a schematic structural diagram of a data matching apparatus according to an embodiment of the present application is shown, and as shown in fig. 10, the apparatus may include: a question sentence acquisition module 1001, a first semantic vector determination module 1002, a second semantic vector determination module 1003, and an answer sentence determination module 1004.

A question sentence acquisition module 1001 configured to acquire a target question sentence.

A first semantic vector determining module 1002, configured to determine a first semantic vector corresponding to the target question statement and including deep semantic information of the target question statement.

A second semantic vector determining module 1003, configured to determine a second semantic vector that includes a dependency syntax relationship of the target question statement and corresponds to the target question statement.

An answer sentence determining module 1004, configured to determine a target answer sentence corresponding to the target question sentence based on the first semantic vector and the second semantic vector corresponding to the target question sentence.

The data matching device provided in the embodiment of the application may obtain, after obtaining the target question statement, a first semantic vector corresponding to the target question statement and including deep semantic information of the target question statement and a second semantic vector corresponding to the target question statement and including a dependency syntax of the target question statement, and determine, based on the first semantic vector and the second semantic vector corresponding to the target question statement, a target answer statement corresponding to the target question statement. Because the first semantic vector contains deep semantic information of the target question sentence and the second semantic vector contains the dependency syntax relation of the target question sentence, the two vectors can better represent the target question sentence, and the accurate answer can be matched for the target question sentence by combining the two semantic vectors.

In a possible implementation manner, the first semantic vector determining module 1002 in the foregoing embodiment is specifically configured to determine the first semantic vector corresponding to the target question statement through a pre-established semantic vector determining model.

Wherein the semantic vector determination model is a model based on a residual error network, the residual error network comprising: a basic mapping module and a residual mapping module; the first semantic vector is determined by a basic semantic vector output by the basic mapping module and a residual semantic vector output by the residual mapping module; the basic semantic vector comprises basic semantic information of the target question statement, and the residual semantic vector comprises semantic information lost by the basic semantic vector.

In a possible implementation manner, the semantic vector determination model in the above embodiment is obtained by training based on a training question sentence in a training sentence set. The data matching device provided by the above embodiment further includes: a first training module.

The first training module includes: the device comprises an acquisition submodule, a basic semantic vector determining module, a residual semantic vector determining module, a semantic vector determining submodule and a parameter updating submodule.

The obtaining submodule is used for obtaining original training question sentences from the training sentence set and generating training question sentences, and the generated training question sentences are generated based on the original training question sentences;

the basic semantic vector determining module is configured to input the original training question sentence and the generated training question sentence into a basic mapping module of the semantic vector determining model, and obtain basic semantic vectors corresponding to the original training question sentence and the generated training question sentence respectively.

The residual semantic vector determining module is configured to input the basic semantic vectors corresponding to the original training question sentence and the generated training question sentence into the residual mapping module of the semantic vector determining module, and obtain the residual semantic vectors corresponding to the original training question sentence and the generated training question sentence.

The semantic vector determining submodule is configured to input the basic semantic vector and the residual semantic vector corresponding to the original training question sentence and the basic semantic vector and the residual semantic vector corresponding to the generated training question sentence into a semantic vector determining module of the semantic vector determining module, respectively, and obtain semantic vectors corresponding to the original training question sentence and the generated training question sentence, respectively.

And the parameter updating module is used for updating the parameters of the semantic vector determination model based on semantic vectors respectively corresponding to the original training question sentences and the generated training question sentences and a preset loss function.

The data matching device provided by the above embodiment further includes: and a statement generation module.

And the statement generating module is used for inputting the original training question statement into a pre-established statement generating model and obtaining a generated training question statement which is output by the statement generating model and is related to or close to the original training question statement in semantic meaning.

The sentence generation model in the above embodiment is obtained by training an original training question sentence, and includes a generation module and a confrontation discrimination module. The data matching device provided by the above embodiment further includes: a second training module.

The second training module is used for inputting the original training question sentence into the generation module of the sentence generation model and obtaining the generated training question sentence output by the generation module; evaluating the generated training question sentences output by the generating module through the confrontation judging module to obtain an evaluation result; updating parameters of the generation module based on the evaluation result.

In a possible implementation manner, the second semantic vector determining module 1003 in the foregoing embodiment is specifically configured to remove a stop word from the target question statement, and obtain a statement without the stop word; determining word vectors corresponding to all words in the sentence without the stop words, and determining a plurality of dependency syntax relations of the sentence without the stop words; determining vectors corresponding to the plurality of dependency syntax relations respectively based on the plurality of dependency syntax relations and the word vectors corresponding to the words; and weighting and summing the vectors corresponding to the plurality of dependency syntax relations respectively, wherein the vector obtained by weighting and summing is used as the second semantic vector corresponding to the target question sentence, and the weight value corresponding to each dependency syntax relation is determined based on the importance of each dependency syntax relation in a sentence structure.

The dependency syntax relationships in the above embodiments include one or more of the following relationships: a cardinal relationship, a dynamic guest relationship, an inter-guest relationship, a preposition object, a bilingual, a middle-of-form relationship, a dynamic complement structure, a parallel relationship, a mediate-guest relationship, a left additional relationship, a right additional relationship, an independent structure, a punctuation and a core relationship.

In a possible implementation manner, the answer sentence determination module 1004 in the foregoing embodiment may include: a target semantic vector determining sub-module, a question matching sub-module and an answer sentence determining sub-module.

The target semantic vector determining submodule is configured to determine a target semantic vector corresponding to the target question statement according to the first semantic vector and the second semantic vector corresponding to the target question statement.

The question matching sub-module is specifically configured to determine, from a question statement set, a question statement with the highest similarity to the target question statement based on the target semantic vector corresponding to the target question statement.

The answer sentence determination submodule is specifically configured to determine an answer sentence corresponding to the question sentence with the highest similarity to the target question sentence as the target answer sentence corresponding to the target question sentence.

In another possible implementation manner, the answer sentence determination module 1004 in the above embodiment may include: a similarity determination submodule and an answer sentence determination submodule.

The similarity determining submodule is used for determining a first similarity between a target question statement and a question statement based on the first semantic vector corresponding to the target question statement and determining a second similarity between the target question statement and the question statement based on a second semantic vector corresponding to the target question statement for any question statement in a question statement set; and determining the target similarity corresponding to the problem based on the first similarity and the second similarity to obtain the target similarity corresponding to each problem statement in the problem statement set, and acquiring the maximum similarity in the target similarity corresponding to each problem statement in the problem statement set.

And the answer sentence determining submodule is used for taking the answer sentence corresponding to the question sentence corresponding to the maximum similarity as the target answer sentence corresponding to the target question sentence.

An embodiment of the present application further provides a data matching device, please refer to fig. 11, which shows a schematic structural diagram of the data matching device, where the data matching device may include: at least one processor 1101, at least one communication interface 1102, at least one memory 1103, and at least one communication bus 1104;

in the embodiment of the present application, the number of the processor 1101, the communication interface 1102, the memory 1103 and the communication bus 1104 is at least one, and the processor 1101, the communication interface 1102 and the memory 1103 complete communication with each other through the communication bus 1104;

the processor 1101 may be a central processing unit CPU, or an application Specific Integrated circuit (asic), or one or more Integrated circuits configured to implement embodiments of the present invention, or the like;

the memory 1103 may include a high-speed RAM memory, a non-volatile memory (non-volatile memory), and the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

obtaining a target question sentence;

and determining a target answer sentence corresponding to the target question sentence based on the first semantic vector and the second semantic vector corresponding to the target question sentence.

Alternatively, the detailed function and the extended function of the program may be as described above.

Embodiments of the present application further provide a readable storage medium, where a program suitable for being executed by a processor may be stored, where the program is configured to:

obtaining a target question sentence;

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of data matching, comprising:

obtaining a target question sentence;

wherein the process of determining the first semantic vector comprises:

2. The data matching method according to claim 1, wherein the semantic vector determination model is trained based on training question sentences in a training sentence set, and the training process of the semantic vector determination model comprises:

3. The data matching method of claim 2, wherein generating the generated training question sentence based on the original training question sentence comprises:

4. The data matching method of claim 3, wherein the sentence generation model is obtained by training original training problem sentences, and comprises a generation module and a confrontation discrimination module;

updating parameters of the generation module based on the evaluation result.

5. The data matching method of claim 1, wherein the determining a second semantic vector corresponding to the target question statement comprises:

6. The data matching method according to any one of claims 1 to 5, wherein the determining a target answer sentence corresponding to the target question sentence based on the first semantic vector and the second semantic vector corresponding to the target question sentence comprises:

7. The data matching method according to any one of claims 1 to 5, wherein the determining a target answer sentence corresponding to the target question sentence based on the first semantic vector and the second semantic vector corresponding to the target question sentence comprises:

8. A data matching apparatus, comprising: the system comprises a question sentence acquisition module, a first semantic vector determination module, a second semantic vector determination module and an answer sentence determination module;

9. A data matching device, comprising: a memory and a processor;

the memory is used for storing programs;

the processor is used for executing the program and realizing the steps of the data matching method according to any one of claims 1 to 7.

10. A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the data matching method according to any one of claims 1 to 7.