CN116150380A

CN116150380A - Text matching method, device, storage medium and equipment

Info

Publication number: CN116150380A
Application number: CN202310413634.9A
Authority: CN
Inventors: 王一张; 杨非; 邱鹏; 陆陈昊; 候星甫; 王智明; 崔鑫
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-04-18
Filing date: 2023-04-18
Publication date: 2023-05-23
Anticipated expiration: 2043-04-18
Also published as: CN116150380B

Abstract

The specification discloses a text matching method, a device, a storage medium and equipment, and determines the loss of a text pair according to the first similarity of the characterization vectors of each original text in the matched text pair, the second similarity between the characterization vectors of each original text and each deformed text in the text pair and the third similarity of the characterization vectors of each original text and other original texts. The first similarity and the second similarity are both inversely related to the loss of the text pair, and the third similarity is positively related to the loss of the text pair. After determining the total loss according to the loss of each text pair and training with the minimum total loss as a target to obtain a trained characterization model, determining a characterization vector of the text pair to be matched through the trained characterization model in response to the matching request, so as to determine a matching result of the text pair to be matched according to the obtained characterization vector. The method can train to obtain the characterization model capable of outputting the accurate characterization vector, and improves the matching accuracy of text matching based on the characterization model.

Description

Text matching method, device, storage medium and equipment

Technical Field

The present disclosure relates to the field of natural language processing technologies, and in particular, to a text matching method, apparatus, storage medium, and device.

Background

Currently, text matching is applied in numerous scenarios, such as question-answering systems, text deduplication, information retrieval, etc.

Text matching typically requires determining a token vector for the text to determine a text matching result based on the token vector for the text. Thus, the text matching effect is closely related to the quality of the text token vector.

In order to improve the text matching effect by improving the quality of text token vectors, the specification provides a text matching method.

Disclosure of Invention

The present disclosure provides a text matching method, apparatus, storage medium, and storage device, so as to at least partially solve the foregoing problems in the prior art.

The technical scheme adopted in the specification is as follows:

the specification provides a text matching method, which comprises the following steps:

acquiring a text set, and determining deformed text of each original text in the text set, wherein the text set comprises text pairs consisting of the original texts with matching relations;

respectively obtaining characterization vectors of each original text and the deformed text thereof through a characterization model to be trained;

for each text pair in the text set, determining the similarity between the characterization vectors of the two original texts in the text pair as a first similarity;

For each original text in the text pair, determining the similarity between the original text and the characterization vector of the deformed text of the original text as a second similarity, and determining the similarity between the original text and the characterization vector of each other original text respectively as each third similarity; wherein the other original texts are original texts except the text pair in the text set;

determining the loss of the text pair according to the first similarity, each second similarity corresponding to the text and each third similarity corresponding to the text pair, wherein the first similarity and each second similarity are inversely related to the loss, and each third similarity is positively related to the loss of the text pair;

determining total loss according to the determined loss of each text pair, and training the characterization model by taking the minimum total loss as an optimization target to obtain the trained characterization model;

and responding to a matching request carrying a text pair to be matched, inputting the text pair to be matched into the trained characterization model, and determining a matching result of the text pair to be matched according to a characterization vector output by the characterization model.

Optionally, determining deformed text of each original text in the text set specifically includes:

and aiming at each original text in the text set, obtaining a deformed text of the original text by a preset data enhancement mode.

Optionally, the data enhancement mode is position transformation;

obtaining a deformed text of the original text through a preset data enhancement mode, wherein the deformed text specifically comprises the following steps:

determining each word element of the original text;

and obtaining the deformed text of the original text by carrying out position transformation on at least part of the word elements of the original text.

Optionally, the data enhancement mode is random zero setting;

determining each word element of the original text;

and randomly setting zero for part of the word elements of the original text according to a preset first probability to obtain the deformed text of the original text.

determining original texts with matching relations with other original texts from all original texts in the text set, and taking the original texts as matching texts;

and respectively determining deformed texts of the matched texts.

Optionally, through a characterization model to be trained, characterization vectors of each original text and its deformed text are obtained respectively, which specifically includes:

and respectively obtaining the matching texts and the characterization vectors corresponding to the deformation texts through the characterization model to be trained.

Optionally, the method further comprises:

for each original text of each text pair in the text set, determining similarity between the deformed text of the original text and the characterization vector of the other original text in the text pair as a fourth similarity.

Optionally, determining the loss of the text pair according to the first similarity, each second similarity corresponding to the text and each third similarity corresponding to the text pair specifically includes:

determining the loss of the text pair according to the first similarity, the second similarities corresponding to the text, the third similarities corresponding to the text and the fourth similarities corresponding to the text pair;

wherein the fourth similarity is inversely related to the loss of the text pair.

Optionally, the characterization model is composed of a pre-trained feature model and a weak encoder;

the method comprises the steps of respectively obtaining characterization vectors of each original text and deformation text thereof through a characterization model to be trained, wherein the method specifically comprises the following steps:

Inputting each text in each original text and each deformed text into the feature model to obtain the coding feature of the text;

and inputting the coding characteristics of the text into the weak coder to obtain the characterization vector of the text.

Optionally, the feature model includes a number of encoders;

inputting the text into the feature model to obtain the coding feature of the text, which comprises the following steps:

randomly setting zero for parameters of an encoder of the feature model according to a preset second probability;

and inputting the text into the feature model after random zeroing to obtain the coding feature of the text.

Optionally, determining the matching result of the text pair to be matched according to the characterization vector output by the characterization model specifically includes:

determining the similarity of the text pairs to be matched according to the characterization vector output by the characterization model;

and inputting the similarity into a preset activation function to obtain a matching result of the text pair to be matched.

Optionally, the matching result corresponds to a classification result or a multi-classification result;

inputting the similarity into a preset activation function to obtain a matching result of the text pair to be matched, wherein the matching result comprises the following specific steps:

When the relation of the text pairs to be matched is classified twice, inputting the similarity into a preset first activation function to obtain a first score of the text pairs to be matched as a matching relation, and when the first score is larger than a preset threshold value, determining that the text pairs to be matched have the matching relation as a matching result, otherwise, determining that the text pairs to be matched do not have the matching relation as a matching result;

when the relation of the text pairs to be matched is subjected to multi-classification, inputting the similarity into a preset second activation function to obtain second scores of the text pairs to be matched corresponding to preset matching degrees, and taking the matching degree corresponding to the largest second score as a matching result of the text pairs to be matched.

The present specification provides a text matching apparatus including:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a text set and determining deformed texts of original texts in the text set, wherein the text set contains text pairs composed of the original texts with matching relations;

the characterization module is used for respectively obtaining characterization vectors of each original text and the deformation text thereof through a characterization model to be trained;

The first determining module is used for determining, for each text pair in the text set, the similarity between the characterization vectors of the two original texts in the text pair as a first similarity;

a second determining module, configured to determine, for each original text in the text pair, a similarity between the original text and a feature vector of a deformed text of the original text, as a second similarity, and determine, as each third similarity, a similarity between the original text and a feature vector of each other original text, respectively; wherein the other original texts are original texts except the text pair in the text set;

a third determining module, configured to determine a loss of the text pair according to the first similarity, each second similarity corresponding to the text, and each third similarity corresponding to the text pair, where the first similarity and each second similarity are inversely related to the loss, and each third similarity is positively related to the loss of the text pair;

the training module is used for determining total loss according to the determined loss of each text pair, and training the characterization model by taking the minimum total loss as an optimization target to obtain the trained characterization model;

The matching module is used for responding to a matching request carrying a text pair to be matched, inputting the text pair to be matched into the trained characterization model, and determining a matching result of the text pair to be matched according to the characterization vector output by the characterization model.

The present specification provides a computer readable storage medium storing a computer program which when executed by a processor implements the above text matching method.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above text matching method when executing the program.

The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:

in the text matching method, the loss of the text pair is determined by utilizing the first similarity of the characterization vectors of two original texts in the matched text pair, the second similarity between the characterization vectors of each original text and each deformed text in the text pair and the third similarity between the characterization vectors of each original text and other original texts. The first similarity and the second similarity are both inversely related to the loss of the text pair, and the third similarity is positively related to the loss of the text pair. After determining the total loss according to the loss of each text pair and training with the minimum total loss as a target to obtain a trained characterization model, determining a characterization vector of the text pair to be matched through the trained characterization model in response to the matching request, so as to determine a matching result of the text pair to be matched according to the obtained characterization vector.

From the above, it can be seen that the text matching method provided in the present specification can output a characterization model of an accurate characterization vector by using a matched text pair through comparison and learning training, so as to improve the matching accuracy of text matching based on the characterization model.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:

fig. 1 is a schematic flow chart of a text matching method provided in the present specification;

FIG. 2 is a schematic representation of one characterization model provided in the present specification;

FIG. 3 is a schematic diagram of a text matching device provided in the present specification;

fig. 4 is a schematic diagram of an electronic device provided in the present specification.

Detailed Description

In a text matching scene, whether the text can be encoded based on the semantics of the text is a decisive factor for the quality of the characterization vector of the obtained text. Therefore, in order to enable a characterization model for outputting a characterization vector of a text to learn the semantics of the combined text, a high-quality characterization vector is accurately output, so that a matching result between the texts can be accurately obtained based on the characterization vector of the text.

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a text matching method in the present specification, specifically including the following steps:

s100: and acquiring a text set, and determining deformed text of each original text in the text set, wherein the text set comprises text pairs consisting of the original texts with matching relations.

In this specification, the text matching method may be performed by a server. The server may be a single server or may be a cluster of servers, such as a distributed system.

The server can train the characterization model by executing the text matching method, and determine the matching result of the text pair to be matched according to the characterization vector output by the trained characterization model. In the present specification, step S100 to step S110 are training processes of the characterization model, and step S112 is a process of determining a matching result of the text pair to be matched based on the trained characterization model.

In the present specification, the labeled text corpus is used as a training sample to train the model. The characterization model is a deep learning model.

In addition, the present specification adopts a contrast learning method to train the model.

Through contrast learning, the characterization model can learn the difference of similarity between semantic contrast texts based on texts, and learn to distinguish different texts. The characterization model can accurately distinguish similar texts from dissimilar texts, and accurately output characterization vectors of the texts. So that the closer the distance of the characterization vector output by the characterization model in the feature space is for more similar texts, the farther the distance of the characterization vector output by the characterization model in the feature space is for less similar texts.

The contrast learning is self-supervision learning. And the text training the representation model is determined based on the marked text in the specification. Therefore, the specification combines a supervision mode and a self-supervision mode, and trains the characterization model in the self-supervision mode according to the marked data so as to improve the accuracy of the characterization vector output by the characterization model.

In training the representation model, first, the server may obtain a set of text.

Wherein the text set contains text pairs composed of original texts with matching relations. The matching relationship is determined based on the annotation. That is, at least a portion of the original text in the set of text contains annotations. When one original text is marked as matching with the other original text, the two have matching relationship, otherwise, the two do not have matching relationship.

In order to enable the representation model to recognize the semantics of the complex text, the server can determine the deformed text of each original text in the text set after determining the original text in the text set so as to train the representation model based on the original text and the deformed text thereof. The method and the device have the advantages that for complex, deformed texts, the semantic meaning of the representation model can be accurately identified, and accurate representation vectors are output.

The original text is text which is not deformed, and the deformed text is obtained by adjusting the original text.

Specifically, the server may obtain, for each original text in the text set, a deformed text of the original text by using a preset data enhancement method.

S102: and respectively obtaining the characterization vectors of each original text and the deformed text thereof through the characterization model to be trained.

In one or more embodiments of the present disclosure, after determining each original text and its deformed text, the server may input the text into the characterization model for each of the original text and the deformed text, to obtain a characterization vector of the text.

After the original texts and the deformed texts thereof are input into the characterization model, the characterization vectors corresponding to the original texts and the deformed texts thereof can be obtained respectively.

S104: for each text pair in the text set, determining the similarity between the characterization vectors of the two original texts in the text pair as a first similarity.

In one or more embodiments of the present description, whether there is a match between texts may be determined based on a similarity between texts. And the similarity between the texts can be determined based on the token vector of the text.

Therefore, after determining the characterization vectors corresponding to each original text and the deformed text thereof, the server can determine the similarity between the characterization vectors of the texts so as to determine the loss based on the similarity, and adjust the parameters of the characterization model according to the loss.

Parameters of the characterization model are adjusted based on the loss determined by the similarity, so that the characterization model can accurately output a characterization vector of the text. To achieve the following: for more similar texts, the closer the distance between the characterization vectors output by the characterization model in the feature space is, and for less similar texts, the farther the distance between the characterization vectors output by the characterization model in the feature space is.

First, the server may determine, for each text pair in the set of texts, a similarity between token vectors of two original texts in the text pair as a first similarity.

S106: for each original text in the text pair, determining the similarity between the original text and the characterization vector of the deformed text of the original text as a second similarity, and determining the similarity between the original text and the characterization vector of each other original text respectively as each third similarity; wherein the other original text is the original text except the text pair in the text set.

The server may also determine, for each original text in the text pair, a similarity between the original text and a token vector of a deformed text of the original text as a second similarity, and a similarity between the original text and token vectors of respective other original texts as respective third similarities.

Wherein the other original text is the original text except the text pair in the text set. That is, the original text having the matching relationship may be included, or the original text not having the matching relationship with any of the original texts in the text set may be included.

S108: and determining the loss of the text pair according to the first similarity, each second similarity corresponding to the text and each third similarity corresponding to the text pair, wherein the first similarity and each second similarity are inversely related to the loss, and each third similarity is positively related to the loss of the text pair.

After determining, for each text pair, the respective similarities corresponding to the text pair, the server may determine a loss of the text pair based on the first similarity, the respective second similarities corresponding to the text pair, and the respective third similarities corresponding to the text pair.

As described above, the present specification trains the model of the characterization by means of contrast learning. In addition to making the characterization model learn to compare similar or matched texts, the characterization model learn to output similar characterization vectors for what texts, the characterization model can also be made to learn to compare dissimilar texts, so that the characterization model learn to output dissimilar characterization vectors for what texts.

And the similarity of token vectors between texts that do not belong to the same text pair should be lower than matched texts in the same text pair.

Thus, the first similarity and each second similarity are inversely related to the loss of the text pair. Each third similarity is positively correlated with the loss of the text pair.

S110: and determining total loss according to the determined loss of each text pair, and training the characterization model by taking the minimum total loss as an optimization target to obtain the trained characterization model.

After determining the corresponding loss of each text pair, the server can train the representation model according to the corresponding loss of each text pair until the convergence condition is determined to be met, so as to obtain the trained representation model.

For example, after each training round, a token vector may be output through a token model using a test set of text pairs with labels. To determine a matching result for each text pair based on the token vector. And calculating the F1 score (F1 score) corresponding to the characterization model according to the obtained matching result and the label.

The matching result may be set according to the requirement, for example, in a classification scene (for example, the type corresponding to the matching result may include matching and non-matching), the matching result of the text pair may be a score of [0,1 ]. For example, when the score is greater than 0.6, a text pair match may be determined, otherwise, a mismatch. In a multi-classification scenario (e.g., the types to which the matching results correspond may include dissimilar, generally similar, similar), the classification results may include scores for each type.

When the difference of F1 scores obtained by two adjacent rounds of calculation is not larger than a preset threshold value, the convergence of the finally obtained characterization model can be determined so as to stop training the characterization model.

In one or more embodiments of the present disclosure, for each text pair, a similarity between the token vectors of the text pair may be input into a preset activation function, so as to obtain a matching result.

In a classification scenario, the activation function may be a sigmoid function. In a multi-class scenario, the activation function may be a softmax function.

It should be noted that, in the multi-classification scenario, the text pairs with the matching result being generally similar and similar can be regarded as text pairs with matching relationship. Alternatively, only text pairs whose matching results are similar may be regarded as text pairs having a matching relationship.

S112: and responding to a matching request carrying a text pair to be matched, inputting the text pair to be matched into the trained characterization model, and determining a matching result of the text pair to be matched according to a characterization vector output by the characterization model.

After the trained characterization model is obtained, the server can output the characterization vector of the text in the text pair to be matched according to the characterization model so as to determine the matching result of the text pair to be matched.

The server can receive a matching request carrying a text pair to be matched, input the text pair to be matched into a trained characterization model in response to the matching request carrying the text pair to be matched, and determine a matching result of the text pair to be matched according to a characterization vector output by the characterization model.

When determining the matching result of the text pair to be matched according to the characterization vector output by the characterization model, specifically, the server can determine the similarity of the text pair to be matched according to the characterization vector output by the characterization model. And inputting the determined similarity into a preset activation function to obtain a matching result of the text pair to be matched.

As described above, the matching result may correspond to a classification result or a multi-classification result.

When the server inputs the determined similarity into a preset activation function to obtain a matching result of the text pair to be matched, the similarity can be input into a preset first activation function when the relation of the text pair to be matched is classified into two categories, and a first score of the text pair to be matched as the matching relation is obtained. When the first score is larger than a preset threshold value, the text pair to be matched can be determined to have a matching relationship as a matching result, otherwise, the text pair to be matched is determined to have no matching relationship as a matching result.

When the relation of the text pairs to be matched is classified more, the similarity can be input into a preset second activation function to obtain second scores of the text pairs to be matched corresponding to the preset matching degrees respectively, and the matching degree corresponding to the largest second score is used as a matching result of the text pairs to be matched.

Wherein the first activation function may be a sigmoid function. The second activation function may be a softmax function.

Based on the text matching method shown in fig. 1, the loss of the text pair is determined by using the first similarity of the characterization vectors of two original texts in the matched text pair, the second similarity between the characterization vectors of each original text and each deformed text in the text pair, and the third similarity of the characterization vectors of each original text and other original texts. The first similarity and the second similarity are both inversely related to the loss of the text pair, and the third similarity is positively related to the loss of the text pair. After determining the total loss according to the loss of each text pair and training with the minimum total loss as a target to obtain a trained characterization model, determining a characterization vector of the text pair to be matched through the trained characterization model in response to the matching request, so as to determine a matching result of the text pair to be matched according to the obtained characterization vector.

According to the method, the matched text pairs can be utilized, the characterization model of the accurate characterization vector can be output through comparison learning training, and the matching accuracy of text matching based on the characterization model is improved.

In one or more embodiments of the present disclosure, the predetermined data enhancement mode may be a position transformation. I.e. shuffle.

In step S100, when obtaining the deformed text of the original text by the preset data enhancement method, specifically, the server may determine each word element of the original text. And obtaining the deformed text of the original text by carrying out position transformation on at least part of the word elements of the original text.

In one or more embodiments of the present disclosure, the preset data enhancement mode may also be random zeroing (dropout).

In step S100, when obtaining the deformed text of the original text by the preset data enhancement method, the server may further determine each word element of the original text. And randomly zeroing part of the word elements of the original text according to a preset first probability to obtain a deformed text of the original text.

In one or more embodiments of the present description, each original text in the set of text may be the text itself. Alternatively, the text may be a text matrix obtained by encoding the text by an embedding (embedding) layer, specifically, the text matrix obtained by embedding and encoding each word element of the original text after tokenizing the original text. That is, each original text in the text set may also be text in a matrix form. Then, for each original text, each word element of the original text corresponds to an embedded vector, and the original text is a matrix formed by the embedded vectors. In the above embodiment, the determined lemma may be an embedded vector corresponding to the lemma of the original text.

In addition, the preset data enhancement mode may also be clipping (cutoff). Specifically, one of token clipping (token clipping) and feature clipping (feature clipping) may be included. Wherein a feature is a value in the matrix to which the original text corresponds.

When the data enhancement mode is word clipping, the server can randomly determine a target word from each word of the original text. And setting the embedded vector of the target word element to zero to obtain the deformed text of the original text.

When the data enhancement mode is feature clipping, the server can determine any one row of features or any one column of features from a matrix corresponding to the original text as a target feature, and zero the target feature to obtain a deformed text of the original text.

Alternatively, in one or more embodiments of the present disclosure, at least some of the three data enhancement modes described above may be used to obtain a deformed text of the original text. For example, the position of the word element of the original text may be changed, and then randomly zeroed, to obtain the deformed text of the first text.

In the present specification, a term may correspond to a word, or may also correspond to a phrase, etc., which is not limited herein.

Additionally, in one or more embodiments of the present description, the characterization model may also be trained based solely on matching text. That is, the original text in the text collection that does not have a matching relationship with any of the original text may not participate in training the characterization model.

Then, when determining the deformed text of each original text in the text set in step S100, the server may specifically determine, from each original text in the text set, an original text having a matching relationship with other original texts as a matching text. Then, the deformed text of each matching text is determined. That is, for an original text in the text set that does not have a matching relationship with any other original text, its deformed text may not be determined.

Then, in step S102, when the token vectors of the original texts and the deformed texts thereof are obtained through the token model to be trained, the server may obtain the token vectors corresponding to the matching texts and the deformed texts thereof through the token model to be trained.

Further, in step S106, for each original text in the text pair, a similarity between the original text and the token vector of each other original text is determined, and when the similarity is used as each third similarity, the other original text applied is the original text in the other text pair. And the original text which does not have a matching relation with any original text is not used as other original text, and does not participate in the calculation of the third similarity.

In addition, when determining the loss of the text pair according to the first similarity, the second similarities corresponding to the text, and the third similarities corresponding to the text pair in step S108, the server may determine, for each original text in the text pair, the loss corresponding to the original text according to the first similarity, the second similarities corresponding to the original text, and the third similarities corresponding to the original text.

Then, the loss of the text pair can be determined according to the loss corresponding to each original text in the text pair.

For ease of description, for each text pair, one of the original text in the text pair is taken as the first text and the other text is taken as the second text.

The first similarity of the text pair is the similarity between the token vectors of the first text and the second text of the text pair.

In step S106, the server may then determine, for each text pair, a similarity between the token vector of the first text of the text pair and the deformed text of the first text as a second similarity of the first text. And taking the similarity between the second text of the text pair and the characterization vector of the deformed text of the second text as the second similarity of the second text.

And, the server may determine the similarity between the first text and the token vector of each other original text, as each third similarity corresponding to the first text. And determining the similarity between the second text and the characterization vectors of the other original texts respectively, and taking the similarity as each third similarity corresponding to the second text. As described above, each other original text is an original text other than the text pair in the text set.

Then, the server may determine a loss corresponding to the first text as a first loss according to the first similarity, the second similarity corresponding to the first text, and the third similarity corresponding to the first text. And determining a loss corresponding to the second text as a second loss according to the first similarity, the second similarity corresponding to the second text, and the third similarity corresponding to the second text.

The server may then determine the loss of the text pair based on the first loss and the second loss. Specifically, the penalty for the text pair may be obtained by summing the first penalty and the second penalty.

Wherein the first similarity is inversely related to the first penalty and inversely related to the second penalty. The second similarity corresponding to the first text is inversely related to the first penalty. The second similarity corresponding to the second text is inversely related to the second penalty. The third similarity corresponding to the first text is positively correlated with the first penalty. The third similarity corresponding to the second text is positively correlated with the second penalty.

In one or more embodiments of the present description, the formula for determining the first loss may be specifically as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,

i.e. the first loss of the i-th text pair. Then, in this specification, the text collection may contain at least one text pair of batch. A represents a first text, B represents a second text matching the first text, < ->

Deformed text representing the first text. />

For the token vector of the first text of the ith text pair, ±>

Is a token vector for the second text of the ith text pair. />

A token vector for the warped text that is the first text of the ith text pair. In calculating the first loss of the ith text pair, the first text in the ith text pair is taken as a positive sample, the deformed text of the first text is taken as an enhanced positive sample, and the ith text is taken as an enhanced positive sampleThe second text in the text pair is taken as a weak positive sample. The first text of the other text pair serves as a negative sample. N represents the total number of text pairs. />

Is the token vector for the first text of the j-th text pair. In this specification, then, positive and negative samples are opposite.

For the preset weight coefficient, for adjusting the specific gravity of the enhanced positive sample and the weak positive sample in loss, < ->

Is a preset temperature coefficient- >

Representing similarity, e.g.)>

And representing the similarity between the characterization vectors of the first text and the second text in the ith text pair as the first similarity. />

I.e. second similarity, +.>

And is a third similarity.

In one or more embodiments of the present disclosure, when determining the similarity, a cosine similarity between the token vectors may be specifically determined, so:

alternatively, the euclidean distance between the token vectors may be determined, and the similarity between the token vectors may be determined according to the euclidean distance.

In this specification, the first text in different text pairs may have commonality, and the second text in different text pairs may also have commonality. For example, when applied in a question and answer system, the first text may be both question text. The second text may be answer text. Alternatively, there may be no commonality between the first text of the different text pairs and between the second text of the different text pairs. For example, in a text deduplication scene, or in a near-sense text recognition scene, etc.

In one or more embodiments of the present disclosure, the formula for determining the second loss may be specifically as follows:

representing the second penalty of the ith text pair. />

And representing the deformed text of the second text in the ith text pair. / >

Is a token vector for the second text of the jth text pair. The explanation of other characters in the formula may refer to the explanation of the formula corresponding to the first loss, which is not described herein. />

In order for the degree of similarity to be a second degree of similarity,

and is a third similarity.

As described above, for each text pair, the first penalty and the second penalty for that text pair may be summed to obtain the penalty for that text pair.

Then, the formula for determining the loss of a text pair may be specifically as follows:

then, the formula for determining the total loss may be specifically as follows:

i.e. the total loss. Loss of Lossi, i.e., dii, text pairs. N is still the number of text pairs.

In addition, the server may determine, for each text pair, a similarity between the warped text of the first text of the text pair and the token vector of the second text, and a similarity between the warped text of the second text of the text pair and the token vector of the first text.

The server may specifically determine, for each original text of each text pair in the set of texts, a similarity between the deformed text of the original text and a token vector of the other original text of the set of texts as a fourth similarity.

In step S108, when determining the loss of the text pair according to the first similarity, the second similarities corresponding to the text, and the third similarities corresponding to the text pair, the server may further determine the loss of the text pair according to the first similarity, the second similarities corresponding to the text, the third similarities corresponding to the text, and the fourth similarities corresponding to the text pair.

Specifically, the server may determine, for each original text in the text pair, a loss corresponding to the original text according to the first similarity, the second similarity corresponding to the original text, the third similarity corresponding to the original text, and the fourth similarity. Wherein the fourth similarity corresponding to the original text is not changed is inversely related to the loss corresponding to the original text.

Then, the loss of the text pair can be determined according to the loss corresponding to each original text of the text pair.

That is, the server may also determine the first penalty for the first text based on a first similarity between the token vectors for the first text and the second text of the text pair, a second similarity between the token vectors for the first text and its variant text of the text pair, a fourth similarity between the variant text and the token vectors for the second text of the first text pair, and a third similarity between the token vectors for the other original text and the first text of the text pair.

And determining a second penalty for the second text based on a first similarity between the token vectors of the first text and the second text of the text pair, a second similarity between the token vectors of the second text and its variant text of the text pair, a fourth similarity between the variant text of the second text of the text pair and the token vector of the first text, and a third similarity between the other original text and the token vector of the second text of the text pair.

Alternatively, in one or more embodiments of the present disclosure, the server may further determine the first penalty of the first text based only on a second similarity between the token vectors of the first text of the text pair and its variant text, a fourth similarity between the variant text of the first text of the text pair and the token vector of the second text, and a third similarity between the other original text and the token vector of the first text of the text pair.

And determining a first penalty for the second text based on a second similarity between the token vectors of the second text of the text pair and its variant text, a fourth similarity between the variant text of the second text of the text pair and the token vector of the first text, and a third similarity between the other original text and the token vector of the second text of the text pair.

Then, in one or more embodiments of the present disclosure, the formula for determining the first loss may also be specifically as follows:

accordingly, the formula for determining the second loss may also be specifically as follows:

、/>

all are fourth similarities.

For ease of understanding, the present specification also provides a representation of a characterization model as shown in fig. 2. As shown in the figure, after the data enhancement is performed on the first text of the text pair, a deformed text of the first text can be obtained. After the first text, the deformed text of the first text and the second text matched with the first text in the text pair are input into the characterization model, a characterization vector of the first text, a characterization vector of the deformed text and a characterization vector of the second text can be obtained respectively. In the figure, a graph formed by a rectangle and two unfilled circles represents the first text, the deformed text of the first text and the representation vector corresponding to the second text respectively. The graph consisting of one rectangle and two diagonally filled circles represents the token vector of the other original samples.

As shown in fig. 2, the training of the representation model is performed with the aim of zooming in the distance between the deformed text of the first text in the same text pair and the representation vector of the first text in the feature space and zooming out the distance between the representation vectors of the original text in different text pairs. The characterization model can accurately output the characterization vector of the text by learning the semantics of the text, the difference between texts with similar semantics and the difference between texts with different semantics. Of course, fig. 2 is only an exemplary illustration of a portion of text corresponding to a training sample and its token vectors, and the exemplary illustration is for the purpose of narrowing the distance between the token vectors. As described above, the distance between the feature vectors of the first text and the second text in the training sample, the distance between the second text and the deformed text thereof in the feature space, and the like can be all the purposes of training the feature model in the present specification, which are not listed here.

In one or more embodiments of the present description, the characterization model may be composed of a pre-trained feature model and a weak encoder. For example, the pre-trained feature model may be a bi-directional encoder characterization model (Bidirectional Encoder Representation from Transformers, BERT) model. The weak encoder may be a linear layer. The weak encoder may be used to prevent the characterization model from collapsing. For example, the weak encoder may be a linear layer with 512 nodes. Since the input dimensions of the feature model may be fixed, the text may be pre-processed prior to entering the text into the feature model. The preprocessing mode can comprise cutting and filling, so that the dimension of the text is the same as the input dimension of the feature model.

Then, in step S102, when the token vectors of each original text and its deformed text are obtained through the token model to be trained, the server may input the text into the feature model for each text of each original text and each deformed text, to obtain the coding feature of the text, and input the coding feature of the text into the weak encoder, to obtain the token vector of the text.

Also, to enhance the performance of the characterization model, dropouts may be performed in the feature model to enhance the perturbation.

In one or more embodiments of the present description, the feature model may include a number of encoders.

When the text is input into the feature model to obtain the coding feature of the text, the parameters of the encoder of the feature model can be randomly set to zero according to a preset second probability. I.e. according to the parameters of the encoder of the preset second probability characterization model, dropout is performed.

And then inputting the text into a feature model after randomly zeroing to obtain the coding feature of the text.

For example, dropout may be performed specifically in the self-attention layer of the feature model. For example, dropouts can be performed on three matrices of Q (query), K (key), and V (value) of the self-attention layer, respectively. That is, the parameters of the encoder may be Q, K, W matrix. Also, the value of dropout may be set as desired, for example, to 0.1, i.e., dropout is performed on one tenth of the values in the matrix.

Alternatively, the characterization model may also be composed of a pre-trained characterization model, an averaging pooling layer, and a linear layer. The characterization model is used for outputting a vector corresponding to each word element of each input text. The average pooling layer is used for carrying out average pooling on the vectors of the words of the text. The weak encoder outputs a characterization vector based on the vector obtained after the average pooling. Wherein the vector is the coding feature of the text output by the feature model.

In addition, in this specification, the text set used to train the characterization model may be derived from a chinese financial semantic similarity corpus, a credit text matching similarity corpus, a question semantic matching dataset, a synonym recognition dataset, and the like.

Fig. 3 is a schematic diagram of a text matching device provided in the present specification, where the device includes:

the obtaining module 200 is configured to obtain a text set, and determine deformed text of each original text in the text set, where the text set includes text pairs composed of original texts with matching relationships;

the characterization module 201 is configured to obtain, through a characterization model to be trained, a characterization vector of each original text and a deformed text thereof;

A first determining module 202, configured to determine, for each text pair in the text set, a similarity between token vectors of two original texts in the text pair as a first similarity;

a second determining module 203, configured to determine, for each original text in the text pair, a similarity between the original text and a token vector of a deformed text of the original text, as a second similarity, and determine, as each third similarity, a similarity between the original text and a token vector of each other original text, respectively; wherein the other original texts are original texts except the text pair in the text set;

a third determining module 204, configured to determine a loss of the text pair according to the first similarity, each second similarity corresponding to the text, and each third similarity corresponding to the text pair, where the first similarity and each second similarity are inversely related to the loss, and each third similarity is positively related to the loss of the text pair;

the training module 205 is configured to determine a total loss according to the determined loss of each text pair, and train the characterization model with the minimum total loss as an optimization target to obtain the trained characterization model;

And the matching module 206 is configured to input the text pair to be matched into the trained characterization model in response to a matching request carrying the text pair to be matched, and determine a matching result of the text pair to be matched according to a characterization vector output by the characterization model.

Optionally, the obtaining module 200 is further configured to obtain, for each original text in the text set, a deformed text of the original text by a preset data enhancement mode.

Optionally, the data enhancement mode is position transformation, and the obtaining module 200 is further configured to determine each word element of the original text, and obtain the deformed text of the original text by performing position transformation on at least part of the word elements of the original text.

Optionally, the data enhancement mode is random zeroing, and the obtaining module 200 is further configured to determine each word element of the original text, and randomly zeroing, according to a preset first probability, a part of word elements of the original text to obtain a deformed text of the original text.

Optionally, the obtaining module 200 is further configured to determine, from each original text in the text set, an original text having a matching relationship with other original texts, and determine, as a matching text, a deformed text of each matching text, respectively.

Optionally, the characterization module 201 is further configured to obtain, through a characterization model to be trained, a characterization vector corresponding to each matching text and its deformed text.

Optionally, the apparatus further comprises:

a fourth determining module 207, configured to determine, for each original text of each text pair in the text set, a similarity between the deformed text of the original text and a token vector of the other original text in the text pair as a fourth similarity.

Optionally, the third determining module 204 is further configured to determine a loss of the text pair according to the first similarity, each second similarity corresponding to the text, each third similarity corresponding to the text, and each fourth similarity corresponding to the text pair, where the fourth similarity is inversely related to the loss of the text pair.

Optionally, the characterization model is composed of a pre-trained feature model and a weak encoder, and the characterization module 201 is further configured to input, for each of the original text and the deformed text, the text into the feature model to obtain a coding feature of the text, and input, for the coding feature of the text, into the weak encoder to obtain a characterization vector of the text.

Optionally, the feature model includes a plurality of encoders, and the characterization module 201 is further configured to randomly set zero for parameters of the encoder of the feature model according to a preset second probability, and input the text into the feature model after randomly setting zero to obtain the encoding feature of the text.

Optionally, the matching module 206 is further configured to determine a similarity of the text pair to be matched according to the token vector output by the token model, and input the similarity into a preset activation function to obtain a matching result of the text pair to be matched.

Optionally, the matching result corresponds to a two-class result or a multi-class result, and the matching module 206 is further configured to input the similarity into a preset first activation function when the relationship of the text pair to be matched is two-classified, obtain a first score of the text pair to be matched as a matching relationship, and determine that the text pair to be matched has a matching relationship when the first score is greater than a preset threshold, and use the matching result as a matching result, otherwise determine that the text pair to be matched does not have a matching relationship as a matching result; when the relation of the text pairs to be matched is subjected to multi-classification, inputting the similarity into a preset second activation function to obtain second scores of the text pairs to be matched corresponding to preset matching degrees, and taking the matching degree corresponding to the largest second score as a matching result of the text pairs to be matched.

The present specification also provides a computer readable storage medium storing a computer program operable to perform the above text matching method.

The present specification also provides a schematic structural diagram of the electronic device shown in fig. 4. At the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile storage, as described in fig. 4, although other hardware required by other services may be included. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to realize the text matching method. Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims

1. A text matching method, comprising:

2. The method of claim 1, wherein determining deformed text for each original text in the set of text comprises:

3. The method of claim 2, wherein the data enhancement mode is a position transformation;

determining each word element of the original text;

4. The method of claim 2, wherein the data enhancement mode is random zeroing;

determining each word element of the original text;

5. The method of claim 1, wherein determining deformed text for each original text in the set of text comprises:

and respectively determining deformed texts of the matched texts.

6. The method of claim 5, wherein the characterization vectors of each original text and its deformed text are obtained by the characterization model to be trained, specifically comprising:

7. The method of claim 1, wherein the method further comprises:

8. The method of claim 7, wherein determining the loss of the text pair based on the first similarity, the respective second similarities corresponding to the text, and the respective third similarities corresponding to the text pair, comprises:

9. The method of claim 1, wherein the characterization model consists of a pre-trained feature model and a weak encoder;

10. The method of claim 9, wherein the feature model comprises a number of encoders;

11. The method of claim 1, wherein determining the matching result of the text pair to be matched according to the token vector output by the token model specifically comprises:

12. The method of claim 11, wherein the matching result corresponds to a bi-classification result or a multi-classification result;

13. A text matching apparatus, comprising:

14. A computer readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-12.

15. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1-12 when executing the program.