CN107729300B

CN107729300B - Text similarity processing method, device and equipment and computer storage medium

Info

Publication number: CN107729300B
Application number: CN201710841945.XA
Authority: CN
Inventors: 范淼; 李传勇; 孙明明; 施鹏; 冯悦; 李平
Original assignee: Baidu Online Network Technology Beijing Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd
Priority date: 2017-09-18
Filing date: 2017-09-18
Publication date: 2021-12-24
Anticipated expiration: 2037-09-18
Also published as: CN107729300A

Abstract

The invention provides a text similarity processing method, a text similarity processing device, text similarity processing equipment and a computer storage medium, wherein the text similarity processing method comprises the following steps: obtaining the similarity determination results of the text pairs obtained by a plurality of similarity determination methods; splicing the similarity determination results of the text pairs to obtain splicing characteristics; the splicing characteristics are used as the input of a similarity determination model, and the text similarity of the text pair is obtained according to the output of the similarity determination model; wherein the similarity determination model is trained in advance. By the technical scheme provided by the invention, the integrated processing of the similarity determination results obtained by multiple similarity determination methods can be realized, and the accuracy of determining the text similarity of the text pair is improved, so that the similarity calculation accuracy after the integrated processing is higher than that of any single similarity calculation mode.

Description

Text similarity processing method, device and equipment and computer storage medium

[ technical field ] A method for producing a semiconductor device

The present invention relates to natural language processing technologies, and in particular, to a method, an apparatus, a device, and a computer storage medium for processing text similarity.

[ background of the invention ]

Many internet applications (e.g., search engines, question and answer platforms, etc.) require a calculation method that relies on accurate text similarity to provide users with content that matches the input query or the proposed question, and thus text similarity calculation has been a research and development issue that needs to be addressed and improved. In the prior art, a plurality of text similarity calculation methods exist, the text similarity calculation still remains at the level of lexical analysis, part of speech analysis and syntactic template extraction of the traditional natural linguistics, and texts need to be processed by using methods such as word cutting tools, part of speech identification, text matching templates and the like. However, each single similarity calculation method tends to have a certain limitation in calculation accuracy due to the limitation of its algorithm.

[ summary of the invention ]

In view of this, the present invention provides a method, an apparatus, a device and a computer storage medium for processing text similarity, which are used to implement integrated processing on multiple similarity determination results of text pairs and improve the accuracy of calculating the similarity of the text pairs.

The technical scheme adopted by the invention for solving the technical problem is to provide a text similarity processing method, which comprises the following steps: obtaining the similarity determination results of the text pairs obtained by a plurality of similarity determination methods; splicing the similarity determination results of the text pairs to obtain splicing characteristics; the splicing characteristics are used as the input of a similarity determination model, and the text similarity of the text pair is obtained according to the output of the similarity determination model; wherein the similarity determination model is trained in advance.

According to a preferred embodiment of the present invention, the similarity determination model is obtained by pre-training in the following manner: obtaining similarity determination results of the text pairs marked with the similarities obtained by multiple similarity determination methods; splicing the similarity determination results of the text pairs to obtain the splicing characteristics of the text pairs; and training a classification model by taking the splicing characteristics of each text pair and the labeling similarity of each text pair as training samples to obtain a similarity determination model.

According to a preferred embodiment of the present invention, the training goal of the classification model is to minimize the loss value of the classification model; and in the process of training the classification model, performing parameter adjustment on the classification model by using the loss value.

According to a preferred embodiment of the present invention, the loss value is an error between the text similarity of the text pair output by the classification model and the labeled similarity of the text pair.

According to a preferred embodiment of the present invention, the similarity determination result of the text pair obtained by the multiple similarity determination methods includes: the similarity feature vector and the similarity score of the text pair.

According to a preferred embodiment of the present invention, before the splicing the similarity determination results of the text pairs, the method further includes: randomly sampling the similarity characteristic vector to obtain a sampling characteristic vector; and splicing the sampling feature vector and the similarity score to obtain the splicing feature.

According to a preferred embodiment of the present invention, the randomly sampling the similarity feature vector to obtain a sampled feature vector includes: and randomly sampling the characteristic values in the similarity characteristic vector according to a preset probability, and setting the characteristic values which are not sampled in the similarity characteristic vector as 0 to obtain a sampling characteristic vector.

According to a preferred embodiment of the present invention, the similarity determination model is a neural network-based classification model.

The technical scheme adopted by the invention for solving the technical problem is to provide a text similarity processing device, which comprises: the acquisition unit is used for acquiring the similarity determination results of the text pairs obtained by the multiple similarity determination methods; the splicing unit is used for splicing the similarity determination results of the text pairs to obtain splicing characteristics; the processing unit is used for taking the splicing characteristics as the input of a similarity determination model and obtaining the text similarity of the text pair according to the output of the similarity determination model; wherein the similarity determination model is trained in advance.

According to a preferred embodiment of the present invention, the apparatus further comprises: the training unit is used for pre-training in the following mode to obtain the similarity determination model: obtaining similarity determination results of the text pairs marked with the similarities obtained by multiple similarity determination methods; splicing the similarity determination results of the text pairs to obtain the splicing characteristics of the text pairs; and training a classification model by taking the splicing characteristics of each text pair and the labeling similarity of each text pair as training samples to obtain a similarity determination model.

According to a preferred embodiment of the present invention, before the splicing unit splices the determination results of the similarity of the text pairs, the method further performs: randomly sampling the similarity characteristic vector to obtain a sampling characteristic vector; and splicing the sampling feature vector and the similarity score to obtain the splicing feature.

According to a preferred embodiment of the present invention, the stitching unit performs random sampling on the similarity feature vector to obtain a sampling feature vector, and specifically performs: and randomly sampling the characteristic values in the similarity characteristic vector according to a preset probability, and setting the characteristic values which are not sampled in the similarity characteristic vector as 0 to obtain a sampling characteristic vector.

According to the technical scheme, the similarity determination results of the text pairs obtained by the multiple similarity determination methods are spliced, and the splicing characteristics are used as the input of the similarity determination model, so that the multiple similarity determination results of the text pairs are integrated, the calculation accuracy of the text similarity of the text pairs is improved, and the calculation accuracy of the similarity after the integrated processing is higher than that of any single similarity calculation mode.

[ description of the drawings ]

Fig. 1 is a flowchart of a text similarity processing method according to an embodiment of the present invention.

Fig. 2 is a block diagram of a text similarity processing apparatus according to an embodiment of the present invention.

Fig. 3 is a block diagram of a computer system/server according to an embodiment of the invention.

[ detailed description ] embodiments

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

The core idea of the processing method for text similarity provided by the invention is as follows: the method for integrally processing the multiple similarity determination results of the text similarity is provided, so that the integrated processing method can integrally calculate the similarity determination results obtained by multiple single text similarity determination methods, and the text similarity obtained by the integrated processing method can be more accurate than the text similarity obtained by any single similarity determination method. In the present invention, two text similarity determination results, i.e., the similarity feature vector of a text pair and the similarity score of the text pair, are taken as an example for explanation. It can be understood that whether the similarity feature vector or the similarity score reflects the similarity of the corresponding text pair.

Fig. 1 is a flowchart of a text similarity processing method according to an embodiment of the present invention, as shown in fig. 1, the method includes:

in 101, similarity determination results of text pairs obtained by a plurality of similarity determination methods are obtained.

In this step, similarity determination results of the same text pair obtained by a plurality of similarity determination methods are obtained. The plurality of similarity determination manners may be two, three or more, and thus the obtained similarity determination results of the text pairs are correspondingly two, three or more.

In this embodiment, the obtained similarity determination result of the text pair is the similarity feature vector and the similarity score of the text pair. Optionally, in a specific implementation process of this embodiment, an existing text similarity calculation system may be used to obtain the similarity score of the text pair, or a cosine similarity calculation method, a BM25 similarity calculation method, or the like may be used to obtain the similarity score of the text pair. The similarity feature vector of a text pair may be obtained by using a text similarity calculation method based on a neural network or a deep learning model, for example, after a certain text pair is input to a text matching algorithm based on a convolutional neural network, the algorithm may output the feature vector corresponding to the text pair, and then the feature vector is used as the similarity feature vector of the text pair. The invention does not limit the types of the neural network or the learning model, and can output the neural network or the learning model of the characteristic vector according to the input text.

For example, for a pair of texts P and Q, when obtaining the similarity score of the text pair, the similarity score may be obtained by the existing text similarity calculation system a, the existing system a may be packaged as a callable interface, and the similarity score of the text pair may be directly obtained by calling the interface, and since the existing system a is packaged, its internal parameters or codes may not be changed. When the similarity feature vector of the text pair is obtained, the similarity feature vector can be obtained by a newly developed neural network-based similarity calculation method B, and because the calculation method B is newly developed, internal parameters of the calculation method B can be changed. The similarity score obtained in this step may be denoted as X_AAnd the similarity feature vector is marked as X_B。

It can be understood that the similarity determination result of the text pair obtained in this step may also be a similarity feature vector of the text pair obtained by different similarity determination methods, so as to perform subsequent processing as multiple similarity determination results of the text pair.

In 102, the similarity determination results of the text pairs are spliced to obtain a splicing characteristic.

In this step, the multiple similarity determination results obtained in step 101 are spliced, so as to obtain the splicing characteristics of the multiple similarity determination results.

If there is a similarity feature vector in the multiple similarity determination results obtained in step 101, the following processing may be performed on the similarity feature vector: and after randomly sampling the similarity characteristic vector, splicing the sampled characteristic vector with other similarity determination results to obtain the splicing characteristic of the text pair.

Specifically, when the similarity feature vector is randomly sampled, the following method may be adopted: and randomly sampling the characteristic value in the similarity characteristic vector according to a preset probability, and setting the characteristic value which is not sampled in the similarity characteristic vector as 0, thereby obtaining a sampling characteristic vector.

For example, if the phase obtained in step 101The dimension of the similarity feature vector is 5, and when the preset probability is 0.6, the feature values of 3 (5 × 0.6) dimensions of the 5 dimensions of the similarity feature vector are randomly reserved, and the feature values of the other two dimensions are set to be 0. If the obtained similarity feature vector is X_B＝[8.0,6.0,3.0,4.0,5.0]And if the preset probability is 0.6, sampling the characteristic values of any 3 dimensionalities in the similarity characteristic vector, and setting the characteristic values of the other two dimensionalities as 0. If the sampling feature vector is X_B'＝[8.0,6.0,0.0,0.0,5.0]If so, the sampling is performed on the characteristic values with the index numbers of 0, 1 and 4, and the characteristic values with the index numbers of 2 and 3 are set to be 0; if X_B'＝[8.0,0.0,0.0,4.0,5.0]It indicates that the sampling is performed on the eigenvalues with index numbers 0, 3 and 4, and the eigenvalues with index numbers 1 and 2 are set to 0. The preset probability can be set according to actual conditions.

In this step, after the similarity feature vector is randomly sampled to obtain a sampling feature vector, the obtained sampling feature vector is spliced with the text similarity score obtained in step 101. When the sampling feature vector is spliced with the text similarity score, the method can be performed by the following formula:

in the formula: x_CFor splicing features, X_AIn order to score the similarity of the text,

for line vector splicing operators, X_B' is the sampled feature vector and T is the vector transpose operator.

When performing the splicing, the multiple similarity determination results obtained in step 101 may also be spliced by using a preset function. For example, if the multiple similarity determination results obtained in step 101 are all similarity feature vectors, after the similarity feature vectors are randomly sampled to obtain sampling feature vectors, the [ ] function in matlab may be used to splice the sampling similarity feature vectors to obtain the splicing features of the multiple similarity determination results.

Preferably, in this step, when the similarity feature vector of the text pair is randomly sampled, the preset probability may be set to 1, that is, the obtained similarity feature vector may not be randomly sampled, and the similarity feature vector obtained in step 101 and the similarity score are directly spliced to obtain the splicing feature.

In 103, the splicing features are used as the input of a similarity determination model, and the text similarity of the text pair is obtained according to the output of the similarity determination model.

In this step, the splicing features obtained in step 102 are used as input of the similarity determination model, and the output of the similarity calculation model is used as the text similarity of the text pair.

In this step, the similarity determination model used may be obtained by pre-training in the following manner:

1) and obtaining the similarity determination result of the text pair with the labeled similarity obtained by a plurality of similarity determination methods.

In this step, the similarity of each text pair is labeled in advance, that is, whether the text pairs are similar or not can be known through the labeled similarity. For example, if the similarity degree marked by a certain text pair is 1, it indicates that the text pair is similar; if 0, it indicates no similarity.

And the obtained similarity determination result of the text pair is the similarity score between the similarity feature vector of the text pair and the text pair. Similarly, when the similarity feature vector and the similarity score of the text pair with labeled similarity obtained by different similarity determination methods are obtained, the process in step 101 is the same, and details are not repeated here.

2) And splicing the obtained similarity determination results of the text pairs to obtain the splicing characteristics of the text pairs.

In this step, the similarity feature vectors of the obtained text pairs with labeled similarity are spliced with the similarity scores to obtain the splicing features corresponding to the text pairs. Before splicing, after randomly sampling the similarity feature vector of the text pair, splicing the sampled feature vector with the similarity score to obtain the splicing feature of the text pair. The random sampling process and the splicing process in this step are the same as those in step 102, and are not described herein again. Preferably, in this step, when random sampling of the similarity feature vector is performed, a preset probability of performing sampling is set to be less than 1.

3) And (5) taking the splicing characteristics of each text pair and the labeling similarity of each text pair as training samples, training a classification model, and obtaining a similarity determination model.

In this step, the splicing characteristics of each text pair are used as the input of the classification model, the labeling similarity of each text pair is used as the output of the classification model, and the classification model is trained. Wherein the training goal of the classification model is to minimize the loss value of the classification model. Specifically, the loss value may be an error between a calculated similarity of a text pair output by the classification model and a labeled similarity of the text pair.

The error between the calculation similarity of the text pair output by the classification model and the labeling similarity of the text pair can be obtained by using the following formula:

in the formula: l is the error between the calculated similarity and the labeled similarity of the text pair, y is the labeled similarity of the text pair,

similarity is calculated for the text pairs.

Specifically, the loss value of the classification model is minimized, that is, the error L between the calculated similarity of the output of the text pair and the labeled similarity of the text pair is minimized. Optionally, in a specific implementation process of this embodiment, if an error obtained within a preset number of times converges, it is considered that a loss value of the classification model is minimized; or if the obtained error converges to a preset value, the loss value of the classification model is considered to be minimized; the loss value of the classification model can be considered to be minimized if the training times exceed the preset times. And when the loss value of the classification model is minimized, the training process of the classification model is considered to be completed, and the similarity determination model is obtained. The classification model is a classification model based on a neural network, and may be a convolutional neural network or a cyclic neural network, which is not limited in the present invention.

In the training process of the classification model, the process of minimizing the loss value is actually a process of performing parameter adjustment on the classification model by using the loss value for feedback. The adjusted parameters comprise the weight of similarity determination results obtained by a plurality of similarity determination methods adopted in the classification model. After cyclic iteration, the loss value is minimized by the finally obtained parameters adopted by the classification model, so that the weight allocation of different similarity determination methods is automatically realized.

Meanwhile, when the classification model is trained, parameters of each similarity determination method can be adjusted according to the loss value of the classification model. For example, in the above description, since the internal parameters of the method B for calculating the similarity based on the neural network can be changed, the parameters in the method for calculating the similarity based on the neural network can be changed according to the loss values of the classification model, so that the parameters of the method gradually reach the optimal values. When the loss value obtained in training the classification model is minimized, the parameters in the classification model are optimized, and the parameters in the method B are also optimized. Therefore, the method B outputs a feature vector representing the similarity of the text pair more accurately according to the input text pair, thereby further improving the accuracy of the similarity of the text output by the similarity determination model.

After the similarity determination model is obtained through training, the text similarity of the text pair can be obtained through the model. In step 103, the splicing characteristics of the text pair obtained in step 102 are input into a similarity determination model, and the output obtained by the similarity determination model is the text similarity of the text pair.

In the training process of the similarity determination model, the parameters in the model, the weight parameters of each similarity determination method and the parameters in each similarity determination method are optimized to the optimal values, so that a more accurate text similarity determination result of the text pair can be output according to the splicing characteristics of the input corresponding text pair.

The above description is exemplified:

if the text similarity determination method is a text similarity calculation system A and a text matching algorithm B based on a neural network, the obtained text similarity determination results are respectively a similarity score and a similarity feature vector. And when the similarity determination model is trained, updating the parameters in the classification model and the parameters in the method B by taking the error between the calculated similarity and the labeled similarity obtained by the classification model as a loss value. When the loss value is minimized, the parameters in the classification model reach the optimal value, so that a similarity determination model is obtained; meanwhile, the parameters in the method B are optimal, and the similarity characteristic vector which represents the similarity of the text pairs more accurately can be obtained. And for the text pair to be evaluated, obtaining the similarity score of the text pair through A, obtaining a similarity feature vector through B after the parameters are updated, and inputting the splicing features into a similarity determination model after splicing, thereby obtaining the text similarity of the text pair to be evaluated. Therefore, the method can further improve the calculation accuracy of the text to the similarity.

Fig. 2 is a block diagram of a text similarity processing apparatus according to an embodiment of the present invention, as shown in fig. 2, the apparatus includes: an acquisition unit 21, a stitching unit 22, a training unit 23 and a processing unit 24.

An obtaining unit 21, configured to obtain similarity determination results of the text pairs obtained by the multiple similarity determination methods.

The similarity determination results of the same text pair obtained by the plurality of similarity determination methods are acquired by the acquisition unit 21. The plurality of similarity determination manners may be two, three, or more, and therefore the similarity determination results of the text pairs obtained by the obtaining unit 21 are also two, three, or more accordingly.

In this embodiment, the similarity determination result of the text pair acquired by the acquisition unit 21 is the similarity feature vector and the similarity score of the text pair. Optionally, in a specific implementation process of this embodiment, the obtaining unit 21 may use an existing text similarity calculation system to obtain the similarity score of the text pair, or may use a cosine similarity calculation method, a BM25 similarity calculation method, or the like to obtain the similarity score of the text pair. The similarity feature vector of a text pair may be obtained by using a text similarity calculation method based on a neural network or a deep learning model, for example, after a certain text pair is input to a text matching algorithm based on a convolutional neural network, the algorithm may output a feature vector corresponding to the text pair, and the obtaining unit 21 uses the feature vector as the similarity feature vector of the text pair. The invention does not limit the types of the neural network or the learning model, and can output the neural network or the learning model of the characteristic vector according to the input text.

For example, for a pair of texts P and Q, when the obtaining unit 21 obtains the similarity score of the text pair, the similarity score may be obtained by the existing text similarity calculation system a, the existing system a may be packaged as a callable interface, the similarity score of the text pair may be directly obtained by calling the interface, and since the existing system a is packaged, its internal parameters or codes may not be changed. When the obtaining unit 21 obtains the similarity feature vector of the text pair, it can obtain the similarity feature vector by a newly developed neural network-based similarity calculation method B, and since the calculation method B is newly developed, its internal parameters can be changed or modified. The similarity score obtained in this step may be denoted as X_AAnd the similarity feature vector is marked as X_B。

It is understood that the similarity determination result of the text pair obtained by the obtaining unit 21 may also be a similarity feature vector of the text pair obtained by different similarity determination methods, so as to perform subsequent processing as multiple similarity determination results of the text pair.

And the splicing unit 22 is configured to splice the similarity determination results of the text pairs to obtain a splicing characteristic.

The stitching unit 22 stitches the plurality of similarity determination results acquired by the acquisition unit 21, thereby obtaining stitching characteristics of the plurality of similarity determination results.

If there is a similarity feature vector in the multiple similarity determination results obtained by the obtaining unit 21, the stitching unit 22 may further perform the following processing on the similarity feature vector: and after randomly sampling the similarity characteristic vector, splicing the sampled characteristic vector with other similarity determination results to obtain the splicing characteristic of the text pair.

Specifically, when the stitching unit 22 randomly samples the similarity feature vector, the following method may be adopted: and randomly sampling the characteristic value in the similarity characteristic vector according to a preset probability, and setting the characteristic value which is not sampled in the similarity characteristic vector as 0, thereby obtaining a sampling characteristic vector.

For example, if the dimension of the similarity feature vector acquired by the acquiring unit 21 is 5 and the preset probability is 0.6, the stitching unit 22 randomly retains the feature values of 3 (5 × 0.6) dimensions of the 5 dimensions of the similarity feature vector, and sets the feature values of the other two dimensions to 0. If the obtained similarity feature vector is X_B＝[8.0,6.0,3.0,4.0,5.0]And if the preset probability is 0.6, sampling the characteristic values of any 3 dimensionalities in the similarity characteristic vector, and setting the characteristic values of the other two dimensionalities as 0. If the sampling feature vector is X_B'＝[8.0,6.0,0.0,0.0,5.0]If so, the sampling is performed on the characteristic values with the index numbers of 0, 1 and 4, and the characteristic values with the index numbers of 2 and 3 are set to be 0; if X_B'＝[8.0,0.0,0.0,4.0,5.0]It indicates that the sampling is performed on the eigenvalues with index numbers 0, 3 and 4, and the eigenvalues with index numbers 1 and 2 are set to 0. The preset probability can be set according to actual conditions.

The concatenation unit 22 randomly samples the similarity feature vector to obtain a sampling feature vector, and concatenates the obtained sampling feature vector with the text similarity score obtained by the obtaining unit 21. When the sample feature vector is spliced with the text similarity score, the splicing unit 22 may perform the following operations:

When performing the splicing, the splicing unit 22 may also splice a plurality of similarity determination results acquired by the acquisition unit 21 using a preset function. For example, if the multiple similarity determination results obtained by the obtaining unit 21 are all similarity feature vectors, the splicing unit 22 may splice the sampling similarity feature vectors by using an [ ] function in matlab after randomly sampling the similarity feature vectors to obtain sampling feature vectors, so as to obtain splicing features of the multiple similarity determination results.

Preferably, when the concatenation unit 22 randomly samples the similarity feature vector of the text pair, the preset probability may be set to 1, that is, the concatenation unit 22 may not randomly sample the acquired similarity feature vector, and directly concatenate the similarity feature vector acquired in the acquisition unit 21 and the similarity score to obtain the concatenation feature.

The training unit 23 is configured to obtain a similarity determination model through pre-training.

The training unit 23 trains the obtained similarity determination model in advance to determine the text similarity of the text pair to be evaluated. Specifically, the training unit 23 may pre-train to obtain the similarity determination model in the following manner:

The text pairs used by the training unit 23 are each text pair labeled with similarity in advance, that is, whether the text pairs are similar or not can be known through the labeled similarity. For example, if the similarity degree marked by a certain text pair is 1, it indicates that the text pair is similar; if 0, it indicates no similarity.

And the obtained similarity determination result of the text pair is the similarity score between the similarity feature vector of the text pair and the text pair. Similarly, when the similarity feature vector and the similarity score of the text pair with labeled similarity obtained by different similarity determination methods are obtained, the process of obtaining the similarity feature vector and the similarity score is the same as that obtained by the obtaining unit 21, and details are not repeated here.

The training unit 23 splices the similarity feature vectors of the acquired text pairs labeled with the similarity scores to obtain the splicing features corresponding to the text pairs. Before the concatenation, the training unit 23 may further perform random sampling on the similarity feature vector of the text pair, and then concatenate the sampled feature vector with the similarity score to obtain the concatenation feature of the text pair. The random sampling process and the splicing process performed by the training unit 23 are the same as those performed by the splicing unit 22, and are not described herein again. Preferably, when the training unit 23 performs random sampling of the similarity feature vectors of the text pairs, the preset probability of sampling is set to be less than 1.

The training unit 23 takes the concatenation characteristics of each text pair as input of the classification model, and takes the labeled similarity of each text pair as output of the classification model, and trains the classification model. Wherein the training goal of the classification model is to minimize the loss value of the classification model. Specifically, the loss value may be an error between a calculated similarity of a text pair output by the classification model and a labeled similarity of the text pair.

similarity is calculated for the text pairs.

Specifically, the training unit 23 minimizes the loss value of the classification model, i.e. minimizes the error L between the calculated similarity of the output of the text pair and the labeled similarity of the text pair. Optionally, in a specific implementation process of this embodiment, if the error obtained by the training unit 23 within the preset number of times converges, it is considered that the loss value of the classification model is minimized; or if the error obtained by the training unit 23 converges to a preset value, the loss value of the classification model is considered to be minimized; it is also possible that if the number of training times of the training unit 23 exceeds a preset number, the loss value of the classification model is considered to be minimized. When the loss value of the classification model is minimized, the training process of the classification model by the training unit 23 is considered to be completed, and the similarity determination model is obtained. The classification model is a classification model based on a neural network, and may be a convolutional neural network or a cyclic neural network, which is not limited in the present invention.

In the training process of the classification model by the training unit 23, the process of minimizing the loss value is actually a process of using the loss value for feedback to tune the classification model. The adjusted parameters comprise the weight of similarity determination results obtained by a plurality of similarity determination methods adopted in the classification model. After cyclic iteration, the loss value is minimized by the finally obtained parameters adopted by the classification model, so that the weight allocation of different similarity determination methods is automatically realized.

Meanwhile, when the training unit 23 trains the classification model, parameters of each similarity determination method may also be adjusted according to the loss value of the classification model. For example, in the above description, since the internal parameters of the method B for calculating the similarity based on the neural network can be changed, the parameters in the method for calculating the similarity based on the neural network can be changed according to the loss values of the classification model, so that the parameters of the method gradually reach the optimal values. When the loss value obtained in training the classification model is minimized, the parameters in the classification model are optimized, and the parameters in the method B are also optimized. Therefore, the method B outputs a feature vector representing the similarity of the text pair more accurately according to the input text pair, thereby further improving the accuracy of the similarity of the text output by the similarity determination model.

And the processing unit 24 is configured to use the splicing characteristics as an input of a similarity determination model, and obtain the text similarity of the text pair according to an output of the similarity determination model.

The processing unit 24 takes the stitching features obtained in the stitching unit 22 as input of the similarity determination model trained by the training unit 23, and takes the output of the similarity calculation model as the text similarity of the text pair.

After the training unit 23 has trained the similarity determination model, the processing unit 24 may obtain the text similarity of the text pair from the model. That is, the processing unit 24 inputs the splicing characteristics of the text pair obtained by the splicing unit 22 into the similarity determination model, and the output obtained by the similarity determination model is the text similarity of the text pair.

In the training process of the similarity determination model, the parameters in the model and the parameters in each similarity determination method are optimized to the optimal values, so that a more accurate text similarity determination result of the text pair can be output according to the splicing characteristics of the input corresponding text pair.

Fig. 3 illustrates a block diagram of an exemplary computer system/server 012 suitable for use in implementing embodiments of the invention. The computer system/server 012 shown in fig. 3 is only an example, and should not bring any limitations to the function and the scope of use of the embodiments of the present invention.

As shown in fig. 3, the computer system/server 012 is embodied as a general purpose computing device. The components of computer system/server 012 may include, but are not limited to: one or more processors or processing units 016, a system memory 028, and a bus 018 that couples various system components including the system memory 028 and the processing unit 016.

Bus 018 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer system/server 012 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 012 and includes both volatile and nonvolatile media, removable and non-removable media.

System memory 028 can include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)030 and/or cache memory 032. The computer system/server 012 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 034 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 3, commonly referred to as a "hard drive"). Although not shown in FIG. 3, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be connected to bus 018 via one or more data media interfaces. Memory 028 can include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the present invention.

Program/utility 040 having a set (at least one) of program modules 042 can be stored, for example, in memory 028, such program modules 042 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof might include an implementation of a network environment. Program modules 042 generally perform the functions and/or methodologies of embodiments of the present invention as described herein.

The computer system/server 012 may also communicate with one or more external devices 014 (e.g., keyboard, pointing device, display 024, etc.), hi the present invention, the computer system/server 012 communicates with an external radar device, and may also communicate with one or more devices that enable a user to interact with the computer system/server 012, and/or with any device (e.g., network card, modem, etc.) that enables the computer system/server 012 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 022. Also, the computer system/server 012 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via the network adapter 020. As shown, the network adapter 020 communicates with the other modules of the computer system/server 012 via bus 018. It should be appreciated that, although not shown, other hardware and/or software modules may be used in conjunction with the computer system/server 012, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 016 executes various functional applications and data processing by running programs stored in the system memory 028, for example, implementing a processing method for text similarity, and may include:

obtaining the similarity determination results of the text pairs obtained by a plurality of similarity determination methods;

splicing the similarity determination results of the text pairs to obtain splicing characteristics;

the splicing characteristics are used as the input of a similarity determination model, and the text similarity of the text pair is obtained according to the output of the similarity determination model;

wherein the similarity determination model is trained in advance.

The computer program described above may be provided in a computer storage medium encoded with a computer program that, when executed by one or more computers, causes the one or more computers to perform the method flows and/or apparatus operations shown in the above-described embodiments of the invention. For example, the method flows executed by the one or more processors may include:

wherein the similarity determination model is trained in advance.

With the development of time and technology, the meaning of media is more and more extensive, and the propagation path of computer programs is not limited to tangible media any more, and can also be downloaded from a network directly and the like. Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing. Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

By utilizing the technical scheme provided by the invention, the similarity determination results of the text pairs obtained by the obtained multiple similarity determination methods are spliced, and the splicing characteristics are used as the input of the similarity determination model, so that the integrated processing of the multiple similarity determination results is realized, the accuracy of the text pairs in similarity is improved, and the similarity calculation accuracy after the integrated processing is higher than that of any single similarity calculation mode.

In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A processing method for text similarity is characterized by comprising the following steps:

obtaining similarity determination results of the text pairs obtained by a plurality of similarity determination methods, wherein the similarity determination results of the text pairs comprise similarity feature vectors and similarity scores of the text pairs;

wherein the similarity determination model is obtained by pre-training;

the splicing the similarity determination results of the text pairs to obtain the splicing characteristics comprises:

randomly sampling the similarity characteristic vector to obtain a sampling characteristic vector;

splicing the sampling feature vector and the similarity score to obtain the splicing feature;

the similarity determination model is obtained by adopting the following pre-training mode:

obtaining similarity determination results of the text pairs marked with the similarities obtained by multiple similarity determination methods;

splicing the similarity determination results of the text pairs to obtain the splicing characteristics of the text pairs;

and training a classification model by taking the splicing characteristics of each text pair and the labeling similarity of each text pair as training samples to obtain a similarity determination model.

2. The method of claim 1, wherein the training goal of the classification model is to minimize a loss value of the classification model;

and in the process of training the classification model, performing parameter adjustment on the classification model by using the loss value.

3. The method of claim 2, wherein the loss value is an error between a text similarity of a text pair output by the classification model and a labeled similarity of the text pair.

4. The method of claim 1, wherein the randomly sampling the similarity eigenvector to obtain a sampled eigenvector comprises:

and randomly sampling the characteristic values in the similarity characteristic vector according to a preset probability, and setting the characteristic values which are not sampled in the similarity characteristic vector as 0 to obtain a sampling characteristic vector.

5. The method of claim 1, wherein the similarity determination model is a neural network-based classification model.

6. A device for processing text similarity, the device comprising:

the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring similarity determination results of text pairs obtained by a plurality of similarity determination methods, and the similarity determination results of the text pairs comprise similarity feature vectors and similarity scores of the text pairs;

the splicing unit is used for splicing the similarity determination results of the text pairs to obtain splicing characteristics;

the processing unit is used for taking the splicing characteristics as the input of a similarity determination model and obtaining the text similarity of the text pair according to the output of the similarity determination model;

wherein the similarity determination model is obtained by pre-training;

and when the splicing unit splices the similarity determination results of the text pairs to obtain splicing characteristics, specifically executing:

the training unit is used for pre-training in the following mode to obtain the similarity determination model:

7. The apparatus of claim 6, wherein the training goal of the classification model is to minimize a loss value of the classification model;

8. The apparatus of claim 7, wherein the loss value is an error between a text similarity of a text pair output by the classification model and a labeled similarity of the text pair.

9. The apparatus according to claim 6, wherein the stitching unit performs random sampling on the similarity feature vector, and specifically performs, when obtaining the sampled feature vector:

10. The apparatus of claim 6, wherein the similarity determination model is a neural network-based classification model.

11. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.

12. A storage medium containing computer-executable instructions for performing the method of any one of claims 1-5 when executed by a computer processor.