CN116644180A

CN116644180A - Training method and training system for text matching model and text label determining method

Info

Publication number: CN116644180A
Application number: CN202310511425.8A
Authority: CN
Inventors: 张乐中; 方俊
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2023-05-08
Filing date: 2023-05-08
Publication date: 2023-08-25

Abstract

The embodiment of the disclosure discloses a training method, a training system and a text label determining method for a text matching model. One embodiment of the training method comprises the following steps: inputting label data and unlabeled text data in the training data into a pre-trained interaction model to obtain first predicted label data of the unlabeled text data; generating first new tag data based on the first predicted tag data and the tag data; and fine tuning a pre-trained text matching model according to the text data in the training data and the first new label data, wherein the text matching model adopts a double-tower model structure. The embodiment relates to a model training technology, and a text matching model is guided to build a more accurate text labeling model by utilizing the deep semantic expression capability of an interactive model. The data processing efficiency of the model is ensured, and meanwhile, the processing effect of the model can be improved.

Description

Training method and training system for text matching model and text label determining method

Technical Field

The embodiment of the disclosure relates to the technical field of model training, in particular to a training method, a training system and a text label determining method for a text matching model.

Background

Text and tag matching pertains to the sentence semantic similarity learning problem in the field of NLP (Natural Language Processing ). The essence of text matching is that similarity calculation of text semantics, namely semantic matching, is performed. There are two main solutions at present: a double tower mode (Bi-Encoder) and an interactive mode (Cross-Encoder). In Bi-encodings, two encoders (encodings) are typically used to encode the query and document (document) into vectors, and finally the similarity between the two vectors is calculated by a relevance discriminant function. And the Cross-Encoder is usually used for splicing two sentences together, inputting the two sentences to the Encoder at one time and outputting the semantic scores of the two sentences.

However, the inventors have found that while Bi-Encoder is more efficient in information retrieval over text matching, the output of Bi-Encoder can be used directly downstream. But due to the lack of information interaction, knowledge of the pre-trained model (BERT) cannot be fully utilized. Which often results in performance that is significantly different from the Cross-Encoder effect.

The above information disclosed in this background section is only for enhancement of understanding of the background of the inventive concept and, therefore, may contain information that does not form the prior art that is already known to those of ordinary skill in the art in this country.

Disclosure of Invention

The disclosure is in part intended to introduce concepts in a simplified form that are further described below in the detailed description. The disclosure is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Some embodiments of the present disclosure propose a training method, training apparatus, model training system, text label determining method, electronic device, computer readable medium and computer program product for text matching model to solve one or more of the technical problems mentioned in the background section above.

In a first aspect, some embodiments of the present disclosure provide a training method of a text matching model, including: inputting label data and unlabeled text data in training data into a pre-trained interaction model to obtain first predicted label data of the unlabeled text data, wherein the interaction model is used for determining semantic similarity between each input text data and each label data; generating first new tag data based on the first predicted tag data and the tag data; and fine tuning a pre-trained text matching model according to the text data in the training data and the first new label data, wherein the text matching model adopts a double-tower model structure.

In some embodiments, generating first new tag data based on the first predicted tag data and the tag data includes: according to the semantic similarity of the first predictive label data and the unlabeled text data and the comparison result of the corresponding preset similarity threshold value, positive and negative sample division is carried out on the first predictive label data; and according to the dividing result, the first predicted tag data and the tag data are used as first new tag data.

In some embodiments, the method further comprises: inputting the tag data and the test text data into a pre-trained text matching model to obtain semantic similarity between the test text data and each tag data; determining similarity distribution between each label data and each category according to the category of the test text data; and determining a preset similarity threshold between each label data and each category by utilizing a search strategy based on the similarity distribution.

In some embodiments, the pre-trained interaction model is trained by the following method: inputting the label data and the unlabeled text data into a pre-trained text matching model to obtain second predicted label data of the unlabeled text data; generating second new tag data according to the second predicted tag data and the tag data; and initializing and training the interaction model according to the text data and the second new label data in the training data.

In some embodiments, the method further comprises: responding to the completion of fine tuning of the text matching model, and regenerating second new tag data according to the adjusted text matching model and training data so as to fine tune the interactive model; and responding to the completion of fine tuning of the interactive model, and regenerating first new tag data according to the adjusted interactive model and training data so as to continuously fine tune the adjusted text matching model until knowledge distillation iteration of the model is completed.

In some embodiments, the text matching model employs a mean square error loss function and the interaction model employs a cross entropy loss function during knowledge distillation iterations.

In some embodiments, fine tuning a pre-trained text matching model includes: and adopting a prompt learning method to finely tune the pre-trained text matching model.

In some embodiments, the pre-trained text matching model is trained using the following method: and initializing and training a text matching model by using text data and label data in the training data, and adopting a cross entropy loss function, wherein the text matching model adopts a double-tower model structure based on prompt learning.

In a second aspect, some embodiments of the present disclosure provide a training apparatus for a text matching model, including: the first prediction tag determining unit is configured to input tag data and unlabeled text data in training data into a pre-trained interaction model to obtain first prediction tag data of the unlabeled text data, wherein the interaction model is used for determining semantic similarity between each input text data and each tag data; a first new tag generation unit configured to generate first new tag data based on the first predicted tag data and the tag data; and the text matching model fine tuning unit is configured to fine tune a pre-trained text matching model according to the text data in the training data and the first new label data, wherein the text matching model adopts a double-tower model structure.

In some embodiments, the first new label generating unit is further configured to divide the first predicted label data into positive and negative samples according to the semantic similarity between the first predicted label data and the unlabeled text data and the comparison result of the first predicted label data and the corresponding preset similarity threshold; and according to the dividing result, the first predicted tag data and the tag data are used as first new tag data.

In some embodiments, the training device further includes a threshold determining unit configured to input the tag data and the test text data into a pre-trained text matching model to obtain semantic similarity between the test text data and each tag data; determining similarity distribution between each label data and each category according to the category of the test text data; and determining a preset similarity threshold between each label data and each category by utilizing a search strategy based on the similarity distribution.

In some embodiments, the training device further includes an interaction model initial training unit configured to input the tag data and the unlabeled text data into a pre-trained text matching model to obtain second predicted tag data of the unlabeled text data; generating second new tag data according to the second predicted tag data and the tag data; and initializing and training the interaction model according to the text data and the second new label data in the training data.

In some embodiments, the training device further comprises: the interactive model fine tuning unit is configured to respond to the completion of fine tuning of the text matching model, and re-generate second new tag data according to the adjusted text matching model and training data so as to fine tune the interactive model; and a text matching model fine tuning unit further configured to, in response to completion of the fine tuning of the interactive model, regenerate the first new tag data from the adjusted interactive model and the training data to continue fine tuning the adjusted text matching model until knowledge distillation iterations of the model are completed.

In some embodiments, the text matching model fine tuning unit is further configured to fine tune the pre-trained text matching model using a prompt learning method.

In some embodiments, the training apparatus further comprises a text matching model initial training unit configured to perform initial training on the text matching model using text data and tag data in the training data, and to employ a cross entropy loss function, wherein the text matching model employs a prompt learning based dual-tower model structure.

In a third aspect, some embodiments of the present disclosure provide a model training system comprising: a first server on which a double-tower model is installed, the double-tower model being obtained using the training method described in any of the implementations of the first aspect above; and the second server is provided with an interaction model which is used for carrying out knowledge distillation iterative training with the double-tower model.

In a fourth aspect, some embodiments of the present disclosure provide a text label determining method, including: in response to receiving text data to be analyzed, determining the text data to be analyzed as target text data; inputting target text data and preset tag data into a text matching model, and outputting predicted tag data for obtaining the target text data, wherein the text matching model is obtained by adopting the training method described in any implementation manner of the first aspect; and determining and outputting the label data of the target text data according to the predicted label data.

In a fifth aspect, some embodiments of the present disclosure provide an electronic device comprising: one or more processors; a storage device having one or more programs stored thereon, which when executed by one or more processors causes the one or more processors to implement the method described in any of the implementations of the first or fourth aspects above.

In a sixth aspect, some embodiments of the present disclosure provide a computer readable medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the method described in any of the implementations of the first or fourth aspects above.

In a seventh aspect, some embodiments of the present disclosure provide a computer program product comprising a computer program which, when executed by a processor, implements the method described in any one of the implementations of the first or fourth aspects above.

The above embodiments of the present disclosure have the following advantageous effects: according to the training method in some embodiments of the present disclosure, first prediction tag data of some unlabeled text data is obtained through an interaction model. This enriches the sample data and thus helps to improve the prediction accuracy of the text matching model. Furthermore, the semantic modeling capabilities of interaction models are generally much higher than for the two-tower model. Thus, with the powerful semantic modeling capability of the interaction model, more various pseudo tag data (i.e., predictive tag data) can be obtained. Since the predicted tag data is predicted by a model, not the tag data that is original in the sample data, it may be called pseudo tag data. Therefore, the text matching model can acquire and learn more effective information from the text and label semantic retrieval (namely label hanging) task, so that the prediction accuracy of the model can be further improved. That is, the training method of the present disclosure can make the finally obtained text matching model have the deep semantic expression capability of the interaction model, and also retain the efficient information retrieval capability, so that the win-win effect can be achieved.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

FIG. 1 is a flow chart of some embodiments of the training method of the present disclosure;

FIG. 2A is a structural schematic diagram of some embodiments of the interaction model of the present disclosure;

FIG. 2B is a schematic structural view of some embodiments of the dual tower model of the present disclosure;

FIG. 3 is a flow chart of further embodiments of the training method of the present disclosure;

FIG. 4A is a schematic illustration of some application scenarios of the training method of the present disclosure;

FIG. 4B is a schematic illustration of further application scenarios of the training method of the present disclosure;

FIG. 5 is a schematic structural view of some embodiments of the exercise device of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings. Embodiments of the present disclosure and features of embodiments may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates a flow 100 of some embodiments of a training method for a text matching model according to the present disclosure. The training method comprises the following steps:

and 101, inputting label data and unlabeled text data in the training data into a pre-trained interaction model to obtain first predicted label data of the unlabeled text data.

In some embodiments, the execution body of the training method of the text matching model (e.g., model training server) may be communicatively connected with other electronic devices (e.g., the first server 401 and the second server 402 shown in fig. 4B) through a wired connection or a wireless connection. Here, the executing body may input the tag data and the unlabeled text data in the training data into the interaction model of the pre-selected training. And further, carrying out semantic similarity analysis through the interaction model to obtain first predictive label data of unlabeled text data. Wherein the interaction model may be used to determine semantic similarity between each of the entered text data and the tag data. The interaction model is the Cross-Encoder model _。

It will be appreciated that the interaction model may determine semantic similarity between each input text data and each tag data, respectively. And a specified number of tags whose similarity is the front may be determined as the first predicted tag data of the text data in the order of high-to-low similarity. The specified number here may be set according to actual circumstances, for example, 5.

In some embodiments, the structure of the interaction model may be as shown in FIG. 2A. As can be seen from fig. 2A, the input layer of the interaction model can splice together text data and tag data (i.e., in _x 1 to In _y N _y ) One-time input to the Encoder (i.e., the Encoder). The representation vector (i.e., out) is derived from the encoder output _x 1 to Out _y N _y ). Then, by information aggregation (i.e., agg ranker), candidate vectors (i.e., candidate Embedding) can be obtained. Finally, by dimension reduction (i.e. Dimension Reduction), a semantic similarity Score (i.e. Score) is output.

Step 102, generating first new tag data based on the first predicted tag data and the tag data.

In some embodiments, the execution body may generate first new tag data based on the first predicted tag data obtained in step 101 and the previously existing tag data. As an example, the execution subject may update the original training data by using the unlabeled text data and the corresponding first predictive label data as positive sample data. That is, the first predicted tag data and the existing tag data are used as new tag data, and the first new tag data is obtained.

In some embodiments, the execution body may correct the first predictive label data because there may be a bias in the prediction results of the model. As an example, for semantic similarity of the first predictive label data and unlabeled text data, the execution body may divide the first predictive label data into positive and negative samples according to a comparison result of the first predictive label data and a corresponding preset similarity threshold. For example, if the semantic similarity of the first predicted tag data is greater than the preset similarity threshold, the first predicted tag data may be used as a positive sample tag of the corresponding text data. For another example, if the semantic similarity of the first predicted tag data is less than the preset similarity threshold, the first predicted tag data may be used as a negative sample tag of the corresponding text data.

Then, according to the division result, the execution body may take the first predicted tag data and the tag data as first new tag data. That is, according to the division result of the positive and negative samples, the execution subject may update the first predictive tag data of the positive sample to the original positive tag data set. And the first predictive label data of the negative sample can be updated to the original negative label data set. And then each label data in the updated positive label data set and negative label data set is used as first new label data.

It should be noted that the preset similarity threshold may be preset. As an example, in an actual business scenario application, each text will typically be associated with at least one tertiary category information. At this time, correction of the predicted tag data may be achieved by setting a similarity threshold between each tag data and each of the three classes.

Optionally, the preset similarity threshold may also be obtained by the method shown in fig. 3, and specifically, reference may be made to the related description in the embodiment of fig. 3, which is not repeated herein.

And 103, fine tuning the pre-trained text matching model according to the text data in the training data and the first new label data.

In some embodiments, the executing entity may fine tune the pre-trained text matching model based on the text data in the training data and the first new tag data obtained in step 102. As an example, the execution body may input text data and the first new tag data into the text matching model. And then, according to the label data predicted by the model and the first new label data corresponding to the text data, carrying out loss function analysis so as to finely tune the text matching model. Here, fine-tuning (fine-tuning) is generally based on a trained model using its own data, and using the weight parameter values of the trained model, a model suitable for its own data is trained by modifying the parameters and output categories of the last layer.

It will be appreciated that the text matching model may employ a mean square error loss function during the fine tuning process. Such a loss function is advantageous in that the text matching model can accurately fit the pseudo-scores of the interaction model. The text matching model can learn more deep semantic knowledge.

Here, the text matching model may employ a double-tower model structure. As an example, as shown in fig. 2B, the text matching model may include a text Encoder (Context Encoder) and a candidate tag Encoder (Candidate Encoder). At this time, the input layer may send text data (i.e., in _x 1 to In _x N _x ) The text encoder is input to obtain a text representation vector (i.e., out _x 1 to Out _x N _x ). Thereafter, a text information collection (i.e., context Aggregator) may be performed to obtain a text vector (i.e., context Embedding). In addition, the input layer may store tag data (i.e., in _y 1 to In _y N _y ) The candidate tag encoder is input to obtain a tag representation vector (i.e., out _y 1 to Out _y N _y ). Thereafter, tag information collection (i.e., candidate Aggregator) may be performed to obtain a tag vector (i.e., candidate Embedding). Finally, a similarity Score (i.e., score) between the output text vector and the label vector may be calculated.

In some application scenarios, the execution subject may employ a Prompt Learning (Prompt Learning) method to fine tune a pre-trained text matching model. Note that prompt learning is generally an effective way to tune a pre-trained model to a particular task. Under the condition that the structure and parameters of the pre-training language model are not changed remarkably, the true capability of the pre-training model can be exerted by adding prompt information to input, so that the zero-shot (zero-order learning) and few-shot (small sample learning) capabilities of the model can be improved.

Prompt learning typically processes the input text information according to a specific template, reconstructing the task into a form that more fully utilizes the pre-trained language model processing. As an example, a traditional supervised learning task is to train a model P (y|x), receive x as input, and predict y. While prompt learning is otherwise dependent on a pre-trained language model. The input x is adjusted to x' in the form of a complete fill by introducing a suitable template. The adjusted input x' will typically contain some empty slots. After filling the empty slots with the language model, the corresponding y can be deduced.

In some embodiments, the execution body may employ a hint learning method to add K tokens, i.e., sample tokens, to the tag identification (tokens) that are randomly initialized. Its vector parameters may be trained. By way of example, assume that text passes through a tokenizer (tokenizer) that may result in text labels (tokens) of S [1], S [2], … S [ M ]. The tag passes through the word segmentation device to obtain tag identification (token) T= (T [1], T [2], … T [ N ]). The label tag binding to prompt may be t= (P [1], P [2], … P [ K ], T [1], T [2], … T [ N ]).

It will be appreciated that in embodiments of the present disclosure, a fine-tuning method based on a pre-trained language model (Bert, bidirectional Encoder Representation from Transformers) may be employed to fine-tune a text-matching model. The pre-training language model is usually not pre-trained by adopting a traditional unidirectional language model or a method of performing shallow splicing on two unidirectional language models as in the prior art. Instead, a new Masked Language Model (MLM) is employed so that a deep bi-directional linguistic representation can be generated. Therefore, by adding prompt tokens (prompt tokens), the method is helpful for further mining knowledge of the pre-trained language model, and improves the semantic retrieval effect of the text matching model (double-tower model) on the text-label.

It should be noted that, in practical application, the most accurate semantic similarity method is often an interaction model. But the retrieval efficiency of the interaction model is far lower than that of the double-tower model. It is generally preferred in the industry to use a double-tower model to solve the semantic similarity problem of sentences. However, the main drawback of the double-tower model is that the query side is too far apart from the Document side until the last relevance discriminant function has interacted. I.e. query and document are completely independent before entering the model, with no information interaction. This results in the model representation being typically a static vector, and the learned semantic representation being limited. The interaction model carries out full perception interaction of query and document at the feature level, and can extract richer semantic expression information. The interaction model is therefore much more efficient than the double tower model.

In addition, in performing supervised learning training of text matching models, a large amount of sample text data and corresponding sample tag data, i.e., "text-tag" pair data, is typically required. And analyzing the sample label data and the label data predicted by the model, and calculating a loss function to realize parameter adjustment of the model. However, in some practical application scenarios, the number of sample data tends to be small and the acquisition cost is high.

Through the above description, the training method in some embodiments of the present disclosure obtains first predictive label data of some unlabeled text data through the interaction model. This enriches the sample data and thus helps to improve the prediction accuracy of the text matching model. Furthermore, the semantic modeling capabilities of interaction models are generally much higher than for the two-tower model. Thus, with the powerful semantic modeling capability of the interaction model, more various pseudo tag data (i.e., predictive tag data) can be obtained. Since the predicted tag data is predicted by a model, not the tag data that is original in the sample data, it may be called pseudo tag data. Therefore, the text matching model can acquire and learn more effective information from the text and label semantic retrieval (namely label hanging) task, so that the prediction accuracy of the model can be further improved. That is, the training method of the present disclosure can make the finally obtained text matching model have the deep semantic expression capability of the interaction model, and also retain the efficient information retrieval capability, so that the win-win effect can be achieved.

The pre-trained text matching model and the pre-trained interaction model may be obtained from public resources such as internet, and are models after the training has been initialized. This can improve the overall training efficiency.

Optionally, in order to meet the use requirement of the actual scene, the accuracy of model prediction is improved, and the pre-trained text matching model and the pre-trained interaction model can also be obtained by initializing and training through training data of the present disclosure.

In some embodiments, the pre-trained text matching model may be trained using the following method: and initializing and training a text matching model by using text data and label data in the training data, and adopting a cross entropy loss function. The text matching model herein may employ a double-tower model result. In addition, in order to improve the initial training effect of the text matching model, training and parameter adjustment of the model based on a prompt learning method can be adopted.

And the pre-trained interaction model can be obtained by training the following method: inputting the label data and the unlabeled text data into a text matching model trained in advance so as to obtain second predictive label data of the unlabeled text data; generating second new tag data according to the second predicted tag data and the tag data; and initializing and training the interaction model according to the text data and the second new label data in the training data.

Here, the process of the interactive model initialization training can also be referred to in the relevant description of steps 101 to 103 in the embodiment of FIG. 1. Because the semantic modeling capability of the interactive model is far higher than that of the double-tower model, the interactive model is directly used for training on the pseudo tag data, the interactive model can be seriously fit with the error information of the pseudo tag data, and an error information method in the circulation process is caused, so that the distillation iteration cannot obtain the expected effect. Thus, the second predictive label data may also be corrected here with a preset similarity threshold. In addition, when the interactive model is trained, model parameters can be adjusted by adopting a prompt learning method. The interaction model herein may employ a cross entropy loss function (e.g., a bi-class cross entropy loss function). This prevents the interaction model from overfitting the pseudo-score of the text matching model.

It will be appreciated that pseudo tag data is generated using a trained text matching model (a double tower model) to train an interaction model. Because the amount of label data generated is relatively small, training the interaction model does not take too long. And a desired interaction model can be obtained. Compared with an interaction model using universality, the interaction model obtained by the method can improve the prediction accuracy of the model in the field. And the prediction effect of the text matching model after the follow-up fine tuning is improved.

With continued reference to fig. 3, a flow 300 of further embodiments of training methods according to the present disclosure is shown. The training method comprises the following steps:

step 301, inputting the label data and the test text data into a pre-trained text matching model to obtain semantic similarity between the test text data and each label data.

In some embodiments, as shown in fig. 4A, the execution body of the training method of the text matching model may divide the pair into a training data set and a test data set according to a certain ratio. After the text matching model is completed through the < text-tag > training dataset, the executing body can infer on the < text-tag > test verification set to obtain the similarity distribution of the text and the tag. Specifically, the execution subject may input the tag data and the test text data into a pre-trained text matching model, thereby obtaining semantic similarity between the test text data and each tag data.

Step 302, determining similarity distribution between each label data and each category according to the category of the test text data.

In some embodiments, the executing entity may determine a similarity distribution between each tag data and each category based on the category to which the test text data belongs. It will be appreciated that the similarity of text to labels, i.e., the similarity of the category to which the text belongs to the label. The category to which this belongs may be set according to the actual circumstances. As an example, each test text data may be associated with at least one tertiary category, such as a cell phone, shampoo, or the like. According to the semantic similarity of the test text and the labels output by prediction and the three-level category related to the test text, the similarity distribution of each label on the three-level category can be obtained.

Step 303, determining a preset similarity threshold between each tag data and each category by using a search strategy based on the similarity distribution.

In some embodiments, the executing body may determine a preset similarity threshold between each tag data and each category according to the similarity distribution using a search strategy. As an example, first, based on the true value of the test text (i.e., the tag data that is truly labeled), the executing body may derive a true relationship distribution for each tag data with various purposes. Then, for each tag data, the execution subject may select an appropriate similarity threshold for each category to screen according to the similarity distribution. So that the overall reliability accuracy of the tag reaches a target value (e.g., 0.9).

For example, the label "drum set recommendation" may have a semantic matching relationship with the text of three tertiary categories [ knock toy, drum set/jazz drum, early education). Because the text topics of different categories are different, the optimal similarity threshold values of the tags in different three-level categories are often inconsistent in order to enable the semantic matching accuracy of the tags in each three-level category to be higher than 0.9. Such as a knock toy, drum set/jazz drum, the semantic similarity threshold for early education intelligence may be 0.6, 0.5, 0.8, respectively.

The search strategy methods herein are not limited and may include greedy and/or pruning, and the like. As an example, the optimal similarity threshold for each tag to select on each category may be calculated by greedy + search strategy. That is, as shown in fig. 4A, using the text cid 3-tag similarity threshold search strategy, using three-level category (cid 3) information of text, a text cid 3-tag (label) similarity threshold rule can be obtained. The threshold rules may be used to filter error information during the knowledge distillation iteration phase.

The training method in the embodiment of the disclosure further enriches and perfects the process of generating the preset similarity threshold. Based on the text matching model obtained by initial training, the semantic similarity between the predicted output text and the label is carried out on the test set, and the true value of the test set is combined. Finally, the label and the threshold rule of each category can be obtained. The threshold rules may be applied during the transfer of the interaction model to the text matching model (or the transfer of the text matching model to the interaction model). Therefore, the output pseudo tag data is subjected to data error correction, and error information transmission of a knowledge distillation iteration mechanism can be effectively reduced.

In some embodiments, the training method of the present disclosure may utilize a text matching model (a dual tower model) and an interaction model to form knowledge distillation iterations to fine tune the text matching model. Specifically, in response to completion of the fine-tuning of the text matching model, second new tag data is regenerated according to the adjusted text matching model and training data to fine-tune the interaction model. And responsive to the interaction model trimming being completed, regenerating first new tag data according to the adjusted interaction model and the training data to continue trimming the adjusted text matching model. And (5) circulating until knowledge distillation iteration of the model is completed. That is, the advantages of both Cross-Encoder and Bi-Encoder can be exploited to guide an accurate text-tag semantic retrieval model in an unsupervised manner.

As an example, as shown in FIG. 4B, during the distillation iteration stage, pseudo tag data may first be generated using the Bi-Encoder reliability model (i.e., text matching model). The pseudo tag data is then error corrected by means of threshold rules for tags and three-level categories, and the Cross-Encoder reliability model is trained and fine-tuned in combination with existing supervision data, i.e., < text-tag > pair data. If the false label data after error correction and the existing label data are used, a two-class Cross entropy loss function is used for fine tuning on a Cross-Encoder combined with prompt learning. The Cross entropy loss function can prevent Cross-encodings from overfitting the pseudo-scores of Bi-encodings.

At this time, the Cross-Encoder leaning model can acquire more effective information from the tag leaning task by virtue of strong semantic modeling capability and more various pseudo tag data. It should be noted that, since the pre-training language-based leaning model itself has a certain fault tolerance capability, the effect that a correctly leaning label is error corrected to a negative sample is negligible.

In turn, pseudo tag data is generated using the resulting Cross-Encoder wall-tie model. Similarly, the pseudo tag data is corrected by means of threshold rules of the tag and three-level categories, and the Bi-Encoder hanging model is fine-tuned by combining existing supervision data. If the error-corrected pseudo tag data and the existing tag data are used, a mean square error loss function is used to perform fine tuning on the Bi-Encoder based on pre-training by using prompt learning. The Mean Square Error (MSE) loss function facilitates the Bi-Encoder to accurately fit the pseudo-fraction of the Cross-Encoder. At this time, the Bi-Encoder can further improve the label attaching effect under the guidance of the Cross-Encoder and under the condition of more various pseudo label data.

Unlike traditional distillation tasks, the embodiment of the disclosure is based on the supervised field, adopts a Bi-Encoder method, and adopts a method of mutual distillation iteration with a Cross-Encoder for the problem of lack of sentence interaction. I.e. knowledge transfer is performed between Bi-Encoder and Cross-Encoder by means of data (original tag data, as well as pseudo tag data). The Bi-Encoder is guided to establish a more accurate text marking model by utilizing the deep semantic expression capability of the Cross-Encoder. Meanwhile, a similarity threshold strategy mechanism based on labels is added in the distillation process, so that transmission of error information in the distillation process can be prevented. The transmission and utilization of effective information in the distillation process are further improved. The Bi-Enocder model obtained finally has the deep semantic expression capability of Cross-Encoder and retains the efficient information retrieval capability.

In particular, the process of performing one iteration may include distillation of Bi-Encoder to Cross-Encoder, distillation of Cross-Encoder to Bi-Encoder. Through a distillation iteration mechanism, the method can be circulated one time or multiple times, and the performance of the encoder is continuously optimized. In the subject tag lean application for long text, the total number of tags is 712, and the text-tag pair marks 3 ten thousand pieces of data. By carrying out one distillation iteration through the method, the AUC (Area enclosed by the ROC Curve and the coordinate axis) of the Bi-Encoder hanging model can be improved by 0.02 (0.825 > 0.845), and after two rounds of iteration, the AUC can be improved by 0.034 (0.825 > 0.859).

As can be seen from the above description, the training method in the embodiments of the present disclosure may achieve the following beneficial effects:

first, in supervised text semantic similarity retrieval, a more efficient Bi-Encoder is built by Cross-Encoder in combination with pseudo tag data. Compared with the traditional Bi-Encoder method, the method not only maintains the high-efficiency retrieval efficiency of the Bi-Encoder, but also has the deep semantic knowledge of part of Cross-Encoder, thereby achieving the win-win effect.

Secondly, in the knowledge distillation iteration process, information error correction is carried out through a similarity threshold value, so that transmission of error information in the iteration process can be effectively prevented.

In addition, in the semantic search of text and labels, bi-Encoder and Cross-Encoder both use a learning method of simple learning. By virtue of the design of the sample learning in the classification, the input text of the tag encoder is added with the sample token. Knowledge of the pre-training language model itself can be mined, and better effects can be achieved than ordinary fine tuning.

With further reference to fig. 5, as an implementation of the training method of fig. 1-3 described above, the present disclosure provides some embodiments of a training apparatus. These training device embodiments correspond to those method embodiments shown in fig. 1-3. The training device can be applied to various electronic equipment.

As shown in fig. 5, the training apparatus 500 of some embodiments may include: a first predicted tag determining unit 501 configured to input tag data and unlabeled text data in training data into an interaction model trained in advance to obtain first predicted tag data of the unlabeled text data, wherein the interaction model is used for determining semantic similarity between each input text data and each tag data; a first new tag generation unit 502 configured to generate first new tag data based on the first predicted tag data and the tag data; the text matching model fine tuning unit 503 is configured to fine tune a pre-trained text matching model according to text data and first new tag data in the training data, wherein the text matching model adopts a double-tower model structure.

In some embodiments, the first new label generating unit 502 may be further configured to divide the first predicted label data into positive and negative samples according to the semantic similarity between the first predicted label data and the unlabeled text data and the comparison result of the semantic similarity with the corresponding preset similarity threshold; and according to the dividing result, the first predicted tag data and the tag data are used as first new tag data.

In some embodiments, the training device 500 may further include a threshold determining unit (not shown in the figure) configured to input the tag data and the test text data into a pre-trained text matching model, so as to obtain a semantic similarity between the test text data and each tag data; determining similarity distribution between each label data and each category according to the category of the test text data; and determining a preset similarity threshold between each label data and each category by utilizing a search strategy based on the similarity distribution.

In some embodiments, the training device 500 may further include an interaction model initial training unit (not shown in the figure) configured to input the tag data and the unlabeled text data into a pre-trained text matching model to obtain second predicted tag data of the unlabeled text data; generating second new tag data according to the second predicted tag data and the tag data; and initializing and training the interaction model according to the text data and the second new label data in the training data.

In some embodiments, the training device 500 may further include: an interactive model fine tuning unit (not shown in the figure) configured to re-generate second new tag data according to the adjusted text matching model and training data in response to completion of fine tuning of the text matching model, so as to fine tune the interactive model; and a text matching model fine tuning unit further configured to, in response to completion of the fine tuning of the interactive model, regenerate the first new tag data from the adjusted interactive model and the training data to continue fine tuning the adjusted text matching model until knowledge distillation iterations of the model are completed.

In some embodiments, the text matching model fine tuning unit 503 may be further configured to fine tune the pre-trained text matching model using a prompt learning method.

In some embodiments, the training apparatus 500 may further include a text matching model initial training unit (not shown in the figure) configured to perform initial training on the text matching model using text data and tag data in the training data, and employ a cross entropy loss function, wherein the text matching model employs a dual-tower model structure based on prompt learning.

It will be appreciated that the elements described in the exercise device 500 correspond to the various steps in the method described with reference to figures 1 to 3. Thus, the operations, features, and benefits described above with respect to the method are equally applicable to the training device 500 and the units contained therein, and are not described in detail herein.

The embodiment of the disclosure also provides a model training system. The model training system may include a first server (e.g., first server 401 shown in fig. 4B) and a second server (e.g., second server 402 shown in fig. 4B). Wherein, the first server can be provided with a double-tower model. The dual tower model herein may be derived using the training method described in any of the embodiments of fig. 1-3 above. And the second server may have an interaction model installed thereon. The interaction model can be used for knowledge distillation iterative training with the double-tower model. Therefore, the double-tower model not only has high-efficiency retrieval, but also has deep semantic knowledge of the interaction model, and the prediction effect of the model is improved.

In addition, the embodiment of the disclosure also provides a text label determining method. The text label determining method may include: in response to receiving text data to be analyzed, determining the text data to be analyzed as target text data; inputting the target text data and the preset tag data into a text matching model, and outputting predicted tag data for obtaining the target text data, wherein the text matching model is obtained by adopting the training method described in any implementation mode in the embodiments of the figures 1 to 3; and determining and outputting the label data of the target text data according to the predicted label data.

That is, in the practical application process, the text data to be analyzed may be input into the above-mentioned trained text matching model. The text data to be analyzed may be text data such as unlabeled label data or text data of labeled label data to determine the accuracy of the current label. Here, the predicted tag data output by the model may be directly determined as the tag data of the target text data. At least one predictive tag data may be selected from among a plurality of predictive tag data as tag data of the text data. Here, the data output method is not limited as well, and for example, the data may be displayed and output on a terminal, or the correspondence between the target text data and the tag data may be stored.

It should be noted that the operations, features and the beneficial effects described above for the training method of the text matching model are also applicable to the text label determining method of the present disclosure, and are not described herein.

Referring now to fig. 6, a schematic diagram of an electronic device 600 suitable for use in implementing some embodiments of the present disclosure is shown. The electronic device shown in fig. 6 is merely an example and should not impose any limitations on the functionality and scope of use of embodiments of the present disclosure.

As shown in fig. 6, the terminal apparatus 600 may include a processing device 601 (e.g., a central processor, a graphic processor, etc.), which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage device 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the terminal apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

In general, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; output devices 607 including, for example, speakers, vibrators, etc.; storage 608 including, for example, magnetic disks, hard disks, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 shows an electronic device 600 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead. Each block shown in fig. 6 may represent one device or a plurality of devices as needed.

In particular, according to some embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via communications device 609, or from storage device 608, or from ROM 602. The above-described functions defined in the methods of some embodiments of the present disclosure are performed when the computer program is executed by the processing device 601.

It should be noted that, the computer readable medium described in some embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, the computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (Hyper Text Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: inputting label data and unlabeled text data in training data into a pre-trained interaction model to obtain first predicted label data of the unlabeled text data, wherein the interaction model is used for determining semantic similarity between each input text data and each label data; generating first new tag data based on the first predicted tag data and the tag data; and fine tuning a pre-trained text matching model according to the text data in the training data and the first new label data, wherein the text matching model adopts a double-tower model structure.

Or, in response to receiving the text data to be analyzed, determining the text data to be analyzed as target text data; inputting the target text data and the preset tag data into a text matching model, and outputting predicted tag data for obtaining the target text data, wherein the text matching model is obtained by adopting the training method described in any implementation mode of the embodiment; and determining and outputting the label data of the target text data according to the predicted label data.

Furthermore, computer program code for carrying out operations of some embodiments of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in some embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The described units may also be provided in a processor, for example, described as: a processor includes a first predictive label determination unit, a first new label generation unit, and a text matching model fine tuning unit. The names of these units do not constitute a limitation on the unit itself in some cases, and for example, the first predictive label determining unit may also be described as "a unit that obtains first predictive label data of unlabeled text data".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

Some embodiments of the present disclosure also provide a computer program product comprising a computer program which, when executed by a processor, implements any of the training methods described above, or a text label determination method.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above technical features, but encompasses other technical features formed by any combination of the above technical features or their equivalents without departing from the spirit of the invention. Such as the above-described features, are mutually substituted with (but not limited to) the features having similar functions disclosed in the embodiments of the present disclosure.

Claims

1. A training method of a text matching model, comprising:

inputting label data and unlabeled text data in training data into a pre-trained interaction model to obtain first predicted label data of the unlabeled text data, wherein the interaction model is used for determining semantic similarity between each input text data and each label data;

generating first new tag data based on the first predicted tag data and the tag data;

and fine tuning a pre-trained text matching model according to the text data in the training data and the first new label data, wherein the text matching model adopts a double-tower model structure.

2. The training method of claim 1, wherein the generating first new tag data based on the first predicted tag data and the tag data comprises:

according to the semantic similarity of the first predictive label data and the unlabeled text data and the comparison result of the corresponding preset similarity threshold value, positive and negative sample division is carried out on the first predictive label data;

and according to the dividing result, the first predicted tag data and the tag data are used as first new tag data.

3. The training method of claim 2, wherein the method further comprises:

inputting the tag data and the test text data into the pre-trained text matching model to obtain semantic similarity between the test text data and each tag data;

determining similarity distribution between each label data and each category according to the category of the test text data;

and determining a preset similarity threshold between each label data and each category by utilizing a search strategy based on the similarity distribution.

4. The training method of claim 1, wherein the pre-trained interaction model is trained by:

inputting the label data and the unlabeled text data into the pre-trained text matching model to obtain second predicted label data of the unlabeled text data;

generating second new tag data according to the second predicted tag data and the tag data;

and initializing and training the interaction model according to the text data in the training data and the second new label data.

5. The training method of claim 4, wherein the method further comprises:

Responding to the completion of fine tuning of the text matching model, and regenerating the second new tag data according to the adjusted text matching model and the training data so as to fine tune the interaction model; and

and responding to the fine tuning completion of the interactive model, and regenerating the first new label data according to the adjusted interactive model and the training data so as to continuously fine tune the adjusted text matching model until the knowledge distillation iteration of the model is completed.

6. The training method of claim 5, wherein the text matching model employs a mean square error loss function and the interaction model employs a cross entropy loss function during knowledge distillation iterations.

7. The training method of claim 1, wherein said fine-tuning a pre-trained text matching model comprises:

and adopting a prompt learning method to finely tune the pre-trained text matching model.

8. Training method according to one of the claims 1-7, wherein the pre-trained text matching model is trained by:

and initializing and training a text matching model by using text data and the label data in the training data, and adopting a cross entropy loss function, wherein the text matching model adopts a double-tower model structure based on prompt learning.

9. A training device for a text matching model, comprising:

the first prediction label determining unit is configured to input label data and unlabeled text data in training data into a pre-trained interaction model to obtain first prediction label data of the unlabeled text data, wherein the interaction model is used for determining semantic similarity between each input text data and each label data;

a first new tag generation unit configured to generate first new tag data based on the first predicted tag data and the tag data;

and the text matching model fine tuning unit is configured to fine tune a pre-trained text matching model according to the text data in the training data and the first new label data, wherein the text matching model adopts a double-tower model structure.

10. A model training system, comprising:

a first server having a double tower model mounted thereon, said double tower model being obtained using the training method of any one of claims 1-8;

and the second server is provided with an interaction model and is used for carrying out knowledge distillation iterative training with the double-tower model.

11. A text label determination method, comprising:

in response to receiving text data to be analyzed, determining the text data to be analyzed as target text data;

inputting the target text data and preset tag data into a text matching model, and outputting predicted tag data for obtaining the target text data, wherein the text matching model is obtained by adopting the training method according to one of claims 1-8;

and determining and outputting the label data of the target text data according to the predicted label data.

12. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-8 or 11.

13. A computer readable medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the method of any of claims 1-8 or 11.

14. A computer program product comprising a computer program which, when executed by a processor, implements the method of any of claims 1-8 or 11.