CN114490923A - Training method, device and equipment for similar text matching model and storage medium - Google Patents

Training method, device and equipment for similar text matching model and storage medium Download PDF

Info

Publication number
CN114490923A
CN114490923A CN202111436420.0A CN202111436420A CN114490923A CN 114490923 A CN114490923 A CN 114490923A CN 202111436420 A CN202111436420 A CN 202111436420A CN 114490923 A CN114490923 A CN 114490923A
Authority
CN
China
Prior art keywords
batch
similar
target
similar text
matching model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111436420.0A
Other languages
Chinese (zh)
Inventor
田上萱
何文栋
蔡成飞
赵文哲
孔伟杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202111436420.0A priority Critical patent/CN114490923A/en
Publication of CN114490923A publication Critical patent/CN114490923A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model

Abstract

The embodiment of the application discloses training, a device, equipment and a storage medium of a similar text matching model, and the related embodiments can be applied to various scenes such as cloud technology, artificial intelligence and intelligent traffic and are used for improving the recall rate of similar texts. The method comprises the following steps: obtaining a first batch of sample sets corresponding to a target scene, inputting the first batch of sample sets into an original similar text matching model for vector conversion operation to obtain a first batch of positive example sentence vectors and a first batch of negative example sentence vectors, carrying out triple construction operation on the first batch of positive example sentence vectors to obtain a plurality of first batch of triples, carrying out loss calculation operation to obtain a first batch of loss functions, carrying out parameter adjustment operation on the original similar text matching model to obtain an intermediate similar text matching model, repeatedly obtaining a second batch of sample sets corresponding to the target scene, and carrying out vector conversion operation, triple construction operation, loss calculation operation and parameter adjustment operation to obtain a target similar text matching model.

Description

Training method, device and equipment for similar text matching model and storage medium
Technical Field
The embodiment of the application relates to the technical field of data processing, in particular to a training method, a training device, training equipment and a storage medium for a similar text matching model.
Background
With the development of science and technology, the application of retrieving similar texts is more and more extensive, and when a user uses a search engine to search information, the situation that concepts of an input search word and the same word in an obtained search result are inconsistent often occurs, so that contents required by the user can be retrieved from mass data quickly and accurately and can be provided or pushed to the user.
The retrieval of the similar text is usually realized based on a deep learning judgment text similarity model, but the deep learning judgment text similarity model is a model with a supervision mechanism, and a large number of manually labeled training samples are required to be used as supervision signals, so that the model is better fitted and generalized.
Disclosure of Invention
The embodiment of the application provides a training method, a training device, training equipment and a storage medium for similar text matching models, which are used for drawing the distance between a regular sample and a similar sample through a triple loss function and pushing the distance between the regular sample and a heterogeneous sample so that similar text vectors can form clusters in a feature space, the learning capacity of the similar text matching models on the similarity between the text vectors is improved, the similar text matching models can be better fitted, and the recall rate of target similar text matching models on similar texts is improved.
An embodiment of the present application provides a training method for a similar text matching model, including:
acquiring a first batch sample set corresponding to a target scene, wherein the first batch sample set comprises first batch positive example samples and first batch negative example samples;
respectively inputting the first batch of positive example samples and the first batch of negative example samples into an original similar text matching model for vector conversion operation to obtain first batch of positive example sentence vectors and first batch of negative example sentence vectors;
carrying out triple construction operation on the first batch of positive example sentence vectors to obtain a plurality of first batch of triples, wherein each first batch of triples comprises a first batch of positive example sentence vectors, a first batch of same-class sentence vectors and a first batch of heterogeneous sentence vectors, and the first batch of same-class sentence vectors and the first batch of heterogeneous sentence vectors are derived from the first batch of negative example sentence vectors;
performing loss calculation operation on the plurality of first batch triples to obtain a first batch loss function corresponding to the first batch sample set;
according to the first batch loss function, performing parameter adjustment operation on the original similar text matching model to obtain a middle similar text matching model;
and repeatedly acquiring a second batch of sample sets corresponding to the target scene based on the intermediate similar text matching model, and executing vector conversion operation, triple construction operation, loss calculation operation and parameter adjustment operation to obtain the target similar text matching model.
Another aspect of the present application provides a training apparatus for matching a similar text model, including:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a first batch sample set corresponding to a target scene, and the first batch sample set comprises a first batch positive example sample and a first batch negative example sample;
the processing unit is used for inputting the first batch of positive example samples and the first batch of negative example samples to the original similar text matching model respectively to perform vector conversion operation to obtain first batch of positive example sentence vectors and first batch of negative example sentence vectors;
the processing unit is further used for carrying out triple construction operation on the first batch of positive example sentence vectors to obtain a plurality of first batch of triples, wherein each first batch of triples comprises a first batch of positive example sentence vectors, a first batch of similar sentence vectors and a first batch of heterogeneous sentence vectors, and the first batch of similar sentence vectors and the first batch of heterogeneous sentence vectors are derived from the first batch of negative example sentence vectors;
the processing unit is further used for performing loss calculation operation on the plurality of first batch triples to obtain a first batch loss function corresponding to the first batch sample set;
the processing unit is also used for carrying out parameter adjustment operation on the original similar text matching model according to the first batch loss function to obtain an intermediate similar text matching model;
and the processing unit is further used for repeatedly acquiring a second batch of sample sets corresponding to the target scene based on the intermediate similar text matching model, and executing vector conversion operation, triple construction operation, loss calculation operation and parameter adjustment operation to obtain the target similar text matching model.
In a possible design, in an implementation manner of another aspect of the embodiment of the present application, the obtaining unit may be specifically configured to:
acquiring a target text data set corresponding to a target scene, wherein the target text data set at least comprises a first batch of regular samples and source text data corresponding to the target scene;
retrieving N first matching texts corresponding to the first batch positive example samples from the target text data set as N first batch negative example samples, wherein N is an integer greater than 1;
calculating the matching scores between the first batch positive example samples and each first batch negative example sample to obtain N first matching scores;
respectively carrying out normalization operation on the N first matching scores to obtain N sample matching scores;
constructing the first batch sample set according to the first batch positive sample, the first batch negative sample and the sample matching score.
In a possible design, in an implementation manner of another aspect of the embodiment of the present application, the processing unit may be specifically configured to:
according to the sample matching scores, dividing the first batch of negative example sentence vectors to obtain a similar sentence vector set and a dissimilar sentence vector set;
extracting any one of the similar sentence vectors in the similar sentence vector set to obtain a first batch of similar sentence vectors;
and any heterogeneous sentence vector is extracted from the heterogeneous sentence vector set to obtain a first batch of heterogeneous sentence vectors.
In a possible design, in an implementation manner of another aspect of the embodiment of the present application, the processing unit may be specifically configured to:
respectively performing loss calculation operation on the first batch of positive example sentence vectors, the first batch of similar sentence vectors and the first batch of heterogeneous sentence vectors to obtain loss functions corresponding to a plurality of first batch of triples;
and performing weighted calculation operation on the loss functions corresponding to the first batch triples to obtain the first batch loss functions.
In a possible design, in an implementation manner of another aspect of the embodiment of the present application, the processing unit may be specifically configured to:
acquiring a second batch of sample sets corresponding to the target scene, and executing vector transformation operation, triple construction operation and loss calculation operation according to the second batch of sample sets to obtain a second loss function;
and if the second loss function is smaller than the first threshold value, taking the current intermediate similar text matching model as the target similar text matching model.
In a possible design, in an implementation manner of another aspect of the embodiment of the present application, the processing unit may be specifically configured to:
acquiring intermediate model parameters of the intermediate similar text matching model;
obtaining a current similar text matching model after acquiring a second batch of sample sets corresponding to a target scene and executing vector conversion operation, triple construction operation and parameter adjustment operation, wherein the current similar text matching model comprises current model parameters;
and if the difference value between the intermediate model parameter and the current model parameter meets a second threshold value, taking the current intermediate similar text matching model as a target similar text matching model.
In one possible design, in one implementation of another aspect of an embodiment of the present application,
the acquisition unit is also used for receiving a text to be matched;
the processing unit is also used for respectively enabling the text to be matched and the target text data set to pass through a target similar text matching model to obtain a sentence vector to be matched and a plurality of original sentence vectors;
the processing unit is also used for calculating the similarity between the sentence vector to be matched and each original sentence vector to obtain a plurality of similarity scores;
and the determining unit is used for determining the target similar text according to the plurality of similar scores and pushing the target similar text to the target terminal equipment.
According to the technical scheme, the embodiment of the application has the following advantages:
obtaining a first batch of sample sets corresponding to a target scene, inputting a first batch of positive samples and a first batch of negative samples in the first batch of sample sets to an original similar text matching model for vector conversion operation to obtain a first batch of positive example sentence vectors and a first batch of negative example vectors, performing triple construction operation on the first batch of positive example vectors to obtain a plurality of first batch triples, performing loss calculation operation on the plurality of first batch triples to obtain a first batch of loss functions corresponding to the first batch of sample sets, performing parameter adjustment operation on the original similar text matching model according to the first batch of loss functions to obtain a middle similar text matching model, and repeatedly obtaining a second batch of sample sets corresponding to the target scene based on the middle similar text matching model, and executing vector conversion operation, triad construction operation, loss calculation operation and parameter adjustment operation to obtain the target similar text matching model. By the method, the triple loss function can be obtained by constructing the triple through the first batch of positive example sentence vectors and the first batch of negative example sentence vectors, the distance between the positive example samples and the same-class samples can be drawn close through the triple loss function, and the distance between the positive example samples and the heterogeneous samples can be pushed away, so that the similar text vectors can form clusters in the feature space, the learning capacity of the similar text matching model on the similarity between the text vectors is improved, the similar text matching model can be better fitted, and the recall rate of the target similar text matching model on the similar text is improved.
Drawings
FIG. 1 is a schematic diagram of an architecture of a text data control system in an embodiment of the present application;
FIG. 2 is a flowchart of an embodiment of a method for training similar text matching models in the embodiment of the present application;
FIG. 3 is a schematic diagram illustrating a search principle of a training method for similar text matching models in the embodiment of the present application;
FIG. 4 is a schematic diagram of another search principle of a training method of a similar text matching model in the embodiment of the present application;
FIG. 5 is a schematic diagram of a sample set construction of a training method for similar text matching models in the embodiment of the present application;
FIG. 6 is a schematic diagram of a model training flow of a training method for matching similar texts with a model in the embodiment of the present application;
FIG. 7 is a schematic diagram of an embodiment of a training apparatus for matching similar texts with a model in the embodiment of the present application;
FIG. 8 is a schematic diagram of an embodiment of a computer device in the embodiment of the present application.
Detailed Description
The embodiment of the application provides a training method, a training device, training equipment and a storage medium for similar text matching models, which are used for drawing the distance between a regular sample and a similar sample through a triple loss function and pushing the distance between the regular sample and a heterogeneous sample so that similar text vectors can form clusters in a feature space, the learning capacity of the similar text matching models on the similarity between the text vectors is improved, the similar text matching models can be better fitted, and the recall rate of target similar text matching models on similar texts is improved.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims and drawings of the present application, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "corresponding" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
With the rapid development of information, Cloud technology (Cloud technology) gradually moves into the aspect of people's life. The cloud technology is a general term of network technology, information technology, integration technology, management platform technology, application technology and the like applied based on a cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data of different levels are processed separately, and various kinds of industry data need strong system background support and can only be realized through cloud computing.
The Cloud Security (Cloud Security) refers to a generic name of Security software, hardware, users, organizations and Security Cloud platforms applied based on a Cloud computing business model. The cloud security integrates emerging technologies and concepts such as parallel processing, grid computing and unknown virus behavior judgment, the latest information of trojans and malicious programs in the internet is acquired by monitoring the abnormality of software behaviors in the network through a large number of meshed clients, the latest information is sent to a server for automatic analysis and processing, and then the virus and trojan solution is distributed to each client. The training and testing method for the similar text matching model provided by the embodiment of the application can be realized through a cloud computing technology and a cloud security technology.
It should be understood that the training test method of the similar text matching model provided by the application can be applied to the fields of cloud technology, artificial intelligence, intelligent transportation and the like, and is used in scenes where similar texts are pushed or launched to a target object through matching of the similar texts, for example, more matched advertisements can be recommended to the target object through matching of the similar texts on advertisement texts; as another example, a more matched item can be recommended for the target object, for example, by matching the item text with similar text; for example, a more matched book or document can be recommended for a target object by matching a similar text with a book text, in the above-mentioned various scenarios, in order to realize matching of the similar text, usually, the text similarity model is judged based on deep learning, but a large amount of manually labeled training samples are needed to make the model better fit and generalize, however, the text semantics are rich, and different labeling personnel have different standards with respect to text similarity, and are difficult to unify, so that great difficulty is brought to labeling the training samples with the manual data, huge cost is needed to be consumed for training, optimizing and iterating the model, the fitting and generalization effects of the model are poor, and the recall rate of the model is reduced.
In order to solve the above problems, the present application provides a training and testing method for a similar text matching model, the method is applied to a text data control system shown in fig. 1, please refer to fig. 1, fig. 1 is a schematic structural diagram of the text data control system in the embodiment of the present application, as shown in fig. 1, a server obtains a first batch of sample sets corresponding to a target scene, respectively inputs a first batch of positive example samples and a first batch of negative example samples in the first batch of sample sets to an original similar text matching model for vector transformation operation, obtains a first batch of positive example sentence vectors and a first batch of negative example vectors, and performs a triplet construction operation on the first batch of positive example sentence vectors to obtain a plurality of first batch of triples, and further may perform a loss calculation operation on the plurality of first batch of triples to obtain a first batch of loss functions corresponding to the first batch of sample sets, and performing parameter adjustment operation on the original similar text matching model according to the first batch loss function to obtain an intermediate similar text matching model, then repeatedly obtaining a second batch of sample sets corresponding to the target scene based on the intermediate similar text matching model, and performing vector transformation operation, triple construction operation, loss calculation operation and parameter adjustment operation to obtain the target similar text matching model. Through the method, the triple loss function can be obtained by constructing the triple through the first batch of positive example sentence vectors and the first batch of negative example sentence vectors, the distance between the positive example samples and the same-class samples can be drawn by the triple loss function, and the distance between the positive example samples and the heterogeneous samples can be pushed away, so that the similar text vectors can form clusters in the characteristic space, the learning capacity of the similar text matching model on the similarity between the text vectors is improved, the similar text matching model can be better fitted, and the recall rate of the target similar text matching model on the similar text is improved.
It is understood that fig. 1 only shows one terminal device, and in an actual scene, a greater variety of terminal devices may participate in the data processing process, where the terminal devices include, but are not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent household appliance, a vehicle-mounted terminal, and the specific number and variety depend on the actual scene, and are not limited herein. In addition, fig. 1 shows one server, but in an actual scenario, a plurality of servers may participate, and particularly in a scenario of multi-model training interaction, the number of servers depends on the actual scenario, and is not limited herein.
It should be noted that in this embodiment, the server may be an independent physical server, may also be a server cluster or distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data, and an artificial intelligence platform. The terminal device and the server may be directly or indirectly connected through a wired or wireless communication manner, and the terminal device and the server may be connected to form a block chain network, which is not limited herein.
In order to solve the above problems, the present application proposes a training method for similar text matching models, which is generally executed by a server or a terminal device, and accordingly, a training apparatus applied to similar text matching models is generally disposed in the server or the terminal device.
It is understood that the training method, apparatus, device and storage medium for similar text matching model as disclosed in the present application, wherein a plurality of servers or terminal devices can be combined into a blockchain, and the servers or terminal devices are nodes on the blockchain. In practical application, data sharing between nodes can be required in a block chain, and text data and the like can be stored in each node.
Referring to fig. 2 and 8, a method for training a similar text matching model in the present application will be described, where an embodiment of the method for training a similar text matching model in the present application includes:
in step S101, a first batch sample set corresponding to a target scene is obtained, where the first batch sample set includes first batch positive example samples and first batch negative example samples;
in this embodiment, before the matching request or the retrieval request sent by the target terminal device is obtained, for example, before the text to be matched or the text to be retrieved is obtained, the first batch of sample sets corresponding to the target scene may be obtained, so that the original similar text matching model may be trained subsequently through the first batch of positive example samples and the first batch of negative example samples in the first batch of sample sets, so as to optimize the original similar text matching model.
Specifically, before the text to be matched or the text to be retrieved is obtained, a target scene stored in the database may be obtained, where the target scene may be specifically represented as an advertisement scene, a news information scene, a book management scene, or the like, or may be another target scene, which is not specifically limited herein, and further, a first batch sample set corresponding to the target scene may be obtained, where the first batch sample set may be a batch of sample data randomly extracted from the sample set and including a positive example sample set and a negative example sample set, the first positive example sample set is an advertisement text in a known advertisement scene, and the first negative example sample set and the first positive example sample are text data subjected to ES matching, where as shown in table 1, one positive example sample may correspond to one or more negative example samples.
TABLE 1
Figure BDA0003381663090000061
For example, a first batch sample set may be 128-dimensional sample data, i.e. including 12 sample data, which includes 118 pieces of sample data, i.e. first batch positive samples, which may include 10 positive samples, and 118 pieces of sample data, i.e. first batch negative samples, which are ES-matched with the 10 positive samples, respectively.
In step S102, inputting the first batch of positive example samples and the first batch of negative example samples to the original similar text matching model for vector transformation operation, so as to obtain a first batch of positive example sentence vectors and a first batch of negative example sentence vectors;
in this embodiment, after the first batch of positive examples and the first batch of negative examples are obtained, the first batch of positive examples and the first batch of negative examples may be respectively input to a plurality of original similar text matching models for vector conversion to obtain a first batch of positive example vectors and a first batch of negative example vectors, so that the distance between each positive example and each negative example can be better calculated through the first batch of positive example vectors and the first batch of negative example vectors, and the similarity between the positive examples and the negative examples can be better represented through the distance between the positive examples and the negative examples.
Specifically, as shown in fig. 6, the original similar text matching model may be specifically represented as a combination of a Bert model and a plurality of fully-connected layers and pooling layers, and other text processing models may also be used, which is not limited herein. After the first batch of positive example samples and the first batch of negative example samples are obtained, as shown in fig. 6, the first batch of positive example samples and the first batch of negative example samples may be respectively input to a plurality of original similar text matching models for vector conversion, for example, the first batch of positive example samples and the first batch of negative example samples are respectively compiled through a Bert model to obtain at least two word vectors corresponding to the first batch of positive example samples and at least two word vectors corresponding to the first batch of negative example samples, and then the at least two word vectors corresponding to the first batch of positive example samples and the at least two word vectors corresponding to the first batch of negative example samples are respectively passed through a plurality of full-connection layers and pooling layers to obtain the first batch of positive example vectors and the first batch of negative example vectors.
In step S103, performing a triple construction operation on the first batch of positive example sentence vectors to obtain a plurality of first batch of triples, where each first batch of triples includes a first batch of positive example sentence vectors, a first batch of similar sentence vectors, and a first batch of heterogeneous sentence vectors, and the first batch of similar sentence vectors and the first batch of heterogeneous sentence vectors are derived from the first batch of negative example sentence vectors;
in this embodiment, after the first batch of positive example sentence vectors and the first batch of negative example sentence vectors are obtained, for each first batch of positive example sentence vectors, a first batch of same-class sentence vector and a first batch of heterogeneous sentence vector may be randomly combined into a triplet, where one first batch of positive example sentence vector may correspond to one or more triples, so that a plurality of first batch of triples may be obtained.
The first batch of similar sentence vectors can be understood as negative example samples with higher similarity to the first batch of positive example sentence vectors, and can be specifically expressed as sentence vectors corresponding to the negative example samples with the matching score larger than 0.5; the first batch of heterogeneous sentence vectors can be understood as negative example samples with lower similarity to the first batch of positive example sentence vectors, and can be specifically expressed as sentence vectors corresponding to the negative example samples with matching scores less than 0.5.
Specifically, after a first batch of positive example sentence vectors and a first batch of negative example sentence vectors are obtained, for example, 10 first batch of positive example samples and 118 first batch of negative example samples are obtained, and it is assumed that for one first batch of positive example sample, 3 first batch of negative example samples that are ES-matched with the first batch of positive example samples are obtained, where 3 first batch of negative example samples correspond to 2 first batch of same-class sentence vectors and 1 first batch of different-class sentence vectors, a first batch of positive example vectors, a first batch of same-class sentence vectors and a first batch of different-class sentence vectors are randomly extracted to form a triplet, and 2 triples corresponding to the first batch of positive example samples can be obtained.
In step S104, performing a loss calculation operation on the plurality of first batch triples to obtain a first batch loss function corresponding to the first batch sample set;
in this embodiment, after a plurality of first batch triples are obtained, a loss function may be calculated for each triplet, and then the obtained loss functions are integrated into one loss function, that is, the first batch loss function corresponding to the first batch sample set, the triplet loss function may be constructed based on the triplets, the distance between the normal sample and the similar sample may be drawn, and the distance between the normal sample and the heterogeneous sample may be pushed away, so that similar text vectors may form clusters in a feature space, and a purpose of text matching is achieved.
Furthermore, the similarity between text sentences can be regressed through a triple loss function, so that the original similar text model can express the similarity between sentence vectors through an Embedding vector (Embedding) obtained after learning, and the similarity is as close as possible to the normalized matching score.
The first batch loss function may be expressed as a triple loss function, and specifically may be as follows:
Figure BDA0003381663090000081
wherein L is a first batch loss function, ES (a, p) represents a normalized matching score of a positive example sample a corresponding to a positive example sentence vector and a negative example sample p corresponding to a same-type sentence vector, and d (p, n) represents a cosine similarity between a same-type negative example sentence vector corresponding to the negative example sample p and a different-type sentence vector corresponding to the negative example sample n, that is, d (a, p) is 1-cosine (a, p).
In step S105, performing parameter adjustment operation on the original similar text matching model according to the first batch loss function to obtain an intermediate similar text matching model;
specifically, after the first batch loss function is obtained, a parameter adjustment operation may be performed on the original similar text matching model, specifically, a reverse gradient descent algorithm may be adopted to update the model parameters in the bert until convergence, so that the intermediate similar text matching model may be obtained.
In step S106, based on the intermediate similar text matching model, a second batch of sample sets corresponding to the target scene is repeatedly obtained, and a vector transformation operation, a triple construction operation, a loss calculation operation, and a parameter adjustment operation are performed to obtain the target similar text matching model.
In this embodiment, after the intermediate similar text matching model is obtained, a second batch of sample sets corresponding to the target scene may be repeatedly obtained, and based on the obtained second batch of sample sets, the vector transformation operation, the triple construction operation, the loss calculation operation, and the parameter adjustment operation similar to steps S102 to S105 are repeatedly performed until the model parameters of the intermediate similar text matching model tend to be stable, the intermediate similar text matching model may be used as the target similar text matching model, and it may be understood that the second batch of sample sets corresponding to the target scene may be a batch of sample data randomly extracted from the sample sets, where the batch of sample data includes a positive example sample set and a negative example sample set.
In the embodiment of the application, a training method of a similar text matching model is provided, and by the above method, a triple loss function can be obtained by constructing a triple through a first batch of positive example sentence vectors and a first batch of negative example sentence vectors, the distance between a positive example sample and a similar sample can be drawn close through the triple loss function, and the distance between the positive example sample and a heterogeneous sample can be pushed away, so that similar text vectors can form clusters in a feature space, the learning capability of the similar text matching model on the similarity between the text vectors is improved, the similar text matching model can be better fitted, and the recall rate of a target similar text matching model on similar texts is improved.
Optionally, on the basis of the embodiment corresponding to fig. 2, in another optional embodiment of the training method for a similar text matching model provided in the embodiment of the present application, the obtaining a first batch sample set corresponding to a target scene includes:
acquiring a target text data set corresponding to a target scene, wherein the target text data set at least comprises a first batch of regular samples and source text data corresponding to the target scene;
retrieving N first matching texts corresponding to the first batch positive example samples from the target text data set as N first batch negative example samples, wherein N is an integer greater than 1;
calculating the matching scores between the first batch positive example samples and each first batch negative example sample to obtain N first matching scores;
respectively carrying out normalization operation on the N first matching scores to obtain N sample matching scores;
and constructing a first batch sample set according to the first batch positive sample, the first batch negative sample and the sample matching score.
In this embodiment, as shown in fig. 5, before performing model training, a corresponding target text data set may be obtained according to a target scene, N first matching texts corresponding to first batch positive examples may be retrieved from the target text data set to obtain first batch negative examples, then normalization operations may be performed on the N first matching scores to obtain N sample matching scores to obtain first batch sample sets corresponding to the target scene, the target text data set may be constructed by a search Engine (ES), where ES is a text retrieval engine supporting efficient and multiple scoring strategies, and a self-supervised training sample is obtained from the target text data set, such as the first batch sample set, without spending a lot of time to formulate a text similarity standard and without performing cumbersome manual labeling, the target text data set can be replaced and adjusted according to different target scenes and requirements, and the sample set suitable for the target scenes can be obtained better and more accurately, so that the sample set is more flexible to construct and has stronger expansibility.
Specifically, before a target text data set corresponding to a target scene is obtained, a search library corresponding to the target scene may be established by a search engine, that is, source text data corresponding to the target scene is obtained, where the source text data may be embodied as an advertisement document, a description, a commodity document, or the like in the advertisement scene, or may be embodied as other text data, where no specific limitation is imposed, and the obtaining of the source text data may be embodied as obtaining initial text data corresponding to the target scene by the search engine, further, since the initial text data such as the advertisement document has many similar texts, for example, only a punctuation mark difference, only individual characters are different, in order to improve diversity of the text data as much as possible, data deduplication processing may be performed on the obtained initial text data, for example, an edit distance between every two texts may be calculated respectively or a length of a longest text in every two texts may be counted, if the editing distance or the length difference between every two texts is smaller than a preset distance threshold, filtering the text with a shorter length as a duplicate sample to obtain source text data, and then, when one piece of source text data is stored by the ES, the ES may use a tokenizer to extract a word unit (token) from the source text data to support storage and retrieval of an index, wherein the extracted word unit may use a plurality of tokenizers built in the ES, or may use an optional (IKAnalyzer, IK) tokenizer as a word segmentation tool for each piece of source text data.
Further, as shown in fig. 5, after the source text data is acquired, a query sample (query) collected according to the target scene may be acquired, and the query sample and the acquired source text data may form target text data, for example, assuming that six thousand query samples under the advertisement scene are collected, two million six thousand entry target text data may be formed with two million pieces of source text data under the advertisement scene, and then, when one piece of target text data is stored by the ES, the ES may extract word units from the target text data by using a word splitter to establish an index of the target text data.
Further, as shown in fig. 5, a search may be performed in the target text data for each good-case sample, for example, six thousand good-case samples are searched in two million six thousand entry target text data, matching according to ES may be performed during the search, specifically, one or more matching methods such as keyword matching based on bag-of-words model (bag-of-words), or sentence matching in which a word vector generates a sentence vector, for example, as shown in fig. 3, assuming that one good-case sample is "beijing snack", a set of word units such as "beijing", "snack" may be obtained by a word splitter, and if one source text data is "i am beijing snack", we get "i", "i love", "beijing" and "snack", keyword matching based on the bag-of-words model, i.e. hit "beijing" and "snack", then a relevance score, i.e., a matching score, between the formal example and the source text data can be calculated according to the hit word units, and similarly, topN pieces of text data similar to each formal example in the target text data, i.e., N first matching texts, and a matching score corresponding to each first matching text can be returned, and the matching scores are arranged from high to low.
For example, as shown in FIG. 4, assume that there is a positive sample of a "drill' screwdriver blasting country, 1 minute 250 revolutions, easy drill! 47 bits, suitable for 500 screws! ! 1! "may be ES matched to the target text data containing the proper example, if the text matched to the proper example is also the proper example, a maximum matching score such as" 133.11464 "may be obtained, or the proper example may be matched to a source text data such as" drill' selling screwdriver |)! 1 minute 2500 rpm, fastest 3 seconds punch! 47 bits, suitable for 500 screws! 1! ", a match score such as" 105.32658 "may be obtained, or the sample may be matched to another source text data such as" this screwdriver is fired! 1 minute 250 revolutions, punch easily! 47 bits can be screwed on 500 screws! The user can use the book 1! ", a match score such as" 102.89305 "may be obtained, or the sample may be matched to a source text data such as" the electric screwdriver is too much for 1 minute 250 turns, 47 bits are easily punched, 500 screws are screwed! A bid of 1 "is present, a match score such as" 99.31 "may be obtained.
Further, after acquiring the N first matching scores, since the matching score has no clear range, for example, the highest matching score may vary from several tens to several hundreds according to different positive samples, and if the matching score is used to directly fit with the deep learning model, it is difficult to converge the model training process, as shown in fig. 5, the matching scores may be normalized by normalizing each matching score to acquire a sample matching score, so as to solve the comparability between the matching scores, make the matching scores in the same order of magnitude, so as to better fit with the deep learning model, so that the model training process can better and more easily converge, and then, a first batch sample set may be better constructed according to the first batch positive samples, the first batch negative samples, and the sample matching scores, specifically, the normalization of each matching score may be obtained by the following formula (1):
Figure BDA0003381663090000101
wherein q is a positive example sample, d is source text data, k is a total number of target text data, es _ norm (q, d) sample matching score, score (q, d) is a matching score,
Figure BDA0003381663090000102
the maximum matching score, wherein the maximum matching score is the score obtained by matching the positive example with itself because the positive example is the best match with itself.
Wherein score (q, d) can be obtained by the following formula (2):
score(q,d)=coord(q,d)*queryNorm(q)*∑t∈d(tf(t∈d)*idf(t)2* boost(t)*norm(t,d)) (2)
wherein, coord (q, d) is a coordination factor, queryNorm (q) is a query norm, tf (t belongs to d) is word frequency, idf (t)2For inverse document frequency, boost (t) is term weight and norm (t, d) is length norm.
Optionally, on the basis of the embodiment corresponding to fig. 2, in another optional embodiment of the training method for a similar text matching model provided in the embodiment of the present application, the triple construction operation is performed on the first batch of positive example sentence vectors to obtain a plurality of first batch of triples, including:
according to the sample matching scores, dividing the first batch of negative sentence vectors to obtain a similar sentence vector set and an dissimilar sentence vector set;
extracting any one of the similar sentence vectors in the similar sentence vector set to obtain a first batch of similar sentence vectors;
and any heterogeneous sentence vector is extracted from the heterogeneous sentence vector set to obtain a first batch of heterogeneous sentence vectors.
In this embodiment, after the first batch of positive example sentence vectors and the first batch of negative example sentence vectors are obtained, the first batch of negative example sentence vectors may be divided into a homogeneous sentence vector set and a heterogeneous sentence vector set according to the sample matching scores of the first batch of sample sets, then, any homogeneous sentence vector may be extracted from the homogeneous sentence vector set to obtain the first batch of homogeneous sentence vectors, and similarly, any heterogeneous sentence vector may be extracted from the heterogeneous sentence vector set to obtain the first batch of heterogeneous sentence vectors, so that a triplet may be constructed for the first batch of positive example sentence vectors based on the first batch of homogeneous sentence vectors and the first batch of heterogeneous sentence vectors, and thus, the distance between the similar text and the non-similar text may be represented better by the triplet.
Specifically, the negative example sentence vectors of the first batch are divided according to the sample matching scores, specifically, the sentence vectors with the sample matching scores larger than 0.5 can be used as a similar sentence vector set, and in a similar way, the sentence vectors with the sample matching scores smaller than 0.5 can be used as a heterogeneous sentence vector set, then, any similar sentence vector can be randomly extracted from the similar sentence vector set to obtain the similar sentence vectors of the first batch, and in a similar way, any heterogeneous sentence vector can be randomly extracted from the heterogeneous sentence vector set to obtain the heterogeneous sentence vectors of the first batch.
TABLE 2
Figure BDA0003381663090000111
For example, as shown in Table 2, the matching score corresponding to the negative example sample can be used, and the positive example sample "this electric screwdriver too big! 1 minute 250 revolutions, punch easily! 47 bits can be screwed with 500 screws! The user can just use the device to fold 1! "the corresponding negative example sample 1 and negative example sample 2 are converted into negative example vectors to obtain a negative example vector 1 and a negative example vector 2, and then the negative example vector 1 is used as a homogeneous sentence vector and the negative example vector 2 is used as a heterogeneous sentence vector based on the sample matching score 0.89 corresponding to the negative example vector 1 and the sample matching score 0.18 corresponding to the negative example vector 2.
Optionally, on the basis of the embodiment corresponding to fig. 2, in another optional embodiment of the training method for a similar text matching model provided in the embodiment of the present application, performing a loss calculation operation on a plurality of first batch triples to obtain a first batch loss function corresponding to a first batch sample set includes:
respectively performing loss calculation operation on the first batch of positive example sentence vectors, the first batch of similar sentence vectors and the first batch of heterogeneous sentence vectors to obtain loss functions corresponding to a plurality of first batch of triples;
and performing weighted calculation operation on the loss functions corresponding to the first batch triples to obtain the first batch loss functions.
In this embodiment, after a plurality of first batch triples constructed based on a first batch of positive example sentence vectors are obtained, loss calculation may be performed on each triplet, that is, loss calculation may be performed on the first batch of positive example sentence vectors, the first batch of similar sentence vectors, and the first batch of heterogeneous sentence vectors, respectively, to obtain a loss function corresponding to each first batch of triplet, and then, weighting calculation may be performed on the loss functions corresponding to the plurality of first batch triples according to a preset weight, to obtain a first batch loss function, the triplet loss function may be constructed based on triples, a distance between a positive example and a similar example may be drawn, and a distance between a positive example and a heterogeneous example may be pushed away, so that similar text vectors may form clusters in a feature space, and a purpose of text matching may be achieved.
Specifically, after a plurality of first-batch triples are obtained, a loss function may be calculated for each triplet, specifically, a first-batch positive example sentence vector, a first-batch same-class sentence vector, and a first-batch heterogeneous sentence vector may be substituted into the function expression of the triplet loss function in step S104 for calculation, so as to obtain a loss function number corresponding to one triplet, and then, the obtained plurality of loss functions may be integrated into one loss function, specifically, a weighting calculation operation may be performed on the loss functions corresponding to the plurality of first-batch triples through a preset weight, so as to obtain a first-batch loss function corresponding to the first-batch sample set, where the preset weight is set according to an actual application requirement, and no specific limitation is made here.
Optionally, on the basis of the embodiment corresponding to fig. 2, in another optional embodiment of the training method for similar text matching models provided in the embodiment of the present application, based on the intermediate similar text matching model, a second batch of sample sets corresponding to the target scene is repeatedly obtained, and a vector transformation operation, a triple construction operation, a loss calculation operation, and a parameter adjustment operation are performed to obtain a target similar text matching model, where the method includes:
acquiring a second batch of sample sets corresponding to the target scene, and executing vector transformation operation, triple construction operation and loss calculation operation according to the second batch of sample sets to obtain a second loss function;
and if the second loss function is smaller than the first threshold value, taking the current intermediate similar text matching model as the target similar text matching model.
In this embodiment, after the intermediate similar text matching model is obtained, a second batch of sample sets corresponding to the target scene may be continuously obtained to perform vector transformation operation, triple construction operation, and loss calculation operation to obtain a second loss function, and reverse iterative training is performed on the intermediate similar text matching model through the second loss function, when the second loss function is smaller than the first threshold, the current intermediate similar text matching model may be used as the target similar text matching model, so that the model can sufficiently learn similarity between text vectors, the target similar text matching model can be better fitted, and thus the recall rate of the target similar text matching model to similar texts is improved to a certain extent.
Specifically, after a second batch of sample sets corresponding to the target scene is obtained, where the second batch of sample sets is used to generally refer to sample sets of other batches extracted from the sample sets, which are different from the first batch of sample sets, and may be specifically represented as three-batch, four-batch, or N-batch sample sets, then operations similar to the vector transformation operation, the triple construction operation, and the loss calculation operation in steps S102 to S104 may be performed, which are not described herein again, to obtain a second loss function, where the second loss function may be used to generally refer to a loss function corresponding to each batch of sample sets.
Further, after obtaining the second loss function, it may be understood that the smaller the second loss function is, the better the model fits, and therefore, the second loss function may be compared with a first threshold, where the first threshold may specifically be represented as a smaller value, such as 0.18, and the first threshold is set according to the actual application requirements, and is not specifically limited herein, and then, when the second loss function is smaller than the first threshold, it may be understood that when the second loss function is already small enough, the current intermediate similar text matching model tends to be stable, and the current intermediate similar text matching model has converged, and then the current intermediate similar text matching model may be taken as the target similar text matching model.
Optionally, on the basis of the embodiment corresponding to fig. 2, in another optional embodiment of the training method for similar text matching models provided in the embodiment of the present application, based on the intermediate similar text matching model, a second batch of sample sets corresponding to the target scene is repeatedly obtained, and a vector transformation operation, a triple construction operation, a loss calculation operation, and a parameter adjustment operation are performed to obtain a target similar text matching model, where the method includes:
acquiring intermediate model parameters of the intermediate similar text matching model;
obtaining a second batch of sample sets corresponding to the target scene, and executing vector transformation operation, triad construction operation and parameter adjustment operation to obtain a current similar text matching model, wherein the current similar text matching model comprises current model parameters;
and if the difference value between the intermediate model parameter and the current model parameter meets a second threshold value, taking the current intermediate similar text matching model as a target similar text matching model.
In this embodiment, after the intermediate similar text matching model is obtained, a second batch of sample sets corresponding to the target scene may be continuously obtained to perform vector transformation operation, triple construction operation, and loss calculation operation to obtain a second loss function, and reverse iterative training is performed on the intermediate similar text matching model through a second loss function, when a difference value between an intermediate model parameter and a current model parameter satisfies a second threshold, the current intermediate similar text matching model may be used as the target similar text matching model, so that the model can sufficiently learn similarity between text vectors, so that the target similar text matching model can be better fitted, thereby increasing the recall rate of the target similar text matching model to similar texts to a certain extent, where the second threshold may be specifically expressed as a smaller value, the second threshold is set according to the actual application requirement, and is not particularly limited herein.
Specifically, after the intermediate similar text matching model is obtained, the intermediate model parameters may be extracted, further, a sample set of a next batch corresponding to the target scene may be obtained, and operations similar to the vector transformation operation, the triple construction operation, and the parameter adjustment operation in steps S102 to S105 are repeatedly performed, which is not described here any more, so that the current similar text matching model can be obtained, and the model parameters of the current similar text matching model are extracted to obtain the current model parameters.
Further, since the intermediate similar text matching model tends to be stable, convergence of the intermediate similar text matching model can be reflected by stabilization of the model parameters, and therefore, a difference between the intermediate model parameters and the current model parameters can be calculated, and if the difference between the intermediate model parameters and the current model parameters is smaller than the second threshold, it can be understood that the difference between the intermediate model parameters and the current model parameters is small enough, that is, the trend of the model parameters tends to be stable, that is, the intermediate similar text matching model tends to be stable, the current intermediate similar text matching model can be taken as the target similar text matching model.
Optionally, on the basis of the embodiment corresponding to fig. 2, in another optional embodiment of the training method for a similar text matching model provided in the embodiment of the present application, based on the intermediate similar text matching model, a second batch of sample sets corresponding to the target scene is repeatedly obtained, and a vector transformation operation, a triple construction operation, a loss calculation operation, and a parameter adjustment operation are performed to obtain the target similar text matching model, where the method further includes:
receiving a text to be matched;
respectively enabling the text to be matched and the target text data set to pass through a target similar text matching model to obtain a sentence vector to be matched and a plurality of original sentence vectors;
calculating the similarity between the sentence vector to be matched and each original sentence vector to obtain a plurality of similarity scores;
and determining a target similar text according to the plurality of similar scores, and pushing the target similar text to the target terminal equipment.
In this embodiment, after the target similar text matching model is obtained, the target similar text matching model may be applied, the text to be matched is received, the target text data set corresponding to the text to be matched is obtained, then, the text to be matched and the target text data set are respectively input into the target similar text matching model, the sentence vector to be matched and the multiple original sentence vectors are obtained through the target similar text model, the similarity between the sentence vector to be matched and each original sentence vector may be calculated to obtain multiple similarity scores, the target similar text may be determined according to the multiple similarity scores, and the target similar text is pushed to the target terminal device, so that the matched target similar text can be recommended to the target object better and more accurately.
Specifically, the text to be matched may be embodied as an advertisement case, a commodity keyword, and the like, and may also be other texts, which are not specifically limited herein. After the target similar text matching model is obtained, if a text to be matched, such as an advertisement file A, sent by a target object through target terminal equipment is received, a target scene, such as an advertisement retrieval scene, to which the text to be matched belongs can be determined, a corresponding target similar text matching model and a corresponding target text data set, such as an advertisement file retrieval library, can be determined according to the target scene, then the obtained text to be matched and the target text data set can be respectively input into the target similar text matching model, a sentence vector to be matched and a plurality of original sentence vectors can be obtained through the target similar text matching model, further the sentence vector to be matched and each original sentence vector can be matched pairwise, the similarity between the sentence vector to be matched and each original sentence vector is calculated, so as to obtain a similarity score between the sentence vector to be matched and each original sentence vector, the calculation method of the similarity may specifically be through an euclidean distance or a cosine similarity, and may also be other calculation methods of the similarity, which is not limited herein.
Further, after the similarity scores between the sentence vectors to be matched and each original sentence vector are obtained, the similarity scores can be sorted from high to low according to the principle that the similarity score is higher and the similarity is higher, original sentence vectors corresponding to the first ten or hundred similar scores can be selected according to the requirements of a target scene, and then texts corresponding to the ten or hundred original sentence vectors which are selected are determined to be target similar texts and pushed to target terminal equipment.
When the target similar text is pushed to the target terminal device, the target similar text may be pushed specifically according to the type of the target similar text, for example, if the target similar text is represented as a commodity image or a commodity link, the target similar text may be pushed to the target terminal device directly; if the target similar text is represented by a video advertisement or hotspot information, the text more matched with the target object can be further determined from the target similar text according to the historical click rate, the conversion rate and the like of the target object, and then the text is pushed to the target terminal device.
It can be understood that the target similar text matching model can also be used in a plurality of links such as commodity retrieval, advertisement material retrieval, advertisement pre-estimation model characteristics, advertisement analysis and diagnosis and the like, and the overall advertisement delivery effect of the whole link can be improved. For example, the goods or advertisement materials are provided with corresponding documentaries generally, and matching accuracy and recall rate can be greatly improved by combining text keyword search and target similar text matching model search, so that more matched goods or advertisements are recommended for target objects. Or, the method can also be applied to a plurality of links such as recall, rough ranking or fine ranking model estimation and strategy adjustment, and delivery effect analysis and diagnosis related to the advertisement delivery process. For example, in the rough ranking step, similar texts and picture videos in the advertisements are obtained through the target similar text matching model to filter the similar advertisements and increase the diversity of the advertisements, or through the combination of the rough ranking or fine ranking estimation model and the target similar text matching model, the advertisements can be better understood and the generalization performance of the models can be increased.
It is to be understood that the target similar text matching model may also be applied to other text retrieval scenarios, such as public frequently-identified, virtual or physical item retrieval, book retrieval in fine-grained domain, legal document retrieval, medical record retrieval, and the like, which are not limited herein.
Referring to fig. 7, fig. 7 is a schematic diagram of an embodiment of a similar text matching model training apparatus in an embodiment of the present application, where the similar text matching model training apparatus 20 includes:
an obtaining unit 201, configured to obtain a first batch sample set corresponding to a target scene, where the first batch sample set includes a first batch positive example sample and a first batch negative example sample;
the processing unit 202 is configured to input the first batch of positive example samples and the first batch of negative example samples to the original similar text matching model for vector transformation operation, so as to obtain first batch of positive example sentence vectors and first batch of negative example sentence vectors;
the processing unit 202 is further configured to perform a triple construction operation on the first batch of positive example sentence vectors to obtain a plurality of first batch of triples, where each first batch of triples includes a first batch of positive example sentence vectors, a first batch of similar sentence vectors, and a first batch of heterogeneous sentence vectors, and the first batch of similar sentence vectors and the first batch of heterogeneous sentence vectors are derived from the first batch of negative example sentence vectors;
the processing unit 202 is further configured to perform a loss calculation operation on the plurality of first batch triples, and obtain a first batch loss function corresponding to the first batch sample set;
the processing unit 202 is further configured to perform parameter adjustment operation on the original similar text matching model according to the first batch loss function to obtain an intermediate similar text matching model;
the processing unit 202 is further configured to repeatedly obtain a second batch of sample sets corresponding to the target scene based on the intermediate similar text matching model, and perform a vector transformation operation, a triple construction operation, a loss calculation operation, and a parameter adjustment operation to obtain the target similar text matching model.
Optionally, on the basis of the embodiment corresponding to fig. 7, in another embodiment of the training apparatus for matching a similar text model provided in the embodiment of the present application, the obtaining unit 201 may specifically be configured to:
acquiring a target text data set corresponding to a target scene, wherein the target text data set at least comprises a first batch of regular samples and source text data corresponding to the target scene;
retrieving N first matching texts corresponding to the first batch positive example samples from the target text data set as N first batch negative example samples;
calculating the matching scores between the first batch positive example samples and each first batch negative example sample to obtain N first matching scores;
respectively carrying out normalization operation on the N first matching scores to obtain N sample matching scores;
constructing the first batch sample set according to the first batch positive sample, the first batch negative sample and the sample matching score.
Optionally, on the basis of the embodiment corresponding to fig. 7, in another embodiment of the training apparatus for matching a similar text model provided in the embodiment of the present application, the processing unit 202 may specifically be configured to:
according to the sample matching scores, dividing the first batch of negative example sentence vectors to obtain a similar sentence vector set and a dissimilar sentence vector set;
extracting any one of the similar sentence vectors in the similar sentence vector set to obtain a first batch of similar sentence vectors;
and any heterogeneous sentence vector is extracted from the heterogeneous sentence vector set to obtain a first batch of heterogeneous sentence vectors.
Optionally, on the basis of the embodiment corresponding to fig. 7, in another embodiment of the training apparatus for matching a similar text model provided in the embodiment of the present application, the processing unit 202 may specifically be configured to:
respectively performing loss calculation operation on the first batch of positive example sentence vectors, the first batch of similar sentence vectors and the first batch of heterogeneous sentence vectors to obtain loss functions corresponding to a plurality of first batch of triples;
and performing weighted calculation operation on the loss functions corresponding to the first batch triples to obtain the first batch loss functions.
Optionally, on the basis of the embodiment corresponding to fig. 7, in another embodiment of the training apparatus for matching a similar text model provided in the embodiment of the present application, the processing unit 202 may specifically be configured to:
acquiring a second batch of sample sets corresponding to the target scene, and executing vector transformation operation, triple construction operation and loss calculation operation according to the second batch of sample sets to obtain a second loss function;
and if the second loss function is smaller than the first threshold value, taking the current intermediate similar text matching model as the target similar text matching model.
Optionally, on the basis of the embodiment corresponding to fig. 7, in another embodiment of the training apparatus for matching a similar text model provided in the embodiment of the present application, the processing unit 202 may specifically be configured to:
acquiring intermediate model parameters of the intermediate similar text matching model;
obtaining a current similar text matching model after acquiring a second batch of sample sets corresponding to a target scene and executing vector transformation operation, triple construction operation and parameter adjustment operation, wherein the current similar text matching model comprises current model parameters;
and if the difference value between the intermediate model parameter and the current model parameter meets a second threshold value, taking the current intermediate similar text matching model as a target similar text matching model.
Optionally, on the basis of the above embodiment corresponding to fig. 7, in another embodiment of the training apparatus for matching a similar text with a model provided in the embodiment of the present application,
the acquiring unit 201 is further configured to receive a text to be matched;
the processing unit 202 is further configured to pass the text to be matched and the target text data set through a target similar text matching model respectively to obtain a sentence vector to be matched and a plurality of original sentence vectors;
the processing unit 202 is further configured to calculate similarity between the sentence vector to be matched and each original sentence vector, so as to obtain a plurality of similarity scores;
the determining unit 203 is configured to determine a target similar text according to the plurality of similarity scores, and push the target similar text to the target terminal device.
Another exemplary computer device is provided, as shown in fig. 8, fig. 8 is a schematic structural diagram of a computer device provided in this embodiment, and the computer device 300 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 310 (e.g., one or more processors) and a memory 320, and one or more storage media 330 (e.g., one or more mass storage devices) storing an application 331 or data 332. Memory 320 and storage media 330 may be, among other things, transient or persistent storage. The program stored on the storage medium 330 may include one or more modules (not shown), each of which may include a series of instructions operating on the computer device 300. Still further, the central processor 310 may be configured to communicate with the storage medium 330 to execute a series of instruction operations in the storage medium 330 on the computer device 300.
The computer device 300 may also include one or more power supplies 340, one or more wired or wireless network interfaces 350, one or more input-output interfaces 360, and/or one or more operating systems 333, such as a Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTMAnd so on.
The computer device 300 described above is also used to perform the steps in the corresponding embodiment of fig. 2.
Another aspect of the present application provides a computer-readable storage medium having instructions stored thereon, which, when executed on a computer, cause the computer to perform the steps of the method as described in the embodiment shown in fig. 2.
Another aspect of the application provides a computer program product comprising instructions which, when run on a computer or processor, cause the computer or processor to perform the steps of the method as described in the embodiment shown in fig. 2.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Claims (11)

1. A training method of a similar text matching model is characterized by comprising the following steps:
acquiring a first batch sample set corresponding to a target scene, wherein the first batch sample set comprises a first batch positive example sample and a first batch negative example sample;
respectively inputting the first batch of positive example samples and the first batch of negative example samples into an original similar text matching model for vector conversion operation to obtain first batch of positive example sentence vectors and first batch of negative example sentence vectors;
performing a triple construction operation on the first batch of positive example sentence vectors to obtain a plurality of first batch of triples, wherein each first batch of triples comprises the first batch of positive example sentence vectors, a first batch of similar sentence vectors and a first batch of heterogeneous sentence vectors, and the first batch of similar sentence vectors and the first batch of heterogeneous sentence vectors are derived from the first batch of negative example sentence vectors;
performing loss calculation operation on the plurality of first batch triples to obtain a first batch loss function corresponding to the first batch sample set;
according to the first batch loss function, performing parameter adjustment operation on the original similar text matching model to obtain a middle similar text matching model;
and repeatedly acquiring a second batch of sample sets corresponding to the target scene based on the intermediate similar text matching model, and executing the vector conversion operation, the triple construction operation, the loss calculation operation and the parameter adjustment operation to obtain the target similar text matching model.
2. The method of claim 1, wherein obtaining a first set of batch samples corresponding to a target scene comprises:
acquiring a target text data set corresponding to the target scene, wherein the target text data set at least comprises the first batch of positive example samples and source text data corresponding to the target scene;
retrieving N first matching texts corresponding to the first batch positive example samples from the target text data set as N first batch negative example samples, wherein N is an integer greater than 1;
calculating the matching scores between the first batch positive example samples and each first batch negative example sample to obtain N first matching scores;
respectively carrying out normalization operation on the N first matching scores to obtain N sample matching scores;
constructing the first batch sample set from the first batch positive example sample, the first batch negative example sample, and the sample match score.
3. The method of claim 2, wherein the performing a triplet construction operation on the first batch of positive example sentence vectors to obtain a plurality of first batch triples comprises:
according to the sample matching scores, dividing the first batch of negative example sentence vectors to obtain a homogeneous sentence vector set and a heterogeneous sentence vector set;
extracting any one of the similar sentence vectors in the similar sentence vector set to obtain the first batch of similar sentence vectors;
and extracting any heterogeneous sentence vector in the heterogeneous sentence vector set to obtain the first batch of heterogeneous sentence vectors.
4. The method of claim 1, wherein performing a loss calculation operation on the first batch triples to obtain a first batch loss function corresponding to the first batch sample set comprises:
respectively performing loss calculation operation on the first batch of positive example sentence vectors, the first batch of similar sentence vectors and the first batch of heterogeneous sentence vectors to obtain loss functions corresponding to the plurality of first batch of triples;
and performing weighted calculation operation on the loss functions corresponding to the first batch of triples to obtain the first batch of loss functions.
5. The method according to claim 1, wherein the repeatedly obtaining a second batch of sample sets corresponding to the target scene based on the intermediate similar text matching model and performing the vector transformation operation, the triple construction operation, the loss calculation operation, and the parameter adjustment operation to obtain a target similar text matching model comprises:
acquiring a second batch of sample sets corresponding to the target scene, and executing the vector transformation operation, the triple construction operation and the loss calculation operation according to the second batch of sample sets to obtain a second loss function;
and if the second loss function is smaller than a first threshold value, taking the current intermediate similar text matching model as the target similar text matching model.
6. The method according to claim 1, wherein the repeatedly obtaining a second batch of sample sets corresponding to the target scene based on the intermediate similar text matching model and performing the vector transformation operation, the triple construction operation, the loss calculation operation, and the parameter adjustment operation to obtain a target similar text matching model comprises:
acquiring intermediate model parameters of the intermediate similar text matching model;
obtaining a current similar text matching model after acquiring a second batch of sample sets corresponding to the target scene and executing the vector conversion operation, the three-tuple construction operation and the parameter adjustment operation, wherein the current similar text matching model comprises current model parameters;
and if the difference value between the intermediate model parameter and the current model parameter meets a second threshold value, taking the current intermediate similar text matching model as the target similar text matching model.
7. The method according to claim 1, wherein after repeatedly obtaining a second batch of sample sets corresponding to the target scene based on the intermediate similar text matching model and performing the vector transformation operation, the triple construction operation, the loss calculation operation, and the parameter adjustment operation to obtain the target similar text matching model, the method further comprises:
receiving a text to be matched;
respectively enabling the text to be matched and the target text data set to pass through the target similar text matching model to obtain a sentence vector to be matched and a plurality of original sentence vectors;
calculating the similarity between the sentence vector to be matched and each original sentence vector to obtain a plurality of similarity scores;
and determining a target similar text according to the plurality of similar scores, and pushing the target similar text to the target terminal equipment.
8. A training apparatus for matching a similar text to a model, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a first batch sample set corresponding to a target scene, and the first batch sample set comprises a first batch positive example sample and a first batch negative example sample;
the processing unit is used for respectively inputting the first batch of positive example samples and the first batch of negative example samples to an original similar text matching model for vector conversion operation to obtain a first batch of positive example sentence vectors and a first batch of negative example sentence vectors;
the processing unit is further configured to perform triple construction operation on the first batch of positive example sentence vectors to obtain a plurality of first batch of triples, where each first batch of triples includes the first batch of positive example sentence vectors, a first batch of similar sentence vectors, and a first batch of heterogeneous sentence vectors, and the first batch of similar sentence vectors and the first batch of heterogeneous sentence vectors are derived from the first batch of negative example sentence vectors;
the processing unit is further configured to perform a loss calculation operation on the plurality of first batch triples, and obtain a first batch loss function corresponding to the first batch sample set;
the processing unit is further configured to perform parameter adjustment operation on the original similar text matching model according to the first batch loss function to obtain an intermediate similar text matching model;
the processing unit is further configured to repeatedly obtain a second batch of sample sets corresponding to the target scene based on the intermediate similar text matching model, and execute the vector transformation operation, the triple construction operation, the loss calculation operation, and the parameter adjustment operation to obtain a target similar text matching model.
9. A computer device, comprising: a memory, a transceiver, a processor, and a bus system;
wherein the memory is used for storing programs;
the processor when executing the program in the memory implementing the method of any one of claims 1 to 7;
the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.
10. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 7.
11. A computer program product comprising computer program/instructions, characterized in that the computer program/instructions, when executed by a processor, implement the steps of the method of any of claims 1 to 7.
CN202111436420.0A 2021-11-29 2021-11-29 Training method, device and equipment for similar text matching model and storage medium Pending CN114490923A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111436420.0A CN114490923A (en) 2021-11-29 2021-11-29 Training method, device and equipment for similar text matching model and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111436420.0A CN114490923A (en) 2021-11-29 2021-11-29 Training method, device and equipment for similar text matching model and storage medium

Publications (1)

Publication Number Publication Date
CN114490923A true CN114490923A (en) 2022-05-13

Family

ID=81492144

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111436420.0A Pending CN114490923A (en) 2021-11-29 2021-11-29 Training method, device and equipment for similar text matching model and storage medium

Country Status (1)

Country Link
CN (1) CN114490923A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115658903A (en) * 2022-11-01 2023-01-31 百度在线网络技术(北京)有限公司 Text classification method, model training method, related device and electronic equipment
CN116150380A (en) * 2023-04-18 2023-05-23 之江实验室 Text matching method, device, storage medium and equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115658903A (en) * 2022-11-01 2023-01-31 百度在线网络技术(北京)有限公司 Text classification method, model training method, related device and electronic equipment
CN115658903B (en) * 2022-11-01 2023-09-05 百度在线网络技术(北京)有限公司 Text classification method, model training method, related device and electronic equipment
CN116150380A (en) * 2023-04-18 2023-05-23 之江实验室 Text matching method, device, storage medium and equipment
CN116150380B (en) * 2023-04-18 2023-06-27 之江实验室 Text matching method, device, storage medium and equipment

Similar Documents

Publication Publication Date Title
CN111931062B (en) Training method and related device of information recommendation model
CN107436875B (en) Text classification method and device
US9147154B2 (en) Classifying resources using a deep network
CN109885773B (en) Personalized article recommendation method, system, medium and equipment
US9020947B2 (en) Web knowledge extraction for search task simplification
Tong et al. A shilling attack detector based on convolutional neural network for collaborative recommender system in social aware network
CN110019794B (en) Text resource classification method and device, storage medium and electronic device
CN110968684A (en) Information processing method, device, equipment and storage medium
CN110413875A (en) A kind of method and relevant apparatus of text information push
CN109471978B (en) Electronic resource recommendation method and device
CN106354856B (en) Artificial intelligence-based deep neural network enhanced search method and device
CN114490923A (en) Training method, device and equipment for similar text matching model and storage medium
Zubiaga et al. Content-based clustering for tag cloud visualization
CN112084307A (en) Data processing method and device, server and computer readable storage medium
CN113641797A (en) Data processing method, device, equipment, storage medium and computer program product
Vishwakarma et al. A comparative study of K-means and K-medoid clustering for social media text mining
Harakawa et al. accurate and efficient extraction of hierarchical structure ofWeb communities forWeb video retrieval
CN110162769B (en) Text theme output method and device, storage medium and electronic device
CN114328800A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN109819002B (en) Data pushing method and device, storage medium and electronic device
CN110347916B (en) Cross-scene item recommendation method and device, electronic equipment and storage medium
CN116957128A (en) Service index prediction method, device, equipment and storage medium
CN108509449B (en) Information processing method and server
JPWO2012077818A1 (en) Method for determining transformation matrix of hash function, hash type approximate nearest neighbor search method using the hash function, apparatus and computer program thereof
Harakawa et al. Extraction of hierarchical structure of Web communities including salient keyword estimation for Web video retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination