CN112347791B

CN112347791B - Method, system, computer equipment and storage medium for constructing text matching model

Info

Publication number: CN112347791B
Application number: CN202011235662.9A
Authority: CN
Inventors: 赵海林; 刘庆宇; 魏强
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2020-11-06
Filing date: 2020-11-06
Publication date: 2023-10-13
Anticipated expiration: 2040-11-06
Also published as: CN112347791A

Abstract

The embodiment of the invention relates to a method, a system, computer equipment and a storage medium for constructing a text matching model, which are characterized in that a search text sample set and a video title sample set are input into a feature extraction module, and an output result of the feature extraction module is used as input of an encoding module to train the text matching model; monitoring a training process of the text matching model by adopting a semantic interaction module, and determining a first association relationship between the text matching model and the semantic interaction module, wherein the first association relationship is used for representing a training result of the text matching model; the text matching model is assisted to train based on the first association relation until the first association relation meets the preset convergence condition, the text matching model is confirmed to train, and compared with a text matching model which is directly trained by supervised learning without other modules, the accuracy is higher, and the output result of the text matching model meets the requirements of users.

Description

Method, system, computer equipment and storage medium for constructing text matching model

Technical Field

The embodiment of the invention relates to the field of videos, in particular to a method, a system, computer equipment and a storage medium for constructing a text matching model.

Background

With the continuous development of science and technology, electronic technology has also been rapidly developed. People can download and install various video/information-type applications (such as a TengX video or XX bar, etc.) through a computer device such as a smart phone, a tablet computer, etc. to watch the video.

In the prior art, a user can acquire a video to be watched by inputting a search word on an application interface, for example, a text matching method of representation is adopted to acquire a search result, a representation extraction model of the search word and the search result can be obtained based on the text matching method of representation, and the representation of the search word and the search result at high frequency is directly extracted when a search operation is executed, but only sentence-level semantic matching is considered in the model construction process, so that the model has low precision, and video which is not wanted by the user or video which is searched out is poor in quality and cannot meet the watching requirement of the user well frequently occurs.

Disclosure of Invention

In view of this, in order to solve the above technical problems or part of the technical problems, embodiments of the present invention provide a method, a system, a computer device, and a storage medium for constructing a text matching model.

In a first aspect, an embodiment of the present invention provides a method for constructing a text matching model, where the text matching model includes a feature extraction module and an encoding module, and the method includes:

Inputting a search text sample set and a video title sample set into the feature extraction module, taking the output result of the feature extraction module as the input of the coding module, and training the text matching model; monitoring a training process of the text matching model by adopting the semantic interaction module, and determining a first association relationship between the text matching model and the semantic interaction module, wherein the first association relationship is used for representing a training result of the text matching model; and training the text matching model based on the first association relationship in an auxiliary manner until the first association relationship meets a preset convergence condition, and determining that the training of the text matching model is completed.

In one possible embodiment, the method further comprises: monitoring the training process of the text matching model by adopting a decoding module and the semantic interaction module, and determining a second association relationship among the text matching model, the semantic interaction module and the decoding module, wherein the second association relationship is used for representing the training result of the text matching model; and training the text matching model based on the second association relationship until the second association relationship meets a preset convergence condition, and determining that the training of the text matching model is completed.

In one possible implementation manner, the inputting the search text sample set and the video title sample set into the feature extraction module, and taking the output result of the feature extraction module as the input of the encoding module includes:

vectorizing a search text sample set and a video title sample set through the feature extraction module to obtain a first text vector corresponding to the search text sample set and a second text vector corresponding to the video title sample set, wherein the first text vector and the second text vector are used as input vectors of the semantic interaction module;

and encoding the first text vector and the second text vector through the encoding module to obtain a third text vector corresponding to the search text sample set and a fourth text vector corresponding to the video title sample set, wherein the third text vector and the fourth text vector are used as input vectors of the decoding module.

In one possible implementation manner, the training process of the text matching model is supervised by adopting a decoding module and the semantic interaction module, and the determining a second association relationship among the text matching model, the semantic interaction module and the decoding module includes:

Decoding the third text vector and the fourth text vector through the decoding module to obtain a fifth text vector corresponding to the search text sample set and a sixth text vector corresponding to the video title sample set;

determining, by the semantic interaction module, a feature matching vector between the first text vector and the second text vector, and a feature interaction vector between the third text vector and the fourth text vector;

determining a second association relationship among the third text vector, the fourth text vector, a pre-trained search text vector, a pre-trained video title vector, the fifth text vector and the sixth text vector, the feature matching vector and the feature interaction vector;

the training process of the text matching model is supervised by adopting the semantic interaction module, and a first association relationship between the text matching model and the semantic interaction module is determined, which comprises the following steps:

And determining a first association relationship among the third text vector, the fourth text vector, the feature matching vector and the feature interaction vector.

In one possible implementation manner, the determining the second association relationship among the third text vector, the fourth text vector, the pre-trained search text vector, the pre-trained video title vector, the fifth text vector, the sixth text vector, the feature matching vector, and the feature interaction vector includes:

determining a first loss function of matching between the set of search text samples and the set of video title samples based on the third text vector and the fourth text vector;

determining a corresponding second loss function based on a mean square error between the pre-trained search text vector and the fifth text vector;

determining a corresponding third penalty function based on a mean square error between the pre-trained video title vector text vector and the sixth text vector;

determining a corresponding fourth loss function based on cross entropy between the feature matching vector and the feature interaction vector;

determining a fifth loss function corresponding to the text matching model based on the first loss function, the second loss function, the third loss function and the fourth loss function, wherein the fifth loss function is used for representing the second association relation;

The determining a first association relationship among the third text vector, the fourth text vector, the feature matching vector and the feature interaction vector includes:

and determining a sixth loss function corresponding to the text matching model based on the first loss function and the fourth loss function, wherein the sixth loss function is used for representing the first association relation.

In one possible implementation manner, the training of the text matching model based on the second association relationship is assisted until the second association relationship meets a preset convergence condition, and determining that the training of the text matching model completes the text matching model includes:

assisting the text matching model to train through the fifth loss function;

if the fifth loss function meets a preset convergence condition, determining that the text matching model training is completed;

If the fifth loss function does not meet the preset convergence condition, adjusting the operation parameters of the text matching model to enable the fifth loss function to meet the preset convergence condition;

the training of the text matching model is assisted based on the first association relation until the first association relation meets a preset convergence condition, and the training of the text matching model is determined to be completed:

assisting the text matching model to train through the sixth loss function;

if the sixth loss function meets a preset convergence condition, determining that the text matching model training is completed;

and if the sixth loss function does not meet the preset convergence condition, adjusting the operation parameters of the text matching model to enable the sixth loss function to meet the preset convergence condition.

In one possible implementation manner, the determining, based on the first loss function, the second loss function, the third loss function, and the fourth loss function, a fifth loss function corresponding to the text matching model includes:

setting weight values respectively corresponding to the first loss function, the second loss function, the third loss function and the fourth loss function;

Summing the first loss function, the second loss function, the third loss function and the fourth loss function based on the weight value to obtain a fifth loss function;

the determining a sixth loss function corresponding to the text matching model based on the first loss function and the fourth loss function includes:

setting weight values respectively corresponding to the first loss function and the fourth loss function;

and summing the first loss function and the fourth loss function based on the weight value to obtain a sixth loss function.

In one possible embodiment, the method further comprises:

positive samples and negative samples are obtained, wherein the positive samples comprise positive search text samples and positive video title samples with the frequency higher than a set frequency threshold, and the negative samples comprise negative search text samples and negative video title samples which are randomly sampled;

combining the positive search text samples in the positive samples and the negative search text samples in the negative samples into a search text sample set;

and combining the positive video title samples in the positive samples and the negative video title samples in the negative samples into a video title sample set.

In one possible embodiment, the method further comprises:

acquiring a negative search text sample and a negative video title sample from the log file through negative random sampling;

clustering the negative video title samples through a clustering algorithm to obtain a negative video title cluster;

determining the distance between each negative search text sample in the positive samples and the center of the negative video title cluster;

and taking the negative search text sample with the distance smaller than or equal to the set distance threshold value and the corresponding negative video title sample as a negative sample.

In a possible implementation manner, the determining, by the semantic interaction module, the feature matching vector and the feature interaction vector between the first text vector and the second text vector, and determining the feature interaction vector between the third text vector and the fourth text vector includes:

determining an inner product of the first text vector and the second text vector to obtain the matching matrix;

rolling and pooling the matching matrix to obtain a feature matching vector between the first text vector and the second text vector; the method comprises the steps of,

and determining characteristic interaction vectors representing inter-sentence interaction and matching between the search text sample set and the video title sample set through the third text vector and the fourth text vector.

In a second aspect, an embodiment of the present invention provides a device for constructing a text matching model, including:

the feature extraction module is used for inputting the search text sample set and the video title sample set into the feature extraction module;

the coding module is used for taking the output result of the feature extraction module as the input of the coding module;

the supervising module is used for supervising the training process of the text matching model by adopting the decoding module and the semantic interaction module, and determining a first association relationship among the text matching model, the semantic interaction module and the decoding module, wherein the first association relationship is used for representing the training result of the text matching model;

and the training module is used for assisting the text matching model to train based on the first association relation until the first association relation meets a preset convergence condition, and determining that the training of the text matching model is completed.

In a third aspect, an embodiment of the present invention provides a system for constructing a text matching model, including: the system comprises a text matching model, a semantic interaction module and a training module, wherein the text matching model comprises a feature extraction module and a coding module;

The text matching model is used for inputting a search text sample set and a video title sample set into the feature extraction module, taking the output result of the feature extraction module as the input of the coding module, and training the text matching model;

the semantic interaction module is used for supervising the training process of the text matching model;

the training module is used for determining a first association relation between the text matching model and the semantic interaction module, and the first association relation is used for representing a training result of the text matching model;

the training module is further configured to assist the text matching model to train based on the first association relationship until the first association relationship meets a preset convergence condition, and determine that training of the text matching model is completed.

In one possible embodiment, the system further comprises: the decoding module is used for supervising the training process of the text matching model;

the training module is used for determining a second association relationship among the text matching model, the semantic interaction module and the decoding module, and the second association relationship is used for representing a training result of the text matching model; and training the text matching model based on the second association relationship until the second association relationship meets a preset convergence condition, and determining that the training of the text matching model is completed.

In a fourth aspect, an embodiment of the present invention provides a computer apparatus, including: the processor is used for executing a construction program of the text matching model stored in the memory to realize the construction method of the text matching model in any one of the first aspect.

In a fifth aspect, an embodiment of the present invention provides a storage medium storing one or more programs, where the one or more programs are executable by one or more processors to implement the method for constructing a text matching model according to any one of the first aspects.

According to the construction scheme of the text matching model, provided by the embodiment of the invention, the semantic interaction module is added in the text matching model training process to supervise and learn the whole training process, and the text matching model is adjusted by searching the correlation relation of vectors between the text matching modules and the interaction relation between the text sample and the video title sample in the training process.

Drawings

FIG. 1 is a schematic flow chart of a method for constructing a text matching model according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of generating a search text sample set and a video title sample set according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of determining a second association relationship according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating another method for constructing a text matching model according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a text matching model according to an embodiment of the present invention;

FIG. 6 is a flowchart of another method for constructing a text matching model according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of another text matching model according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a device for constructing a text matching model according to an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of a text matching model building system according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

For the purpose of facilitating an understanding of the embodiments of the present invention, reference will now be made to the following description of specific embodiments, taken in conjunction with the accompanying drawings, which are not intended to limit the embodiments of the invention.

Fig. 1 is a flow chart of a method for constructing a text matching model according to an embodiment of the present invention, as shown in fig. 1, where the method specifically includes:

s11, inputting a search text sample set and a video title sample set into the feature extraction module, taking the output result of the feature extraction module as the input of the coding module, and training the text matching model.

According to the method for constructing the text matching model, the semantic interaction module is added in the text matching model training process, so that the text matching model training process is supervised through the semantic interaction module, training of the text matching model is completed in an auxiliary mode, and the trained text matching model has the capability of interactive matching and is applied to the search field.

Further, the semantic interaction module is additionally arranged in the text matching model training process to increase low-level interaction capability, the trained text matching model can segment input data (such as keywords or a section of text) (the aim is to reduce the length of the text), and the matching degree of the segmented low-level text is calculated.

The text matching model comprises a feature extraction module and a coding module (e.g. an Encodor), wherein the output of the feature extraction module is externally connected with the input of the coding module, the feature extraction module can be a Bert model or an Albert model, and the like, the coding module can be formed by three layers of full connection, and the training process of the text matching model is supervised through a semantic interaction module.

Further, a set of search text samples and a set of video title samples are obtained from the historical data, each search text sample in the set of search text samples having a corresponding video title sample in the set of video title samples.

And searching the text sample set and the video title sample set as inputs of a feature extraction module in the text matching model, wherein an output result of the feature extraction module is used as inputs of a coding module in the text matching model, and training the text matching model by searching the text sample set and the video title sample set.

S12, supervising the training process of the text matching model by adopting the semantic interaction module, and determining a first association relation between the text matching model and the semantic interaction module, wherein the first association relation is used for representing the training result of the text matching model.

In the embodiment of the invention, a feature extraction module and a coding module in a text matching model are respectively externally connected with a semantic interaction module, the semantic interaction module is used as a supervision module, supervision is carried out in the training process of the text matching model, a corresponding supervision standard is determined, the supervision standard can be a first association relation between the text matching model and the semantic interaction module, the first association relation can be determined directly according to a search text sample set and a video title sample set or indirectly according to the search text sample set and the video title sample set, and the first association relation can be expressed in a function mode.

The first association relationship is used for representing the training result of the text matching model, and can also be understood as feedback of the training result corresponding to the training of the text matching model assisted by the semantic interaction module, and whether the text matching model is successfully trained is judged through the first association relationship.

S13, training the text matching model based on the first association relationship, and determining that training of the text matching model is completed until the first association relationship meets a preset convergence condition.

Training is carried out by the first association relation auxiliary text matching model, whether the first association relation meets a preset convergence condition is used as a standard whether the text matching model is trained, namely, if a function value corresponding to a function relation between the text matching model and the semantic interaction model meets the preset convergence condition, the text matching model is determined to be trained, and if the function value does not meet the preset convergence condition, the text matching model is adjusted to continue to execute the model training.

According to the method for constructing the text matching model, provided by the embodiment of the invention, the semantic interaction module is added in the text matching model training process to supervise and learn the whole training process, the text matching model is adjusted through the correlation relation of vectors between the semantic interaction module and the text matching model and the interactive relation between the search text sample and the video title sample in the training process, and compared with the text matching model directly trained by the supervised and learning without other modules, the accuracy is higher, and the output result of the text matching model meets the requirements of users.

Fig. 2 is a schematic flow chart of generating a search text sample set and a video title sample set according to an embodiment of the present invention, where as shown in fig. 2, the method specifically includes:

s21, acquiring a positive sample and a negative sample.

The training samples of the text matching model related to the embodiment of the invention are divided into positive samples and negative samples, wherein the positive/negative samples are obtained from a log file, the log file is recorded with historical data for searching and watching video data by a user, and the historical data comprises a historical input text and a historical click video title.

The positive sample contains a positive search text sample and a positive video title sample, and the acquisition mode can be as follows: obtaining a search text (which can be called query) input by a user within a set time period (for example, one week) from a log file, and a corresponding video title (which can be called doc), wherein the video title is one text in the video title clicked by the user corresponding to the search text, so as to obtain a search text-video title pair; a plurality of search text-video title pairs in the log file within a set period of time are counted, and positive search text samples and positive video title samples with frequencies higher than a set frequency threshold (e.g., 10) are taken as positive samples.

The positive search text sample and the positive video title sample with the frequency higher than the set frequency threshold are adopted as positive samples, and compared with the method that all search text-video title pairs clicked by a user are directly adopted as positive samples, the method can filter out low-frequency search text-video titles after frequency screening, so that the positive samples are more representative.

The negative samples comprise negative search text samples and negative video title samples, and carry out difficult negative sample mining on the negative video title samples in the data of negative random sampling, and take the data with higher difficulty as difficult negative samples, and specifically comprise the following steps: acquiring a negative search text sample and a negative video title sample from the log file through negative random sampling; clustering the negative video title samples through a clustering algorithm to obtain a negative video title cluster; determining the distance between each positive search text sample in the negative samples and the center of the negative video title cluster; and taking the negative search text sample with the distance smaller than or equal to the set distance threshold value and the corresponding negative video title sample as a negative sample.

Further, the number of clusters (e.g. 500) is set, negative video title samples obtained by negative random sampling are clustered through a k-means clustering algorithm to obtain a plurality of negative video title clusters, the distance from each negative search text sample obtained by negative random sampling to the center of each negative video title cluster is calculated, the difficulty level of the negative search text sample and the negative video title sample in the corresponding negative video title cluster is represented by the distance, and the more difficult the negative video title sample in the negative video title cluster is, the smaller the difficulty level is, the more accurate and the stronger the adaptability of a text matching module are, so that the sampling is performed according to the form of the distance from far to near (the more difficult the distance represents the corresponding difficulty of the negative sample is smaller), and the negative search text sample with the distance smaller than or equal to a set distance threshold value and the corresponding negative video title sample are taken as the negative samples.

Through negative random sampling and selecting a sample with larger difficulty as a negative sample, the complexity of the negative sample is increased, the model training learning finer granularity distinction is facilitated, and the accuracy of the text matching model is improved.

S22, combining the positive search text samples in the positive samples and the negative search text samples in the negative samples into a search text sample set.

S23, combining the positive video title samples in the positive samples and the negative video title samples in the negative samples into a video title sample set.

In this embodiment, the positive sample and the negative sample are obtained in a form of combining a search text sample and a video title sample, and in the training process of the text matching model, the positive sample and the negative sample are combined with the positive search text sample and the negative search text sample, and the positive video title sample and the negative video title sample are combined, so that a search text sample set and a search text sample set are obtained, and a semantic interaction supervision training is added, so that the matching of the text matching model to low-level interaction is improved, and the accuracy of the text matching model is improved.

In an alternative scheme of the embodiment of the invention, a decoding module (Decoder) and a semantic interaction module can be adopted to assist in supervising the training process of the text matching model, a pre-trained vector is arranged in the decoding module, priori knowledge can be fused in the training process of the text matching model according to the pre-trained vector, the training process of the text matching model is supervised by the semantic interaction module, the interactive capability of the text matching model is improved, and the purpose of improving the accuracy of the text matching model is achieved by the decoding module and the semantic interaction module.

Fig. 3 is a flow chart of another method for constructing a text matching model according to an embodiment of the present invention, as shown in fig. 3, where the method specifically includes:

s31, acquiring a positive sample and a negative sample.

The positive sample contains a positive search text sample and a positive video title sample, and the acquisition mode can be as follows: obtaining a search text (which can be called query) input by a user within a set time period (for example, one week) from a log file, and a corresponding video title (which can be called doc) which is one text of a click video title corresponding to the search text, so as to obtain a search text-video title pair; a plurality of search text-video title pairs in the log file within a set period of time are counted, and positive search text samples and positive video title samples with frequencies higher than a set frequency threshold (e.g., 10) are taken as positive samples.

The negative sample comprises a negative search text sample and a negative video title sample, and is used for carrying out difficult negative sample mining on data which are subjected to negative random sampling, and taking data with higher difficulty as the negative sample, and specifically comprises the following steps: acquiring a negative search text sample and a negative video title sample from the log file through negative random sampling; clustering the negative video title samples through a clustering algorithm to obtain a negative video title cluster; determining the distance between each positive search text sample in the positive samples and the center of the negative video title cluster; and taking the negative search text sample with the distance smaller than or equal to the set distance threshold value and the corresponding negative video title sample as negative samples, and increasing the difficulty of the negative samples by selecting the negative search text sample with the distance smaller than or equal to the set distance threshold value and the corresponding negative video title sample, thereby improving the accuracy of the training result of the text matching model.

S32, combining the positive search text samples in the positive samples and the negative search text samples in the negative samples into a search text sample set.

S33, combining the positive video title samples in the positive samples and the negative video title samples in the negative samples into a video title sample set.

S34, vectorizing the search text sample set and the video title sample set through the feature extraction module to obtain a first text vector corresponding to the search text sample set and a second text vector corresponding to the video title sample set.

And S35, encoding the first text vector and the second text vector through the encoding module to obtain a third text vector corresponding to the search text sample set and a fourth text vector corresponding to the video title sample set.

As shown in fig. 4, a schematic structural diagram of a text matching model is shown, where a decoding module and a semantic interaction module are used for supervising training of the text matching model, the text matching model includes an initial model to be trained, the corresponding initial model may be an Albert model+an encoding module, the encoding module (Encoder) includes a 3-layer double-tower (two branches, one branch processes a search text sample set, and the other branch processes a video title sample set) fully connected layer structure, the Albert model receives the search text sample set and the video title sample set, and outputs a first text vector and a second text vector of the search text and the video title, where the first text vector is a feature representation vector corresponding to the search text, the second text vector is a feature representation vector of the video title, the first text vector and the second text vector are used as inputs of the encoding module, so that the encoding module outputs a third text vector corresponding to the search text sample set, and a fourth text vector corresponding to the video title sample set, where the third text vector is a bedding vector corresponding to the search text sample set, and the fourth text vector is a video title.

Further, the obtained first text vector and the second text vector can be used as input vectors of a semantic interaction module, interaction capacity between texts of the text matching model is improved through the semantic interaction module, the obtained third text vector and the obtained fourth text vector can be used as input vectors of a decoding module, and vectors output by the text matching model are compared with pre-trained vectors through the decoding module so that the vectors are consistent with the pre-trained vectors, so that the text matching model can be integrated with priori knowledge, and the purpose of improving semantic matching accuracy of the text matching model through the decoding module is achieved.

S36, decoding the third text vector and the fourth text vector through the decoding module to obtain a fifth text vector corresponding to the search text sample set and a sixth text vector corresponding to the video title sample set.

In the embodiment of the invention, the decoding module (Decoder) comprises two parts, a pre-trained vector and a full-connection layer, wherein the pre-trained vector can be obtained by training a user history input search text sequence and a user history click video title sequence through a preset model, and the preset model can be: word2vec model, keras model, etc., one full-connection layer in the decoding module is connected with the last full-connection layer of the encoding module (Encoder) in the text matching model.

Further, the decoding module is added in the text matching model training process to increase the semantic matching capability of sentence level and reduce the query rate per second (Queries Per Second, QPS) of the text matching model in-line operation.

In an alternative scheme of the embodiment of the present invention, as shown in fig. 4, the last full-connection layer of the encoding module in the text matching model is connected to the decoding module, where the decoding module includes a full-connection layer and a pre-trained vector (the pre-trained vector may be a Word2vec vector), the decoding process is to perform nonlinear transformation on the third text vector and the fourth text vector, and the decoded target is that the decoded vector is consistent with the pre-trained vector, so as to obtain a fifth text vector corresponding to the search text sample set, and a sixth text vector corresponding to the search text sample set.

S37, determining a feature matching vector between the first text vector and the second text vector and a feature interaction vector between the third text vector and the fourth text vector through the semantic interaction module.

As shown in fig. 4, the semantic interaction module mainly calculates a Character-level feature interaction and a Sentence-level feature of the vector, wherein the input of the Character-level feature interaction is the output result of the feature extraction module in the text matching model, namely, a first text vector and a second text vector, and the input of the Sentence-level feature is the output result of the coding module in the text matching model, namely, a third text vector and a fourth text vector.

Taking the characteristic extraction module as an Albert model, taking the coding module as a three-layer full-connection layer as an example, calculating the inner product of Character level representation between a search text sample set and a video title sample set according to a first text vector and a second text vector obtained in the Albert model, obtaining a matching matrix of the search text sample set and the video title sample set through matrix multiplication, and calculating the matching matrix through two convolution layers and a maximum pooling layer to obtain a characteristic matching vector between the first text vector and the second text vector; and calculating a sequence level feature interaction vector between the search text sample set and the video title sample set, namely e, based on the third text vector and the fourth text vector by the third text vector and the fourth text vector output by the last full-connection layer of the coding module _Q ，e _D ，e _Q *e _D ，|e _Q -e _D |。

S38, determining a second association relationship among the third text vector, the fourth text vector, a pre-trained search text vector, a pre-trained video title vector, the fifth text vector, the sixth text vector, the feature matching vector and the feature interaction vector.

In this embodiment, the second association relationship may characterize the effect/standard of the decoding module and the semantic interaction module on performing supervised learning on the text matching model, where the second association relationship is used to assist multi-objective joint learning in constructing the text matching model.

For example, the second association relationship may be: the matching degree of the third text vector and the fourth text vector, the mean square error between the pre-trained search text vector and the fifth text vector/the mean square error between the pre-trained video header vector and the sixth text vector, and the cross entropy between the feature matching vector and the feature interaction vector.

In an embodiment of the present invention, fig. 5 is a schematic flow chart for determining a second association relationship according to an embodiment of the present invention, as shown in fig. 5, and specifically includes:

s51, determining a first loss function of the matching degree between the searching text sample set and the video title sample set based on the third text vector and the fourth text vector.

In this embodiment, the text matching model includes a feature extraction module+an encoding module, where the text feature extraction module may be: the Albert model, the coding module comprises a 3-layer double-tower (two branches, one branch processing the search text sample set and the other branch processing the video title sample set) fully connected layer structure.

The Albert model receives a set of search text samples (positive search text samples and negative search text samples) and a set of video title samples (positive video title samples and negative video title samples), and outputs first text vectors and second text vectors of the search text and the video title, the first text vectors being feature representation vectors corresponding to the search text, the second text vectors being feature representation vectors of the video title, the first text vectors and the second text vectors being inputs to the encoding module such that the encoding module outputs a third text vector corresponding to the set of search text samples and a fourth text vector corresponding to the set of video title samples, the third text vector being an ebedding vector corresponding to the search text, the fourth text vector being an ebedding vector of the video title.

Constructing a first loss function for searching matching of text and video titles through a triple loss, specifically comprising:

wherein e _Q For an ebedding vector representation of a text sample being searched,an ebedding vector representation for a positive video header sample,/>Embedding vector representation representing negative search text sample,/->The ebedding vector representation, α, is a constant, taking a value of 1.0, for the negative video title sample.

S52, determining a corresponding second loss function based on the mean square error between the pre-trained search text vector and the fifth text vector.

And S53, determining a corresponding third loss function based on the mean square error between the pre-trained video title vector and the sixth text vector.

In this embodiment, the pre-trained search text vector is obtained by training the Word2vec model based on the input search text sequence of the user history, the pre-trained video title vector is obtained by training the Word2vec model based on the click video title sequence of the user history, and the decoding module is added to decode the search text sample set and the video title sample set to obtain a fifth text vector corresponding to the search text sample set and a sixth text vector corresponding to the search text sample set.

The pre-trained search text vector may be a pre-trained training search text Word2vec vector and the pre-trained video title vector may be a pre-trained video title Word2vec vector.

The second loss function (L2 loss) corresponding to the sum of squares of the differences is used as a regression loss function by calculating the sum of the squares of the differences between the pre-trained search text Word2vec vector and the fifth text vector, and the third loss function (L2 loss) corresponding to the sum of the squares of the differences is used as a regression loss function by calculating the sum of the squares of the differences between the pre-trained video title Word2vec vector and the sixth text vector.

It should be noted that, in this embodiment, besides obtaining the pre-trained search text vector and the pre-trained video title vector by using the Word2vec model, other models may be used to replace the Word2vec model to obtain the two vectors, for example, obtaining the pre-trained search text vector by using the Keras model based on the input search text sequence of the user history, obtaining the pre-trained video title vector by using the Keras model based on the click video title sequence of the user history, and setting the model type specifically adopted according to the actual requirement.

S54, determining a corresponding fourth loss function based on the cross entropy between the feature matching vector and the feature interaction vector.

In this embodiment, the semantic interaction module employs Character-level feature interactions and Sentence-level feature interactions to construct a cross entropy loss function between the search text sample set and the video title sample set.

Further, determining an inner product of the first text vector and the second text vector to obtain the matching matrix; rolling and pooling the matching matrix to obtain a feature matching vector between the first text vector and the second text vector; and determining a characteristic interaction vector representing the inter-sentence interaction and matching between the search text sample set and the video title sample set through the third text vector and the fourth text vector.

Specifically, according to a first text vector and a second text vector obtained in an Albert model, obtaining an inner product represented by Character level between a search text sample set and a video title sample set, obtaining a matching matrix of the search text sample set and the video title sample set through matrix multiplication, obtaining a feature matching vector between the first text vector and the second text vector through calculation of a two-layer convolution layer and a maximum pooling layer on the matching matrix, and obtaining a feature matching vector between the first text vector and the second text vector through the third text vector and the fourth text vector Calculating a quality level feature interaction vector between a set of search text samples and a set of video title samples, i.e. e _Q ，e _D ，e _Q *e _D ，|e _Q -e _D |。

And (3) passing the feature matching vector and the feature interaction vector through a full-connection layer to obtain a cross prediction probability logic based on the interaction prediction probability logic, and modeling by using cross entropy to obtain a fourth loss function (CE loss).

S55, determining a fifth loss function corresponding to the text matching model based on the first loss function, the second loss function, the third loss function and the fourth loss function.

Specifically, weight values respectively corresponding to the first loss function, the second loss function, the third loss function and the fourth loss function are set; and summing the first loss function, the second loss function, the third loss function and the fourth loss function based on the weight value to obtain a fifth loss function, wherein the fifth loss function is used for representing a second association relation.

The fifth loss function may be:

L＝α*Triplet loss(e _Q ，e _Q -，e _D +，e _D -)+β*(L2loss(e″ _Q ，e′ _Q )+L2loss(e″ _D ，e′ _D ))+γ*CE loss(Q，D)

wherein, the Triplet loss is a first loss function, a is a constant, and the value is 1; l2loss is a second loss function and a third loss function, beta is a constant, and the value is 0.5; CE loss is a fourth loss function, γ is a constant, and takes a value of 0.5.

S39, training the text matching model through the fifth loss function.

Taking the fifth loss function as a condition of the text matching model, and taking whether the fifth loss function converges as a standard whether the text matching model is trained, wherein the convergence of the fifth loss function can be as follows: the function value of the fifth loss function is not reduced any more.

S310, judging whether the loss fifth loss function converges or not.

And S311, if the fifth loss function meets a preset convergence condition, determining that the text matching model training is completed.

S312, if the fifth loss function does not meet the preset convergence condition, adjusting the operation parameters of the text matching model to enable the fifth loss function to meet the preset convergence condition.

In the embodiment, the difficulty in the model training process is improved by adopting a clustering algorithm to carry out difficult negative sample screening on the negative samples; the positive sample and the negative sample are obtained in a mode of combining the search text sample and the video title sample, in the text matching model training process, the positive sample and the negative sample are combined with the positive search text sample and the negative search text sample, and the positive video title sample and the negative video title sample are combined with each other, so that a search text sample set and a search text sample set are obtained, the positive/negative search text sample and the positive/negative video title sample are mixed, the difficulty of training the sample can be improved, the supervision training of semantic interaction is added, the matching of the text matching model to low-level interaction is improved, and the accuracy of the text matching model can be improved.

In an alternative scheme of the embodiment of the invention, only the semantic interaction module is adopted to assist in supervising the training process of the text matching model, the semantic interaction module is adopted to supervise the training process of the text matching model, the interactive capability of the text matching model is improved, and the purpose of improving the accuracy of the text matching model is achieved through the semantic interaction module.

Fig. 6 is a flow chart of another method for constructing a text matching model according to an embodiment of the present invention, as shown in fig. 6, where the method specifically includes:

s61, acquiring a positive sample and a negative sample.

S62, combining the positive search text samples in the positive samples and the negative search text samples in the negative samples into a search text sample set.

S63, combining the positive video title samples in the positive samples and the negative video title samples in the negative samples into a video title sample set.

S64, vectorizing the search text sample set and the video title sample set through the feature extraction module to obtain a first text vector corresponding to the search text sample set and a second text vector corresponding to the video title sample set.

S65, encoding the first text vector and the second text vector through the encoding module to obtain a third text vector corresponding to the search text sample set and a fourth text vector corresponding to the video title sample set.

As shown in fig. 7, a schematic structural diagram of a text matching model is shown, where the text matching model is used for training a supervised text matching model, the text matching model includes an initial model to be trained, the corresponding initial model may be an Albert model+an encoding module, the encoding module includes a 3-layer dual-tower (two branches, one branch processes a search text sample set, and the other branch processes a video title sample set) fully connected layer structure, the Albert model receives the search text sample set and the video title sample set, and outputs a first text vector and a second text vector of the search text and the video title, where the first text vector is a feature representation vector corresponding to the search text, the second text vector is a feature representation vector of the video title, the first text vector and the second text vector are used as inputs to the encoding module, so that the encoding module outputs a third text vector corresponding to the search text sample set, and a fourth text vector corresponding to the video title sample set, where the third text vector is a search text vector corresponding to the video title, and the fourth text vector is a dding vector of the video title.

Further, the obtained first text vector and the second text vector can be used as input vectors of a semantic interaction module, and interaction capability among texts of the text matching model is improved through the semantic interaction module.

S66, determining a feature matching vector between the first text vector and the second text vector and a feature interaction vector between the third text vector and the fourth text vector through the semantic interaction module.

S61-S661 are similar to S31-S35 and S37 in fig. 3, and reference is made to the above description related to fig. 3, and will not be repeated here.

As shown in fig. 7, the semantic interaction module mainly calculates a Character-level feature interaction and a Sentence-level feature of the vector, wherein the input of the Character-level feature interaction is the output result of the feature extraction module in the text matching model, namely, a first text vector and a second text vector, and the input of the Sentence-level feature is the output result of the coding module in the text matching model, namely, a third text vector and a fourth text vector.

S67, determining a first association relationship among the third text vector, the fourth text vector, the feature matching vector and the feature interaction vector.

In this embodiment, the first association relationship may characterize the effect/standard of the semantic interaction module on performing supervised learning on the text matching model, where the first association relationship is used to assist multi-objective joint learning in constructing the text matching model.

For example, the first association relationship may be: the matching degree of the third text vector and the fourth text vector, and the cross entropy between the feature matching vector and the feature interaction vector.

In the embodiment of the present invention, determining a first association relationship among the third text vector, the fourth text vector, the feature matching vector and the feature interaction vector specifically includes:

determining a first loss function of matching between the set of search text samples and the set of video title samples based on the third text vector and the fourth text vector; determining a corresponding fourth loss function based on cross entropy between the feature matching vector and the feature interaction vector; and determining a sixth loss function corresponding to the text matching model based on the first loss function and the fourth loss function, wherein the sixth loss function is used for representing the first association relation.

The steps of determining the first loss function and the fourth loss function are similar to the principles of S51 and S54, and reference is made to the description related to fig. 5, which is not repeated here.

Further, setting weight values respectively corresponding to the first loss function and the fourth loss function; and summing the first loss function and the fourth loss function based on the weight value to obtain a sixth loss function.

The sixth loss function may be:

L＝α*Triplet loss(e _Q ，e _Q -，e _D +，e _D -)+γ*CE loss(Q，D)

wherein, the Triplet loss is a first loss function, a is a constant, and the value is 1; CE loss is a fourth loss function, γ is a constant, and takes a value of 0.5.

And S68, training the text matching model through the sixth loss function.

Taking the sixth loss function as a condition of the text matching model, and taking whether the sixth loss function converges as a criterion whether the text matching model is trained, wherein the convergence of the sixth loss function can be as follows: the function value corresponding to the sixth loss function is not reduced any more.

S69, judging whether the loss sixth loss function converges or not.

And S610, if the sixth loss function meets a preset convergence condition, determining that the text matching model training is completed.

S611, if the sixth loss function does not meet the preset convergence condition, adjusting the operation parameters of the text matching model to enable the sixth loss function to meet the preset convergence condition.

S68-S611 are similar to S39-S312 in fig. 3, and reference is made to the above description related to fig. 3, and will not be repeated here.

Fig. 8 is a schematic structural diagram of a device for constructing a text matching model according to an embodiment of the present invention, where, as shown in fig. 8, the structure specifically includes:

a feature extraction module 81 for inputting a set of search text samples and a set of video title samples into the feature extraction module;

An encoding module 82, configured to take an output result of the feature extraction module as an input of the encoding module;

a supervision module 83, configured to supervise a training process of the text matching model by using the semantic interaction module, and determine a first association relationship between the text matching model and the semantic interaction module, where the first association relationship is used to characterize a training result of the text matching model;

the training module 84 is configured to assist the text matching model to train based on the first association relationship until the first association relationship meets a preset convergence condition, and determine that training of the text matching model is completed.

In a possible implementation manner, the supervision module 83 is further configured to supervise the training process of the text matching model by using the decoding module and the semantic interaction module, and determine a second association relationship among the text matching model, the semantic interaction module and the decoding module, where the second association relationship is used to characterize a training result of the text matching model;

the training module 84 is further configured to assist the text matching model to train based on the second association relationship until the second association relationship meets a preset convergence condition, and determine that training of the text matching model is completed.

The text matching model construction device provided in this embodiment may be a text matching model construction device as shown in fig. 8, and may perform all steps of the text matching model construction method shown in fig. 1-6, so as to achieve the technical effects of the text matching model construction method shown in fig. 1-6, and refer to the related description of fig. 1-6, which is omitted herein for brevity.

Fig. 9 is a schematic structural diagram of a system for constructing a text matching model according to an embodiment of the present invention, where, as shown in fig. 9, the structure specifically includes:

a text matching model 91, configured to input a search text sample set and a video title sample set into the feature extraction module, and train the text matching model by using an output result of the feature extraction module as an input of the encoding module;

the semantic interaction module 92 is configured to supervise a training process of the text matching model;

a training module 93, configured to determine a first association between the text matching model and the semantic interaction module, where the first association is used to characterize a training result of the text matching model;

the training module 93 is further configured to assist the text matching model to train based on the first association relationship until the first association relationship meets a preset convergence condition, and determine that training of the text matching model is completed.

In one possible embodiment, the system further comprises: a decoding module 904, configured to supervise a training process of the text matching model;

the training module 93 is further configured to determine a second association relationship between the text matching model, the semantic interaction module, and the decoding module, where the second association relationship is used to characterize a training result of the text matching model;

the training module 93 is further configured to assist the text matching model to train based on the second association relationship until the second association relationship meets a preset convergence condition, and determine that training of the text matching model is completed.

The text matching model construction system provided in this embodiment may be a text matching model construction system as shown in fig. 9, and may perform all steps of the text matching model construction method shown in fig. 1-6, so as to achieve the technical effects of the text matching model construction method shown in fig. 1-6, and detailed descriptions with reference to fig. 1-6 are omitted herein for brevity.

Fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present invention, and the computer device 1000 shown in fig. 10 includes: at least one processor 1001, memory 1002, at least one network interface 1004, and other user interfaces 1003. The various components in computer device 1000 are coupled together by a bus system 1005. It is appreciated that the bus system 1005 is used to enable connected communications between these components. The bus system 1005 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 1005 in fig. 10.

The user interface 1003 may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, a trackball, a touch pad, or a touch screen, etc.).

It is to be appreciated that the memory 1002 in embodiments of the present invention can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct memory bus RAM (DRRAM). The memory 1002 described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

In some implementations, the memory 1002 stores the following elements, executable units or data structures, or a subset thereof, or an extended set thereof: an operating system 10021 and application programs 10022.

The operating system 10021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application 10022 includes various applications, such as a Media Player (Media Player), a Browser (Browser), etc., for implementing various application services. A program for implementing the method according to the embodiment of the present invention may be included in the application 10022.

In the embodiment of the present invention, the processor 1001 is configured to execute the method steps provided by the method embodiments by calling a program or an instruction stored in the memory 1002, specifically, a program or an instruction stored in the application program 10022, for example, including:

inputting a search text sample set and a video title sample set into the feature extraction module, taking the output result of the feature extraction module as the input of the coding module, and training the text matching model; monitoring a training process of the text matching model by adopting the semantic interaction module, and determining a first incidence relation between the text matching model and the semantic interaction module, wherein the first incidence relation is used for representing a training result of the text matching model and the first incidence relation is used for representing the training result of the text matching model; and training the text matching model based on the first association relationship in an auxiliary manner until the first association relationship meets a preset convergence condition, and determining that the training of the text matching model is completed.

In one possible implementation manner, the decoding module and the semantic interaction module are adopted to monitor the training process of the text matching model, and a second association relationship among the text matching model, the semantic interaction module and the decoding module is determined, wherein the second association relationship is used for representing the training result of the text matching model; and training the text matching model based on the second association relationship until the second association relationship meets a preset convergence condition, and determining that the training of the text matching model is completed.

In a possible implementation manner, the feature extraction module performs vectorization processing on a search text sample set and a video title sample set to obtain a first text vector corresponding to the search text sample set and a second text vector corresponding to the video title sample set, wherein the first text vector and the second text vector are used as input vectors of the semantic interaction module; and encoding the first text vector and the second text vector through the encoding module to obtain a third text vector corresponding to the search text sample set and a fourth text vector corresponding to the video title sample set, wherein the third text vector and the fourth text vector are used as input vectors of the decoding module.

In a possible implementation manner, the decoding module decodes the third text vector and the fourth text vector to obtain a fifth text vector corresponding to the search text sample set and a sixth text vector corresponding to the video title sample set; determining, by the semantic interaction module, a feature matching vector between the first text vector and the second text vector, and a feature interaction vector between the third text vector and the fourth text vector; determining a second association relationship among the third text vector, the fourth text vector, a pre-trained search text vector, a pre-trained video title vector, the fifth text vector and the sixth text vector, the feature matching vector and the feature interaction vector;

or alternatively, the first and second heat exchangers may be,

determining, by the semantic interaction module, a feature matching vector between the first text vector and the second text vector, and a feature interaction vector between the third text vector and the fourth text vector; and determining a first association relationship among the third text vector, the fourth text vector, the feature matching vector and the feature interaction vector.

In one possible implementation, a first penalty function for the degree of matching between the set of search text samples and the set of video title samples is determined based on the third text vector and the fourth text vector; determining a corresponding second loss function based on a mean square error between the pre-trained search text vector and the fifth text vector; determining a corresponding third loss function based on a mean square error between the pre-trained video title vector and the sixth text vector; determining a corresponding fourth loss function based on cross entropy between the feature matching vector and the feature interaction vector; determining a fifth loss function corresponding to the text matching model based on the first loss function, the second loss function, the third loss function and the fourth loss function, wherein the fifth loss function is used for representing the second association relation;

or alternatively, the first and second heat exchangers may be,

In one possible implementation, training the text matching model is aided by the fifth loss function; if the fifth loss function meets a preset convergence condition, determining that the text matching model training is completed; if the fifth loss function does not meet the preset convergence condition, adjusting the operation parameters of the text matching model to enable the fifth loss function to meet the preset convergence condition;

or alternatively, the first and second heat exchangers may be,

assisting the text matching model to train through the sixth loss function; if the sixth loss function meets a preset convergence condition, determining that the text matching model training is completed; and if the sixth loss function does not meet the preset convergence condition, adjusting the operation parameters of the text matching model to enable the sixth loss function to meet the preset convergence condition.

In one possible implementation manner, weight values respectively corresponding to the first loss function, the second loss function, the third loss function and the fourth loss function are set; summing the first loss function, the second loss function, the third loss function and the fourth loss function based on the weight value to obtain a fifth loss function;

Or alternatively, the first and second heat exchangers may be,

setting weight values respectively corresponding to the first loss function and the fourth loss function; and summing the first loss function and the fourth loss function based on the weight value to obtain a sixth loss function.

In one possible implementation manner, a positive sample and a negative sample are obtained, wherein the positive sample comprises a positive search text sample and a positive video title sample with frequencies higher than a set frequency threshold, and the negative sample comprises a negative search text sample and a negative video title sample which are randomly sampled; combining the positive search text samples in the positive samples and the negative search text samples in the negative samples into a search text sample set; and combining the positive video title samples in the positive samples and the negative video title samples in the negative samples into a video title sample set.

In one possible implementation, a negative search text sample and a negative video title sample are obtained from the log file by negative random sampling; clustering the negative video title samples through a clustering algorithm to obtain a negative video title cluster; determining the distance between each negative search text sample in the positive samples and the center of the negative video title cluster; and taking the negative search text sample with the distance smaller than or equal to the set distance threshold value and the corresponding negative video title sample as a negative sample.

In a possible implementation manner, determining an inner product of the first text vector and the second text vector to obtain the matching matrix; rolling and pooling the matching matrix to obtain a feature matching vector between the first text vector and the second text vector; and determining a characteristic interaction vector representing the inter-sentence interaction and matching between the search text sample set and the video title sample set through the third text vector and the fourth text vector.

The method disclosed in the above embodiment of the present invention may be applied to the processor 1001 or implemented by the processor 1001. The processor 1001 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 1001 or by instructions in the form of software. The processor 1001 described above may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software elements in a decoding processor. The software elements may be located in a random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 1002, and the processor 1001 reads the information in the memory 1002, and in combination with its hardware, performs the steps of the above method.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the processing units may be implemented within one or more application specific integrated circuits (Application Specific Integrated Circuits, ASIC), digital signal processors (Digital Signal Processing, DSP), digital signal processing devices (dspev, DSPD), programmable logic devices (Programmable Logic Device, PLD), field programmable gate arrays (Field-Programmable Gate Array, FPGA), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units designed to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented by means of units that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

The computer device provided in this embodiment may be a computer device as shown in fig. 10, and may perform all steps of the method for constructing a text matching model as shown in fig. 1-6, so as to achieve the technical effects of the method for constructing a text matching model as shown in fig. 1-6, and refer to the related descriptions in fig. 1-6, which are not repeated herein for brevity.

The embodiment of the invention also provides a storage medium (computer readable storage medium). The storage medium here stores one or more programs. Wherein the storage medium may comprise volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, hard disk, or solid state disk; the memory may also comprise a combination of the above types of memories.

The one or more programs, when executed in the storage medium, are executable by the one or more processors to implement the above-described method of constructing a text matching model, which is performed on the construction device side of the text matching model.

The processor is configured to execute a construction program of a text matching model stored in the memory, so as to implement the following steps of a construction method of the text matching model, which are executed on a construction device side of the text matching model:

or alternatively, the first and second heat exchangers may be,

Or alternatively, the first and second heat exchangers may be,

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of function in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A method for constructing a text matching model, the text matching model comprising a feature extraction module and an encoding module, the method comprising:

inputting a search text sample set and a video title sample set into the feature extraction module, taking the output result of the feature extraction module as the input of the coding module, and training the text matching model;

Monitoring a training process of the text matching model by adopting a semantic interaction module, and determining a first association relationship between the text matching model and the semantic interaction module, wherein the first association relationship is used for representing a training result of the text matching model;

and training the text matching model based on the first association relationship in an auxiliary manner until the first association relationship meets a preset convergence condition, and determining that the training of the text matching model is completed.

2. The method of claim 1, wherein the method further comprises:

monitoring the training process of the text matching model by adopting a decoding module and the semantic interaction module, and determining a second association relationship among the text matching model, the semantic interaction module and the decoding module, wherein the second association relationship is used for representing the training result of the text matching model;

and training the text matching model based on the second association relationship until the second association relationship meets a preset convergence condition, and determining that the training of the text matching model is completed.

3. The method of claim 2, wherein inputting the set of search text samples and the set of video title samples into the feature extraction module and taking the output of the feature extraction module as the input of the encoding module comprises:

4. The method of claim 3, wherein the employing a decoding module with the semantic interaction module to supervise the training process of the text matching model and determining a second association between the text matching model, the semantic interaction module, and the decoding module comprises:

determining a second association relationship among the third text vector, the fourth text vector, a pre-trained search text vector, a pre-trained video title vector, the fifth text vector, the sixth text vector, the feature matching vector and the feature interaction vector;

5. The method of claim 4, wherein the determining a second association between the third text vector, the fourth text vector, a pre-trained search text vector, a pre-trained video title vector, the fifth text vector, the sixth text vector, the feature matching vector, and the feature interaction vector comprises:

determining a corresponding third loss function based on a mean square error between the pre-trained video title vector and the sixth text vector;

6. The method of claim 5, wherein the training the text matching model based on the second association is assisted until the second association meets a preset convergence condition, and determining that the training of the text matching model is completed comprises:

assisting the text matching model to train through the fifth loss function;

Assisting the text matching model to train through the sixth loss function;

7. The method of claim 5 or 6, wherein the determining a fifth loss function corresponding to the text matching model based on the first loss function, the second loss function, the third loss function, and the fourth loss function comprises:

8. The method according to claim 1, wherein the method further comprises:

9. The method of claim 8, wherein the method further comprises:

10. The method of claim 4, wherein the determining, by the semantic interaction module, a feature matching vector and a feature interaction vector between the first text vector and the second text vector, and determining a feature interaction vector between the third text vector and the fourth text vector, comprises:

determining an inner product of the first text vector and the second text vector to obtain a matching matrix;

11. A system for constructing a text matching model, comprising: the system comprises a text matching model, a semantic interaction module and a training module, wherein the text matching model comprises a feature extraction module and a coding module;

12. The system of claim 11, wherein the system further comprises:

the decoding module is used for supervising the training process of the text matching model;

the training module is further configured to determine a second association relationship among the text matching model, the semantic interaction module and the decoding module, where the second association relationship is used to characterize a training result of the text matching model; and training the text matching model based on the second association relationship until the second association relationship meets a preset convergence condition, and determining that the training of the text matching model is completed.

13. A computer device, comprising: the processor and the memory are used for executing a construction program of the text matching model stored in the memory so as to realize the construction method of the text matching model according to any one of claims 1-10.

14. A storage medium storing one or more programs executable by one or more processors to implement the method of constructing a text matching model of any one of claims 1-10.