CN115114931A

CN115114931A - Model training method, short video recall method, device, equipment and medium

Info

Publication number: CN115114931A
Application number: CN202210579279.8A
Authority: CN
Inventors: 潘程; 王敏; 杨善松
Original assignee: Hisense Visual Technology Co Ltd
Current assignee: Hisense Visual Technology Co Ltd
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2022-09-27

Abstract

The present disclosure relates to a model training method, a short video recall method, apparatus, device and medium, comprising: obtaining a training sample; defining a semantic understanding model, wherein the semantic understanding model comprises a first semantic understanding model and a second semantic understanding model, and the first semantic understanding model and the second semantic understanding model are connected in parallel; generating a corresponding first semantic vector based on the first training sample and the first semantic understanding model, and generating a corresponding second semantic vector based on the second training sample and the second semantic understanding model; and training the semantic understanding model based on the loss function values of the first semantic vector and the second semantic vector to obtain a target semantic understanding model, so that the semantic understanding model is more accurate, and the accuracy of the field dialogue understanding is improved.

Description

Model training method, short video recall method, device, equipment and medium

Technical Field

The present disclosure relates to the field of artificial intelligence technology and natural language processing technology, and in particular, to a semantic understanding model training method, a short video recall method, an apparatus, a device, and a medium.

Background

An intelligent voice assistant or an intelligent customer service is one of the most extensive and important ways for technologies such as Artificial Intelligence (AI), Natural Language Processing (NLP) and the like to fall into an actual scene, wherein correct understanding of user query is the core capability of the intelligent voice assistant. At present, the intelligent voice assistant can analyze the field, intention and slot position of the user inquiry by using a rule engine or a deep learning model, so as to realize the understanding of the user inquiry. However, with the development of technologies such as mobile internet, short videos have a blowout type development, and people are more interested in watching fragmented and shorter-duration video assets, namely short video assets.

The short video assets and the normal video assets have obvious differences in time length, asset titles and the like, for example, the time length of the short video assets is shorter, and the titles are longer. Most of the existing recommendation systems recall related videos based on word granularity and can recommend normal video assets well, but because user query is usually short, the recommendation systems are difficult to capture slight language changes in the user query, so that a voice assistant cannot accurately recommend short video assets to users, and poor user experience is brought.

Disclosure of Invention

To solve the above technical problem or to at least partially solve the above technical problem, the present disclosure provides a model training method, a short video recall method, an apparatus, a device, and a medium.

In a first aspect, an embodiment of the present disclosure provides a semantic understanding model training method, including:

acquiring training samples, wherein the training samples comprise a first training sample and a second training sample, the first training sample comprises a natural language text, a label corresponding to the natural language text and a field to which the natural language text belongs, and the second training sample comprises title information of a short video asset recalled based on the natural language text and a label corresponding to the recalled short video asset;

defining a semantic understanding model, wherein the semantic understanding model comprises a first semantic understanding model and a second semantic understanding model, and the first semantic understanding model and the second semantic understanding model are connected in parallel;

generating a corresponding first semantic vector based on the first training sample and the first semantic understanding model, and generating a corresponding second semantic vector based on the second training sample and the second semantic understanding model;

and training the semantic understanding model based on the loss function values of the first semantic vector and the second semantic vector to obtain a target semantic understanding model.

As an optional implementation manner of the embodiment of the present disclosure, the training the semantic understanding model based on the loss function values of the first semantic vector and the second semantic vector to obtain a target semantic understanding model includes:

determining a loss function value of the first semantic vector and the second semantic vector based on a preset loss function;

and adjusting parameters of the semantic understanding model according to the loss function value until the semantic understanding model converges to obtain a target semantic understanding model.

As an optional implementation manner of the embodiment of the present disclosure, adjusting parameters of the semantic understanding model according to the loss function value until the semantic understanding model converges to obtain a target semantic understanding model, including:

when the loss function value does not meet a preset threshold value, training the first semantic understanding model based on the first training sample, adjusting parameters of the first semantic understanding model, training the second semantic understanding model based on the second training sample, and adjusting parameters of the second semantic understanding model;

and obtaining a target semantic understanding model when the loss function value of the first semantic vector output by the first semantic understanding model and the loss function value of the second semantic vector output by the second semantic understanding model meet a preset threshold value.

As an optional implementation manner of the embodiment of the present disclosure, the first semantic understanding model includes a semantic feature extraction unit, a full connection layer, and an activation function, where the full connection layer includes N sub full connection layers, the activation function includes N sub activation functions, the semantic feature extraction unit, the full connection layer, and the activation function are connected in series, the N sub full connection layers are connected in parallel, and the N sub activation functions are connected in parallel;

the semantic feature extraction unit is used for generating a corresponding semantic feature vector from the first training sample;

the full connection layer is used for mapping the semantic feature vectors to different subspaces to obtain sub-semantic feature vectors;

the activation function is used for extracting the features of the sub-semantic feature vectors which are mapped to the subspaces in a full-connected mode, and splicing the sub-semantic feature vectors obtained from the subspaces to obtain a first semantic vector.

As an optional implementation manner of the embodiment of the present disclosure, before defining the semantic understanding model, the method further includes:

and acquiring a negative training sample corresponding to the training sample, wherein the negative training sample comprises the title information of the short video medium resource selected from a training sample data set.

In a second aspect, an embodiment of the present disclosure provides a short video recall method, including:

acquiring a text to be predicted and a short video medium resource to be recalled;

inputting the text to be predicted into a first target semantic understanding model to obtain a first target semantic vector corresponding to the text to be predicted, and inputting the short-to-be-recalled video asset into a second target semantic understanding model to obtain a second target semantic vector corresponding to the short-to-be-recalled video asset;

calculating recall scores of the text to be predicted and the short video medium resource to be recalled according to the first target semantic vector and the second target semantic vector;

determining the recalled short video assets according to the recall scores;

wherein the target semantic understanding model comprises a first target semantic understanding model and a second target semantic understanding model, and the target semantic understanding model is obtained by training based on the method of any one of claims 1 to 5.

In a third aspect, an embodiment of the present disclosure provides a semantic understanding model training apparatus, including:

the training sample acquisition module is used for acquiring training samples, wherein the training samples comprise a first training sample and a second training sample, the first training sample comprises a natural language text, a label corresponding to the natural language text and a field to which the natural language text belongs, and the second training sample comprises title information of a short video asset recalled based on the natural language text and a label corresponding to the recalled short video asset;

the semantic understanding module is used for defining a semantic understanding model, wherein the semantic understanding model comprises a first semantic understanding model and a second semantic understanding model, and the first semantic understanding model and the second semantic understanding model are connected in parallel;

a semantic vector generation module, configured to generate a corresponding first semantic vector based on the first training sample and the first semantic understanding model, and generate a corresponding second semantic vector based on the second training sample and the second semantic understanding model;

and the model determining module is used for training the semantic understanding model based on the loss function values of the first semantic vector and the second semantic vector to obtain a target semantic understanding model.

In a fourth aspect, an embodiment of the present disclosure provides a short video recall device, including:

the information acquisition module is used for acquiring the text to be predicted and the short video assets to be recalled;

a semantic vector acquisition module, configured to input the text to be predicted into a first target semantic understanding model to obtain a first target semantic vector corresponding to the text to be predicted, input the video asset to be recalled into a second target semantic understanding model to obtain a second target semantic vector corresponding to the video asset to be recalled;

the recall score calculation module is used for calculating the recall score of the text to be predicted and the video medium resource to be recalled according to the first target semantic vector and the second target semantic vector;

the recall module is used for determining to recall the short video assets according to the recall scores;

In a fifth aspect, the present disclosure also provides a computer device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement a method as in any one of the first aspect, or a method as in the second aspect.

In a sixth aspect, the present disclosure also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method of any one of the first aspect or the method of the second aspect.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:

the semantic understanding model training method, the short video recall method, the device, the equipment and the medium provided by the embodiment of the disclosure are characterized in that training samples are firstly obtained, the training samples comprise a first training sample and a second training sample, the first training sample comprises a natural language text, a label corresponding to the natural language text and a field to which the natural language text belongs, the second training sample comprises title information of short video assets recalled based on the natural language text and a label corresponding to the recalled short video assets, then a semantic understanding model is defined, wherein the semantic understanding model comprises a first semantic understanding model and a second semantic understanding model, the first semantic understanding model and the second semantic understanding model are connected in parallel, a corresponding first semantic vector is generated based on the first training sample and the first semantic understanding model, and a corresponding first semantic vector is generated based on the second training sample and the second semantic understanding model, and finally, training the semantic understanding model based on the loss function values of the first semantic vector and the second semantic vector to obtain a target semantic understanding model. The semantic understanding model comprises a first semantic understanding model and a second semantic understanding model, the first semantic understanding model obtains a first semantic vector based on an input first training sample, the second semantic understanding model obtains a second semantic vector based on an input second training sample, the first semantic vector is a representation of a semantic vector of an input natural language text, the second semantic vector is a representation of a title vector of a short video medium recalled based on the natural language text, namely, the semantic understanding model is trained by calculating loss function values of the first semantic vector and the second semantic vector, the semantic understanding model is guaranteed to be more accurate, and the accuracy of field dialogue understanding is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1A is a schematic diagram of a semantic understanding process in the prior art;

FIG. 1B is a schematic diagram illustrating an application scenario of a semantic understanding process according to an embodiment of the present disclosure;

FIG. 2A is a block diagram of a hardware configuration of a computer device according to one or more embodiments of the present disclosure;

FIG. 2B is a software configuration diagram of a computer device according to one or more embodiments of the present disclosure;

FIG. 2C is a schematic illustration of an icon control interface display of an application program included in a smart device in accordance with one or more embodiments of the present disclosure;

fig. 3A is a schematic flowchart of a semantic understanding model training method provided by the embodiment of the present disclosure;

FIG. 3B is a schematic diagram illustrating a semantic understanding model training method according to an embodiment of the disclosure;

FIG. 4A is a schematic flow chart diagram of another semantic understanding model training method provided by the embodiment of the present disclosure;

FIG. 4B is a schematic diagram illustrating another semantic understanding model training method provided by the disclosed embodiment;

FIG. 5A is a schematic structural diagram of a semantic understanding model provided by an embodiment of the present disclosure;

fig. 5B is a specific structural diagram of a semantic understanding model provided by an embodiment of the present disclosure;

FIG. 5C is a structural schematic diagram of a first semantic understanding model of an exemplary embodiment of the present disclosure;

FIG. 5D is a structural schematic diagram of a second semantic understanding model of an exemplary embodiment of the present disclosure;

FIG. 5E is a specific structural diagram of another semantic understanding model of an exemplary embodiment of the present disclosure;

fig. 6A is a schematic flowchart of a short video recall method according to an embodiment of the present disclosure;

FIG. 6B is a schematic diagram illustrating a short video recall method according to an embodiment of the present disclosure;

fig. 7A is a schematic structural diagram of a semantic understanding model training apparatus provided in an embodiment of the present disclosure;

fig. 7B is a schematic structural diagram of a semantic understanding model training apparatus provided in the embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a short video recall device according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

The terms "first" and "second," etc. in this disclosure are used to distinguish between different objects, rather than to describe a particular order of objects. For example, the first training sample and the second training sample are used to distinguish different training samples, rather than describing a particular order of training samples.

An intelligent voice assistant or an intelligent customer service is one of the most extensive and important ways for technologies such as Artificial Intelligence (AI), Natural Language Processing (NLP) and the like to fall into an actual scene, wherein correct understanding of user query is the core capability of the intelligent voice assistant. At present, the intelligent voice assistant can analyze the field, intention and slot position of user inquiry by using a rule engine or a deep learning model, so as to realize the understanding of the user inquiry. However, with the development of technologies such as mobile internet, short videos have a blowout type development, and people are more interested in watching fragmented and shorter-duration video assets, namely short video assets.

The short video assets and the normal video assets have obvious differences in time length, asset titles and the like, for example, the time length of the short video assets is shorter, and the titles are longer. Most of the existing recommendation systems recall related videos based on word granularity and can recommend normal video assets well, however, since user inquiry is usually short, the recommendation systems are difficult to capture some slight language changes in the user inquiry, and further the voice assistant cannot accurately recommend short video assets to the user, for example, a natural language text 'how a baby is ill' is input to the recommendation system, the recommendation system can recommend the related videos 'how a baby dog is ill' and cannot recall a result similar to the input natural language text semantics, and poor user experience is brought.

Fig. 1A is a schematic diagram of a semantic vector understanding process in the prior art. As shown in fig. 1A, the main implementation flow is as follows: the natural language text is a text content obtained by identifying the audio data a of the user, and may also be other text contents, which is not limited in this embodiment. The method comprises the steps of firstly dividing a natural language text based on word granularity, then determining an intention corresponding to the natural language text and a label of a keyword in the divided word or word based on the divided word or word, and finally obtaining a semantic understanding result corresponding to the natural language text based on the natural language text, the intention of the natural language text and the label corresponding to the natural language text. However, in the method, because the natural language text is divided based on the word granularity, when the user query is usually short, the recommendation system is difficult to capture some slight language changes in the user query, and the semantic understanding result obtained based on the natural language text, the intention information corresponding to the natural language text and the text label information is not accurate enough, so that the short video media asset cannot be accurately recommended to the user.

Aiming at the defects in the method, the embodiment of the disclosure firstly obtains training samples, the training samples comprise a first training sample and a second training sample, the first training sample comprises a natural language text, a label corresponding to the natural language text and a field to which the natural language text belongs, the second training sample comprises title information based on a short video asset recalled from the natural language text and a label corresponding to the recalled short video asset, then a semantic understanding model is defined, wherein the semantic understanding model comprises a first semantic understanding model and a second semantic understanding model, the first semantic understanding model and the second semantic understanding model are connected in parallel, a corresponding first semantic vector is generated based on the first training sample and the first semantic understanding model, a corresponding second semantic vector is generated based on the second training sample and the second semantic understanding model, and finally a loss function value based on the first semantic vector and the second semantic vector, and training the semantic understanding model to obtain a target semantic understanding model. The semantic understanding model comprises a first semantic understanding model and a second semantic understanding model, the first semantic understanding model obtains a first semantic vector based on an input first training sample, the second semantic understanding model obtains a second semantic vector based on an input second training sample, the first semantic vector is a representation of a semantic vector of an input natural language text, and the second semantic vector is a representation of a title vector of an input recalled short video media asset, namely the semantic understanding model is trained by calculating loss function values of the first semantic vector and the second semantic vector, so that the semantic understanding model is more accurate, and the accuracy of field dialogue understanding is improved.

Fig. 1B is a schematic view of an application scenario of a semantic understanding process in the embodiment of the present disclosure. As shown in fig. 1B, the semantic understanding process may be used in a voice interaction scenario between a user and an intelligent terminal, assuming that the intelligent terminal in the scenario includes an intelligent display device 001, when the user wants to switch short video assets displayed by the intelligent display device of the intelligent terminal, a voice instruction needs to be sent first, and when the intelligent terminal receives the voice instruction, the intelligent terminal needs to perform semantic understanding on the voice instruction, determine a semantic understanding result corresponding to the voice instruction, so that the subsequent intelligent device searches for more relevant short video assets to be displayed on the intelligent display device of the intelligent terminal according to the semantic understanding result, and thus, the search requirement of the user is met.

The semantic understanding model training method and the short video recall method provided by the embodiment of the disclosure can be implemented based on computer equipment, or a functional module or a functional entity in the computer equipment.

The computer device may be a Personal Computer (PC), a server, a mobile phone, a tablet computer, a notebook computer, a mainframe computer, and the like, which is not particularly limited in this disclosure.

Illustratively, fig. 2A is a block diagram of a hardware configuration of a computer device according to one or more embodiments of the present disclosure. As shown in fig. 2A, the computer apparatus includes: at least one of a tuner demodulator 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface 280. The controller 250 includes a central processing unit, a video processor, an audio processor, a graphic processor, a RAM, a ROM, a first interface to an nth interface for input/output, among others. The display 260 may be at least one of a liquid crystal display, an OLED display, a touch display, and a projection display, and may also be a projection device and a projection screen. The tuner demodulator 210 receives broadcast television signals through wired or wireless reception, and demodulates audio and video signals, such as EPG audio and video data signals, from a plurality of wireless or wired broadcast television signals. The communicator 220 is a component for communicating with an external device or a server according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, and other network communication protocol chips or near field communication protocol chips, and an infrared receiver. The computer device may establish transmission and reception of control signals and data signals with the server or the local control device through the communicator 220. The detector 230 is used to collect signals of the external environment or interaction with the outside. The controller 250 and the modem 210 may be located in different separate devices, that is, the modem 210 may also be located in an external device of the main device where the controller 250 is located, such as an external set-top box.

In some embodiments, controller 250 controls the operation of the computer device and responds to user actions through various software control programs stored in memory. The controller 250 controls the overall operation of the computer device. A user may input a user command on a Graphical User Interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the Graphical User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface receives the user input command by recognizing the sound or gesture through the sensor.

Fig. 2B is a schematic software configuration diagram of a computer device according to one or more embodiments of the present disclosure, and as shown in fig. 2B, the system is divided into four layers, which are, from top to bottom, an Application (Applications) layer (referred to as an "Application layer"), an Application Framework (Application Framework) layer (referred to as a "Framework layer"), an Android runtime (Android runtime) and system library layer (referred to as a "system runtime library layer"), and a kernel layer.

Fig. 2C is a schematic diagram illustrating an icon control interface display of an application program included in an intelligent terminal (mainly an intelligent playback device, such as an intelligent television, a digital cinema system, or a video server), according to one or more embodiments of the present disclosure, as shown in fig. 2C, an application layer includes at least one application program that can display a corresponding icon control in a display, for example: the system comprises a live television application icon control, a video on demand VOD application icon control, a media center application icon control, an application center icon control, a game application icon control and the like. The live television application program can provide live television through different signal sources. A video on demand VOD application may provide video from different storage sources. Unlike live television applications, video on demand provides a video display from some storage source. The media center application program can provide various applications for playing multimedia contents. The application program center can provide and store various application programs.

For a more detailed description of the semantic understanding model training scheme, the following description is given in conjunction with fig. 3A by way of example, and it is understood that the steps involved in fig. 3A may include more steps or fewer steps in actual implementation, and the order between the steps may also be different, so as to enable the semantic understanding model training method provided in the embodiment of the present application.

Fig. 3A is a schematic flowchart of a semantic understanding model training method provided by the embodiment of the present disclosure; fig. 3B is a schematic diagram illustrating a semantic understanding model training method according to an embodiment of the present disclosure. The embodiment can be applied to the situation that the target semantic understanding model is obtained by training the semantic understanding model. The method of the embodiment can be executed by a semantic understanding model training device, which can be implemented in hardware and/or software and can be configured in a computer device.

As shown in fig. 3A, the method specifically includes the following steps:

and S310, obtaining a training sample.

The training samples comprise a first training sample and a second training sample, the first training sample comprises a natural language text, a label corresponding to the natural language text and a field to which the natural language text belongs, and the second training sample comprises title information of a short video asset recalled based on the natural language text and a label corresponding to the recalled short video asset.

In the embodiment of the present disclosure, the first training sample may be a training sample randomly extracted from a predetermined training data set, or may be a training sample grouped from the training data set, where the training data set is a set formed by text content corresponding to each audio data determined based on multiple types of audio data of multiple different users, or other collected text content, a label corresponding to the text content, and a set formed by a domain to which each text content belongs. The tag information can be understood as attributes corresponding to keywords in the text content, such as character attributes: actors, singers, athletes, etc.; the film and television attributes are as follows: drama, movie, and art programs. The domain information may be understood as a domain and/or intention corresponding to the text content in the history index, such as a movie domain, a search related movie, and the like. The natural language text is: text content determined from the training data set.

And the second training sample is the title information of the target short video asset corresponding to the target short video selected by the user from the short video assets recalled based on the natural language text and the label corresponding to the target short video asset. Illustratively, the short video assets recalled based on the natural language text include short video assets 1, short video assets 2, short video assets 3, and short video assets N, and if the user triggers a click to browse the short video assets 2 in the recalled short video assets, at this time, the second training sample is the title information of the short video assets 2 and the label corresponding to the short video assets 2.

S320, defining a semantic understanding model, wherein the semantic understanding model comprises a first semantic understanding model and a second semantic understanding model, and the first semantic understanding model and the second semantic understanding model are connected in parallel.

In the embodiment of the present disclosure, by defining a semantic understanding model, identifying semantic information of a natural language text by a first semantic understanding model in the semantic understanding model, and identifying semantic information corresponding to a title of a short video asset recalled based on the natural language text by a second semantic understanding model in the semantic understanding model, that is: the semantic information recognition of the natural language text and the semantic information recognition of the title of the short video media asset are fused in the semantic understanding model, so that the natural language text can be accurately recognized.

S330, generating a corresponding first semantic vector based on the first training sample and the first semantic understanding model, and generating a corresponding second semantic vector based on the second training sample and the second semantic understanding model.

In the embodiment of the disclosure, a first training sample is input to a first semantic understanding model, and a first semantic vector corresponding to the first training sample is obtained, wherein the first semantic vector represents the semantics of the natural language text.

In a specific embodiment, the first training sample comprises a natural language text, a label corresponding to the natural language text and a field to which the natural language text belongs, and the first training sample input to the first semantic understanding model comprises the natural language text, the label corresponding to the natural language text and the field to which the natural language text belongs, so that the accuracy of the semantic meaning corresponding to the natural language text represented by the first semantic vector output by the first semantic understanding model is ensured.

And inputting the second training sample into a second semantic understanding model to obtain a second semantic vector corresponding to the second training sample, wherein the second semantic vector represents the semantic represented by the text of the title information of the short video assets selected and browsed by the user in the short video assets recalled based on the natural language text.

In a specific implementation manner, the second training sample comprises title information of the recalled short video assets based on the natural language text and a label corresponding to the recalled short video assets, and the second training sample input to the second semantic understanding model is set to comprise the label corresponding to the short video assets in addition to the title information of the recalled short video assets, so that the accuracy of text semantics corresponding to the title information of the short video assets represented by the second semantic vector output by the second semantic understanding model is ensured.

S340, training the semantic understanding model based on the loss function values of the first semantic vector and the second semantic vector to obtain a target semantic understanding model.

The method comprises the steps of inputting a first training sample into a first semantic understanding model to obtain a first semantic vector, inputting a second training sample into a second semantic understanding model to obtain a second semantic vector, calculating loss function values of the first semantic vector and the second semantic vector through a preset loss function, judging whether the semantic understanding model reaches a convergence condition or not based on the relation between the calculated loss function values and the preset loss function values, continuing to train the semantic understanding model when the semantic understanding model does not reach the convergence condition until the semantic understanding model reaches the convergence condition, and determining a target semantic understanding model.

It should be noted that, when the semantic understanding model is judged not to reach the convergence condition based on the relationship between the calculated loss function value and the preset threshold, and then the semantic understanding model is continuously trained, one implementation manner is to train the first semantic understanding model in the semantic understanding model, so as to ensure that the correlation between the first semantic vector output by the first semantic understanding model and the second semantic vector of the text corresponding to the header information of the short video media asset is higher, and further ensure the precision of the semantic understanding model. The other implementation mode is that a first semantic understanding model and a second semantic understanding model in the semantic understanding model are trained, so that the accuracy of a first semantic vector output by the first semantic understanding model and the accuracy of a second semantic vector output by the second semantic understanding model are ensured, and the accuracy of the semantic understanding model is further ensured.

The semantic understanding model training method provided by the embodiment of the disclosure comprises the steps of firstly obtaining training samples, wherein the training samples comprise a first training sample and a second training sample, the first training sample comprises a natural language text, labels corresponding to the natural language text and fields to which the natural language text belongs, the second training sample comprises title information of short video assets recalled based on the natural language text and labels corresponding to the recalled short video assets, then defining a semantic understanding model, wherein the semantic understanding model comprises a first semantic understanding model and a second semantic understanding model, the first semantic understanding model and the second semantic understanding model are connected in parallel, generating a corresponding first semantic vector based on the first training sample and the first semantic understanding model, generating a corresponding second semantic vector based on the second training sample and the second semantic understanding model, and finally generating a loss function value based on the first semantic vector and the second semantic vector, and training the semantic understanding model to obtain a target semantic understanding model. The semantic understanding model comprises a first semantic understanding model and a second semantic understanding model, the first semantic understanding model obtains a first semantic vector based on an input first training sample, the second semantic understanding model obtains a second semantic vector based on an input second training sample, the first semantic vector is a representation of a semantic vector of an input natural language text, the second semantic vector is a representation of a title vector of a short video medium recalled based on the natural language text, namely, the semantic understanding model is trained by calculating loss function values of the first semantic vector and the second semantic vector, the semantic understanding model is guaranteed to be more accurate, and the accuracy of field dialogue understanding is improved.

Fig. 4A is a schematic flowchart of another training method for semantic understanding models provided in the embodiment of the present disclosure, fig. 4B is a schematic principle diagram of a training method for semantic understanding models provided in the embodiment of the present disclosure, and this embodiment is further expanded and optimized based on the above embodiments, as shown in fig. 4A and fig. 4B, a specific implementation manner of step S340 includes:

and S3401, determining loss function values of the first semantic vector and the second semantic vector based on a preset loss function.

The preset loss function may be a metric learning (RankingLoss) loss function, specifically may be determined according to actual use requirements, and may also be set by user definition, which is not limited in the embodiment of the present disclosure.

In the embodiment of the disclosure, the preset loss function is a measurement standard for determining whether the semantic understanding model is qualified or not, so that the semantic understanding model obtained by training has a high-precision recognition result. The loss function value between a first semantic vector generated by the first semantic understanding model and a second semantic vector generated by the second semantic understanding model can be calculated through a preset loss function, and the recognition precision of the semantic understanding model is verified according to the relation between the loss function value and a preset threshold value, so that the semantic understanding model with high accuracy is trained.

In a specific embodiment, after a first semantic vector output by a first semantic understanding model and a second semantic vector output by a second semantic understanding model are obtained, based on a preset loss function, a loss function value of the first semantic vector and a loss function value of the second semantic vector are determined, parameters of a semantic understanding model are optimized, and a loss function of Rankinggloss can be expressed in the following form:

loss(s1，s2，y)＝max(0，-y*(s1-s2)+margin)

wherein s1 represents the score of the first semantic vector, s2 represents the score of the second semantic vector, and y represents whether s1 is closer to the optimal score than s2, taking the values { +1, -1 }.

Illustratively, when the score value s1 of the first semantic vector is 0.7, the score value s2 of the second semantic vector is 0.5, the score value s1 of the first semantic vector is closer to the optimal score than the score value s2 of the second semantic vector, the y value is +1, the loss function output is 0, the semantic understanding model satisfies the convergence condition, the training may be ended, when the score value s1 of the first semantic vector is 0.5, the score value s2 of the second semantic vector is 0.7, the score value s1 of the first semantic vector is closer to the optimal score than the score value s2 of the second semantic vector, the y value is +1, the loss function output is-y (s1-s2) + margin, the semantic understanding model does not satisfy the convergence condition, and the understanding model needs to keep training.

And S3402, adjusting parameters of the semantic understanding model according to the loss function value until the semantic understanding model converges to obtain a target semantic understanding model.

In a specific embodiment, when the loss function value does not meet a preset threshold value, training a first semantic understanding model based on a first training sample, adjusting parameters of the first semantic understanding model, training a second semantic understanding model based on a second training sample, and adjusting parameters of the second semantic understanding model; and obtaining a target semantic understanding model when the loss function values of the first semantic vector output by the first semantic understanding model and the second semantic vector output by the second semantic understanding model meet a preset threshold, wherein the target semantic understanding model consists of the first semantic understanding model after parameter optimization and the second semantic understanding model after parameter optimization.

As another implementable manner, when the loss function value does not meet the preset threshold, only the parameter of the first semantic understanding model may be adjusted, so as to ensure that the loss function value of the first semantic vector output by the first semantic understanding model and the loss function value of the second semantic vector output by the second semantic understanding model meet the preset threshold, and obtain the target semantic understanding model, and at this time, the target semantic understanding model is composed of the first semantic understanding model and the second semantic understanding model after parameter optimization.

When the first semantic understanding model is trained only based on the first training sample, and the loss function value of the first semantic vector output by the first semantic understanding model and the loss function value of the second semantic vector output by the second semantic understanding model meet a preset threshold value, at the moment, the first semantic understanding model is trained based on the first training sample by setting the parameter of the second semantic understanding model to be unchanged, and the parameter of the first semantic understanding model is updated.

The semantic understanding model training method provided by the embodiment of the disclosure determines loss function values of a first semantic vector and a second semantic vector based on a preset loss function, and then adjusts parameters of the semantic understanding model based on the loss function values until the semantic understanding model converges to obtain a target semantic understanding model.

In some embodiments, the first semantic understanding model includes a semantic feature extraction unit, a full connection layer and an activation function, where the full connection layer includes N sub-connection layers, the activation function includes N sub-activation functions, the semantic feature extraction unit, the full connection layer and the activation function are connected in series, the N sub-full connection layers are connected in parallel, the N sub-activation functions are connected in parallel, the semantic feature extraction unit is configured to generate a corresponding semantic feature vector from the first training sample, the full connection layer is configured to map the semantic feature vector to different subspaces to obtain sub-semantic feature vectors, and the activation function is configured to perform feature extraction on the sub-semantic feature vectors mapped to the subspaces in full connection, and perform a stitching process on the sub-semantic feature vectors obtained in the respective subspaces to obtain the first semantic vector.

Specifically, fig. 5A is a schematic structural diagram of a semantic understanding model provided by the embodiment of the disclosure, fig. 5B is a schematic structural diagram of a semantic understanding model provided by the embodiment of the disclosure, fig. 5C is a schematic structural diagram of a first semantic understanding model provided by the embodiment of the disclosure, as shown in fig. 5A, 5B and 5C, the first semantic understanding model 510 includes a semantic feature extraction unit 5101, a full-connected layer 5102 and an activation function 5103, the full-connected layer 5102 includes N sub full-connected layers, the activation function 5103 includes N sub activation functions, after the semantic feature extraction unit generates a semantic feature vector corresponding to a first training sample, the semantic feature extraction unit respectively inputs the generated semantic feature vector corresponding to the natural language text to the full-connected layer, as shown in fig. 5C, the full-connected layer is exemplarily represented by a subFC, and includes N sub full-connected layers, all the sub full connection layers are connected in parallel, the output end of the semantic feature extraction unit is connected with the input ends of the N sub full connection layers, and the different connection layers map the semantic feature vectors generated by the semantic feature extraction unit to different spaces to obtain sub semantic feature vectors. The activation function is exemplarily represented by ReLU, the output end of one sub full-connection layer is connected to one sub activation function, N sub activation functions are connected in parallel, feature extraction is carried out on the sub semantic feature vectors which are mapped to the sub spaces through full connection through the activation function, and the sub semantic feature vectors obtained by the sub spaces are spliced to obtain a first semantic vector.

In this embodiment, the full connection layer maps the semantic feature vectors generated by the semantic feature extraction unit to different subspaces, that is, the first semantic understanding model extracts semantic information from the semantic feature vectors by using a mode of modeling a plurality of subspaces, and the process may be represented as:

Headi＝SubFCi(Origin _Vector )＝Wi*Ovector

wherein, header represents the sub-semantic feature vector of the semantic feature vector in the ith subspace, origin represents the semantic feature vector, and Wi represents the mapping weight of the semantic feature vector mapped from the original semantic space to the ith subspace.

In the embodiment of the disclosure, in order to enable the first semantic understanding model to selectively adopt the natural language text, the label corresponding to the natural language text and the field to which the natural language text belongs according to the input first training sample, the first semantic understanding model uses a Transformer to realize a self-attention mechanism, and the semantic expression capability of the model is improved.

In the embodiment of the disclosure, the natural language text, the tags corresponding to the natural language text and the field to which the natural language text belongs are jointly modeled based on the first semantic understanding model, and the semantic feature vectors of the first training sample are modeled in a plurality of subspaces by using a multi-head mechanism, so that the semantic understanding capability is improved.

In some embodiments, the second semantic understanding model includes a semantic feature extraction unit and an activation function, the semantic feature extraction unit is configured to generate a corresponding semantic feature vector from the second training sample, and the activation function is configured to perform feature extraction on the semantic feature vector to obtain a second semantic vector.

Specifically, the specific structure of the second semantic understanding model is as shown in fig. 5D, wherein after the semantic feature extraction unit generates the semantic feature vector corresponding to the second training sample, the semantic feature extraction unit inputs the generated semantic feature vector corresponding to the second training sample to the activation function, as shown in fig. 5D, the activation function exemplarily uses ReLU to represent that the output end of the semantic feature extraction unit is connected to the activation function, and performs feature extraction on the semantic feature vector of the semantic feature extraction unit through the activation function, so as to obtain the second semantic vector.

In a specific implementation manner, the activation function in the first semantic understanding model and the second semantic understanding model may include a plurality of activation functions, which is not specifically limited by the embodiments of the present disclosure.

Fig. 5D is a general framework diagram of a semantic understanding model provided by the embodiment of the present disclosure, and as shown in fig. 5D, exemplary first training samples are: MOVIeGeneralSearch, X, operator, Singer, videoRoleName, in the first training sample, the natural language text is: the label corresponding to the natural language text is as follows: actor, Singer, videorollame, the domain to which natural language text belongs: MOVIeGeneralSearch, second training sample is: the Chinese medicinal preparation comprises Chinese medicaments and Chinese medicaments, wherein the Chinese medicaments comprise X, namely X, yellow, wherein X, king, X, Liu, X, Shanghai, Xiaoxiao, earth: the science and science department includes a science and science department, wherein the science and science department forecast yellow X, the science and science Fiction department forecast yellow XX, the science and science Fiction department forecast yellow X, the science and science Fiction department forecast yellow XX is title information corresponding to a target short video Media asset triggered and clicked by a user in the short video Media asset recalled based on a natural language text XX, the science and science forecast yellow are labels corresponding to the target short video Media asset, a first training sample is input into a first semantic understanding model to obtain a first semantic vector Multi-head representation output by the first semantic understanding model, a second training sample is input into a second semantic understanding model to obtain a second semantic vector Media title representation output by the second semantic understanding model, then loss function values of the first semantic vector and the second semantic vector are calculated based on a preset loss function value, and when the loss function values of the first semantic vector and the second semantic vector meet a preset threshold value, the target semantic understanding model is composed of a first semantic understanding model and a second semantic understanding model, when the loss function value of the first semantic vector and the second semantic vector does not meet the preset threshold value, at this time, firstly, by adjusting the parameters of the first semantic understanding model and the second semantic understanding model, then the first training sample is input into the first semantic understanding model after the parameters are adjusted again, the second training sample is input into the second semantic understanding model after the parameters are adjusted again, and calculating a loss function value based on a first semantic vector output by the first semantic understanding model and a second semantic vector output by the second semantic understanding model, and when the loss function value satisfies a preset threshold value, at this time, it can be determined that the semantic understanding model after the parameters are modified reaches the convergence condition, and the target semantic understanding model is determined to be composed of the first semantic understanding model and the second semantic understanding model after the parameters are modified. And when the calculated loss function value does not meet the preset threshold value, the semantic understanding model after the parameters are modified does not reach the convergence condition, and at the moment, the parameters of the first semantic understanding model and the second semantic understanding model are adjusted again until the semantic understanding model reaches the convergence condition, so that the training of the semantic understanding model is finished.

In addition, the method uses the token of Albert to realize the division of the first training sample and the second training sample, for example, the XX of the ' I and the ' I ' can be divided into five words of ' I ' and ' XX ', the XX word is not divided into X and X, the size of a word stock is reduced, and the model training is facilitated.

As an implementation manner, the semantic understanding model training method further comprises the following steps:

and acquiring a negative training sample corresponding to the training sample, wherein the negative training sample comprises the title information of the short video asset selected from the training sample data set.

In order to ensure the training efficiency of the semantic understanding model, after the training samples are obtained, negative training samples corresponding to the training samples are obtained, namely negative samples are constructed based on the training sample set.

Fig. 6A is a schematic flowchart of a short video recall method according to an embodiment of the present disclosure, and fig. 6B is a schematic diagram of a principle of the short video recall method according to the embodiment of the present disclosure. The embodiment can be suitable for short video recall based on text to be predicted, and the method of the embodiment can be executed by a short video recall device, which can be implemented by adopting a hardware/software mode and can be configured in computer equipment.

As shown in fig. 6A, the method specifically includes the following steps:

s610, obtaining the text to be predicted and the short video medium resource to be recalled.

The text to be predicted may be audio data of a user, for example, voice data in a voice interaction process between the user and the intelligent device, or may be an input text, which is not limited in this embodiment.

And the short video assets to be recalled are all the short video assets stored in the database.

S620, inputting the text to be predicted into the first target semantic understanding model to obtain a first target semantic vector corresponding to the text to be predicted, and inputting the short-to-be-recalled video asset into the second target semantic understanding model to obtain a second target semantic vector corresponding to the short-to-be-recalled video asset.

The target semantic understanding model comprises a first target semantic understanding model and a second target semantic understanding model, and the target semantic understanding model is obtained by training based on the semantic understanding model training method in any embodiment.

The text to be predicted is input into the first target semantic understanding model, so that a first target semantic vector corresponding to the text to be predicted can be obtained, wherein the text to be predicted input into the first target semantic understanding model comprises the text to be predicted, a label corresponding to the text to be predicted and the field to which the text to be predicted belongs.

And inputting each short video asset to be recalled in the database into a second target semantic understanding model to obtain a second target semantic vector corresponding to the header information of the short video asset, wherein the short video asset input into the second target semantic understanding model comprises the header information of the short video asset and a label corresponding to the short video asset.

S630, calculating recall scores of the text to be predicted and the short video assets to be recalled according to the first target semantic vector and the second target semantic vector.

After a first target semantic vector corresponding to a text to be predicted and a second target semantic vector corresponding to the title information of the short video asset are obtained, the recall score of the first target semantic vector and the second target semantic vector can be calculated.

In a specific embodiment, if the text to be predicted input to the first target semantic understanding model is query1, the short video asset to be recalled input to the second target semantic understanding model includes media1, media2, media3, media4, …, and media n, the text to be predicted 1 is input to the first target semantic understanding model to obtain a first target semantic vector, which is query _ vector1, the text to be recalled is input to the second target semantic understanding model to obtain second target semantic vectors, which are respectively media1_ vector, media2_ vector, media 42 _ vector, …, and media _ vector, and then the first target semantic vector, the second target semantic vector, which is query 1_ vector, media2_ vector, media 4642 _ vector, and media _ vector are calculated respectively, and then the score of the first target semantic vector, the second semantic vector, media _ vector, and the score of the first target semantic vector, the second semantic vector, media _ vector, and then the score of the media _ vector are calculated respectively, and then the score of the first target semantic _ vector, the media _ vector, and the media _ vector, and the score of the second target semantic vector are determined.

Specifically, recall scores of the first target semantic vector query _ vector1 and each of the second target semantic vectors medium 1vector, medium 2_ vector, medium 3_ vector, …, and medium n _ vector are calculated

Wherein x1 represents the first target semantic vector, x2i represents the second target semantic vector of media vector, and epsilon is a minimum value.

And S640, determining to recall the short video assets according to the recall scores.

In a specific implementation manner, since the short video assets in the database include a plurality of short video assets to be recalled, the short video assets to be recalled are sequentially input into the second target semantic understanding model to obtain second target semantic vectors corresponding to the short video assets, recall scores of the first target semantic vectors and the second target semantic vectors corresponding to the short video assets to be recalled are sequentially calculated, and the recall target short video assets are determined according to the calculated recall scores. Illustratively, the recall scores of the first target semantic vector query _ vector1 and the second target semantic vector media1_ vector are calculated as similarity1, the recall scores of the first target semantic vector query _ vector1 and the second target semantic vector media2_ vector are similarity2, the recall scores of the first target semantic vector query _ vector1 and the second target semantic vector media _ vector are similarity n, and the relationship between the recall scores is similarity1> similarity2> similarity n, then according to the ranking of the recall scores, a preset number of short video assets are selected as the target short video assets for recall display. For example, if the number of short video assets needing to be recalled for display is 5, then media1, media2, media3, media4 and media5 are recalled, i.e., it is determined that the short video assets being recalled are media1, media2, media3, media4 and media 5.

In this embodiment, the target short video can be recalled quickly and accurately by the short video recall method, so that the target short video recommended to the user can meet the user requirements, and the user experience can be improved.

Fig. 7A is a schematic structural diagram of a semantic understanding model training apparatus provided in an embodiment of the present disclosure, where the apparatus is configured in a computer device, and may implement the semantic understanding model training method according to any embodiment of the present disclosure, where the apparatus specifically includes the following:

a training sample obtaining module 710, configured to obtain a training sample, where the training sample includes a first training sample and a second training sample, the first training sample includes a natural language text, a tag corresponding to the natural language text, and a field to which the natural language text belongs, and the second training sample includes header information of a short video asset recalled based on the natural language text and a tag corresponding to the recalled short video asset;

the definition module 720 is configured to define a semantic understanding model, where the semantic understanding model includes a first semantic understanding model and a second semantic understanding model, and the first semantic understanding model and the second semantic understanding model are connected in parallel;

a semantic vector generation module 730, configured to generate a corresponding first semantic vector based on the first training sample and the first semantic understanding model, and generate a corresponding second semantic vector based on the second training sample and the second semantic understanding model;

and the model determining module 740 is configured to train the semantic understanding model based on the loss function values of the first semantic vector and the second semantic vector to obtain a target semantic understanding model.

As an optional implementation manner of the embodiment of the present disclosure, fig. 7B is a schematic structural diagram of a semantic understanding model training apparatus provided in the embodiment of the present disclosure, and as shown in fig. 7B, the model determining module includes:

a loss function value determining unit 7401, configured to determine a loss function value of the first semantic vector and the second semantic vector based on a preset loss function;

and the adjusting unit 7402 is used for adjusting parameters of the semantic understanding model according to the loss function value until the semantic understanding model converges to obtain the target semantic understanding model.

As an optional implementation manner of the embodiment of the present disclosure, a specific implementation manner of the adjustment unit includes:

and obtaining the target semantic understanding model when the loss function values of the first semantic vector output by the first semantic understanding model and the second semantic vector output by the second semantic understanding model meet a preset threshold value.

the activation function is used for extracting features of the sub-semantic feature vectors which are mapped to the subspaces in a full-connection mode, and splicing the sub-semantic feature vectors obtained from the subspaces to obtain a first semantic vector.

As an optional implementation manner of the embodiment of the present disclosure, optionally, the method further includes:

and the negative training sample acquisition module is used for acquiring a negative training sample corresponding to the training sample, wherein the negative training sample comprises the title information of the short video media asset selected from the training sample data set.

Fig. 8 is a schematic structural diagram of a short video recall apparatus configured in a computer device and capable of implementing the short video recall method according to any embodiment of the present disclosure, where the apparatus specifically includes the following:

the information acquisition module 810 is used for acquiring a text to be predicted and a short video asset to be recalled;

a semantic vector obtaining module 820, configured to input the text to be predicted into the first target semantic understanding model to obtain a first target semantic vector corresponding to the text to be predicted, and input the short-to-be-recalled video asset into the second target semantic understanding model to obtain a second target semantic vector corresponding to the short-to-be-recalled video asset;

the recall score calculating module 830 is configured to calculate a recall score of the text to be predicted and the short video asset to be recalled according to the first target semantic vector and the second target semantic vector;

a recall module 840 for determining to recall the short video assets according to the recall score;

The short video recall device provided by the embodiment of the disclosure can execute the short video recall method provided by any embodiment of the disclosure, has corresponding functional modules and beneficial effects of the execution method, and is not repeated here to avoid repetition.

The disclosed embodiment provides a computer device, including: one or more processors; a storage device configured to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the semantic understanding model training method of any of the embodiments of the present disclosure or the short video recall method of any of the embodiments of the present disclosure.

Fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure. As shown in fig. 9, the computer apparatus includes a processor 910 and a storage 920; the number of the processors 910 in the computer device may be one or more, and fig. 9 illustrates one processor 910 as an example; the processor 910 and the storage 920 in the computer device may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 9.

Storage 920 is a computer-readable storage medium that can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the semantic understanding model training method in the embodiments of the disclosure; or program instructions/modules corresponding to the semantic understanding method in the embodiments of the present disclosure. The processor 910 executes various functional applications and data processing of the computer device by executing software programs, instructions and modules stored in the storage 920, namely, implements the semantic understanding model training method or the short video recall method provided by the embodiments of the present disclosure.

The storage 920 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Additionally, the storage 920 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the storage 920 may further include memory located remotely from the processor 910, which may be connected to a computer device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The computer device provided by the embodiment can be used for executing the method provided by any embodiment, and has corresponding functions and beneficial effects.

The embodiments of the present disclosure further provide a storage medium containing computer-executable instructions, where the computer-executable instructions, when executed by a computer processor, implement each process executed by the method provided in any of the above embodiments, and can achieve the same technical effect, and in order to avoid repetition, the details are not repeated here.

The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description is only for the purpose of describing particular embodiments of the present disclosure, so as to enable those skilled in the art to understand or implement the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A semantic understanding model training method is characterized by comprising the following steps:

2. The method of claim 1, wherein training the semantic understanding model based on the loss function values of the first semantic vector and the second semantic vector to obtain a target semantic understanding model comprises:

3. The method of claim 2, wherein adjusting parameters of the semantic understanding model according to the loss function values until the semantic understanding model converges to obtain a target semantic understanding model comprises:

4. The method according to claim 1, wherein the first semantic understanding model comprises a semantic feature extraction unit, a full connection layer and an activation function, wherein the full connection layer comprises N sub full connection layers, the activation function comprises N sub activation functions, the semantic feature extraction unit, the full connection layer and the activation function are connected in series, N sub full connection layers are connected in parallel, and N sub activation functions are connected in parallel;

5. The method of claim 1, wherein before defining the semantic understanding model, further comprising:

and acquiring a negative training sample corresponding to the training sample, wherein the negative training sample comprises the title information of the short video media asset selected from the training sample data set.

6. A short video recall method, comprising:

calculating recall scores of the text to be predicted and the short video assets to be recalled according to the first target semantic vector and the second target semantic vector;

determining the recalled short video assets according to the recall scores;

7. A semantic understanding model training device is characterized by comprising:

8. A short video recall device, comprising:

9. A computer device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5 or to implement the method of claim 6.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 5 or carries out the method of claim 6.