CN115098729A

CN115098729A - Video processing method, sample generation method, model training method and device

Info

Publication number: CN115098729A
Application number: CN202210732217.6A
Authority: CN
Inventors: 韩翠云
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-06-23
Filing date: 2022-06-23
Publication date: 2022-09-23

Abstract

The present disclosure provides a video processing method, a sample generation method, a model training method and an apparatus, which relate to the technical field of artificial intelligence, in particular to the fields of knowledge graph, natural language processing, deep learning, etc., and can be applied to scenes such as content understanding. The specific implementation scheme is as follows: determining video representation according to video text information and key frames of a video to be processed; determining a first similarity between an event representation and a video representation of each of a plurality of event information, each of the plurality of event information comprising event text information and an event representation determined from the event text information; and determining target event information corresponding to the video to be processed from the plurality of event information according to the first similarity.

Description

Video processing method, sample generation method, model training method and device

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and in particular, to the fields of knowledge-graphs, natural language processing, deep learning, and the like, which can be applied to scenes such as content understanding.

Background

With the development and popularization of communication transmission technology, video becomes one of the ways for people to acquire information. According to the user requirements, videos in which the user is interested can be determined from the video library, and then video pushing is carried out, so that the user can obtain information through the videos.

Disclosure of Invention

The present disclosure provides a video processing method, a sample generation method, a training method of a deep learning model, an apparatus, an electronic device, a computer readable storage medium, and a computer program product.

According to an aspect of the present disclosure, there is provided a video processing method including: determining video representation according to video text information and key frames of a video to be processed; determining a first similarity between an event representation and a video representation of each of a plurality of event information, each of the plurality of event information comprising event text information and an event representation determined from the event text information; and determining target event information corresponding to the video to be processed from the plurality of event information according to the first similarity.

According to another aspect of the present disclosure, there is provided a sample generation method including: determining key frame data and video text data in the video data; according to the video text data, determining event text data matched with the video text data from a plurality of event text data to obtain a matching relation between the video text data and the event text data; generating sample data according to the matching relationship, the key frame data, the video text data and the event text data; based on the matching relation, taking the actual similarity between the video representation data and the event representation data as a label, and adding the label to the sample data; wherein the video representation data is determined based on the key frame data and the video text data, and the event representation data is determined based on the event text data.

According to another aspect of the present disclosure, there is provided a training method of a deep learning model, including: acquiring sample data; inputting sample data into a deep learning model to obtain video representation data and event representation data; determining a similarity between the video representation data and the event representation data; adjusting parameters of the deep learning model according to the similarity and the label corresponding to the sample data; wherein the sample data is generated according to the sample generation method.

According to another aspect of the present disclosure, there is provided a video processing apparatus including: the device comprises a video representation determining module, a first similarity determining module and a target event information determining module. The video representation determining module is used for determining video representation according to video text information and key frames of a video to be processed; the first similarity determining module is used for determining first similarity between an event representation and a video representation of each of a plurality of pieces of event information, wherein each of the plurality of pieces of event information comprises event text information and the event representation determined according to the event text information; the target event information determining module is used for determining target event information corresponding to the video to be processed from the plurality of event information according to the first similarity.

According to another aspect of the present disclosure, there is provided a sample generation apparatus comprising: the device comprises a data determining module, a matching relation determining module, a generating module and a label determining module. The data determining module is used for determining key frame data and video text data in the video data; the matching relation determining module is used for determining event text data matched with the video text data from the plurality of event text data according to the video text data to obtain a matching relation between the video text data and the event text data; the generating module is used for generating sample data according to the matching relation, the key frame data, the video text data and the event text data; the label determining module is used for taking the actual similarity between the video representation data and the event representation data as a label based on the matching relation and adding the label to the sample data; wherein the video representation data is determined based on the key frame data and the video text data, and the event representation data is determined based on the event text data.

According to another aspect of the present disclosure, there is provided a training apparatus for a deep learning model, including: the device comprises an acquisition module, an input module, a similarity determination module and an adjustment module. The acquisition module is used for acquiring sample data; the input module is used for inputting the sample data into the deep learning model to obtain video representation data and event representation data; the similarity determining module is used for determining the similarity between the video representation data and the event representation data; the adjusting module is used for adjusting parameters of the deep learning model according to the similarity and the label corresponding to the sample data; wherein the sample data is generated by the sample generation device.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods provided by the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method provided by the present disclosure.

According to another aspect of the disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method provided by the disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic diagram of an application scenario of a video processing method, a sample generation method, a training method of a deep learning model, and an apparatus according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow diagram of a sample generation method according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow diagram of a method of training a deep learning model according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a training method of a deep learning model according to an embodiment of the present disclosure;

FIG. 5 is a schematic flow chart diagram of a video processing method according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of constructing a database according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a video processing method according to an embodiment of the present disclosure;

FIG. 8 is a block schematic diagram of a sample generation apparatus according to an embodiment of the present disclosure;

FIG. 9 is a block diagram of a schematic structure of a deep learning model training apparatus according to an embodiment of the present disclosure;

fig. 10 is a schematic configuration block diagram of a video processing apparatus according to an embodiment of the present disclosure; and

fig. 11 is a block diagram of an electronic device for implementing a video processing method, a sample generation method, and a training method of a deep learning model according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of embodiments of the present disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In some technical solutions, the video data may be characterized and learned to obtain a video representation, and both the video data and the video representation may be stored in a database. In a video recommendation and search scene, after a search video is received, the similarity between the search video and a video representation stored in a database can be calculated, and then a target video similar to a search video picture is screened from the database according to the similarity, so that a video retrieval function is realized.

It can be understood that the above technical solution uses the video representation to represent the video data, but the video representation is an undefined content, and the content of the video data cannot be understood through the video representation. For example, the content of an event name, an event description, an event type, an event occurrence time, an event occurrence place, an event participant, and the like involved in video data cannot be understood through a video representation.

The embodiment of the disclosure aims to provide a video processing method, which can associate a video to be processed with corresponding structured target event information and understand video content by using the target event information. In addition, under the video recommendation and search scene, the target event information can be utilized to screen the video to be pushed, so that the accuracy of video pushing is improved.

The technical solutions provided in the present disclosure will be described in detail below with reference to the accompanying drawings and specific embodiments.

Fig. 1 is a schematic view of an application scenario of a video processing method, a sample generation method, a deep learning model training method and an apparatus according to an embodiment of the present disclosure.

As shown in fig. 1, the application scenario 100 of this embodiment may include an electronic device 110, and the electronic device 110 may be any electronic device with processing functionality, including but not limited to a smartphone, a tablet, a laptop, a desktop computer, a server, and so on.

The electronic device 110 may process the input video 120 to be processed, for example, to obtain the target event information 130. For example, the electronic device 110 can determine the video representation of the pending video 120 using a deep learning model and then determine the target event information 130 based on a first similarity between the video representation and the event representation.

According to an embodiment of the present disclosure, as shown in fig. 1, the application scenario 100 may further include a server 140. The electronic device 110 may be communicatively coupled to the server 140 via a network, which may include wireless or wired communication links.

Illustratively, the server 140 may be configured to train the deep learning model 150, and send the trained deep learning model 150 to the electronic device 110 in response to a model obtaining request sent by the electronic device 110, so as to facilitate the electronic device 110 to determine a video representation of the to-be-processed video 120 by using the deep learning model 150. In an embodiment, the electronic device 110 may further send the video 120 to be processed to the server 140 through a network, and the server 140 processes the obtained video 120 to be processed according to the trained deep learning model 150 to obtain a processing result, and then returns the processing result to the electronic device 110, where the processing result may include a video representation and may also include target event information.

According to an embodiment of the present disclosure, as shown in fig. 1, the application scenario 100 may further include a database 160, and the database 160 may maintain a huge amount of sample data. For example, sample data with a tag may be generated according to key frame data and video text data in the video data, and event text data, and a matching relationship between the video text data and the event text data. The server 140 may access the database 160 and extract portions of the sample data from the database 160 to train the deep learning model 150.

In training the deep learning model 150, a similarity between video representation data determined from the key frame data and the video text data and event representation data determined from the event text data may be calculated, and using the similarity and the label, a loss function is employed to determine a loss of the deep learning model 150, and training of the model is performed by minimizing model loss.

It should be noted that the video processing method provided by the present disclosure may be executed by the electronic device 110 or the server 140, the sample generation method provided by the present disclosure may be executed by the electronic device 110 or the server 140, and the training method of the deep learning model provided by the present disclosure may be executed by the server 140. Accordingly, the video processing apparatus provided by the present disclosure may be disposed in the electronic device 110 or the server 140, the sample generation apparatus provided by the present disclosure may be disposed in the electronic device 110 or the server 140, and the training apparatus of the deep learning model provided by the present disclosure may be disposed in the server 140.

It should be understood that the number and type of electronic devices, servers, and databases in FIG. 1 are merely illustrative. There may be any number and type of electronic devices, servers, and databases, as desired for an implementation.

Fig. 2 is a schematic flow diagram of a sample generation method according to an embodiment of the disclosure.

As shown in fig. 2, the sample generation method 200 may include operations S210 to S240.

In operation S210, key frame data and video text data in the video data are determined.

For example, the video data may be current event information, entertainment information, or the like. For current affairs information video, the images in the video can include the explanation images of the speaker and the scene images of the events without the speaker.

For example, the key frame data may be extracted according to image differences between different image frames, for example, if differences between adjacent image frames are small, one of the image frames may be randomly selected as the key frame.

For example, the video text data may include at least one of video caption text data, voice recognition text data, and subtitle text data. For example, a video clip of a period of time before and after the key frame data is captured from the video data, and then the video clip is subjected to speech recognition to obtain speech recognition text data. For example, optical character recognition may be performed on the video segment to obtain subtitle text data.

In operation S220, event text data matched with the video text data is determined from the plurality of event text data according to the video text data, so that a matching relationship between the video text data and the event text data is obtained.

For example, the event text data may include data such as an event name, an event description, an event type, a time when the event occurred, a location where the event occurred, a participant, and the like. Matching the event text data with the video text data may indicate: the event text data and the video text data describe the same event, for example, the event text data and the video text data both describe a "zhang san capture the champion in the diving game" event.

In one example, a similarity between the event text data and the video notepad data may be determined, and the similarity may be a text similarity. The event text data may be determined to match the video text data if the similarity is greater than a first threshold. The first threshold may be, for example, 0.6.

It should be understood that determining whether the event text data and the video text data match only by the similarity may cause an error in the matching result. For example, the event text data includes "5/1/2022, zhang obtains champion in a diving game held in a first city", the video text data includes "5/15/2022, zhang obtains champion in a diving game held in a second city". It can be seen that the above event text data and video text data describe two different events because the events occur at different times. However, the event text data and the video text data have high text similarity, and it is determined through similarity calculation that the event text data and the video text data are matched, resulting in an error in the matching result.

In another example, candidate event text data may be determined according to a similarity between each of the plurality of event text data and the video text data. And then determining that the candidate event text data is matched with the video text data under the condition that the attribute information corresponding to the candidate event text data is consistent with the attribute information corresponding to the video text data.

For example, a plurality of event text data may be stored in a database in advance, and then matching candidate event text data may be retrieved in the database based on the video text data. For example, event text data having a similarity equal to or greater than a second threshold may be determined as candidate event text data.

For example, the attribute information may include at least one of an event type, an event occurrence time, an event occurrence location, and an event participant. Whether the candidate event text data are matched with the video text data or not is determined through consistency of the attributes, the matched event text data and the matched video text data can be ensured to describe the same event, and further the accuracy of a matching result is ensured.

In operation S230, sample data is generated according to the matching relationship, the key frame data, the video text data, and the plurality of event text data.

For example, a positive sample may be generated from keyframe data, video text data, and event text data that matches the video text data. The three data involved by the positive sample generated in the mode describe the same event, so that the training effect of the model is ensured.

For example, a negative example may be generated from keyframe data, video text data, and event text data that does not match the video text data. The three kinds of data related to the negative samples generated in the mode describe different events, and therefore the model training effect is guaranteed. Of course negative examples may be generated in other ways.

In operation S240, an actual similarity between the video representation data and the event representation data is taken as a tag based on the matching relationship, and the tag is added to the sample data. The video presentation data is determined based on the keyframe data and the video text data, and the event presentation data is determined based on the event text data.

For example, when the event text data matches the video text data, the value of the tag may be determined to be 1. When the event text data does not match the video text data, the value of the tag may be determined to be 0.

For example, key frame data and video text data may be input into a deep learning model, resulting in video representation data. Event text data can be input into the deep learning model to obtain event representation data. The deep learning model may be an ERNIE (enhanced Language Representation with information entities) model.

According to the technical scheme provided by the embodiment of the disclosure, the key frame data and the video text data are determined from the same video data, and the matching relationship between the video text data and the event text data is also determined, so that whether the key frame data, the video text data and the event text data describe the same event can be determined, and the label of sample data can be accurately determined, thereby ensuring the accuracy of the sample. In addition, the technical scheme provided by the embodiment of the disclosure can automatically construct the sample according to the matching relationship, the key frame data, the video text data and the event text data, thereby reducing the cost of generating the sample.

In some embodiments, the samples generated using the sample generation methods described above may be used to train a deep learning model.

FIG. 3 is a schematic flow diagram of a method of training a deep learning model according to an embodiment of the present disclosure.

As shown in fig. 3, the training method 300 of the deep learning model may include operations S310 to S340.

In operation S310, sample data is acquired.

For example, the sample data may be generated using the sample generation method described above.

In operation S320, sample data is input to the deep learning model, resulting in video representation data and event representation data.

For example, key frame data and video text data may be input into a deep learning model, resulting in video representation data. For example, the deep learning model may extract features of key frame data and features of video text data, and then perform feature fusion on the two extracted features to obtain video representation data.

For example, event text data may be input into a pre-trained model, resulting in event representation data.

In operation S330, a similarity between the video representation data and the event representation data is determined.

In operation S340, parameters of the deep learning model are adjusted according to the similarity and the label corresponding to the sample data.

According to the technical scheme provided by the embodiment of the disclosure, the sample data input by the deep learning model comprises three data, wherein the three data are respectively key frame data, video text data and event text data, and whether the three data describe the same event or not is known. Therefore, in the training process, the video representation data and the event representation data output by the deep learning model can be optimized by judging whether the three data describe the same event or not, so that the deep learning model outputs more accurate video representation data and event representation data, and the training precision of the model is improved.

In some embodiments, the deep learning model obtained by the training method can be applied to a video processing method to obtain target event information corresponding to a video to be processed, so that the effect of understanding video content is achieved.

FIG. 4 is a schematic diagram of a training method of a deep learning model according to an embodiment of the present disclosure.

As shown in fig. 4, in this embodiment, key frame data and video text data in sample data may be input into the deep learning model 430 to be trained as the video side input information 410, and event text data 420 in the sample data may be input into the deep learning model 430 to be trained. The deep learning model 430 outputs a multimodal semantic representation 440, the multimodal semantic representation 440 including a video representation and an event representation. Meanwhile, in the training process, the parameters of the deep learning model 430 are optimized through a gradient descent method.

In some embodiments, entity granularity matching may be performed on key frame data and event text data in the sample data. For example, in one aspect, the keyframe data may relate to two events, the zhangsan acquired champion and the liqu acquired champion, and it can be seen that the keyframe data may include a first sub-image relating to the zhangsan acquired champion and a second sub-image relating to the liqu acquired champion. On the other hand, the event text data includes only text data describing a zhangsan event. A first sub-image may be cut from the keyframe data and matched against the event text data.

In practical application, in order to simplify the process of constructing sample data and improve the efficiency of constructing sample data, the key frame data and the event text data in the sample data may not be subjected to entity granularity matching. Taking the above-mentioned example that the key frame data includes the first sub-image and the second sub-image, if the position information of each of the first sub-image and the second sub-image in the key frame data is not clear, it is not possible to know which sub-image of which region is related to the zhangsan obtained champion in the key frame data, and which sub-image of which region is not related to the zhangsan obtained champion. In this case, the first sub-image in the key frame data and the event text data may not be matched, but the entire key frame data and the event text data may be matched.

In the model training process, for example, a weakly supervised contrast Learning (contrast Learning) framework can be adopted to map different modal semantic features to the same feature space, and the output video representation and event representation are optimized by a gradient descent method, so that more accurate video representation and event representation can be obtained. In the contrast learning process, for example, for a positive sample, the distance between the video representation data and the event representation data output by the deep learning model can be made small. For example, for negative examples, the distance between the video representation data and the event representation data output by the deep learning model may be made larger.

For example, inputting the key frame data, the video text data and the event text data of the first event and the second event into the deep learning model respectively can obtain the video representation of the first event, the event representation of the first event, the video representation of the second event and the event representation of the second event. The first distance may be made smaller than a second distance through contrast learning, the first distance being a distance between the video representation of the first event and the event representation of the first event, and the second distance may be a distance between the video representation of the first event and the event representation of the second event.

Fig. 5 is a schematic flow diagram of a video processing method according to an embodiment of the present disclosure.

As shown in fig. 5, the video processing method 500 may include operations S510 to S530.

In operation S510, a video representation is determined according to video text information and key frames of a video to be processed.

For example, the key frame data may be extracted according to image differences between different image frames, for example, if the differences between adjacent image frames are small, one of the image frames may be randomly selected as a key frame.

For example, the video text information may include at least one of video title text information, voice recognition text information, and subtitle text information. For example, speech recognition may be performed on the video to be processed to obtain speech recognition text information. For example, optical character recognition may be performed on a video to be processed to obtain subtitle text information.

For example, key frames and video text information can be input into a deep learning model, resulting in a video representation. The deep learning model may be, for example, a model obtained by the training method described above.

In operation S520, a first similarity between an event representation and a video representation of each of a plurality of event information is determined, each of the plurality of event information including event text information and an event representation determined from the event text information.

For example, the same event text information may include at least one of event name, event description, event type, time of event occurrence, location of event occurrence, participant, and the like.

For example, a plurality of event text messages may be stored in the database, and the plurality of event text messages may be input to the deep learning model to obtain a plurality of event representations corresponding to the plurality of event text messages, and the event representations may be stored in the database. Furthermore, a corresponding relation between the event text information and the event representation can be established, and the corresponding relation is stored in the database.

In operation S530, target event information corresponding to the video to be processed is determined from the plurality of event information according to the first similarity.

For example, a target event representation may be retrieved from the database based on the video representation, e.g., a first similarity between the video representation and the plurality of event representations is calculated, and then the event representation corresponding to the largest first similarity is determined as the target event representation. In addition, target text information corresponding to the target event representation can be retrieved from the database according to the corresponding relation between the event representation and the event text information. And then taking the event information including the target event representation and the target event text information as target event information.

According to the technical scheme provided by the embodiment of the disclosure, the video representation can be determined according to the video text information and the key frame of the video to be processed, then the target event information matched with the video representation is retrieved from the plurality of event information according to the first similarity between the video representation and the event representation, and then the target event information is returned. The target event information includes event text information and event representation, the event text information in the target event information is understanding of the content of the video to be processed, and for example, the content of the video to be processed is understood through information including an event name, an event description, an event type, an event occurrence time, an event occurrence place, a participant and the like included in the event text information, so that the effect of understanding the video content is achieved.

According to another embodiment of the present disclosure, video text information may also be determined prior to determining the video representation. In this embodiment, the operation of determining the video text information may include the following operations: and determining the video clips in the video to be processed according to the key frames in the video to be processed. And then determining video segment text information corresponding to the video segment, wherein the video segment text information comprises at least one of subtitle text information obtained by performing optical character recognition on the video segment and voice recognition text information obtained by performing voice recognition on the video segment. And then determining the video text information according to the video clip text information.

For example, the time information of the key frame may be determined, and then a video clip before and after the key frame is captured from the video to be processed, where the predetermined time may be 3 seconds, 5 seconds, or the like.

For example, when the video text information includes subtitle text information and voice recognition text information, the subtitle text information and the voice recognition text information may be spliced and then the result of the splicing is taken as the video text information.

For example, at least one of the subtitle text information and the speech recognition text information may be spliced with a video title, and the result of the splicing may be taken as the video text information.

According to the embodiment of the disclosure, the key frames are extracted from the video to be processed, then the video clips are determined according to the key frames, and then the video text information is determined according to the video clip text information corresponding to the video clips. Compared with the technical scheme that voice recognition or optical character recognition is carried out on the complete video to be processed and the recognized text information is used as the video text information, the embodiment of the disclosure can extract effective information from the video to be processed, so that the data volume of the video text information is reduced, and the data processing speed is increased.

According to another embodiment of the present disclosure, the operation of determining a video representation according to the video text information and the key frame of the video to be processed may include the following operations: extracting image features of key frames in the video to be processed and text features of video text information, and then performing feature fusion on the image features and the text features to obtain video representation.

For example, the deep learning model may include a first feature extraction network, a second feature extraction network, and a feature fusion network, the first feature extraction network may be used to extract image features of key frames, the second feature extraction network may be used to extract text features of video text information, and then the feature fusion network may be used to perform feature fusion on the image features and the text features, where the fusion may be, for example, stitching the image features and the text features.

The embodiment of the disclosure performs feature extraction and feature fusion on the key frames and the video text information in the video to be processed, thereby ensuring the accuracy of video representation.

FIG. 6 is a schematic diagram of constructing a database according to an embodiment of the present disclosure.

As shown in fig. 6, the present embodiment explains a process of constructing a database. Event extraction can be performed on the information data 610 to obtain the structured event text information 620. Event text information 620 may then be input into the trained deep learning model described above, resulting in an event representation 630. The event text information 620 and the event representation 630 may then be stored in a database 640, and the database 640 may also store a correspondence between the event text information 620 and the event representation 630. It can be seen that database 640, constructed in the manner described above, supports vector retrieval functionality.

In some embodiments, the database described above may be applied to a sample generation method and a training method of a deep learning model. For example, event text information can be obtained from a database, and then a sample is constructed and a deep learning model is trained by using the event text information.

In some embodiments, the database described above may be applied to a video processing method. For example, the event representation may be obtained from a database to determine a first similarity, and the target event information may be determined from the database according to the first similarity and the corresponding relationship between the event text information and the event representation.

According to another embodiment of the present disclosure, the video processing method may further include the following operations: a second similarity between the event text information and the video text information of each of the plurality of event information is determined. Accordingly, the operation of determining the target event information corresponding to the video to be processed from the plurality of event information according to the first similarity may include the following operations: and determining target event information corresponding to the video to be processed from the plurality of event information according to the first similarity and the second similarity.

In one example, a weighted sum of the first similarity and the second similarity may be calculated, and then the event information corresponding to the maximum weighted sum may be determined as the target event information.

In another example, event information, of the plurality of event information, of which the second similarity is greater than or equal to the similarity threshold and the first similarity is the largest, may be determined as the target event information.

The embodiment of the disclosure determines the target event information according to the first similarity related to the semantics and the second similarity related to the text, thereby ensuring the accuracy of the target event information.

Fig. 7 is a schematic diagram of a video processing method according to an embodiment of the present disclosure.

As shown in fig. 7, in this embodiment, a key frame 720 and video text information 730 in a to-be-processed video 710 may be determined, and then the key frame 720 and the video text information 730 may be input into the trained deep learning model, so as to obtain a video representation 740. Video representation 740 and video text information 730 may be considered video information 750.

A search may then be performed in database 760 to recall a plurality of event information 770 based on at least one of video representation 740 and video text information 730, the plurality of event information 770 each including an event representation and event text information.

A determination 780 of similarity may then be performed, such as a first similarity between the video representation 740 and the event representation, and a second similarity between the video textual information 730 and the event textual information. The target event information 790 is determined from the recalled plurality of event information 770 according to the first similarity and the second similarity. The target event information 790 may include event text information, which may include, for example, event name, event description, event type, time of event occurrence, location of event occurrence, attendees, and the like, and event representation.

According to another embodiment of the present disclosure, the video processing method may be applied to a video search and recommendation scene, in which the number of the videos to be processed is multiple, and the video processing method may further include a post-processing operation.

In one example, the post-processing operations may include the following operations: and determining respective video tags of the plurality of videos to be processed according to the event text information of the plurality of target event information corresponding to the plurality of videos to be processed. Then, in response to receiving a video search request including a search term, determining a third similarity between the video tags of the plurality of videos to be processed and the search term. And then determining a video to be pushed from the plurality of videos to be processed according to the third similarity as a search result aiming at the video search request.

For example, content entered by a user in a search page of an electronic device may be taken as a search term and trigger a video search request. For example, the search terms may be determined from the browsing history of the user and trigger a video search request.

For example, the event text information in the target event information includes "5/1/2022, zhang san obtains champion in a diving game held in the first city", and the video tag of the video to be processed may be "three-spring-captured-water".

For example, a video to be processed with a third similarity greater than or equal to a third threshold may be determined as a video to be pushed, and then the video to be pushed may be presented to the user.

In another example, the post-processing operations may include the following operations: in response to receiving a video search request that includes a search term, a search term representation corresponding to the search term is determined. And then determining a fourth similarity between the event representation of the target event information corresponding to the videos to be processed and the search word representation. And then determining a video to be pushed from the plurality of videos to be processed according to the fourth similarity as a search result aiming at the video search request.

For example, a search term may be input into the deep learning model described above, resulting in a search term representation.

For example, a video to be processed with a fourth similarity greater than or equal to a fourth threshold may be determined as a video to be pushed, and then the video to be pushed may be presented to the user.

In another example, the video to be pushed may be determined according to the third similarity and the fourth similarity, for example, the video to be processed with the third similarity being equal to or greater than a third threshold and the fourth similarity being equal to or greater than a fourth threshold is determined as the video to be pushed.

It should be understood that in practical applications, a large number of short videos are propagated in the network, and when the to-be-processed video is a short video cut from a complete information video, the to-be-processed video has incomplete content, so that the video representation cannot well represent events related to the to-be-processed video. With the increasing maturity of database construction technologies based on multimedia information, information related to events can be extracted from various types of multimedia information such as videos and news, and also information related to events can be extracted from a plurality of videos, and structured event information is generated based on the extracted information. Therefore, compared with the video to be processed, the event information can more completely describe the event related to the video to be processed.

Therefore, the post-processing operation provided by the embodiment of the disclosure determines the video to be pushed according to the target event information corresponding to the videos to be processed, so that the accuracy of video recommendation can be improved.

Fig. 8 is a schematic block diagram of a sample generation device according to an embodiment of the present disclosure.

As shown in fig. 8, the sample generation apparatus 800 may include a data determination module 810, a matching relationship determination module 820, a generation module 830, and a label determination module 840.

The data determination module 810 is configured to determine keyframe data and video text data in the video data.

The matching relationship determining module 820 is configured to determine event text data matching the video text data from the plurality of event text data according to the video text data, so as to obtain a matching relationship between the video text data and the event text data.

The generating module 830 is configured to generate sample data according to the matching relationship, the key frame data, the video text data, and the event text data.

The tag determination module 840 is configured to take the actual similarity between the video representation data and the event representation data as a tag based on the matching relationship, and add the tag to the sample data. The video presentation data is determined based on the keyframe data and the video text data, and the event presentation data is determined based on the event text data.

According to another embodiment of the present disclosure, the generation module includes a first generation submodule and a second generation submodule. The first generation submodule is used for generating a positive sample according to the key frame data, the video text data and the event text data matched with the video text data. And the second generation submodule is used for generating a negative sample according to the key frame data, the video text data and the event text data which is not matched with the video text data.

According to another embodiment of the present disclosure, the matching relationship determination module includes a candidate determination submodule and a second determination submodule. And the candidate determining submodule is used for determining the candidate event cost data according to the similarity between each of the event cost data and the video text data. The second determining submodule is used for determining that the candidate event text data is matched with the video text data under the condition that the attribute information corresponding to the candidate event text data is determined to be consistent with the attribute information corresponding to the video text data.

Fig. 9 is a schematic block diagram of a deep learning model training apparatus according to an embodiment of the present disclosure.

As shown in fig. 9, the training apparatus 900 for the deep learning model may include an obtaining module 910, an inputting module 920, a similarity determining module 930, and an adjusting module 940.

The obtaining module 910 is configured to obtain sample data. For example, the sample data may be generated according to the above-described sample generation apparatus.

The input module 920 is used for inputting the sample data into the deep learning model to obtain video representation data and event representation data

The similarity determination module 930 is configured to determine a similarity between the video representation data and the event representation data.

The adjusting module 940 is configured to adjust parameters of the deep learning model according to the similarity and the label corresponding to the sample data.

Fig. 10 is a schematic block diagram of a video processing apparatus according to an embodiment of the present disclosure.

As shown in fig. 10, the video processing apparatus 1000 may include a video representation determining module 1010, a first similarity determining module 1020, and a target event information determining module 1030.

The video representation determining module 1010 is configured to determine a video representation according to the video text information and the key frame of the video to be processed.

The first similarity determination module 1020 is configured to determine a first similarity between an event representation and a video representation of each of a plurality of event information, each of the plurality of event information including event text information and an event representation determined according to the event text information.

The target event information determining module 1030 is configured to determine target event information corresponding to the to-be-processed video from the plurality of event information according to the first similarity.

According to another embodiment of the present disclosure, the video processing apparatus further includes a second similarity determining module, configured to determine a second similarity between the video text information and the event text information of each of the plurality of event information; the target event information determining module comprises a first determining submodule and is used for determining target event information corresponding to the video to be processed from the plurality of event information according to the first similarity and the second similarity.

According to another embodiment of the present disclosure, the first determination submodule includes a determination unit configured to determine, as the target event information, event information in which the second similarity is equal to or greater than a similarity threshold and the first similarity is the largest, from among the plurality of event information.

According to another embodiment of the present disclosure, the video representation determination module includes an extraction sub-module and a fusion sub-module. The extraction submodule is used for extracting the image characteristics of the key frames in the video to be processed and the text characteristics of the video text information. And the fusion sub-module is used for carrying out feature fusion on the image features and the text features to obtain video representation.

According to another embodiment of the present disclosure, the video processing apparatus further includes a segment determining module, a segment information determining module, and a video text information determining module. The segment determining module is used for determining video segments in the video to be processed according to the key frames in the video to be processed before determining the video representation. The segment information determining module is used for determining video segment text information corresponding to the video segment, wherein the video segment text information comprises at least one of subtitle text information obtained by performing optical character recognition on the video segment and voice recognition text information obtained by performing voice recognition on the video segment. The video text information determining module is used for determining the video text information according to the video segment text information.

According to another embodiment of the present disclosure, the number of videos to be processed is plural; the device further comprises a video tag determination module, a third similarity determination module and a first result determination module. The video label determining module is used for determining respective video labels of the multiple videos to be processed according to event text information of the multiple target event information corresponding to the multiple videos to be processed. The third similarity determining module is used for determining third similarities between video labels of the videos to be processed and the search terms in response to receiving a video search request comprising the search terms. The first result determining module is used for determining a video to be pushed from the plurality of videos to be processed according to the third similarity, and the video to be pushed is used as a searching result aiming at the video searching request.

According to another embodiment of the present disclosure, the number of videos to be processed is plural; the apparatus further comprises a search term representation determining module, a fourth similarity determining module, and a second result determining module. The search term representation determination module is configured to determine a search term representation corresponding to a search term in response to receiving a video search request that includes the search term. The fourth similarity determining module is used for determining a fourth similarity between the event representation of the target event information corresponding to the videos to be processed and the search word representation. The second result determining module is used for determining a video to be pushed from the plurality of videos to be processed according to the fourth similarity, and the video to be pushed is used as a searching result aiming at the video searching request.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

In the technical scheme of the disclosure, before the personal information of the user is acquired or collected, the authorization or the consent of the user is acquired.

According to an embodiment of the present disclosure, there is also provided an electronic device, comprising at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the video processing method, the sample generation method, and the training method of the deep learning model described above.

According to an embodiment of the present disclosure, there is also provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the video processing method, the sample generation method, and the training method of the deep learning model described above.

According to an embodiment of the present disclosure, the present disclosure also provides a computer program product, including a computer program, which when executed by a processor implements the video processing method, the sample generation method, and the training method of the deep learning model described above.

FIG. 11 shows a schematic block diagram of an example electronic device that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 11, the device 1100 comprises a computing unit 1101, which may perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)1102 or a computer program loaded from a storage unit 1108 into a Random Access Memory (RAM) 1103. In the RAM 1103, various programs and data necessary for the operation of the device 1100 may also be stored. The calculation unit 1101, the ROM 1102, and the RAM 1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.

A number of components in device 1100 connect to I/O interface 1105, including: an input unit 1106 such as a keyboard, a mouse, and the like; an output unit 1107 such as various types of displays, speakers, and the like; a storage unit 1108, such as a magnetic disk, optical disk, or the like; and a communication unit 1109 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 1109 allows the device 1100 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 1101 can be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1101 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 1101 performs the respective methods and processes described above, such as a video processing method, a sample generation method, a training method of a deep learning model. For example, in some embodiments, the video processing methods, the sample generation methods, the training methods of the deep learning models may be implemented as computer software programs that are tangibly embodied on a machine-readable medium, such as storage unit 1108. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1100 via ROM 1102 and/or communication unit 1109. When the computer program is loaded into the RAM 1103 and executed by the computing unit 1101, one or more steps of the video processing method, the sample generation method, the training method of the deep learning model described above may be performed. Alternatively, in other embodiments, the computing unit 1101 may be configured by any other suitable means (e.g., by means of firmware) to perform a video processing method, a sample generation method, a training method of a deep learning model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A video processing method, comprising:

determining video representation according to video text information and key frames of a video to be processed;

determining a first similarity between an event representation and the video representation for each of a plurality of event information, the plurality of event information each including event text information and an event representation determined from the event text information; and

and determining target event information corresponding to the video to be processed from the plurality of event information according to the first similarity.

2. The method of claim 1, further comprising:

determining a second similarity between the event text information and the video text information of each of the plurality of event information;

wherein the determining, according to the first similarity, target event information corresponding to the video to be processed from the plurality of event information includes:

and determining target event information corresponding to the video to be processed from the plurality of event information according to the first similarity and the second similarity.

3. The method of claim 2, wherein the determining, from the plurality of event information, target event information corresponding to the video to be processed according to the first similarity and the second similarity comprises:

and determining event information, of the plurality of event information, of which the second similarity is greater than or equal to a similarity threshold and the first similarity is the maximum, as the target event information.

4. The method of claim 1, wherein the determining a video representation from video text information and key frames of the video to be processed comprises:

extracting image features of key frames in the video to be processed and text features of the video text information; and

and performing feature fusion on the image features and the text features to obtain the video representation.

5. The method of claim 1, further comprising: prior to the determination of the video representation,

determining a video clip in the video to be processed according to the key frame in the video to be processed;

determining video segment text information corresponding to the video segment, wherein the video segment text information comprises at least one of subtitle text information obtained by performing optical character recognition on the video segment and voice recognition text information obtained by performing voice recognition on the video segment; and

and determining the video text information according to the video clip text information.

6. The method according to any one of claims 1 to 5, wherein the number of videos to be processed is plural; the method further comprises the following steps:

determining video tags of a plurality of videos to be processed according to event text information of a plurality of target event information corresponding to the videos to be processed;

in response to receiving a video search request comprising a search term, determining a third similarity between video tags of the videos to be processed and the search term; and

and determining a video to be pushed from the plurality of videos to be processed according to the third similarity as a search result aiming at the video search request.

7. The method according to any one of claims 1 to 5, wherein the number of videos to be processed is plural; the method further comprises the following steps:

in response to receiving a video search request that includes a search term, determining a search term representation that corresponds to the search term;

determining a fourth similarity between the event representation of a plurality of target event information corresponding to a plurality of videos to be processed and the search word representation; and

and determining a video to be pushed from the plurality of videos to be processed according to the fourth similarity as a search result aiming at the video search request.

8. A sample generation method, comprising:

determining key frame data and video text data in the video data;

according to the video text data, determining event text data matched with the video text data from a plurality of event text data to obtain a matching relation between the video text data and the event text data;

generating sample data according to the matching relation, the key frame data, the video text data and the event text data; and

based on the matching relation, taking the actual similarity between the video representation data and the event representation data as a label, and adding the label to the sample data;

wherein the video representation data is determined based on the keyframe data and the video text data, and the event representation data is determined based on the event text data.

9. The method of claim 8, wherein said generating sample data from said matching relationship, said key frame data, said video text data and said plurality of event text data comprises:

generating a positive sample according to the key frame data, the video text data and the event text data matched with the video text data; and

and generating a negative sample according to the key frame data, the video text data and the event text data which is not matched with the video text data.

10. The method of claim 8 or 9, wherein the determining, from the video text data, event text data that matches the video text data from a plurality of event text data comprises:

determining candidate event text data according to the similarity between each of the event text data and the video text data; and

and under the condition that the attribute information corresponding to the candidate event text data is determined to be consistent with the attribute information corresponding to the video text data, determining that the candidate event text data is matched with the video text data.

11. A training method of a deep learning model comprises the following steps:

acquiring sample data;

inputting the sample data into a deep learning model to obtain video representation data and event representation data;

determining a similarity between the video representation data and the event representation data; and

adjusting parameters of the deep learning model according to the similarity and the label corresponding to the sample data;

wherein the sample data is generated according to the method of any one of claims 8 to 10.

12. A video processing apparatus comprising:

the video representation determining module is used for determining video representation according to the video text information and the key frame of the video to be processed;

a first similarity determination module, configured to determine a first similarity between an event representation of each of a plurality of event information and the video representation, where the plurality of event information each includes event text information and an event representation determined according to the event text information; and

and the target event information determining module is used for determining target event information corresponding to the video to be processed from the plurality of event information according to the first similarity.

13. The apparatus of claim 12, further comprising:

a second similarity determination module, configured to determine a second similarity between event text information of each of the plurality of event information and the video text information;

wherein the target event information determination module includes:

and the first determining submodule is used for determining target event information corresponding to the video to be processed from the plurality of event information according to the first similarity and the second similarity.

14. The apparatus of claim 13, wherein the first determination submodule comprises:

a determining unit, configured to determine event information, of the plurality of event information, for which the second similarity is greater than or equal to a similarity threshold and the first similarity is the greatest, as the target event information.

15. The apparatus of claim 12, wherein the video representation determination module comprises:

the extraction submodule is used for extracting the image characteristics of the key frames in the video to be processed and the text characteristics of the video text information; and

and the fusion submodule is used for carrying out feature fusion on the image features and the text features to obtain the video representation.

16. The apparatus of claim 12, further comprising:

the segment determining module is used for determining a video segment in the video to be processed according to the key frame in the video to be processed before the video representation is determined;

a segment information determining module, configured to determine video segment text information corresponding to the video segment, where the video segment text information includes at least one of subtitle text information obtained by performing optical character recognition on the video segment and voice recognition text information obtained by performing voice recognition on the video segment; and

and the video text information determining module is used for determining the video text information according to the video segment text information.

17. The apparatus according to any one of claims 12 to 16, wherein the number of the videos to be processed is plural; the device further comprises:

the video tag determining module is used for determining video tags of the multiple videos to be processed according to event text information of multiple target event information corresponding to the multiple videos to be processed;

the third similarity determining module is used for determining third similarities between video tags of the videos to be processed and the search terms in response to receiving a video search request comprising the search terms; and

and the first result determining module is used for determining a video to be pushed from the plurality of videos to be processed according to the third similarity, and the video to be pushed is used as a search result aiming at the video search request.

18. The apparatus according to any one of claims 12 to 16, wherein the number of videos to be processed is plural; the device further comprises:

the device comprises a search word representation determining module, a searching word representation determining module and a searching word representation generating module, wherein the search word representation determining module is used for determining search word representations corresponding to search words in response to receiving a video search request comprising the search words;

the fourth similarity determining module is used for determining fourth similarities between the event representations of the target event information corresponding to the videos to be processed and the search word representations; and

and the second result determining module is used for determining a video to be pushed from the plurality of videos to be processed according to the fourth similarity, and the video to be pushed is used as a search result aiming at the video search request.

19. A sample generation device, comprising:

the data determining module is used for determining key frame data and video text data in the video data;

the matching relation determining module is used for determining event text data matched with the video text data from a plurality of event text data according to the video text data to obtain the matching relation between the video text data and the event text data;

the generating module is used for generating sample data according to the matching relation, the key frame data, the video text data and the event text data; and

a tag determination module, configured to take an actual similarity between the video representation data and the event representation data as a tag based on the matching relationship, and add the tag to the sample data;

20. The apparatus of claim 19, wherein the generating means comprises:

the first generation submodule is used for generating a positive sample according to the key frame data, the video text data and the event text data matched with the video text data; and

and the second generation submodule is used for generating a negative sample according to the key frame data, the video text data and the event text data which is not matched with the video text data.

21. The apparatus of claim 19 or 20, wherein the match relationship determination module comprises:

the candidate determining submodule is used for determining candidate event text data according to the similarity between each of the event text data and the video text data; and

and the second determining submodule is used for determining that the candidate event text data is matched with the video text data under the condition that the attribute information corresponding to the candidate event text data is determined to be consistent with the attribute information corresponding to the video text data.

22. A training apparatus for deep learning models, comprising:

the acquisition module is used for acquiring sample data;

the input module is used for inputting the sample data into a deep learning model to obtain video representation data and event representation data;

a similarity determination module for determining a similarity between the video representation data and the event representation data; and

the adjusting module is used for adjusting the parameters of the deep learning model according to the similarity and the label corresponding to the sample data;

wherein the sample data is generated by the apparatus of any of claims 19 to 21.

23. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 11.

24. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1 to 11.

25. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 11.