CN115098730A

CN115098730A - Method for acquiring video data and training method and device of deep learning model

Info

Publication number: CN115098730A
Application number: CN202210796905.9A
Authority: CN
Inventors: 杨虎; 李国豪; 冯知凡; 柴春光
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-07-05
Filing date: 2022-07-05
Publication date: 2022-09-23

Abstract

The disclosure provides a method for acquiring video data, a training method, a device, equipment, a medium and a product of a deep learning model, and relates to the technical field of artificial intelligence such as knowledge maps, natural language processing, deep learning and the like. The method for acquiring the video data comprises the following steps: processing first text data associated with the first type of video data to obtain candidate words and word categories corresponding to the candidate words; determining a target word from the candidate words based on the word category; target video data associated with the first type of video data is obtained from the second type of video data based on the target words.

Description

Method for acquiring video data and training method and device of deep learning model

Technical Field

The present disclosure relates to the field of artificial intelligence technologies such as knowledge graphs, natural language processing, and deep learning, and more particularly, to a method for acquiring video data, a method and an apparatus for training a deep learning model, an electronic device, a medium, and a program product.

Background

In some cases, video data is used for correlation processing, for example, the video data is used as a training sample to train a deep learning model. However, the method for acquiring video data is costly and has low video quality, which results in poor effect of performing related processing by using video data.

Disclosure of Invention

The present disclosure provides a method of acquiring video data, a method of training a deep learning model, an apparatus, an electronic device, a storage medium, and a program product.

According to an aspect of the present disclosure, there is provided a method of acquiring video data, including: processing first text data associated with first type video data to obtain candidate words and word categories corresponding to the candidate words; determining a target word from the candidate words based on the word category; and acquiring target video data associated with the first type of video data from second type of video data based on the target words.

According to another aspect of the present disclosure, there is provided a training method of a deep learning model, including: acquiring sample video data; and training a deep learning model by using the sample video data, wherein the sample video data is obtained according to the method for obtaining the video data.

According to another aspect of the present disclosure, there is provided an apparatus for acquiring video data, including: the device comprises a processing module, a determining module and an obtaining module. The processing module is used for processing first text data associated with the first type video data to obtain candidate words and word categories corresponding to the candidate words; a determining module for determining a target word from the candidate words based on the word category; and the acquisition module is used for acquiring target video data associated with the first type of video data from second type of video data based on the target words.

According to another aspect of the present disclosure, there is provided a training apparatus for a deep learning model, including: the device comprises an acquisition module and a training module. The acquisition module is used for acquiring sample video data; and the training module is used for training a deep learning model by utilizing the sample video data, wherein the sample video data is obtained according to the device for obtaining the video data.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor and a memory communicatively coupled to the at least one processor. Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any one or more of the methods of acquiring video data, training deep learning models described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform any one or more of the above-described method of acquiring video data, training method of deep learning model.

According to another aspect of the present disclosure, there is provided a computer program product comprising computer program/instructions stored on at least one of a readable storage medium and an electronic device, which when executed by a processor, implement the steps of any one or more of the method for acquiring video data, the training method for deep learning model, and the like.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 schematically illustrates a system architecture for acquiring video data and/or training of deep learning models according to an embodiment of the present disclosure;

FIG. 2 schematically shows a flow chart of a method of acquiring video data according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a schematic diagram of a method of acquiring video data according to an embodiment of the present disclosure;

FIG. 4A is a diagram that schematically illustrates a list of candidate words, in accordance with an embodiment of the present disclosure;

FIG. 4B is a diagram that schematically illustrates a list of candidate words, in accordance with another embodiment of the present disclosure;

FIG. 4C schematically illustrates a schematic diagram of a tree structure according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow diagram of a method of training a deep learning model according to an embodiment of the present disclosure;

fig. 6 schematically shows a block diagram of an apparatus for acquiring video data according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a block diagram of a training apparatus for deep learning models according to an embodiment of the present disclosure; and

FIG. 8 is a block diagram of an electronic device for performing training of acquired video data and/or deep learning models to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

Fig. 1 schematically illustrates a system architecture for acquiring video data and/or training of deep learning models according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 1, a system architecture 100 according to this embodiment may include

clients

101, 102, 103, a network 104, and a server 105. Network 104 is the medium used to provide communication links between

clients

101, 102, 103 and server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may use

clients

101, 102, 103 to interact with server 105 over network 104 to receive or send messages, etc. Various messaging client applications, such as shopping applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the

clients

101, 102, 103.

Clients

101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop and desktop computers, and the like. The

clients

101, 102, 103 of the disclosed embodiments may run applications, for example.

The server 105 may be a server that provides various services, such as a back-office management server (for example only) that provides support for websites browsed by users using the

clients

101, 102, 103. The background management server may analyze and perform other processing on the received data such as the user request, and feed back a processing result (e.g., a webpage, information, or data obtained or generated according to the user request) to the client. In addition, the server 105 may also be a cloud server, i.e., the server 105 has a cloud computing function.

It should be noted that the method for acquiring video data provided by the embodiment of the present disclosure may be executed by the server 105. Accordingly, the apparatus for acquiring video data provided by the embodiment of the present disclosure may be disposed in the server 105. Alternatively, the training method of the deep learning model provided by the embodiment of the present disclosure may be performed by the server 105. Accordingly, the training device of the deep learning model provided by the embodiment of the present disclosure may be disposed in the server 105.

In one example, the

client

101, 102, 103 may send an instruction to the server 105 to obtain video data, which may include the first type of video data. The server 105 processes the first type video data in response to the instruction to acquire video data, resulting in target video data. Alternatively, the first type video data is stored in the server 105, and after the server 105 receives the video data acquisition instruction from the

clients

101, 102, 103, the server 105 processes the first type video data based on the video data acquisition instruction to obtain the target video data.

In another example, the

clients

101, 102, 103 may send training instructions of the deep learning model to the server 105, which may include sample video data therein. The server 105 trains the deep learning model with the sample video data in response to the training instructions of the deep learning model. Alternatively, the sample video data is stored in the server 105, and after the server 105 receives a training instruction of the deep learning model from the

clients

101, 102, 103, the server 105 trains the deep learning model using the sample video data based on the training instruction of the deep learning model.

It should be understood that the number of clients, networks, and servers in FIG. 1 is merely illustrative. There may be any number of clients, networks, and servers, as desired for an implementation.

A method of acquiring video data and a training method of a deep learning model according to an exemplary embodiment of the present disclosure are described below with reference to fig. 2 to 5 in conjunction with the system architecture of fig. 1. The method for acquiring video data and the training method for deep learning model according to the embodiments of the present disclosure may be performed by, for example, a server shown in fig. 1, which is, for example, the same as or similar to the electronic device below.

Fig. 2 schematically shows a flow chart of a method of acquiring video data according to an embodiment of the present disclosure.

As shown in fig. 2, the method 200 of acquiring video data according to the embodiment of the present disclosure may include, for example, operations S210 to S230.

In operation S210, first text data associated with the first type of video data is processed to obtain candidate words and word categories corresponding to the candidate words.

In operation S220, a target word is determined from the candidate words based on the word category.

In operation S230, target video data associated with the first type of video data is acquired from the second type of video data based on the target word.

Illustratively, the first type of video data includes, for example, video data in a first language and the second type of video data includes, for example, video data in a second language. The first language is, for example, English, and the second language is, for example, Chinese. Alternatively, the first language is, for example, Chinese and the second language is, for example, English. For convenience of illustration, the embodiments of the present disclosure take the first language as english and the second language as chinese as an example.

For the first type of video data, the associated first text data is, for example, english text, and the first text data is processed to obtain a plurality of candidate words and a word category corresponding to each candidate word. The candidate words include chinese words, each of which is a word or a word.

After obtaining the plurality of candidate words and the word category of each candidate word, at least one target word may be selected from the plurality of candidate words based on the word category, and the word category of the target word conforms to, for example, a preset word category. The selected target word is, for example, an important word of the plurality of candidate words, and the target word can represent the video content of the first type of video data.

After the target word is obtained, the target word may be respectively matched with each of the plurality of second type video data, and the second type video data matched with the target word is used as the target video data, where the target video data is, for example, a chinese video, and is associated with video content of the first type video data, for example, the target video data is consistent with or similar to a video theme of the first type video data.

According to the embodiment of the disclosure, the amount of the first type video data is usually large, the quality is high, in order to obtain the second type video data, the first type video data can be used as a reference, a target word is obtained by processing the first type video data, the target video data is obtained by searching the second type video data based on the target word, the video content and the video quality of the target video data are consistent with those of the first type video data, and therefore the target video data meeting the requirements can be obtained, and the acquisition cost of the video data is reduced.

In another example of the present disclosure, the first text data is obtained based on any one or more of title data, description information, subtitle information, and voice information of the first type video data, for example.

For example, title data of the first type of video data may be determined as the first text data.

Alternatively, the description information of the first type video data may be determined as the first text data. The description information includes, for example, a subject of the first type of video data or summary information of the content of the first type of video data.

Or, identifying the caption information of the first type of video data to obtain first text data. For example, the subtitle information may be recognized by an Optical Character Recognition (OCR) method, and the recognized subtitle information may be determined as the first text data.

Or recognizing the voice information of the first type video data to obtain first text data. For example, the speech information may be recognized by using a speech recognition technique to obtain a text, and the recognized text may be determined as the first text data.

The embodiment of the disclosure can obtain the first text data in various ways, improve the flexibility and improve the accuracy of the first text data.

In another example of the present disclosure, the text type of the first text data includes, for example, a first text type, which is, for example, an english type.

The video library includes, for example, a plurality of second type video data. For each second type of video data, second text data associated with the second type of video data may be obtained, the text type of the second text data being, for example, a second text type, the second text type including, for example, a chinese type.

For example, the text type of the first text data may be converted from the first text type to the second text type, resulting in the converted first text data. For example, the first text data is translated to obtain the converted first text data. And then, processing the converted first text data to obtain the target word.

Next, target video data is acquired from the second type of video data based on the target words and the second text data, and the second text data corresponding to the target video data matches the target words. For example, at least one second type video data may be selected from the plurality of second type video data as the target video data. For example, for each second type of video data, if the second text data corresponding to the second type of video data includes the target word, indicating that the second text data matches the target word, the second type of video data may be used as the target video data. Alternatively, if the second text data corresponding to the second type of video data is similar to the target word, indicating that the second text data matches the target word, the second type of video data may be used as the target video data. For example, the second text data corresponds to a first vector (or a portion of words in the second text data corresponds to a first vector), the target word corresponds to a second vector, and the second text data may be determined to be similar to the target word if the vector distance between the first vector and the second vector is less than a preset distance.

Illustratively, the second text data associated with the second type of video data may be acquired based on any one or more of title data, description information, subtitle information, and voice information of the second type of video data.

For example, title data of the second type video data is determined as the second text data. Alternatively, the description information of the second type video data is determined as the second text data. Or, identifying the caption information of the second type of video data to obtain second text data. Or identifying the voice information of the second type video data to obtain second text data. It is understood that the manner of acquiring the second text data is similar to the manner of acquiring the first text data, and is not described herein again.

According to the embodiment of the disclosure, after the target words are obtained based on the first type of video data, the target words are matched with the second text data of the second type of video data, so that the target video data is selected from the plurality of second type of video data, the relevance between the target video data and the first type of video data is improved, and the obtained target video data is high-quality data at a high probability under the condition that the first type of video data is high-quality data.

Fig. 3 schematically illustrates a schematic diagram of a method of acquiring video data according to an embodiment of the present disclosure.

As shown in fig. 3, for the first type of video data 310, the first text data 311 corresponding to the first type of video data 310 is, for example, english data. The text type of the first text data 311 is converted from an english type to a chinese type, thereby obtaining converted first text data 312.

After the converted first text data 312 is obtained, the converted first text data 312 is processed in a Sequence Labeling (Sequence Labeling) manner, so as to obtain a candidate word list 313, where the candidate word list 313 includes, for example, a plurality of candidate words and a word category corresponding to each candidate word. The sequence labeling method is, for example, a natural speech processing technique.

At least one candidate word is selected as the target word 314 for a plurality of candidate words in the candidate word list 313 based on the word category corresponding to each candidate word. The word category corresponding to the target word 314 is, for example, a preset word category.

The target word 314 is, for example, a key word for the first type of video data 310, and the target word 314 characterizes the main content of the first type of video data 310. Accordingly, the target video data may be acquired based on the target word 314.

For example, for the plurality of second

type video data

320, 330, 340 in the video library, the plurality of second

type video data

320, 330, 340 are all chinese videos, for example. The second text data 321 corresponding to the second type video data 320, the second text data 331 corresponding to the second type video data 330, and the second text data 341 corresponding to the second type video data 340 are, for example, chinese data. The target word 314 is matched with the second text data 321, the target word 314 is matched with the second text data 331, and the target word 314 is matched with the second text data 341. If it is determined that target word 314 matches second text data 331, second type video data 330 corresponding to second text data 331 may be determined as target video data. In an example, matching target word 314 with second text data 331, for example, characterizes the inclusion of target word 314 in second text data 331.

According to the embodiment of the disclosure, the target video data obtained based on the target words and the first type video data have relevance, and the target video data with higher data quality is obtained based on the first type video data with higher data quality.

FIG. 4A schematically shows a diagram of a candidate word list according to an embodiment of the present disclosure.

As shown in fig. 4A, the candidate word list 413A includes at least a plurality of candidate words and a word category corresponding to each candidate word, and the word category includes at least one of a first word category and a second word category, for example. The candidate word list 413A also includes other contents such as a word length.

For example, the converted first text data is processed in a sequence labeling manner, and a plurality of candidate words and a word category of each candidate word are obtained. The sequence labeling mode has, for example, a word segmentation function, a word classification function, a word semantic understanding function, and the like.

For example, "step by step" on the converted first text data. The golden potatoes are cut into classical mashed potatoes. The word segmentation processing is carried out to obtain a plurality of candidate words, namely ' one step ', ' ground ', '. "," will "," gold "," potato "," cut into "," classic "," mashed potato ",".

In one approach, each candidate word may be classified, resulting in a first word category corresponding to each candidate word. For example, a plurality of categories are set in advance, and each candidate word is predicted to belong to at least one of the plurality of categories. The preset plurality of categories include, for example, "people category", "works category", "objects category", "organizational category", "cultural category", "time category", "diet category", "scene event", "life category", "sensory characteristics", "numeric words", "auxiliary words", "preposition", "vocabulary words", "modifiers", and the like. The category of each candidate word may be predicted by a classification model to determine at least one first word category corresponding to each candidate word from a plurality of categories set in advance.

In another approach, each candidate word may be semantically understood, resulting in a second word category corresponding to each candidate word. For example, the second word category may be mined using a knowledge-graph technique for semantic understanding. Taking the candidate word "potato" as an example, semantic understanding of "potato" is performed to obtain a higher-level concept or higher-level attribute of potato, for example, the higher-level concept or higher-level attribute of "potato" is obtained as a "potato" category, and the higher-level concept or higher-level attribute of "potato" is obtained as a "food" category.

Illustratively, the word categories corresponding to the target words include at least one of: noun category, scene category, sensory feature category. For example, the word categories corresponding to the target word include a first word category and a second word category. The first word category corresponding to the target word includes, for example, at least one of: a first noun category, a first scene category, a first sensory characteristic category. The second word category corresponding to the target word includes, for example, at least one of: a second noun category, a second scene category, a second sensory characteristic category.

In one example, the candidate word and a first word category of the candidate word may be obtained, and the target word may then be determined from the candidate words based on the first word category. The first word category corresponding to the target word includes, for example, at least one of: a first noun category, a first scene category, a first sensory characteristic category. The first noun class includes, for example, "diet class". The first scene category, for example, characterizes action scenes or action events of the first type of video data, e.g., the first scene category indicates that the target word belongs to a verb, which may be represented as a scene or action event, as shown in fig. 4A, which may include "scene events". The first sensory characteristic category includes, for example, a color characteristic, a status characteristic, and the like, for example, the first sensory characteristic category indicates that the target word belongs to an adjective, the adjective includes a color adjective, a status adjective, and the like, and as shown in fig. 4A, the first sensory characteristic category may include "sensory characteristics".

In another approach, the candidate word and a second word category of the candidate word may be obtained, and the target word may then be determined from the candidate words based on the second word category. The second word category corresponding to the target word includes, for example, at least one of: a second noun category, a second scene category, a second sensory characteristic category. The second noun category includes, for example, "food", "potato", "mashed potato", and the like. The second scene category characterizes, for example, a domain scene, an action scene, or an action event to which the first type of video data belongs, for example, the second scene category indicates that the target word belongs to a verb, which may represent the domain scene, the action scene, or the action event, as shown in fig. 4A, and includes a "life class" characterizing the domain scene to which the first type of video data belongs, and a "scene event" characterizing the action scene or the action event of the first type of video data. The second sensory characteristic category includes, for example, a color characteristic, a status characteristic, and the like, for example, the second sensory characteristic category indicates that the target word belongs to an adjective, the adjective includes a color adjective, a status adjective, and the like, and as shown in fig. 4A, the second sensory characteristic category may include "sensory characteristics".

In some cases, there are cases where the recognition result of the first word category or the recognition result of the second word category is misrecognized or is missing, and therefore, the target word may be determined based on either one of the first word category and the second word category, that is, the determined first word category of the target word is a first preset category, or the determined second word category of the target word is a second preset category. The first preset category is, for example, a first noun category, a first scene category or a first sensory characteristic category, and the second preset category is, for example, a second noun category, a second scene category or a second sensory characteristic category.

In one approach, the target word may be determined based on a first word category, for example, when the first word category of the candidate word is a first noun category, a first scene category, or a first sensory characteristic category, the candidate word is regarded as the target word. As shown in fig. 4A, the first word category of the candidate words "potato" and "mashed potato" is the first noun category "diet category", and the candidate words "potato" and "mashed potato" can be the target words. Similarly, the first word category of the candidate word "cut" is the first scene category "scene event", and the candidate word "cut" may be the target word. Similarly, the first word category of the candidate word "gold" is the first sensory characteristic category "sensory characteristic", and the candidate word "gold" may be taken as the target word.

In another way, the target word may be determined based on the second word category, for example, when the second word category of the candidate word is the second noun category, the second scene category, or the second sensory characteristic category, the candidate word is determined as the target word. As shown in fig. 4A, the second word category of the candidate words "potato" and "mashed potato" includes the second noun category "food", "potato", and "mashed potato", and the candidate words "potato" and "mashed potato" can be targeted words. Similarly, the second word category of the candidate words "gold", "cut" includes the second scene category "life class", "scene event", and the candidate words "gold", "cut" may be the target words. Of course, the second word category of the candidate word "gold" may also include the second sensory characteristic category "sensory characteristic", and the candidate word "gold" may be taken as the target word. It will be appreciated that the second word category of one target word may include a plurality of categories, for example the second word category of the target word "gold" may include a second scene category "life category" and a second sensory characteristic category "sensory characteristics".

In another approach, M may first be determined from a plurality of candidate words based on a first word category ₁ A target word, if M is determined ₁ The number of the target words is less or does not meet the actual requirement, and the N can be further determined from the candidate words based on the second word category ₁ A target word, then M ₁ Target words and N ₁ An objectRemoving repeated words from the words and then using the remaining words as the final target words, M ₁ And N ₁ For example, each is an integer greater than 0.

In other cases, to improve the accuracy of the target word, the target word may be determined based on both the first word category and the second word category. That is, the first word category of the determined target words is a first preset category, and the second word category is a second preset category. The first preset category is, for example, a first noun category, a first scene category or a first sensory characteristic category, and the second preset category is, for example, a second noun category, a second scene category or a second sensory characteristic category.

Illustratively, when the first word category of the candidate word is the first noun category, the first scene category, and the first sensory characteristic category, and the second word category of the candidate word is the second noun category, the second scene category, and the second sensory characteristic category, the candidate word is determined as the target word. For example, M is determined from a plurality of candidate words based on a first word category ₂ A target word, N is determined from a plurality of candidate words based on the second word category ₂ A target word, will M ₂ Target words and N ₂ The coincident target words in the target words are used as final target words, M ₂ And N ₂ For example, each is an integer greater than 0. For example, if M ₂ The target words comprise 'potato' and 'mashed potato', if N ₂ The target words comprise 'potato' and 'golden', and the superposed words 'potato' are used as the target words. Wherein M is determined from a plurality of candidate words based on the first word category ₂ A target word, and determining N from a plurality of candidate words based on a second word category ₂ The process of each target word is referred to above and will not be described herein.

In other cases, to reduce the amount of computation, the first target word may be determined from the candidate words based on the first word category. And under the condition that the number of the first target words is smaller than the preset number, determining second target words from the remaining candidate words based on the second word category, wherein the remaining candidate words are the words except the first target words in the candidate words.

For example, word segmentation processing is performed on the first text data to obtain a plurality of candidate words. And classifying each candidate word to obtain a first word category corresponding to each candidate word. Then, a first target word is determined from the plurality of candidate words based on a first word category, for example, a first noun category, a first scene category, and a first sensory characteristic category.

And if the number of the first target words is larger than or equal to the preset number, taking the first target words as final target words. If the number of the first target words is smaller than the preset number, for remaining candidate words except for the first target word in the plurality of candidate words, semantic understanding may be performed on each remaining candidate word to obtain a second word category corresponding to each remaining candidate word, and then, based on the second word category, a second target word is determined from the remaining candidate words, where the second word category of the second target word is, for example, a second noun category, a second scene category, and a second sensory characteristic category. And finally, determining the first target word and the second target word as final target words.

It is understood that the first word category of the candidate words is determined first, the first target word is determined based on the first word category, when the number of the first target words is small, the second word categories of the remaining candidate words are determined, and the second target word is determined based on the second word categories, thereby reducing the amount of computation consumed in determining the target words.

According to another example of the present disclosure, a candidate word having a word category that is the third word category may be deleted from the plurality of candidate words, and the remaining candidate words may be determined as target words. The third word category includes, for example, at least one of: a quantity word category, an assistant word category, a preposition word category, a modified word category and an abstract category. The third word category contains less semantic information and is generally difficult to represent the content of the first type of video data, so that the deleted candidate words can represent the content of the first type of video data at a higher probability and are more suitable for being used as target words.

In the above manner, a plurality of target words including, for example, "gold", "potato", "cut", "mashed potato", and the like can be obtained. The plurality of target words may be matched with second text data of the second type of video data, and when the second text data of the second type of video data includes one or more target words, the second type of video data may be taken as the target video data.

FIG. 4B schematically shows a diagram of a candidate word list according to another embodiment of the disclosure.

As shown in fig. 4B, the candidate word list 413B includes at least a plurality of candidate words and a word category corresponding to each candidate word, and the word category includes at least one of a first word category and a second word category, for example. The candidate word list 413B also includes other contents such as a word length.

For example, for the first text data after conversion "more and more people have chosen to travel with the spring. The word segmentation processing is carried out to obtain a plurality of candidate words, namely 'along with', 'coming in spring', 'more and more', 'person', 'selection', 'passed', 'going out for travel', 'so on'. "

Similar to the above, the target word is determined from the plurality of candidate words based on at least one of the first word category and the second word category. The determined target words include, for example, "coming spring", "person", "selection", "going travel". The "time class" and the "task class" in the first word class may be a first noun class, and the "life class", "time phase", "role class" and the "task" in the second word class may be a second noun class.

Fig. 4C schematically shows a schematic diagram of a tree structure according to an embodiment of the present disclosure.

As shown in fig. 4C, the tree structure 450 includes, for example, P nodes, where P nodes correspond to P categories, where P is an integer greater than 1, where each node corresponds to one category.

Taking the candidate word "potato" as an example, semantic understanding is performed on the candidate word to obtain a standard word "potato" corresponding to the candidate word "potato". For example, a search is performed in the knowledge base to find the standard word "potato" corresponding to the candidate word "potato". In other cases, the standard word is, for example, the same as the candidate word.

Next, a target branch structure 451 associated with the standard word is determined from the tree structure 450, e.g. the target branch structure 451 comprises the standard word "potato". The target branch structure 451 includes, for example, Q nodes corresponding to Q categories, Q being an integer equal to or less than P. The Q categories include, for example: "potatoes", "rhizomes", "vegetables" and "food". At least one of the Q categories is determined as the second word category, e.g., "food", "potato" is determined as the second word category for the candidate word "potato".

After the target video data is obtained, the target video data can be filtered based on the duration, text quality, subtitle content and the like of the video, and the filtered high-quality target video data is obtained.

Video data serves as a carrier and includes multi-modal information, including picture information, text information (subtitle information, description information), audio information, etc. in video. Video data plays an important role in modern life. With the development of deep learning, a deep learning model plays an increasingly important role in understanding video content, a large-scale pre-training technology becomes a hotspot, and the training of the deep learning model by using video data with multi-modal information becomes an important task. Compared with image data and text data, video data (including visual information and text information) contains richer dynamic information and can express more various contents. Therefore, how to obtain video data for training a deep learning model at low cost is an urgent problem to be solved.

In some cases, existing training data sets include a large amount of english video data, but lack training samples for chinese video.

In some cases, links of chinese videos may be obtained through a network technology, and further, the video data is cleaned by using rules, so that small-scale chinese video data is obtained. However, the acquisition method is inefficient, and requires tedious rules for cleaning, so that the acquired category distribution of the chinese video data is not uniform, and the training precision of the deep learning model is affected.

In order to obtain chinese video data with uniform category distribution and high quality as a training sample, the embodiments of the present disclosure obtain chinese video data (target video data) in the above manner based on english video data (first category video data) with uniform categories, respectively. The categories of the english video data (first category video data) are respectively uniform, for example, data of multiple categories including a life category, a sports category, a news category, and the like in the english video data is represented, and the data distribution of each category is relatively uniform.

After obtaining the chinese video data (target video data), the deep learning model is trained by using the chinese video data as sample video data, and a specific process is shown in fig. 5.

FIG. 5 schematically shows a flowchart of a training method of a deep learning model according to an embodiment of the present disclosure.

As shown in fig. 5, the training method 500 of the deep learning model according to the embodiment of the present disclosure may include operations S510 to S520, for example.

In operation S510, sample video data is acquired.

In operation S520, a deep learning model is trained using the sample video data.

Illustratively, the sample video data includes, for example, the above target video data.

It can be understood that, because the first category video data has better quality and uniform data category distribution, the category distribution of the target words obtained by processing the first category video data is uniform, and the category of the sample video data obtained by searching from the chinese video library based on the target words is also uniform.

Illustratively, the deep learning models include, for example, visual question-answer class models, visual search class models, generation class models, and the like.

And inputting the question data and the video data into the model aiming at the visual question-answer model, and acquiring answer data from the video data by the model and automatically outputting the answer data.

For visual search-type models, text data is entered into the model, which automatically matches video data associated with the text data. Alternatively, the video data is input into a model that automatically matches other video data that is similar.

And inputting text data into the model aiming at the generation type model, and automatically generating video data. Alternatively, the video data is input into the model, and the text data is automatically generated.

By utilizing the method disclosed by the embodiment of the invention, the sample video data is mined, the scale of the sample video data is enlarged, the quality and the quantity of the sample video data are ensured, and the discrimination capability of the fine granularity of the deep learning model is enhanced.

Fig. 6 schematically shows a block diagram of an apparatus for acquiring video data according to an embodiment of the present disclosure.

As shown in fig. 6, an apparatus 600 for acquiring video data of the disclosed embodiment includes, for example, a processing module 610, a determining module 620, and an acquiring module 630.

The processing module 610 may be configured to process first text data associated with a first type of video data to obtain candidate words and word categories corresponding to the candidate words. According to the embodiment of the present disclosure, the processing module 610 may perform, for example, the operation S210 described above with reference to fig. 2, which is not described herein again.

The determination module 620 may be used to determine a target word from the candidate words based on the word category. According to the embodiment of the present disclosure, the determining module 620 may perform, for example, the operation S220 described above with reference to fig. 2, which is not described herein again.

The obtaining module 630 may be configured to obtain target video data associated with the first type of video data from the second type of video data based on the target word. According to the embodiment of the present disclosure, the obtaining module 630 may, for example, perform operation S230 described above with reference to fig. 2, which is not described herein again.

According to an embodiment of the present disclosure, the processing module 610 includes: a conversion submodule and a processing submodule. The conversion submodule is used for converting the text type of the first text data from the first text type to the second text type to obtain converted first text data; and the processing submodule is used for processing the converted first text data in a sequence labeling mode to obtain candidate words and word categories corresponding to the candidate words.

According to an embodiment of the present disclosure, the obtaining module 630 includes: a first acquisition submodule and a second acquisition submodule. The first obtaining sub-module is used for obtaining second text data associated with second type video data, wherein the text type of the second text data is a second text type; and the second obtaining sub-module is used for obtaining the target video data from the second type video data based on the target words and the second text data, wherein the second text data corresponding to the target video data is matched with the target words.

According to an embodiment of the present disclosure, the word categories include a first word category; the processing module 630 includes: a first sub-module and a classification sub-module. The first word segmentation submodule is used for carrying out word segmentation processing on the first text data to obtain candidate words; and the classification submodule is used for classifying the candidate words to obtain a first word category corresponding to the candidate words.

According to an embodiment of the present disclosure, the word categories include a second word category; the processing module 630 includes: a second sub-module for word segmentation and a semantic understanding sub-module. The second word segmentation submodule is used for carrying out word segmentation processing on the first text data to obtain candidate words; and the semantic understanding submodule is used for carrying out semantic understanding on the candidate words to obtain a second word category corresponding to the candidate words.

According to an embodiment of the present disclosure, the semantic understanding submodule includes: the semantic understanding unit, the first determining unit and the second determining unit. The semantic understanding unit is used for carrying out semantic understanding on the candidate words to obtain standard words corresponding to the candidate words; the system comprises a first determining unit, a second determining unit and a third determining unit, wherein the first determining unit is used for determining a target branch structure associated with standard words from a tree structure, the tree structure comprises P nodes, the P nodes correspond to P categories, the target branch structure comprises Q nodes, the Q nodes correspond to Q categories, P is an integer larger than 1, and Q is an integer smaller than or equal to P; and the second determining unit is used for determining at least one of the Q categories as a second word category.

According to an embodiment of the present disclosure, the determining module 620 includes: a first determination submodule, a second determination submodule, and a third determination submodule. A first determining sub-module, configured to determine a first target word from candidate words based on the first word category, where the candidate words include the first target word and remaining candidate words; a second determination submodule configured to determine, in response to determining that the number of the first target words is less than the preset number, a second target word from the remaining candidate words based on a second word category of the remaining candidate words; and the third determining submodule is used for determining the first target word and the second target word as the target words.

According to an embodiment of the present disclosure, the word category corresponding to the target word includes at least one of: noun category, scene category, sensory feature category.

According to an embodiment of the disclosure, the determining module 620 is further configured to: deleting the candidate words with the word category being a third word category from the candidate words, and determining the remaining candidate words as target words, wherein the third word category comprises at least one of the following words: a quantity word category, an assistant word category, a preposition word category and a modifier word category.

According to an embodiment of the present disclosure, the first text data is obtained by at least one of: determining title data of the first type video data as first text data; determining description information of the first type video data as first text data; identifying subtitle information of the first type of video data to obtain first text data; and identifying voice information of the first type of video data to obtain first text data.

According to an embodiment of the present disclosure, the first acquisition submodule includes at least one of: the device comprises a first determining unit, a second determining unit, a first identifying unit and a second identifying unit. A first determining unit configured to determine title data of the second type of video data as second text data; a second determining unit configured to determine description information of the second type of video data as second text data; the first identification unit is used for identifying the subtitle information of the second type of video data to obtain second text data; and the second identification unit is used for identifying the voice information of the second type of video data to obtain second text data.

FIG. 7 schematically shows a block diagram of a training apparatus for deep learning models according to an embodiment of the present disclosure.

As shown in fig. 7, the training apparatus 700 for deep learning models according to the embodiment of the present disclosure includes, for example, an obtaining module 710 and a training module 720.

The acquisition module 710 may be used to acquire sample video data. According to the embodiment of the present disclosure, the obtaining module 710 may, for example, perform operation S510 described above with reference to fig. 5, which is not described herein again.

The training module 720 may be used to train a deep learning model using sample video data. According to an embodiment of the present disclosure, the training module 720 may, for example, perform operation S520 described above with reference to fig. 5, which is not described herein again.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure, application and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations, necessary confidentiality measures are taken, and the customs of the public order is not violated.

In the technical scheme of the disclosure, before the personal information of the user is acquired or collected, the authorization or the consent of the user is acquired.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

According to an embodiment of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform any one or more of the above-described method of acquiring video data, training method of deep learning model.

According to an embodiment of the present disclosure, there is provided a computer program product comprising computer programs/instructions which, when executed by a processor, implement any one or more of the above described method of acquiring video data, training method of deep learning model.

FIG. 8 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. The electronic device 800 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806 such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 performs any one or more of the respective methods and processes described above, such as a method of acquiring video data, a training method of a deep learning model. For example, in some embodiments, any one or more of the method of acquiring video data, the method of training a deep learning model, may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 808. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 800 via ROM 802 and/or communications unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of any one or more of the above-described method of acquiring video data, training method of deep learning model may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured by any other suitable means (e.g., by means of firmware) to perform any one or more of a method of acquiring video data, a training method of deep learning models.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of any one or more of a general purpose computer, special purpose computer, or other programmable apparatus for acquiring video data, deep learning model training apparatus, such that the program codes, when executed by the processor or controller, cause the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: any one or more of a means for obtaining video data, a training means for deep learning models for displaying information to a user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of acquiring video data, comprising:

processing first text data associated with first type video data to obtain candidate words and word categories corresponding to the candidate words;

determining a target word from the candidate words based on the word category; and

and acquiring target video data associated with the first type of video data from second type of video data based on the target words.

2. The method of claim 1, wherein the processing first text data associated with a first type of video data to obtain a candidate word and a word category corresponding to the candidate word comprises:

converting the text type of the first text data from a first text type to a second text type to obtain converted first text data; and

and processing the converted first text data by using a sequence labeling mode to obtain the candidate words and the word categories corresponding to the candidate words.

3. The method of claim 2, wherein the obtaining target video data associated with the first type of video data from second type of video data based on the target word comprises:

acquiring second text data associated with the second type of video data, wherein the text type of the second text data is the second text type; and

and acquiring the target video data from the second type video data based on the target words and the second text data, wherein the second text data corresponding to the target video data is matched with the target words.

4. The method of claim 1, wherein the word categories include a first word category; the processing first text data associated with a first type of video data to obtain candidate words and word categories corresponding to the candidate words comprises:

performing word segmentation processing on the first text data to obtain the candidate words; and

and classifying the candidate words to obtain the first word category corresponding to the candidate words.

5. The method of claim 1 or 4, wherein the word category comprises a second word category; the processing first text data associated with a first type of video data to obtain candidate words and word categories corresponding to the candidate words comprises:

and performing semantic understanding on the candidate words to obtain the second word categories corresponding to the candidate words.

6. The method of claim 5, wherein the semantically understanding the candidate word, resulting in the second word category corresponding to the candidate word comprises:

performing semantic understanding on the candidate words to obtain standard words corresponding to the candidate words;

determining a target branch structure associated with the standard words from a tree structure, wherein the tree structure comprises P nodes, the P nodes correspond to P categories, the target branch structure comprises Q nodes, the Q nodes correspond to Q categories, P is an integer greater than 1, and Q is an integer less than or equal to P; and

determining at least one of the Q categories as the second word category.

7. The method of claim 5, wherein the determining, based on the word category, the target word from the candidate words comprises:

determining a first target word from the candidate words based on the first word category, wherein the candidate words include the first target word and remaining candidate words;

in response to determining that the number of the first target words is less than a preset number, determining, based on the second word category of the remaining candidate words, a second target word from the remaining candidate words; and

and determining the first target word and the second target word as the target words.

8. The method of any of claims 1-3, wherein the word category corresponding to the target word comprises at least one of: noun category, scene category, sensory feature category.

9. The method of any of claims 1-8, wherein the determining, based on the word category, the target word from the candidate words comprises:

deleting the candidate word with the word category being the third word category from the candidate words, determining the remaining candidate words as the target word,

wherein the third word category includes at least one of: a number word category, an assistant word category, a preposition word category, and a modifier word category.

10. A method of training of a deep learning model, comprising:

acquiring sample video data; and

training a deep learning model by using the sample video data,

wherein the sample video data is obtained according to the method of any one of claims 1-9.

11. An apparatus for acquiring video data, comprising:

the processing module is used for processing first text data associated with the first type video data to obtain candidate words and word categories corresponding to the candidate words;

a determining module for determining a target word from the candidate words based on the word category; and

and the acquisition module is used for acquiring target video data associated with the first type of video data from second type of video data based on the target words.

12. The apparatus of claim 11, wherein the processing module comprises:

the conversion submodule is used for converting the text type of the first text data from a first text type to a second text type to obtain converted first text data; and

and the processing submodule is used for processing the converted first text data in a sequence labeling mode to obtain the candidate words and the word categories corresponding to the candidate words.

13. The apparatus of claim 12, wherein the means for obtaining comprises:

the first obtaining sub-module is used for obtaining second text data associated with the second type of video data, wherein the text type of the second text data is the second text type; and

and the second obtaining sub-module is used for obtaining the target video data from the second type video data based on the target words and the second text data, wherein the second text data corresponding to the target video data is matched with the target words.

14. The apparatus of claim 11, wherein the word categories include a first word category; the processing module comprises:

the first word segmentation sub-module is used for carrying out word segmentation processing on the first text data to obtain the candidate words; and

and the classification submodule is used for classifying the candidate words to obtain the first word category corresponding to the candidate words.

15. The apparatus of claim 11 or 14, wherein the word category comprises a second word category; the processing module comprises:

the second word segmentation submodule is used for carrying out word segmentation processing on the first text data to obtain the candidate words; and

and the semantic understanding submodule is used for carrying out semantic understanding on the candidate words to obtain the second word category corresponding to the candidate words.

16. The apparatus of claim 15, wherein the semantic understanding sub-module comprises:

the semantic understanding unit is used for carrying out semantic understanding on the candidate words to obtain standard words corresponding to the candidate words;

a first determining unit, configured to determine a target branch structure associated with the standard word from a tree structure, where the tree structure includes P nodes, where the P nodes correspond to P categories, the target branch structure includes Q nodes, where the Q nodes correspond to Q categories, P is an integer greater than 1, and Q is an integer less than or equal to P; and

a second determining unit, configured to determine at least one of the Q categories as the second word category.

17. The apparatus of claim 15, wherein the means for determining comprises:

a first determining sub-module for determining a first target word from the candidate words based on the first word category, wherein the candidate words include the first target word and remaining candidate words;

a second determination submodule configured to determine, in response to determining that the number of the first target words is less than a preset number, a second target word from the remaining candidate words based on the second word category of the remaining candidate words; and

a third determining submodule, configured to determine the first target word and the second target word as the target word.

18. The apparatus of any of claims 11-13, wherein the word category corresponding to the target word comprises at least one of: noun category, scene category, sensory feature category.

19. The apparatus of any of claims 11-18, wherein the means for determining is further configured to:

deleting the candidate word with the word category of the third word category from the candidate words, determining the rest candidate words as the target words,

20. An apparatus of training of a deep learning model, comprising:

the acquisition module is used for acquiring sample video data; and

a training module for training a deep learning model by using the sample video data,

wherein the sample video data is obtained by the apparatus of any one of claims 11-19.

21. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

22. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-10.

23. A computer program product comprising computer program/instructions stored on at least one of a readable storage medium and an electronic device, the computer program/instructions when executed by a processor implementing the steps of the method according to any one of claims 1-10.