WO2022134634A1 - 视频处理方法及电子设备 - Google Patents

视频处理方法及电子设备 Download PDF

Info

Publication number
WO2022134634A1
WO2022134634A1 PCT/CN2021/114059 CN2021114059W WO2022134634A1 WO 2022134634 A1 WO2022134634 A1 WO 2022134634A1 CN 2021114059 W CN2021114059 W CN 2021114059W WO 2022134634 A1 WO2022134634 A1 WO 2022134634A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
text information
video segment
network
segment
Prior art date
Application number
PCT/CN2021/114059
Other languages
English (en)
French (fr)
Inventor
高艳珺
陈昕
王华彦
Original Assignee
北京达佳互联信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京达佳互联信息技术有限公司 filed Critical 北京达佳互联信息技术有限公司
Priority to EP21887878.3A priority Critical patent/EP4047944A4/en
Priority to US17/842,654 priority patent/US11651591B2/en
Publication of WO2022134634A1 publication Critical patent/WO2022134634A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/235Processing of additional data, e.g. scrambling of additional data or processing content descriptors
    • H04N21/2353Processing of additional data, e.g. scrambling of additional data or processing content descriptors specifically adapted to content descriptors, e.g. coding, compressing or processing of metadata
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/266Channel or content management, e.g. generation and management of keys and entitlement messages in a conditional access system, merging a VOD unicast channel into a multicast channel
    • H04N21/26603Channel or content management, e.g. generation and management of keys and entitlement messages in a conditional access system, merging a VOD unicast channel into a multicast channel for automatically generating descriptors from content, e.g. when it is not made available by its provider, using content analysis techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/84Generation or processing of descriptive data, e.g. content descriptors

Definitions

  • the present disclosure relates to the technical field of machine learning, and in particular, to a video processing method and an electronic device.
  • Video timing annotation is an important process in tasks such as video processing and pattern recognition.
  • Video timing annotation refers to identifying the video file, predicting the start time and end time matching the text information from the video file, and marking the video file matching the text information according to the start time and end time. video segment.
  • a video processing method comprising: obtaining a video file and first text information; inputting the video file and the first text information into a video processing model identification network to obtain the first video segment matched by the first text information; input the first video segment to the feature extraction network of the video processing model to obtain the video features of the first video segment; The video features of a video segment are input into the translation network of the video processing model to obtain second text information of the first video segment, where the second text information is used to describe the video content of the first video segment; based on the The video processing model outputs the first video segment and the second textual information.
  • a method for training a video processing model comprising: inputting a video sample into a recognition network of the video processing model to obtain a second video segment matched with third text information, The third text information is marked in the video sample; the recognition loss parameter of the recognition network is determined based on the second video segment and the third video segment marked in the video sample; based on the second video segment and the third text information, determine the video features of the first similarity and the second video segment, the first similarity indicates the similarity between the second video segment and the third text information Based on the video features of the second video segment and the third text information, determine the translation quality parameters of the translation network of the video processing model, and the translation quality parameters represent that the translation network translates the video features into text information The quality of the video processing model is adjusted based on the recognition loss parameter, the first similarity and the translation quality parameter.
  • a video processing apparatus includes: an obtaining unit configured to obtain a video file and first text information; a timing labeling unit configured to obtain the video file and the first text information is input into the recognition network of the video processing model to obtain the first video segment matched with the first text information; the feature extraction unit is configured to input the first video segment into the video processing The feature extraction network of the model obtains the video features of the first video segment; the visual text translation unit is configured to input the video features of the first video segment into the translation network of the video processing model to obtain the first video second text information of the segment, where the second text information is used to describe the video content of the first video segment; an output unit configured to output the first video segment and the second video segment based on the video processing model text information.
  • an apparatus for training a video processing model comprising: a time sequence labeling unit configured to input video samples into a recognition network of the video processing model to obtain third text information a matched second video segment, the video sample is marked with the third text information; a second determining unit is configured to determine based on the second video segment and the third video segment marked in the video sample a recognition loss parameter of the recognition network; a third determination unit configured to determine a first similarity and a video feature of the second video segment based on the second video segment and the third text information, the The first similarity indicates the similarity between the second video segment and the third text information; the fourth determining unit is configured to, based on the video feature of the second video segment and the third text information, determining a translation quality parameter of the translation network of the video processing model, the translation quality parameter representing the quality at which the translation network translates video features into textual information; a parameter adjustment unit configured to be based on the recognition loss parameter, the The first similarity
  • an electronic device comprising: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to The following steps are implemented: obtaining a video file and first text information; inputting the video file and the first text information into a recognition network of a video processing model to obtain a first video segment matched by the first text information; The first video segment is input into the feature extraction network of the video processing model to obtain the video features of the first video segment; the video features of the first video segment are input into the translation network of the video processing model to obtain second text information of the first video segment, where the second text information is used to describe the video content of the first video segment; the first video segment and the second text information are output based on the video processing model.
  • an electronic device comprising: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to The following steps are implemented: input the video sample into the recognition network of the video processing model, and obtain a second video segment matched with the third text information, and the third text information is marked in the video sample; based on the second video segment and For the third video segment marked in the video sample, the recognition loss parameter of the recognition network is determined; based on the second video segment and the third text information, the first similarity and the second video segment are determined.
  • the first similarity indicates the similarity between the second video segment and the third text information; based on the video features of the second video segment and the third text information, determine the The translation quality parameter of the translation network of the video processing model, the translation quality parameter represents the quality of the translation network to translate video features into text information; based on the recognition loss parameter, the first similarity and the translation quality parameter , and adjust the parameters of the video processing model.
  • a computer-readable storage medium When the instructions in the computer-readable storage medium are executed by a processor of an electronic device, the electronic device can perform the following steps: acquiring a video file and first text information; input the video file and the first text information into the recognition network of the video processing model to obtain the first video segment matched by the first text information; input the first video segment to the feature extraction network of the video processing model to obtain the video features of the first video segment; input the video features of the first video segment to the translation network of the video processing model to obtain the first video feature of the first video segment.
  • Two text information, where the second text information is used to describe the video content of the first video segment; the first video segment and the second text information are output based on the video processing model.
  • a computer-readable storage medium When the instructions in the computer-readable storage medium are executed by a processor of an electronic device, the electronic device can perform the following steps: The sample is input to the recognition network of the video processing model, and the second video segment matched with the third text information is obtained, and the third text information is marked in the video sample; based on the second video segment and the annotation in the video sample The third video segment is determined, and the recognition loss parameter of the recognition network is determined; based on the second video segment and the third text information, the first similarity and the video feature of the second video segment are determined.
  • the translation quality parameter represents the quality of the video feature translated into text information by the translation network; based on the recognition loss parameter, the first similarity and the translation quality parameter, adjust the video processing parameters of the model.
  • a computer program product comprising computer instructions, which when executed by a processor implement the following steps: acquiring a video file and first text information; The first text information is input into the recognition network of the video processing model, and the first video segment matched by the first text information is obtained; the first video segment is input into the feature extraction network of the video processing model, and the first video segment is obtained.
  • the video features of the first video segment are input; the video features of the first video segment are input into the translation network of the video processing model to obtain the second text information of the first video segment, and the second text information is used to describe video content of the first video segment; outputting the first video segment and the second text information based on the video processing model.
  • a computer program product comprising computer instructions that, when executed by a processor, implement the following steps: input a video sample into a recognition network of a video processing model, and obtain a third a second video segment matched with text information, and the third text information is marked in the video sample; based on the second video segment and the third video segment marked in the video sample, determine the identification of the recognition network a loss parameter; based on the second video segment and the third text information, determining a first similarity and video features of the second video segment, the first similarity indicating the second video segment and the second video segment similarity between third text information; based on the video features of the second video segment and the third text information, determine a translation quality parameter of the translation network of the video processing model, the translation quality parameter representing the The translation network translates the video features into the quality of the textual information; based on the recognition loss parameter, the first similarity and the translation quality parameter, the parameters of the video processing model are adjusted.
  • the video processing model provided by the embodiment of the present disclosure includes a recognition network, a feature extraction network, and a translation network.
  • the first video segment in the video file that matches the first text information can be identified based on the recognition network.
  • the second text information of the first video segment is translated based on the translation network. Therefore, for this video processing model, the first video segment and the second text information can be output, that is, based on a video processing model, the video file is obtained. Multiple output results improve the diversity of video annotation results.
  • FIG. 1 is a flowchart of a video processing method provided according to an exemplary embodiment
  • FIG. 2 is a flowchart of a training method for a video processing model provided according to an exemplary embodiment
  • FIG. 3 is a flowchart of a video processing method provided according to an exemplary embodiment
  • FIG. 4 is a flowchart of a video processing method provided according to an exemplary embodiment
  • FIG. 5 is a flowchart of a video processing method provided according to an exemplary embodiment
  • FIG. 6 is a block diagram of a video processing apparatus provided according to an exemplary embodiment
  • FIG. 7 is a schematic structural diagram of a terminal provided according to an exemplary embodiment
  • Fig. 8 is a schematic structural diagram of a server according to an exemplary embodiment.
  • the original video file is edited through video timing annotation to obtain a video segment matching the text information to be queried.
  • the electronic device receives the text information to be queried input by the user, identifies the video content in the video file according to the text information to be queried, and edits the identified video segment to obtain the edited video. part.
  • the original video file is identified through video timing annotation, and a video file matching the text information to be queried is obtained.
  • the electronic device when performing a video search, receives the text information to be queried input by the user, searches a plurality of video files according to the text information to be queried, and obtains a video file containing a video segment matching the text information to be queried, Feedback on this video file.
  • the timing annotation model when performing video timing annotation, only has the function of timing annotation. Therefore, only a single video annotation result, that is, a video segment, can be obtained when video time series annotation is performed through the time series annotation model.
  • the recognition network, the feature extraction network and the translation network are combined in the video processing model, the recognition network can determine the first video segment in the video file that matches the first text information to be queried, and the feature extraction network Perform feature extraction on the first video segment, and perform visual text translation on the extracted video features through a translation network to obtain the second text information of the first video segment, so that in the process of labeling the video file, the first labelled video can be obtained.
  • the video segment and the second text information corresponding to the first video segment so that through a video processing model, various output results of the video file can be obtained, and the diversity of the video annotation results is improved.
  • the recognition network, feature extraction network and translation network in the video processing model are jointly trained, which enriches the training parameters of the training video processing model, and further improves the video timing labeling of the video processing model. 's accuracy.
  • FIG. 1 is a flowchart of a video processing method according to an exemplary embodiment. As shown in FIG. 1 , the execution subject of the method is an electronic device, which includes the following steps.
  • step 101 a video file and first text information are acquired.
  • step 102 the video file and the first text information are input into the recognition network of the video processing model to obtain the first video segment matched with the first text information.
  • step 103 the first video segment is input to the feature extraction network of the video processing model to obtain video features of the first video segment.
  • step 104 the video features of the first video segment are input into the translation network of the video processing model to obtain second text information of the first video segment, where the second text information is used to describe the video content of the first video segment.
  • step 105 the first video segment and the second text information are output based on the video processing model.
  • the video file and the first text information are input into the recognition network of the video processing model to obtain the first video segment matched by the first text information, including:
  • the video segment corresponding to the target video feature is determined as the first video segment.
  • the training method of the video processing model includes:
  • the translation quality parameters of the translation network of the video processing model are determined, and the translation quality parameters represent the quality of the translation network's translation of the video features into text information;
  • the parameters of the video processing model are adjusted.
  • determining the recognition loss parameter of the recognition network based on the second video segment and the third video segment marked in the video sample including:
  • the recognition loss parameter is determined based on the recognition loss function, the start time and end time of the second video segment in the video sample, and the start time and end time of the third video segment in the video sample.
  • determining the first similarity parameter and the video feature of the second video segment based on the second video segment and the third text information including:
  • the second video segment and the third text information are input into the feature extraction network of the video processing model, and the video feature of the second video segment and the text feature of the third text information are obtained;
  • the cosine similarity between the video feature of the second video segment and the text feature of the third text information is determined to obtain the first similarity.
  • the translation quality parameters of the translation network of the video processing model are determined, including:
  • the video feature of the second video segment is input to the translation network to obtain the fourth text information of the second video segment;
  • the second similarity is determined as a translation quality parameter.
  • the parameters of the video processing model are adjusted based on the recognition loss parameter, the first similarity, and the translation quality parameter, including:
  • the network parameters of the recognition network, the feature extraction network and the translation network are adjusted respectively until the recognition loss parameter is less than the first threshold and the first similarity is greater than the second threshold, And the translation quality parameter is greater than the third threshold, and the model training is completed.
  • the embodiment of the present disclosure provides a new video processing model
  • the video processing model includes a recognition network, a feature extraction network, and a translation network.
  • it can identify the video file based on the recognition network.
  • a first video segment whose text information matches, and the second text information of the first video segment translated based on the translation network. Therefore, for this video processing model, the first video segment and the second text information can be output, that is, based on A video processing model that obtains various output results of video files, improving the diversity of video annotation results.
  • FIG. 2 is a flowchart of a training method for a video processing model provided according to an exemplary embodiment.
  • the model training of the video processing model to be trained is taken as an example for description.
  • the execution subject of the method is an electronic device, which includes the following steps.
  • step 201 a time series annotation model to be trained is determined.
  • the time series labeling model to be trained includes a time series labeling network to be trained, a feature extraction network to be trained and a visual text translation network to be trained.
  • the time series annotation model may be referred to as a video processing model
  • the time series annotation network may be referred to as a recognition network
  • the visual text translation network may be referred to as a translation network.
  • the structure of the video processing model is determined. For example, determine the network structure of the recognition network, the network structure of the feature extraction network and the network structure of the translation network, and the connection structure between the recognition network, the feature extraction network and the translation network.
  • the video processing model is a pipelined model training architecture, that is, the recognition network, the feature extraction network, and the translation network are constructed into a pipelined model training architecture.
  • the output of the recognition network is used as the input of the feature extraction network
  • the output of the feature extraction network is used as the input of the translation network. Therefore, after the recognition network obtains the output result, the output result can be directly input into the feature extraction network, and after the feature extraction network obtains the output result, the output result can be directly input into the translation network.
  • the recognition network, feature extraction network and translation network in the video processing model are constructed as a pipelined model training architecture, so that the output of the previous network can be directly used as the input of the latter network, so that the recognition network can be ,
  • the feature extraction network and the translation network can be trained synchronously, which simplifies the model training process and improves the accuracy of model training.
  • the identification network, the feature extraction network and the translation network are respectively a network of any structure designed by the developer.
  • the structures of the identification network, the feature extraction network and the translation network are not specifically limited. .
  • step 202 the video samples are input into the time series labeling network to be trained, and the video segments labeled by the time series labeling network to be trained are obtained.
  • the video segment labeled by the time series labeling network to be trained is the second video segment. That is, the video samples are input to the recognition network to obtain the second video segment matched with the third text information.
  • the video sample is a video sample marked with a video segment, and the video sample is also marked with text information matching the video segment, that is, the video segment marked in the video sample is the third video segment, and the text information marked in the video sample is the first video segment.
  • Three text information, and the third video segment matches the third text information the third video segment is a sample video segment for training the video processing model by the user, and the third text information is sample text information for training the video processing model.
  • the start time and the end time of the third video segment are marked in the video sample, and the video segment between the start time and the end time is the third video segment.
  • the third text information is a word, a keyword, a description text, an image, a video file, and the like. In this embodiment of the present disclosure, the third text information is not limited.
  • the video sample is input into the recognition network, and the video sample is marked based on the recognition network to obtain a second video segment predicted by the recognition network, and the second video segment matches the third text information.
  • the second video segment is a video segment predicted by the recognition network that matches the third text information.
  • the video features of the video samples are extracted, and the extracted video features are compared with the text features of the third text information, so as to obtain the predicted second video segment.
  • the process includes the following steps (1)-(3).
  • the video feature of the video sample and the text feature of the third text information are features of any type.
  • the video feature of the video sample and the text feature of the third text information are both vector features or matrix features.
  • the text features of the third text information are compared with the video features of the video samples one by one, and the video features matching the text features of the third text information are obtained.
  • the matching of text features and video features refers to the same or similar text features and video features.
  • the similarity between the text feature of the third text information and the video feature of the video sample is determined respectively, and the video feature with the highest similarity is determined as the video feature matching the text feature of the third text information.
  • the similarity between the text feature of the third text information and the video feature of the video sample is any type of similarity.
  • the similarity is cosine similarity or the like.
  • the video sample is divided into a plurality of video segments, each video segment has a corresponding video feature, the similarity between the text feature of the third text information and the video feature of each video segment is determined, and the The video feature with the highest similarity is determined as the video feature matching the text feature of the third text information.
  • the video segment corresponding to the video feature matched by the text feature of the third text information is determined as the second video segment matched by the third text information.
  • the start time and end time in the video sample of the video feature matched by the text feature of the third text information are determined, and the video content between the start time and the end time is determined as the second video segment.
  • feature extraction is performed on the video samples and text information, so that in the process of training the recognition network, the feature extraction network and the recognition network are mutually constrained, so that in the same training process Two networks are trained in the process, which improves the training efficiency of the model, and improves the adaptation of the recognition network and the feature extraction network, thereby improving the accuracy of the video processing model.
  • step 203 based on the video segment labeled by the time sequence labeling network to be trained and the video segment labeled in the video sample, a time sequence labeling loss parameter of the time sequence labeling network to be trained is determined. That is, the recognition loss parameter of the recognition network is determined based on the second video segment and the third video segment marked in the video samples.
  • the recognition loss parameter is the recognition loss parameter generated when the video processing model performs time series annotation on the video samples.
  • the recognition loss parameter is generated based on a time series annotation loss function, which may be called a recognition loss function.
  • the video features of the second video segment and the third video segment are respectively determined, and the video features of the second video segment and the video features of the third video segment are input into a recognition loss function, and the recognition loss function is based on the two The video features of the video segment determine this recognition loss parameter.
  • the start time and end time of the second video segment are determined, and the start time and end time of the third video segment are determined; based on the start time and end time of the two video segments, based on the recognition loss
  • the function determines the recognition loss parameter.
  • the process includes the following steps (4)-(6).
  • the second video segment marked by the identification network is determined, and the corresponding start time and end time of the second video segment in the video sample are determined.
  • the start time and the end time of the second video segment are recorded. In this step, the start time and end time of the second video segment are directly called.
  • the third video segment marked in the video sample is determined based on the start time and the end time marked in the video sample, that is, the start time and the end time of the third video segment marked in the video sample are determined , then in this step, the start time and end time marked in the video sample are directly obtained.
  • the sequence of acquiring the start time and the end time of the two video segments is not specifically limited.
  • the recognition loss parameter is determined based on the recognition loss function, the start time and end time of the second video segment in the video sample, and the start time and end time of the third video segment in the video sample.
  • the start time and end time of the two video segments are used as variable values of the recognition loss function, based on the difference between the two start times and end times, that is, based on the start time and the end time of the second video segment.
  • the difference between the start time of the third video segment and the difference between the end time of the second video segment and the end time of the third video segment determines the recognition loss parameter.
  • step 204 based on the video segment marked by the time series labeling network to be trained and the text information marked in the video sample, determine the distance between the video segment marked by the time series label network to be trained and the text information marked in the video sample The first similarity parameter of and the video feature of the video segment marked by the time series annotation network to be trained.
  • the first similarity parameter may be referred to as the first similarity. Based on the second video segment and the third text information, the first similarity and video features of the second video segment are determined.
  • the first similarity is the similarity between the text feature of the third text information and the video feature of the second video segment, that is, the first similarity indicates the similarity between the second video segment and the third text information.
  • the first similarity is determined according to any similarity determination method.
  • the video features of the second video segment and the text features of the third text information are both feature vectors, and the first similarity is a similarity determined based on a cosine similarity algorithm.
  • the process includes the following steps (7)-(8).
  • the video features of the second video segment and the text features of the third text information are extracted respectively.
  • the sequence of the process of extracting the video feature of the second video segment and the process of extracting the text of the third text information is not limited.
  • the cosine similarity between the video feature and the text feature is determined by a cosine similarity algorithm, and the obtained cosine similarity is determined as the first similarity.
  • the video features of the second video segment and the text features of the third text information are extracted based on the feature extraction network, and then the similarity between the two is obtained, so that in the process of model training for the video processing model, it is possible to
  • the feature extraction network and the recognition network are used for model training at the same time, thereby improving the training efficiency and accuracy of the video processing model.
  • the second video segment and the third text information are input to a feature extraction network, and the feature extraction network outputs the video feature and the first similarity of the second video segment.
  • a translation quality parameter of the visual text translation network to be trained is determined based on the video features of the video segment marked by the time series annotation network to be trained and the text information marked in the video sample. That is, based on the video features of the second video segment and the third text information, the translation quality parameter of the translation network is determined.
  • the translation quality parameter represents the quality of the translation network's translation of video features into textual information.
  • the video features of the second video segment are translated into text information describing the second video segment, the similarity between the translated text information and the third text information is obtained, and the similarity is determined as the translation network translation quality parameters.
  • the video feature of the second video segment is input into the translation network, the video feature is translated into text information based on the translation network, and the translation quality parameter is obtained based on the translated text information, and the process includes the following steps (9)- (11).
  • (9) Input the video features of the second video segment into the translation network to obtain text information of the video samples. That is, the video features of the second video segment are input to the translation network to obtain fourth text information of the second video segment.
  • the video features are translated into text information based on the translation network to obtain text information for translating the second video segment.
  • the second similarity parameter may be referred to as the second similarity. That is, the second degree of similarity between the fourth text information and the third text information is determined.
  • text feature extraction is performed on the fourth text information and the third text information to obtain the text features of the fourth text information and the text features of the third text information, and the similarity between the two text features is determined , take the determined similarity as the second similarity.
  • the second similarity is determined according to any similarity determination method. For example, the similarity between the text features is determined based on a cosine similarity algorithm, and the similarity is determined as the second similarity.
  • the second similarity can indicate text translation Whether the network's translation of the second video segment is accurate.
  • the video features of the second video segment are translated through the translation network, and according to the similarity between the fourth text information obtained by translation and the third text information, the translation network and the third text information can be used in the model training process of the video processing model.
  • the recognition network performs model training at the same time, thereby improving the training efficiency and accuracy of the video processing model.
  • step 206 parameter adjustment is performed on the time-series annotation model to be trained based on the time-series annotation loss parameter, the first similarity parameter and the translation quality parameter to obtain the time-series annotation model. That is, the parameters of the video processing model are adjusted based on the recognition loss parameter, the first similarity, and the translation quality parameter.
  • the feature extraction network and translation network in the video processing model are already trained network models, then in this step, the identification loss parameter, the first similarity parameter and the translation quality parameter pair
  • the recognition network performs parameter adjustment to obtain the video processing model.
  • parameter adjustment is performed on the recognition network, the feature extraction network and the translation network in the time series annotation model at the same time, and the process is: based on the recognition loss parameter, the first similarity parameter and the translation quality parameter,
  • the network parameters of the recognition network, feature extraction network and translation network to be trained are adjusted until the recognition loss parameter is less than the first threshold, the first similarity is greater than the second threshold, and the translation quality parameter is greater than the third threshold, After the model training is completed, the video processing model is obtained.
  • the first threshold, the second threshold and the third threshold are set as required, and in this embodiment of the present disclosure, the first threshold, the second threshold and the third threshold are not limited.
  • the model training is performed on various networks in the video processing model at the same time by using various parameters.
  • different networks can be constrained to each other, so that in the same training process Multiple networks are trained in the video processing model, which improves the training efficiency of the model and improves the adaptability of each network in the video processing model.
  • the recognition network to be trained, the feature extraction network to be trained and the translation network to be trained can also be trained separately, and then the trained recognition network, feature extraction network and translation network can be directly constructed as video processing. model.
  • the video processing model in the process of training the video processing model, parameters output by other networks are introduced, and the video processing model is trained according to the training parameters of various networks in the video processing model, thereby enriching the training video processing model.
  • the training parameters of the video processing model are further improved, and the accuracy of the video time series labeling of the video processing model is improved.
  • the embodiment of the present disclosure provides a new video processing model
  • the video processing model includes a recognition network, a feature extraction network and a translation network, in the process of time series labeling of video files, based on the recognition network, it is possible to determine the difference between the video file and the first video file based on the recognition network.
  • FIG. 4 is a flowchart of a video processing method provided according to an exemplary embodiment.
  • the time sequence labeling of the video file by using the video processing model is used as an example for description. As shown in Figure 4, the method includes the following steps.
  • step 401 the video file to be marked and the text information to be queried are acquired.
  • the text information to be queried is similar to the text information marked in the video sample, and details are not repeated here.
  • the text information to be queried may be referred to as the first text information, and the video file to be marked and the first text information to be queried are acquired.
  • the video file to be marked is a video file uploaded by a user, or the video file is a video file in a database.
  • the video file is not specifically limited.
  • the text information to be queried is the requirement for the video content to be retained when editing the video
  • the video file input by the user is received, and the content of the video file for editing requirements, and based on the content requirements, the video file is time-series marked.
  • the first text information indicates the video segment that needs to be edited from the video file
  • the video text to be edited and the first text information are obtained
  • the The video file is time-series marked to obtain a video segment in the video file that matches the first text information.
  • the video file is a video file in the query database, receives text information to be queried input by the user, and performs time sequence annotation on the video files in the database according to the text information, thereby determining the video file matching the text information to be queried. That is, in the video query scenario, the first text information indicates the target video file to be queried, and the first text information and multiple candidate video files in the database are obtained, and then based on the first text information, Time sequence annotation is performed on each candidate video file, and a candidate video file capable of marking a video segment matching the first text information is determined as a target video file.
  • step 402 feature extraction is performed on the video file and the text information to be queried respectively through the time series annotation network of the time series annotation model to obtain the video features of the video file and the text features of the text information to be queried. That is, the recognition network is called to extract the video features of the video file and the text features of the first text information, respectively.
  • This step is the same as step (1) in step 202, and will not be repeated here.
  • step 403 video features matching the text features of the text information to be queried are determined from the video features of the video file.
  • the video feature matched with the text feature of the text information to be queried may be called the target video feature, that is, the target video feature matching the text feature of the first text information is determined from the video features of the video file.
  • This step is the same as step (2) in step 202, and will not be repeated here.
  • step 404 the video segment corresponding to the video feature that matches the text feature of the text information to be queried is determined as the video segment that matches the text information to be queried. That is, the video segment corresponding to the target video feature is determined as the first video segment.
  • This step is the same as step (3) in step 202, and will not be repeated here.
  • step 405 the video segment matched by the text information to be queried is input into the feature extraction network of the time series labeling model to obtain the video feature of the video segment matched by the text information to be queried. That is, the first video segment is input to the feature extraction network to obtain video features of the first video segment.
  • This step is similar to the process of determining the video feature of the second video segment in step (7) in step 204, and will not be repeated here.
  • step 406 the video feature of the video segment matched with the text information to be queried is input into the translation network of the video processing model to obtain the text information of the video segment marked in the video file.
  • the video segment marked in the video file is the first video segment. That is, the video features of the first video segment are input into the translation network to obtain the second text information of the first video segment.
  • step (9) in step 205 This step is the same as step (9) in step 205, and will not be repeated here.
  • step 407 the video segment matched with the text information to be queried and the text information marked in the video file are output through the time series annotation model. That is, the first video segment and the second text information are output based on the video processing model, where the second text information is used to describe the video content of the first video segment.
  • the video processing model outputs the first video segment and the second text information of the first video segment according to the output results of the multiple networks respectively.
  • the first text information and the second text information are the same or different, which is not limited in this embodiment of the present disclosure.
  • the target video is a video of a football game
  • the first text information is "goal”
  • the video segment of the "goal” in the target video and the second text information of the video segment can be determined based on the video processing model.
  • the second text information is a piece of content that describes the goal action in detail.
  • the recognition network, feature extraction network and translation network in the video processing model can also be used independently.
  • the usage of the network in the video processing model is not specifically limited.
  • the recognition network can be called separately to label the video files.
  • the feature extraction network is invoked to perform feature extraction on video files or text files.
  • the translation network is called to translate the video features, and the text information corresponding to the video file is obtained.
  • the embodiment of the present disclosure provides a new video processing model
  • the video processing model includes a recognition network, a feature extraction network and a translation network, in the process of time series labeling of video files, based on the recognition network, it is possible to determine the difference between the video file and the first video file based on the recognition network.
  • the electronic device obtains the target video to be searched and the keyword "diving", and inputs the target video and "diving" into the video processing model. It is translated into corresponding description information, so that the video content related to "diving" in the target video is searched out.
  • a target video with a long duration is stored in the electronic device, and the user needs to edit the desired video segment from the target video, then the video processing model provided by the embodiment of the present disclosure can be used to combine the target video with the desired video segment.
  • the text description information of the video segment is input into the video processing model, and based on the video processing model, the video segment matching the text description information and the keyword corresponding to the video segment are output, and the output keyword is used as the title of the video segment, so that the video segment is based on the video segment.
  • the processing model implements clipping of the target video.
  • Fig. 6 is a block diagram of a video processing apparatus provided according to an exemplary embodiment.
  • the device includes:
  • an obtaining unit 601 configured to obtain a video file and first text information
  • the timing labeling unit 602 is configured to input the video file and the first text information into the recognition network of the video processing model to obtain the first video segment matched by the first text information;
  • Feature extraction unit 603 is configured to input the first video segment to the feature extraction network of the video processing model, obtains the video feature of the first video segment;
  • the visual text translation unit 604 is configured to input the video features of the first video segment into the translation network of the video processing model to obtain second text information of the first video segment, and the second text information is used to describe the video of the first video segment content;
  • the output unit 605 is configured to output the first video segment and the second text information based on the video processing model.
  • the timing labeling unit 602 includes:
  • a feature extraction subunit configured to call a recognition network, to extract the video feature of the video file and the text feature of the first text information respectively;
  • a first determining subunit configured to determine a target video feature matching the text feature from the video feature of the video file
  • the second determination subunit is configured to determine the video segment corresponding to the target video feature as the first video segment.
  • the apparatus further includes:
  • the timing labeling unit 602 is also configured to input the video sample into the recognition network of the video processing model to obtain a second video segment matched by the third text information, and the video sample is marked with the third text information;
  • a second determining unit configured to determine a recognition loss parameter of the recognition network based on the second video segment and the third video segment marked in the video sample;
  • a third determining unit configured to determine video features of the first similarity and the second video segment based on the second video segment and the third text information, where the first similarity indicates the difference between the second video segment and the third text information similarity
  • the fourth determining unit is configured to determine the translation quality parameter of the translation network of the video processing model based on the video feature of the second video segment and the third text information, and the translation quality parameter represents the quality of the translation network to translate the video feature into text information;
  • a parameter adjustment unit configured to adjust parameters of the video processing model based on the recognition loss parameter, the first similarity, and the translation quality parameter.
  • the second determining unit includes:
  • a third determining subunit configured to determine the start time and end time of the second video segment in the video sample, and the start time and end time of the third video segment in the video sample;
  • a loss parameter determination subunit configured to determine the recognition loss based on the recognition loss function, the start time and end time of the second video segment in the video sample, and the start time and end time of the third video segment in the video sample parameter.
  • the third determining unit includes:
  • the feature extraction unit 603 is configured to input the second video segment and the third text information into the feature extraction network of the video processing model to obtain the video feature of the second video segment and the text feature of the third text information;
  • the first similarity determination subunit is configured to determine the cosine similarity between the video feature of the second video segment and the text feature of the third text information to obtain the first similarity.
  • the fourth determining unit includes:
  • the visual text translation unit 604 is configured to input the video features of the second video segment into the translation network to obtain fourth text information of the second video segment;
  • a second similarity determination subunit configured to determine a second similarity between the fourth text information and the third text information
  • the fourth determination subunit is configured to determine the second similarity as a translation quality parameter.
  • the parameter adjustment unit is configured to adjust the network parameters of the recognition network, the feature extraction network, and the translation network, respectively, based on the recognition loss parameter, the first similarity, and the translation quality parameter, until the recognition loss parameter is less than The first threshold, and the first similarity is greater than the second threshold, and the translation quality parameter is greater than the third threshold, the model training is completed.
  • the electronic device is a terminal or a server. In some embodiments, the electronic device is a terminal for providing the video processing method provided by the present disclosure.
  • FIG. 7 shows a structural block diagram of a terminal 700 provided by an exemplary embodiment of the present disclosure.
  • the terminal 700 is a portable mobile terminal, such as: a smart phone, a tablet computer, an MP3 (Moving Picture Experts Group Audio Layer III, Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer 3) player Audio Layer IV, Motion Picture Expert Compression Standard Audio Layer 4) Player, Laptop or Desktop.
  • Terminal 700 may also be called user equipment, portable terminal, laptop terminal, desktop terminal, and the like by other names.
  • the terminal 700 includes: a processor 701 and a memory 702 .
  • processor 701 includes one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • the processor 701 adopts at least one of DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), and PLA (Programmable Logic Array, programmable logic array).
  • DSP Digital Signal Processing, digital signal processing
  • FPGA Field-Programmable Gate Array, field programmable gate array
  • PLA Programmable Logic Array, programmable logic array
  • the processor 701 also includes a main processor and a co-processor.
  • the main processor is a processor for processing data in a wake-up state, also referred to as a CPU (Central Processing Unit, central processing unit).
  • a coprocessor is a low-power processor for processing data in a standby state.
  • the processor 701 is integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen.
  • the processor 701 further includes an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is used to process computing operations related to machine learning.
  • AI Artificial Intelligence, artificial intelligence
  • memory 702 includes one or more computer-readable storage media that are non-transitory.
  • Memory 702 also includes high-speed random access memory, as well as non-volatile memory, such as one or more disk storage devices, flash storage devices.
  • a non-transitory computer-readable storage medium in the memory 702 is used to store at least one instruction for being executed by the processor 701 to implement the video processing provided by the method embodiments of the present disclosure method.
  • the terminal 700 may optionally further include: a peripheral device interface 703 and at least one peripheral device.
  • the processor 701, the memory 702 and the peripheral device interface 703 are connected through a bus or signal line.
  • Each peripheral device is connected to the peripheral device interface 703 through a bus, a signal line or a circuit board.
  • the peripheral device includes at least one of a radio frequency circuit 704 , a display screen 705 , a camera assembly 706 , an audio circuit 707 , a positioning assembly 708 and a power supply 709 .
  • the peripheral device interface 703 may be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 701 and the memory 702 .
  • processor 701, memory 702, and peripherals interface 703 are integrated on the same chip or circuit board; in some other embodiments, any one of processor 701, memory 702, and peripherals interface 703 or The two are implemented on a separate chip or circuit board, which is not limited in this embodiment.
  • the radio frequency circuit 704 is used for receiving and transmitting RF (Radio Frequency, radio frequency) signals, also called electromagnetic signals.
  • the radio frequency circuit 704 communicates with the communication network and other communication devices via electromagnetic signals.
  • the radio frequency circuit 704 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals.
  • the radio frequency circuit 704 includes an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and the like.
  • radio frequency circuitry 704 communicates with other terminals via at least one wireless communication protocol.
  • the wireless communication protocol includes but is not limited to: World Wide Web, Metropolitan Area Network, Intranet, various generations of mobile communication networks (2G, 3G, 4G and 5G), wireless local area network and/or WiFi (Wireless Fidelity, Wireless Fidelity) network.
  • the radio frequency circuit 704 further includes a circuit related to NFC (Near Field Communication, short-range wireless communication), which is not limited in the present disclosure.
  • the display screen 705 is used to display UI (User Interface, user interface).
  • the UI includes graphics, text, icons, video, and any combination thereof.
  • the display screen 705 also has the ability to acquire touch signals on or above the surface of the display screen 705 .
  • the touch signal is input to the processor 701 as a control signal for processing.
  • the display screen 705 is also used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards.
  • the display screen 705 is a flexible display screen disposed on a curved or folded surface of the terminal 700 . Even, the display screen 705 is also set as a non-rectangular irregular figure, that is, a special-shaped screen.
  • the display screen 705 is made of materials such as LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, organic light emitting diode).
  • the camera assembly 706 is used to capture images or video.
  • the camera assembly 706 includes a front camera and a rear camera.
  • the front camera is arranged on the front panel of the terminal, and the rear camera is arranged on the back of the terminal.
  • there are at least two rear cameras which are any one of a main camera, a depth-of-field camera, a wide-angle camera, and a telephoto camera, so as to realize the fusion of the main camera and the depth-of-field camera to realize the background blur function, the main camera It is integrated with the wide-angle camera to achieve panoramic shooting and VR (Virtual Reality, virtual reality) shooting functions or other integrated shooting functions.
  • camera assembly 706 also includes a flash.
  • the flash is a single color temperature flash, or, a dual color temperature flash. Dual color temperature flash refers to the combination of warm light flash and cold light flash, which is used for light compensation under different color temperatures.
  • the audio circuit 707 includes a microphone and a speaker.
  • the microphone is used to collect the sound waves of the user and the environment, convert the sound waves into electrical signals, and input them to the processor 701 for processing, or to the radio frequency circuit 704 to realize voice communication.
  • the microphones are also array microphones or omnidirectional acquisition microphones.
  • the speaker is used to convert the electrical signal from the processor 701 or the radio frequency circuit 704 into sound waves.
  • the loudspeaker is a conventional thin-film loudspeaker, or alternatively, a piezoelectric ceramic loudspeaker.
  • the speaker When the speaker is a piezoelectric ceramic speaker, it can not only convert electrical signals into sound waves audible to humans, but also convert electrical signals into sound waves inaudible to humans for distance measurement and other purposes.
  • the audio circuit 707 also includes a headphone jack.
  • the positioning component 708 is used to locate the current geographic location of the terminal 700 to implement navigation or LBS (Location Based Service).
  • LBS Location Based Service
  • the positioning component 708 is a positioning component based on the GPS (Global Positioning System, global positioning system) of the United States, the Beidou system of China, or the Galileo system of Russia.
  • the power supply 709 is used to power various components in the terminal 700 .
  • the power source 709 is alternating current, direct current, a disposable battery, or a rechargeable battery.
  • the rechargeable battery is a wired rechargeable battery or a wireless rechargeable battery. Wired rechargeable batteries are batteries that are charged through wired lines, and wireless rechargeable batteries are batteries that are charged through wireless coils.
  • the rechargeable battery is also used to support fast charging technology.
  • the terminal 700 also includes one or more sensors 710 .
  • the one or more sensors 710 include, but are not limited to, an acceleration sensor 711 , a gyro sensor 712 , a pressure sensor 713 , a fingerprint sensor 714 , an optical sensor 715 and a proximity sensor 716 .
  • the acceleration sensor 711 detects the magnitude of acceleration on the three coordinate axes of the coordinate system established by the terminal 700 .
  • the acceleration sensor 711 is used to detect the components of the gravitational acceleration on the three coordinate axes.
  • the processor 701 controls the display screen 705 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 711 .
  • the acceleration sensor 711 is also used for game or user movement data collection.
  • the gyroscope sensor 712 detects the body direction and rotation angle of the terminal 700 , and the gyroscope sensor 712 cooperates with the acceleration sensor 711 to collect 3D actions of the user on the terminal 700 .
  • the processor 701 can implement the following functions according to the data collected by the gyro sensor 712 : motion sensing (such as changing the UI according to the user's tilt operation), image stabilization during shooting, game control, and inertial navigation.
  • the pressure sensor 713 is disposed on the side frame of the terminal 700 and/or the lower layer of the display screen 705 .
  • the pressure sensor 713 can detect the user's holding signal of the terminal 700 , and the processor 701 performs left and right hand identification or shortcut operations according to the holding signal collected by the pressure sensor 713 .
  • the processor 701 controls the operability controls on the UI interface according to the user's pressure operation on the display screen 705 .
  • the operability controls include at least one of button controls, scroll bar controls, icon controls, and menu controls.
  • the fingerprint sensor 714 is used to collect the user's fingerprint, and the processor 701 identifies the user's identity according to the fingerprint collected by the fingerprint sensor 714 , or the fingerprint sensor 714 identifies the user's identity according to the collected fingerprint. When the user's identity is identified as a trusted identity, the processor 701 authorizes the user to perform relevant sensitive operations, including unlocking the screen, viewing encrypted information, downloading software, making payments, and changing settings.
  • the fingerprint sensor 714 is disposed on the front, back, or side of the terminal 700 . In some embodiments, when the terminal 700 is provided with a physical button or a manufacturer's logo, the fingerprint sensor 714 is integrated with the physical button or the manufacturer's logo.
  • Optical sensor 715 is used to collect ambient light intensity.
  • the processor 701 controls the display brightness of the display screen 705 according to the ambient light intensity collected by the optical sensor 715 . Specifically, when the ambient light intensity is high, the display brightness of the display screen 705 is increased; when the ambient light intensity is low, the display brightness of the display screen 705 is decreased.
  • the processor 701 also dynamically adjusts the shooting parameters of the camera assembly 706 according to the ambient light intensity collected by the optical sensor 715 .
  • a proximity sensor 716 also called a distance sensor, is usually provided on the front panel of the terminal 700 .
  • the proximity sensor 716 is used to collect the distance between the user and the front of the terminal 700 .
  • the processor 701 controls the display screen 705 to switch from the bright screen state to the off screen state; when the proximity sensor 716 detects When the distance between the user and the front of the terminal 700 gradually increases, the processor 701 controls the display screen 705 to switch from the closed screen state to the bright screen state.
  • FIG. 7 does not constitute a limitation on the terminal 700, and can include more or less components than the one shown, or combine some components, or adopt different component arrangements.
  • the electronic device is a server for providing the video processing method provided by the present disclosure.
  • FIG. 8 shows a structural block diagram of a server 800 provided by an exemplary embodiment of the present disclosure.
  • the server 800 may vary greatly due to different configurations or performance, including one or more processors (central processing units, CPU) 801 and one or more memories 802, wherein the The memory 801 stores at least one instruction, and the at least one instruction is loaded and executed by the processor 801 to implement the method for retrieving the target object provided by the above method embodiments.
  • the server 800 also has components such as a wired or wireless network interface, a keyboard, and an input and output interface for input and output, and the server 800 also includes other components for implementing device functions, which are not described here. Repeat.
  • Embodiments of the present disclosure also provide an electronic device, including a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement the following steps: acquiring a video file and first text information ; Input the video file and the first text information to the recognition network of the video processing model, obtain the first video segment that the first text information matches; Input the first video segment to the feature extraction network of the video processing model, obtain the first video segment The video feature of the first video segment is input to the translation network of the video processing model, and the second text information of the first video segment is obtained, and the second text information is used to describe the video content of the first video segment; based on video processing The model outputs the first video segment and the second textual information.
  • the processor is configured to execute the instructions to implement the video processing method and the training method of the video processing model provided by other embodiments of the above method embodiments.
  • the embodiment of the present disclosure also provides a computer-readable storage medium, when the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device can perform the following steps: acquiring a video file and first text information ; Input the video file and the first text information to the recognition network of the video processing model, obtain the first video segment that the first text information matches; Input the first video segment to the feature extraction network of the video processing model, obtain the first video segment The video feature of the first video segment is input to the translation network of the video processing model, and the second text information of the first video segment is obtained, and the second text information is used to describe the video content of the first video segment; based on video processing The model outputs the first video segment and the second textual information.
  • the electronic device when the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device can perform the training of the video processing methods and video processing models provided by other embodiments of the above method embodiments. method.
  • Embodiments of the present disclosure also provide a computer program product, including computer instructions, which, when executed by a processor, implement the following steps: acquiring a video file and first text information; inputting the video file and the first text information into video processing
  • the recognition network of the model obtains the first video segment matched by the first text information; the first video segment is input into the feature extraction network of the video processing model to obtain the video features of the first video segment; the video features of the first video segment are input
  • the translation network to the video processing model obtains the second text information of the first video segment, and the second text information is used to describe the video content of the first video segment; the first video segment and the second text information are output based on the video processing model.
  • the video processing method and the training method of the video processing model are implemented.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

本公开提供了一种视频处理方法及电子设备。方法包括:将获取的视频文件和第一文本信息输入至视频处理模型的识别网络,得到第一文本信息匹配的第一视频段;将第一视频段输入至视频处理模型的特征提取网络,得到第一视频段的视频特征;将第一视频段的视频特征输入至视频处理模型的翻译网络,得到第一视频段的第二文本信息;基于视频处理模型输出第一视频段和第二文本信息。

Description

视频处理方法及电子设备
本公开基于申请日为2020年12月22日、申请号为202011526967.5的中国专利申请,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本公开作为参考。
技术领域
本公开涉及机器学习技术领域,特别涉及一种视频处理方法及电子设备。
背景技术
视频时序标注是视频处理、模式识别等任务中的一个重要过程。视频时序标注是指通过识别视频文件,从视频文件中预测出与文本信息匹配的起始时间和终止时间,根据该起始时间和终止时间,在该视频文件中标注出与该文本信息匹配的视频段。
发明内容
根据本公开实施例的一方面,提供了一种视频处理方法,所述方法包括:获取视频文件和第一文本信息;将所述视频文件和所述第一文本信息输入至视频处理模型的识别网络,得到所述第一文本信息匹配的第一视频段;将所述第一视频段输入至所述视频处理模型的特征提取网络,得到所述第一视频段的视频特征;将所述第一视频段的视频特征输入至所述视频处理模型的翻译网络,得到第一视频段的第二文本信息,所述第二文本信息用于描述所述第一视频段的视频内容;基于所述视频处理模型输出所述第一视频段和所述第二文本信息。
根据本公开实施例的另一方面,提供了一种视频处理模型的训练方法,所述方法包括:将视频样本输入至视频处理模型的识别网络,得到第三文本信息匹配的第二视频段,所述视频样本中标注有所述第三文本信息;基于所述第二视频段和所述视频样本中标注的第三视频段,确定所述识别网络的识别损失参数;基于所述第二视频段和所述第三文本信息,确定第一相似度和所述第二视频段的视频特征,所述第一相似度指示所述第二视频段和所述第三文本信息之间的相似度;基于所述第二视频段的视频特征和所述第三文本信息,确定所述视频处理模型的翻译网络的翻译质量参数,所述翻译质量参数表征所述翻译网络将视频特征翻译为文本信息的质量;基于所述识别损失参数、所述第一相似度和所述翻译质量参数,调整所述视频处理模型的参数。
根据本公开实施例的另一方面,提供了一种视频处理装置,所述装置包括:获取单元,被配置为获取视频文件和第一文本信息;时序标注单元,被配置为将所述视频文件和所述第一文本信息输入至视频处理模型的识别网络,得到所述第一文本信息匹配的第一视频段;特征提取单元,被配置为将所述第一视频段输入至所述视频处理模型的特征提取网络,得到所述第一视频段的视频特征;视觉文本翻译单元,被配置为将所述第一视频段的视频特征输入至所述视频处理模型的翻译网络,得到第一视频段的第二文本信息,所述第二文本信息用于描述所述第一视频段的视频内容;输出单元,被配置为基于所述视频处理模型输出所述第一视频段和所述第二文本信息。
根据本公开实施例的另一方面,提供了一种视频处理模型的训练装置,所述装置包括:时序标注单元,被配置为将视频样本输入至视频处理模型的识别网络,得到第三文本信息匹配的第二视频段,所述视频样本中标注有所述第三文本信息;第二确定单元,被配置为基于所述第二视频段和所述视频样本中标注的第三视频段,确定所述识别网络的识别损失参数;第三确定单元,被配置为基于所述第二视频段和所述第三文本信息,确定第一相似度和所述第二视频段的视频特征,所述第一相似度指示所述第二视频段和所述第三文本信息之间的相 似度;第四确定单元,被配置为基于所述第二视频段的视频特征和所述第三文本信息,确定所述视频处理模型的翻译网络的翻译质量参数,所述翻译质量参数表征所述翻译网络将视频特征翻译为文本信息的质量;参数调整单元,被配置为基于所述识别损失参数、所述第一相似度和所述翻译质量参数,调整所述视频处理模型的参数。
根据本公开实施例的另一方面,提供了一种电子设备,包括:处理器;用于存储所述处理器可执行指令的存储器;其中,所述处理器被配置为执行所述指令,以实现如下步骤:获取视频文件和第一文本信息;将所述视频文件和所述第一文本信息输入至视频处理模型的识别网络,得到所述第一文本信息匹配的第一视频段;将所述第一视频段输入至所述视频处理模型的特征提取网络,得到所述第一视频段的视频特征;将所述第一视频段的视频特征输入至所述视频处理模型的翻译网络,得到第一视频段的第二文本信息,所述第二文本信息用于描述所述第一视频段的视频内容;基于所述视频处理模型输出所述第一视频段和所述第二文本信息。
根据本公开实施例的另一方面,提供了一种电子设备,包括:处理器;用于存储所述处理器可执行指令的存储器;其中,所述处理器被配置为执行所述指令,以实现如下步骤:将视频样本输入至视频处理模型的识别网络,得到第三文本信息匹配的第二视频段,所述视频样本中标注有所述第三文本信息;基于所述第二视频段和所述视频样本中标注的第三视频段,确定所述识别网络的识别损失参数;基于所述第二视频段和所述第三文本信息,确定第一相似度和所述第二视频段的视频特征,所述第一相似度指示所述第二视频段和所述第三文本信息之间的相似度;基于所述第二视频段的视频特征和所述第三文本信息,确定所述视频处理模型的翻译网络的翻译质量参数,所述翻译质量参数表征所述翻译网络将视频特征翻译为文本信息的质量;基于所述识别损失参数、所述第一相似度和所述翻译质量参数,调整所述视频处理模型的参数。
根据本公开实施例的另一方面,提供了一种计算机可读存储介质,当所述计算机可读存储介质中的指令由电子设备的处理器执行时,使得电子设备能够执行如下步骤:获取视频文件和第一文本信息;将所述视频文件和所述第一文本信息输入至视频处理模型的识别网络,得到所述第一文本信息匹配的第一视频段;将所述第一视频段输入至所述视频处理模型的特征提取网络,得到所述第一视频段的视频特征;将所述第一视频段的视频特征输入至所述视频处理模型的翻译网络,得到第一视频段的第二文本信息,所述第二文本信息用于描述所述第一视频段的视频内容;基于所述视频处理模型输出所述第一视频段和所述第二文本信息。
根据本公开实施例的另一方面,提供了一种计算机可读存储介质,当所述计算机可读存储介质中的指令由电子设备的处理器执行时,使得电子设备能够执行如下步骤:将视频样本输入至视频处理模型的识别网络,得到第三文本信息匹配的第二视频段,所述视频样本中标注有所述第三文本信息;基于所述第二视频段和所述视频样本中标注的第三视频段,确定所述识别网络的识别损失参数;基于所述第二视频段和所述第三文本信息,确定第一相似度和所述第二视频段的视频特征,所述第一相似度指示所述第二视频段和所述第三文本信息之间的相似度;基于所述第二视频段的视频特征和所述第三文本信息,确定所述视频处理模型的翻译网络的翻译质量参数,所述翻译质量参数表征所述翻译网络将视频特征翻译为文本信息的质量;基于所述识别损失参数、所述第一相似度和所述翻译质量参数,调整所述视频处理模型的参数。
根据本公开实施例的另一方面,提供了一种计算机程序产品,包括计算机指令,所述计算机指令被处理器执行时实现如下步骤:获取视频文件和第一文本信息;将所述视频文件和所述第一文本信息输入至视频处理模型的识别网络,得到所述第一文本信息匹配的第一视频段;将所述第一视频段输入至所述视频处理模型的特征提取网络,得到所述第一视频段的视频特征;将所述第一视频段的视频特征输入至所述视频处理模型的翻译网络,得到第一视频段的第二文本信息,所述第二文本信息用于描述所述第一视频段的视频内容;基于所述视频 处理模型输出所述第一视频段和所述第二文本信息。
根据本公开实施例的另一方面,提供了一种计算机程序产品,包括计算机指令,所述计算机指令被处理器执行时实现如下步骤:将视频样本输入至视频处理模型的识别网络,得到第三文本信息匹配的第二视频段,所述视频样本中标注有所述第三文本信息;基于所述第二视频段和所述视频样本中标注的第三视频段,确定所述识别网络的识别损失参数;基于所述第二视频段和所述第三文本信息,确定第一相似度和所述第二视频段的视频特征,所述第一相似度指示所述第二视频段和所述第三文本信息之间的相似度;基于所述第二视频段的视频特征和所述第三文本信息,确定所述视频处理模型的翻译网络的翻译质量参数,所述翻译质量参数表征所述翻译网络将视频特征翻译为文本信息的质量;基于所述识别损失参数、所述第一相似度和所述翻译质量参数,调整所述视频处理模型的参数。
本公开实施例提供的视频处理模型包括识别网络、特征提取网络和翻译网络,在对视频文件进行处理的过程中,能够基于识别网络识别出视频文件中与第一文本信息匹配的第一视频段,以及基于翻译网络翻译出第一视频段的第二文本信息,因此,对于该视频处理模型来说,能够输出第一视频段以及第二文本信息,即基于一个视频处理模型,得到视频文件的多种输出结果,提高了视频标注结果的多样性。
附图说明
图1为根据一示例性实施例提供的一种视频处理方法流程图;
图2为根据一示例性实施例提供的一种视频处理模型的训练方法流程图;
图3为根据一示例性实施例提供的一种视频处理方法流程图;
图4为根据一示例性实施例提供的一种视频处理方法流程图;
图5为根据一示例性实施例提供的一种视频处理方法流程图;
图6是根据一示例性实施例提供的一种视频处理装置的框图;
图7是根据一示例性实施例提供的一种终端的结构示意图;
图8是根据一示例性实施例提供的一种服务器的结构示意图。
具体实施方式
随着机器学习技术的发展,视频时序标注的应用场景越来越广泛。例如,视频时序标注应用在视频处理、模式识别等场景中。在一些实施例中,通过视频时序标注来剪辑原视频文件,得到与待查询的文本信息匹配的视频段。例如,在剪辑视频的过程中,电子设备接收用户输入的待查询的文本信息,根据该待查询的文本信息识别视频文件中的视频内容,将识别到的视频段剪辑出来,得到剪辑完成的视频段。在另一些实施例中,通过视频时序标注来识别原视频文件,得到与待查询的文本信息匹配的视频文件。例如,在进行视频搜索时,电子设备接收用户输入的待查询的文本信息,根据该待查询的文本信息搜索多个视频文件,得到包含与该待查询的文本信息匹配的视频段的视频文件,反馈该视频文件。
相关技术中,在进行视频时序标注时,时序标注模型只有时序标注的功能。因此,在通过时序标注模型进行视频时序标注时,只能得到单一的视频标注结果,即视频段。
相应的,在对视频文件进行视频处理之前,需要对待训练的视频处理模型进行模型训练,得到训练完成的视频处理模型。相关技术中,在对视频处理模型进行模型训练时,将视频样本输入至待训练的视频处理模型,基于视频处理模型产生的识别损失参数,调整视频处理模型的参数,直到完成模型训练,得到视频处理模型。这样模型训练的过程中,只将识别损失参数作为衡量视频处理模型是否训练完成的标准,使得模型训练的训练指标较为单一,在训练过程中出现特征提取不准确等问题的情况下,造成文本特征与视频文件的视频特征匹配度出现错误,导致训练得到的视频处理模型不准确。
在本公开实施例中,在视频处理模型中结合识别网络、特征提取网络和翻译网络,能够通过识别网络确定视频文件中与待查询的第一文本信息匹配的第一视频段,通过特征提取网 络对该第一视频段进行特征提取,通过翻译网络对提取的视频特征进行视觉文本翻译,得到该第一视频段的第二文本信息,使得在标注视频文件的过程中,能够得到标注的第一视频段以及该第一视频段对应的第二文本信息,从而实现通过一个视频处理模型,就能得到视频文件的多种输出结果,提高了视频标注结果的多样性。
并且,在训练视频处理模型的过程中,对视频处理模型中的识别网络、特征提取网络和翻译网络共同进行训练,丰富了训练视频处理模型的训练参数,进而提高了视频处理模型进行视频时序标注的准确率。
图1为根据一示例性实施例提供的一种视频处理方法流程图。如图1所示,该方法的执行主体为电子设备,包括以下步骤。
在步骤101中,获取视频文件和第一文本信息。
在步骤102中,将视频文件和第一文本信息输入至视频处理模型的识别网络,得到第一文本信息匹配的第一视频段。
在步骤103中,将第一视频段输入至视频处理模型的特征提取网络,得到第一视频段的视频特征。
在步骤104中,将第一视频段的视频特征输入至视频处理模型的翻译网络,得到第一视频段的第二文本信息,第二文本信息用于描述第一视频段的视频内容。
在步骤105中,基于视频处理模型输出第一视频段和第二文本信息。
在一些实施例中,将视频文件和第一文本信息输入至视频处理模型的识别网络,得到第一文本信息匹配的第一视频段,包括:
调用识别网络,分别提取视频文件的视频特征和第一文本信息的文本特征;
从视频文件的视频特征中确定与文本特征匹配的目标视频特征;
将目标视频特征对应的视频段,确定为第一视频段。
在一些实施例中,视频处理模型的训练方法包括:
将视频样本输入至视频处理模型的识别网络,得到第三文本信息匹配的第二视频段,视频样本中标注有第三文本信息;
基于第二视频段和视频样本中标注的第三视频段,确定识别网络的识别损失参数;
基于第二视频段和第三文本信息,确定第一相似度和第二视频段的视频特征,第一相似度指示第二视频段和第三文本信息之间的相似度;
基于第二视频段的视频特征和第三文本信息,确定视频处理模型的翻译网络的翻译质量参数,翻译质量参数表征翻译网络将视频特征翻译为文本信息的质量;
基于识别损失参数、第一相似度和翻译质量参数,调整视频处理模型的参数。
在一些实施例中,基于第二视频段和视频样本中标注的第三视频段,确定识别网络的识别损失参数,包括:
确定第二视频段在视频样本中的起始时间和终止时间,以及第三视频段在视频样本中的起始时间和终止时间;
基于识别损失函数、第二视频段在视频样本中的起始时间和终止时间,以及第三视频段在视频样本中的起始时间和终止时间,确定识别损失参数。
在一些实施例中,基于第二视频段和第三文本信息,确定第一相似度参数和第二视频段的视频特征,包括:
将第二视频段和第三文本信息输入至视频处理模型的特征提取网络,得到第二视频段的视频特征和第三文本信息的文本特征;
确定第二视频段的视频特征和第三文本信息的文本特征之间的余弦相似度,得到第一相似度。
在一些实施例中,基于提取的第二视频段的视频特征和第三文本信息,确定视频处理模 型的翻译网络的翻译质量参数,包括:
将第二视频段的视频特征输入至翻译网络,得到第二视频段的第四文本信息;
确定第四文本信息与第三文本信息之间的第二相似度;
将第二相似度确定为翻译质量参数。
在一些实施例中,基于识别损失参数、第一相似度和翻译质量参数,调整视频处理模型的参数,包括:
基于识别损失参数、第一相似度和翻译质量参数,分别对识别网络、特征提取网络和翻译网络的网络参数进行调整,直到识别损失参数小于第一阈值,且第一相似度大于第二阈值,且翻译质量参数大于第三阈值,完成模型训练。
本公开实施例提供了一种新的视频处理模型,该视频处理模型包括识别网络、特征提取网络和翻译网络,在对视频文件进行处理的过程中,能够基于识别网络识别出视频文件中与第一文本信息匹配的第一视频段,以及基于翻译网络翻译出第一视频段的第二文本信息,因此,对于该视频处理模型来说,能够输出第一视频段以及第二文本信息,即基于一个视频处理模型,得到视频文件的多种输出结果,提高了视频标注结果的多样性。
在基于视频处理模型对待标注的视频文件进行标注之前,需要对待训练的视频处理模型进行模型训练,得到该视频处理模型。图2为根据一示例性实施例提供的一种视频处理模型的训练方法的流程图。在本公开实施例中,以对待训练的视频处理模型进行模型训练为例进行说明。如图2所示,该方法的执行主体为电子设备,包括以下步骤。
在步骤201中,确定待训练的时序标注模型。
其中,该待训练的时序标注模型包括待训练的时序标注网络、待训练的特征提取网络和待训练的视觉文本翻译网络。本公开实施例中,时序标注模型可称为视频处理模型,时序标注网络可称为识别网络,视觉文本翻译网络可称为翻译网络。
在本步骤中,确定视频处理模型的结构。例如,确定识别网络的网络结构、特征提取网络的网络结构和翻译网络的网络结构,以及识别网络、特征提取网络和翻译网络之间的连接结构。
在一些实施例中,视频处理模型为流水线式的模型训练架构,即将识别网络、特征提取网络和翻译网络构建为流水线式的模型训练架构。参见图3,将识别网络的输出作为特征提取网络的输入,将特征提取网络的输出作为翻译网络的输入。从而识别网络得到输出结果后,能够直接将输出结果输入至特征提取网络中,特征提取网络得到输出结果后,能够直接将输出结果输入至翻译网络中。
在本公开实施例中,将视频处理模型中识别网络、特征提取网络和翻译网络构建为流水线式的模型训练架构,使得能够直接将前一网络的输出作为后一网络的输入,从而使识别网络、特征提取网络和翻译网络能够同步训练,简化了模型训练的过程,提高了模型训练的准确性。
需要说明的一点是,该识别网络、特征提取网络和翻译网络分别为开发人员设计的任一结构的网络,在本公开实施例中,对识别网络、特征提取网络和翻译网络的结构不作具体限定。
在步骤202中,将视频样本输入至该待训练的时序标注网络,得到待训练的时序标注网络标注的视频段。
在一些实施例中,待训练的时序标注网络标注的视频段为第二视频段。也即是,将视频样本输入至识别网络,得到第三文本信息匹配的第二视频段。
其中,该视频样本为标注了视频段的视频样本,该视频样本还标注了视频段匹配的文本信息,即视频样本中标注的视频段为第三视频段,视频样本中标注的文本信息为第三文本信息,且第三视频段与第三文本信息匹配,该第三视频段为用户训练视频处理模型的样本视频 段,第三文本信息为用于训练视频处理模型的样本文本信息。在一些实施例中,视频样本中标注有第三视频段的起始时间和终止时间,该起始时间和终止时间之间的视频段即为第三视频段。需要说明的一点是,该第三文本信息为词语、关键字、描述文本、图像、视频文件等。在本公开实施例中,对该第三文本信息不作限定。
在本步骤中,将视频样本输入至识别网络中,基于识别网络对该视频样本进行标注,得到该识别网络预测的第二视频段,该第二视频段与第三文本信息匹配。其中,该第二视频段为识别网络预测的与第三文本信息匹配的视频段。
在本步骤中,基于识别网络,提取视频样本的视频特征,将提取出的视频特征与第三文本信息的文本特征进行对比,从而获取到预测到的第二视频段。该过程包括以下步骤(1)-(3)。
(1)基于识别网络,分别对该视频样本和该第三文本信息进行特征提取,得到该视频样本的视频特征和该第三文本信息的文本特征。
其中,该视频样本的视频特征和该第三文本信息的文本特征为任一类型的特征。例如,该视频样本的视频特征和该第三文本信息的文本特征均为向量特征或矩阵特征等。
(2)从该视频样本的视频特征中确定与该第三文本信息的文本特征匹配的视频特征。
在本步骤中,将该第三文本信息的文本特征与视频样本的视频特征逐一进行特征对比,得到与该第三文本信息的文本特征匹配的视频特征。其中,文本特征与视频特征匹配指文本特征与视频特征相同或者相似。
在一些实施例中,分别确定第三文本信息的文本特征与视频样本的视频特征的相似度,将相似度最高的视频特征,确定为与第三文本信息的文本特征匹配的视频特征。其中,该第三文本信息的文本特征与视频样本的视频特征的相似度为任一类型的相似度。例如,该相似度为余弦相似度等。
在一些实施例中,将视频样本划分为多个视频段,每个视频段具有对应的视频特征,分别确定第三文本信息的文本特征与每个视频段的视频特征之间的相似度,将相似度最高的视频特征,确定为与第三文本信息的文本特征匹配的视频特征。
(3)将该第三文本信息的文本特征匹配的视频特征对应的视频段,确定为该第三文本信息匹配的第二视频段。
在本步骤中,确定第三文本信息的文本特征匹配的视频特征在视频样本中的起始时间和终止时间,将该起始时间和终止时间之间的视频内容确定为第二视频段。
在本公开实施例中,基于视频处理模型中的识别网络,对视频样本和文本信息进行特征提取,从而在训练识别网络的过程中,通过特征提取网络与识别网络进行相互约束,从而在同一训练过程中训练两个网络,提高了模型的训练的效率,并且,提高了识别网络和特征提取网络的适配度,进而提高了视频处理模型的准确度。
在步骤203中,基于该待训练的时序标注网络标注的视频段和该视频样本中标注的视频段,确定该待训练的时序标注网络的时序标注损失参数。也即是,基于第二视频段和视频样本中标注的第三视频段,确定识别网络的识别损失参数。
其中,该识别损失参数为视频处理模型对视频样本进行时序标注时产生的识别损失参数。该识别损失参数基于时序标注损失函数生成,该时序标注损失函数可称为识别损失函数。
在一些实施例中,分别确定第二视频段和第三视频段的视频特征,将第二视频段的视频特征和第三视频段的视频特征输入至识别损失函数中,识别损失函数基于两个视频段的视频特征确定该识别损失参数。
在一些实施例中,确定第二视频段的起始时间和终止时间,以及,确定第三视频段的起始时间和终止时间;基于两个视频段的起始时间和终止时间,基于识别损失函数确定该识别损失参数。该过程包括以下步骤(4)-(6)。
(4)确定第二视频段在该视频样本中的起始时间和终止时间。
在本步骤中,确定识别网络标注的第二视频段,确定该第二视频段在视频样本中对应的起始时间和终止时间。
在一些实施例中,由于基于识别网络标注第二视频段的过程中,会记录第二视频段的起始时间和终止时间。在本步骤中,直接调用该第二视频段的起始时间和终止时间。
(5)确定第三视频段在该视频样本中的起始时间和终止时间。
在一些实施例中,基于在视频样本中标注的起始时间和终止时间,确定该视频样本中标注的第三视频段,即在视频样本中标注有第三视频段的起始时间和终止时间,则在本步骤中,直接获取该视频样本中标注的起始时间和终止时间。
需要说明的一点是,本公开实施例中,对获取两个视频段的起始时间和终止时间的先后顺序不作具体限定。
(6)将第二视频段在该视频样本中的起始时间和终止时间,以及第三视频段在该视频样本中的起始时间和终止时间输入至识别损失函数,得到该识别损失参数。也即是,基于识别损失函数、第二视频段在该视频样本中的起始时间和终止时间,以及第三视频段在该视频样本中的起始时间和终止时间,确定识别损失参数。
在本步骤中,将两个视频段的起始时间和终止时间作为识别损失函数的变量值,基于两个起始时间和终止时间之间的差异,即基于第二视频段的起始时间和第三视频段的起始时间之间的差异,以及第二视频段的终止时间和第三视频段的终止时间之间的差异,确定该识别损失参数。
在本公开实施例中,通过确定第二视频段的起始时间与第三视频段的起始时间是否匹配,以及确定第二视频段的终止时间和第三视频段的终止时间是否匹配,来调整识别网络的网络参数,提高了模型的训练效率和准确度。
在步骤204中,基于该待训练的时序标注网络标注的视频段和该视频样本中标注的文本信息,确定该待训练的时序标注网络标注的视频段和该视频样本中标注的文本信息之间的第一相似度参数和该待训练的时序标注网络标注的视频段的视频特征。
在本公开实施例中,第一相似度参数可称为第一相似度。基于第二视频段和第三文本信息,确定第一相似度和第二视频段的视频特征。
其中,该第一相似度为第三文本信息的文本特征与第二视频段的视频特征之间的相似度,即第一相似度指示第二视频段与第三文本信息之间的相似度。该第一相似度根据任一相似度确定方式确定。在一些实施例中,该第二视频段的视频特征和第三文本信息的文本特征均为特征向量,则该第一相似度为基于余弦相似度算法确定的相似度。相应的,该过程包括以下步骤(7)-(8)。
(7)将第二视频段和第三文本信息输入至特征提取网络,得到第二视频段的视频特征和第三文本信息的文本特征。
基于特征提取网络,分别提取第二视频段的视频特征和该第三文本信息的文本特征。其中,在本公开实施例中,对提取第二视频段的视频特征的过程与提取第三文本信息的文本的过程的先后顺序不作限定。
(8)确定第二视频段的视频特征和第三文本信息的文本特征之间的余弦相似度,得到该第一相似度。
在本步骤中,通过余弦相似度算法确定视频特征和文本特征之间的余弦相似度,将得到的余弦相似度确定为第一相似度。
在本公开实施例中,基于特征提取网络提取第二视频段的视频特征和第三文本信息的文本特征,进而得到二者的相似度,使得在对视频处理模型进行模型训练的过程中,能够将特征提取网络和识别网络同时进行模型训练,进而提高视频处理模型的训练效率和准确性。
本公开实施例中,将第二视频段和第三文本信息输入至特征提取网络,该特征提取网络输出第二视频段的视频特征和第一相似度。
在步骤205中,基于该待训练的时序标注网络标注的视频段的视频特征和该视频样本中标注的文本信息,确定该待训练的视觉文本翻译网络的翻译质量参数。也即是,基于第二视频段的视频特征和第三文本信息,确定翻译网络的翻译质量参数。
其中,该翻译质量参数表征翻译网络将视频特征翻译为文本信息的质量。
在一些实施例中,将第二视频段的视频特征翻译为描述该第二视频段的文本信息,获取该翻译的文本信息和第三文本信息的相似度,将该相似度确定为该翻译网络的翻译质量参数。其中,该相似度越高,翻译网络的翻译质量参数越高,即翻译网络翻译得到的文本信息越准确。
在本步骤中,将第二视频段的视频特征输入至翻译网络中,基于翻译网络将该视频特征翻译为文本信息,基于翻译的文本信息获取翻译质量参数,该过程包括以下步骤(9)-(11)。
(9)将第二视频段的视频特征输入至翻译网络,得到视频样本的文本信息。也即是,将第二视频段的视频特征输入至翻译网络,得到第二视频段的第四文本信息。
在本步骤中,基于翻译网络将视频特征翻译成文本信息,得到对第二视频段进行翻译的文本信息。
(10)确定该视频样本的文本信息与第三文本信息之间的第二相似度参数。
在一些实施例中,第二相似度参数可称为第二相似度。也即是,确定第四文本信息与第三文本信息之间的第二相似度。
在一些实施例中,对第四文本信息与第三文本信息进行文本特征提取,得到该第四文本信息的文本特征和第三文本信息的文本特征,确定这两个文本特征之间的相似度,将确定的相似度作为第二相似度。其中,该第二相似度根据任一相似度确定方式确定。例如,基于余弦相似度算法确定该文本特征之间的相似度,将该相似度确定为第二相似度。
(11)将该第二相似度确定为该翻译质量参数。
在本公开实施例中,由于第四文本信息和第三文本信息均对应于第二视频段,且第三文本信息是提前标注的、准确的文本信息,因此该第二相似度能够指示文本翻译网络对第二视频段的翻译是否准确。
通过翻译网络对第二视频段的视频特征进行翻译,根据翻译得到的第四文本信息和第三文本信息之间的相似度,使得在对视频处理模型进行模型训练的过程中能够将翻译网络和识别网络同时进行模型训练,进而提高视频处理模型的训练效率和准确性。
在步骤206中,基于该时序标注损失参数、该第一相似度参数和该翻译质量参数对该待训练的时序标注模型进行参数调整,得到该时序标注模型。也即是,基于识别损失参数、第一相似度和翻译质量参数,调整视频处理模型的参数。
在一些实施例中,该视频处理模型中的特征提取网络和翻译网络为已经训练好的网络模型,则在本步骤中,通过该识别损失参数、该第一相似度参数和该翻译质量参数对该识别网络进行参数调整,得到该视频处理模型。
在一些实施例中,同时对该时序标注模型中的识别网络、特征提取网络和翻译网络进行参数调整,该过程为:基于该识别损失参数、该第一相似度参数和该翻译质量参数,对该待训练的识别网络、特征提取网络和翻译网络的网络参数进行调整,直到该识别损失参数小于第一阈值,且该第一相似度大于第二阈值,且该翻译质量参数大于第三阈值,完成模型训练,得到该视频处理模型。
其中,该第一阈值、第二阈值和第三阈值根据需要进行设置,在本公开实施例中,对该第一阈值、第二阈值和第三阈值不作限定。
在本公开实施例中,通过多种参数分别对视频处理模型中的多种网络同时进行模型训练,在训练视频处理模型的过程中,使不同的网络之间能够相互约束,从而在同一训练过程中训练多个网络,提高了模型的训练效率,并且,提高了视频处理模型中各个网络的适配度。
需要说明的一点是,待训练的识别网络、待训练的特征提取网络和待训练的翻译网络还 能够分别进行模型训练,之后直接将训练完成的识别网络、特征提取网络和翻译网络构建为视频处理模型即可。
在本公开实施例中,通过在训练视频处理模型的过程中,引入其他网络输出的参数,根据视频处理模型中多种网络的训练参数对视频处理模型进行模型训练,从而丰富了训练视频处理模型的训练参数,进而提高了视频处理模型进行视频时序标注的准确率。
本公开实施例提供了一种新的视频处理模型,该视频处理模型包括识别网络、特征提取网络和翻译网络,在对视频文件进行时序标注的过程中,能够基于识别网络确定视频文件中与第一文本信息匹配的第一视频段,以及基于翻译网络翻译出第一视频段的第二文本信息,因此,对于该视频处理模型来说,能够输出第一视频段以及第二文本信息,即基于一个视频处理模型,得到视频文件的多种输出结果,提高了视频标注结果的多样性。
在完成模型训练后,能够基于训练完成的视频处理模型对待标注的视频文件进行时序标注。参见图4,图4为根据一示例性实施例提供的一种视频处理方法流程图。在本公开实施例中,以通过视频处理模型对视频文件进行时序标注为例进行说明。如图4所示,该方法包括以下步骤。
在步骤401中,获取待标注的视频文件和待查询的文本信息。
其中,该待查询的文本信息与视频样本中标注的文本信息相似,在此不再赘述。
在一些实施例中,待查询的文本信息可称为第一文本信息,则获取待标注的视频文件和待查询的第一文本信息。
该待标注的视频文件为用户上传的视频文件,或者,该视频文件为数据库中的视频文件。在本公开实施例中,对该视频文件不作具体限定。例如,该视频文件为需要进行剪辑的视频文件,则该待查询的文本信息为对剪辑视频时保留的视频内容的要求,接收用户输入的该视频文件,以及,对该视频文件进行剪辑的内容要求,基于该内容要求对该视频文件进行时序标注。也即是,在视频剪辑场景下,第一文本信息指示需要从视频文件中剪辑出的视频段,获取待剪辑的视频文本和该第一文本信息,后续即可基于该第一文本信息,对视频文件进行时序标注,得到该视频文件中与第一文本信息匹配的视频段。
又例如,该视频文件为查询数据库中的视频文件,接收用户输入的待查询的文本信息,根据该文本信息对数据库中的视频文件进行时序标注,从而确定待查询的文本信息匹配的视频文件。也即是,在视频查询场景下,第一文本信息指示待查询的目标视频文件,获取第一文本信息和数据库中的多个备选视频文件,后续即可基于该第一文本信息,分别对每个备选视频文件进行时序标注,将能够标注出与第一文本信息匹配的视频段的备选视频文件确定为目标视频文件。
在步骤402中,通过该时序标注模型的时序标注网络,分别对该视频文件和该待查询的文本信息进行特征提取,得到该视频文件的视频特征和该待查询的文本信息的文本特征。也即是,调用识别网络,分别提取视频文件的视频特征和第一文本信息的文本特征。
本步骤与步骤202中的步骤(1)同理,在此不再赘述。
在步骤403中,从该视频文件的视频特征中确定与该待查询的文本信息的文本特征匹配的视频特征。待查询的文本信息的文本特征匹配的视频特征可称为目标视频特征,也即是,从视频文件的视频特征中确定与第一文本信息的文本特征匹配的目标视频特征。
本步骤与步骤202中的步骤(2)同理,在此不再赘述。
在步骤404中,将该待查询的文本信息的文本特征匹配的视频特征对应的视频段,确定为该待查询的文本信息匹配的视频段。也即是,将目标视频特征对应的视频段,确定第一视频段。
本步骤与步骤202中的步骤(3)同理,在此不再赘述。
在步骤405中,将该待查询的文本信息匹配的视频段输入至该时序标注模型的特征提取 网络,得到该待查询的文本信息匹配的视频段的视频特征。也即是,将第一视频段输入至特征提取网络,得到第一视频段的视频特征。
本步骤与步骤204中的步骤(7)中确定第二视频段的视频特征的过程相似,在此不再赘述。
在步骤406中,将该待查询的文本信息匹配的视频段的视频特征输入至该视频处理模型的翻译网络,得到该视频文件中标注的视频段的文本信息。
其中,视频文件中标注的视频段即为第一视频段。也即是,将第一视频段的视频特征输入至翻译网络,得到该第一视频段的第二文本信息。
本步骤与步骤205中的步骤(9)同理,在此不再赘述。
在步骤407中,通过该时序标注模型输出该待查询的文本信息匹配的视频段和该视频文件中标注的文本信息。也即是,基于视频处理模型输出第一视频段和第二文本信息,该第二文本信息用于描述第一视频段的视频内容。
在本步骤中,参见图5,该视频处理模型分别根据多个网络的输出结果,输出第一视频段和该第一视频段的第二文本信息。
需要说明的一点是,上述实施例中第一文本信息和第二文本信息相同或者不同,本公开实施例对此不做限制。例如,目标视频为一段足球比赛的视频,第一文本信息为“进球”,则基于视频处理模型能够确定目标视频中“进球”的视频段和该视频段的第二文本信息,该第二文本信息为对进球动作进行详细描述的一段内容。
需要说明的一点是,该视频处理模型中的识别网络、特征提取网络和翻译网络还能够单独使用。在本公开实施例中,对该视频处理模型中的网络的使用方式不作具体限定。例如,在训练完成后,能够单独调用识别网络对视频文件进行时序标注。或者,调用特征提取网络对视频文件或文本文件进行特征提取。或者,调用翻译网络对视频特征进行翻译,得到视频文件对应的文本信息等。
本公开实施例提供了一种新的视频处理模型,该视频处理模型包括识别网络、特征提取网络和翻译网络,在对视频文件进行时序标注的过程中,能够基于识别网络确定视频文件中与第一文本信息匹配的第一视频段,以及基于翻译网络翻译出第一视频段的第二文本信息,因此,对于该视频处理模型来说,能够输出第一视频段以及第二文本信息,即基于一个视频处理模型,得到视频文件的多种输出结果,提高了视频标注结果的多样性。
上述实施例中所示的视频处理方法能够应用于多种场景下。
例如,应用于视频内容搜索场景下。
电子设备获取待搜索的目标视频和关键词“跳水”,将目标视频和“跳水”输入至视频处理模型,视频处理模型在目标视频中标注与“跳水”相关的视频段,将该视频段再翻译为对应的描述信息,从而该搜索出目标视频中与“跳水”相关的视频内容。
例如,应用于视频剪辑场景下。
电子设备中存储有一个时长较长的目标视频,用户需要从该目标视频中剪辑出想要的视频段,则能够采用本公开实施例提供的视频处理模型,将目标视频和对想要剪辑的视频段的文本描述信息输入至视频处理模型,基于该视频处理模型输出与文本描述信息匹配的视频段,以及该视频段对应的关键词,将输出的关键词作为视频段的标题,从而基于视频处理模型实现对目标视频的剪辑。
图6是根据一示例性实施例提供的一种视频处理装置的框图。参见图6,装置包括:
获取单元601,被配置为获取视频文件和第一文本信息;
时序标注单元602,被配置为将视频文件和第一文本信息输入至视频处理模型的识别网络,得到第一文本信息匹配的第一视频段;
特征提取单元603,被配置为将第一视频段输入至视频处理模型的特征提取网络,得到 第一视频段的视频特征;
视觉文本翻译单元604,被配置为将第一视频段的视频特征输入至视频处理模型的翻译网络,得到第一视频段的第二文本信息,第二文本信息用于描述第一视频段的视频内容;
输出单元605,被配置为基于视频处理模型输出第一视频段和第二文本信息。
在一些实施例中,该时序标注单元602包括:
特征提取子单元,被配置为调用识别网络,分别提取视频文件的视频特征和第一文本信息的文本特征;
第一确定子单元,被配置为从视频文件的视频特征中确定与文本特征匹配的目标视频特征;
第二确定子单元,被配置为将目标视频特征对应的视频段,确定为第一视频段。
在一些实施例中,该装置还包括:
该时序标注单元602,还被配置为将视频样本输入至视频处理模型的识别网络,得到第三文本信息匹配的第二视频段,视频样本中标注有第三文本信息;
第二确定单元,被配置为基于第二视频段和视频样本中标注的第三视频段,确定识别网络的识别损失参数;
第三确定单元,被配置为基于第二视频段和第三文本信息,确定第一相似度和第二视频段的视频特征,第一相似度指示第二视频段和第三文本信息之间的相似度;
第四确定单元,被配置为基于第二视频段的视频特征和第三文本信息,确定视频处理模型的翻译网络的翻译质量参数,翻译质量参数表征翻译网络将视频特征翻译为文本信息的质量;
参数调整单元,被配置为基于识别损失参数、第一相似度和翻译质量参数,调整视频处理模型的参数。
在一些实施例中,该第二确定单元包括:
第三确定子单元,被配置为确定第二视频段在视频样本中的起始时间和终止时间,以及第三视频段在视频样本中的起始时间和终止时间;
损失参数确定子单元,被配置为基于识别损失函数、第二视频段在视频样本中的起始时间和终止时间,以及第三视频段在视频样本中的起始时间和终止时间,确定识别损失参数。
在一些实施例中,该第三确定单元包括:
该特征提取单元603,被配置为将第二视频段和第三文本信息输入至视频处理模型的特征提取网络,得到第二视频段的视频特征和第三文本信息的文本特征;
第一相似度确定子单元,被配置为确定第二视频段的视频特征和第三文本信息的文本特征之间的余弦相似度,得到第一相似度。
在一些实施例中,该第四确定单元包括:
该视觉文本翻译单元604,被配置为将第二视频段的视频特征输入至翻译网络,得到第二视频段的第四文本信息;
第二相似度确定子单元,被配置为确定第四文本信息与第三文本信息之间的第二相似度;
第四确定子单元,被配置为将第二相似度确定为翻译质量参数。
在一些实施例中,该参数调整单元,被配置为基于识别损失参数、第一相似度和翻译质量参数,分别对识别网络、特征提取网络和翻译网络的网络参数进行调整,直到识别损失参数小于第一阈值,且第一相似度大于第二阈值,且翻译质量参数大于第三阈值,完成模型训练。
电子设备为终端或服务器。在一些实施例中,电子设备为用于提供本公开所提供的视频处理方法的终端。图7示出了本公开一个示例性实施例提供的终端700的结构框图。在一些实施例中,该终端700是便携式移动终端,比如:智能手机、平板电脑、MP3(Moving Picture  Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)播放器、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。终端700还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。
通常,终端700包括有:处理器701和存储器702。
在一些实施例中,处理器701包括一个或多个处理核心,比如4核心处理器、8核心处理器等。在一些实施例中,处理器701采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。在一些实施例中,处理器701也包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器701集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器701还包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
在一些实施例中,存储器702包括一个或多个计算机可读存储介质,该计算机可读存储介质是非暂态的。存储器702还包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器702中的非暂态的计算机可读存储介质用于存储至少一个指令,该至少一个指令用于被处理器701所执行以实现本公开中方法实施例提供的视频处理方法。
在一些实施例中,终端700还可选包括有:外围设备接口703和至少一个外围设备。在一些实施例中,处理器701、存储器702和外围设备接口703之间通过总线或信号线相连。各个外围设备通过总线、信号线或电路板与外围设备接口703相连。可选地,外围设备包括:射频电路704、显示屏705、摄像头组件706、音频电路707、定位组件708和电源709中的至少一种。
外围设备接口703可被用于将I/O(Input/Output,输入/输出)相关的至少一个外围设备连接到处理器701和存储器702。在一些实施例中,处理器701、存储器702和外围设备接口703被集成在同一芯片或电路板上;在一些其他实施例中,处理器701、存储器702和外围设备接口703中的任意一个或两个在单独的芯片或电路板上实现,本实施例对此不加以限定。
射频电路704用于接收和发射RF(Radio Frequency,射频)信号,也称电磁信号。射频电路704通过电磁信号与通信网络以及其他通信设备进行通信。射频电路704将电信号转换为电磁信号进行发送,或者,将接收到的电磁信号转换为电信号。可选地,射频电路704包括:天线系统、RF收发器、一个或多个放大器、调谐器、振荡器、数字信号处理器、编解码芯片组、用户身份模块卡等等。在一些实施例中,射频电路704通过至少一种无线通信协议来与其它终端进行通信。该无线通信协议包括但不限于:万维网、城域网、内联网、各代移动通信网络(2G、3G、4G及5G)、无线局域网和/或WiFi(Wireless Fidelity,无线保真)网络。在一些实施例中,射频电路704还包括NFC(Near Field Communication,近距离无线通信)有关的电路,本公开对此不加以限定。
显示屏705用于显示UI(User Interface,用户界面)。在一些实施例中,该UI包括图形、文本、图标、视频及其它们的任意组合。当显示屏705是触摸显示屏时,显示屏705还具有采集在显示屏705的表面或表面上方的触摸信号的能力。在一些实施例中,该触摸信号作为控制信号输入至处理器701进行处理。此时,显示屏705还用于提供虚拟按钮和/或虚拟键盘,也称软按钮和/或软键盘。在一些实施例中,显示屏705为一个,设置在终端700的前面板;在另一些实施例中,显示屏705为至少两个,分别设置在终端700的不同表面或呈折叠设计;在另一些实施例中,显示屏705是柔性显示屏,设置在终端700的弯曲表面上或折叠面上。 甚至,显示屏705还设置成非矩形的不规则图形,也即异形屏。在一些实施例中显示屏705采用LCD(Liquid Crystal Display,液晶显示屏)、OLED(Organic Light-Emitting Diode,有机发光二极管)等材质制备。
摄像头组件706用于采集图像或视频。可选地,摄像头组件706包括前置摄像头和后置摄像头。通常,前置摄像头设置在终端的前面板,后置摄像头设置在终端的背面。在一些实施例中,后置摄像头为至少两个,分别为主摄像头、景深摄像头、广角摄像头、长焦摄像头中的任意一种,以实现主摄像头和景深摄像头融合实现背景虚化功能、主摄像头和广角摄像头融合实现全景拍摄以及VR(Virtual Reality,虚拟现实)拍摄功能或者其它融合拍摄功能。在一些实施例中,摄像头组件706还包括闪光灯。闪光灯是单色温闪光灯,或者,是双色温闪光灯。双色温闪光灯是指暖光闪光灯和冷光闪光灯的组合,用于不同色温下的光线补偿。
在一些实施例中,音频电路707包括麦克风和扬声器。麦克风用于采集用户及环境的声波,并将声波转换为电信号输入至处理器701进行处理,或者输入至射频电路704以实现语音通信。在一些实施例中,出于立体声采集或降噪的目的,麦克风为多个,分别设置在终端700的不同部位。在一些实施例中,麦克风还是阵列麦克风或全向采集型麦克风。扬声器则用于将来自处理器701或射频电路704的电信号转换为声波。在一些实施例中,扬声器是传统的薄膜扬声器,或者,是压电陶瓷扬声器。当扬声器是压电陶瓷扬声器时,不仅能够将电信号转换为人类可听见的声波,也能够将电信号转换为人类听不见的声波以进行测距等用途。在一些实施例中,音频电路707还包括耳机插孔。
定位组件708用于定位终端700的当前地理位置,以实现导航或LBS(Location Based Service,基于位置的服务)。在一些实施例中,定位组件708是基于美国的GPS(Global Positioning System,全球定位系统)、中国的北斗系统或俄罗斯的伽利略系统的定位组件。
电源709用于为终端700中的各个组件进行供电。在一些实施例中,电源709是交流电、直流电、一次性电池或可充电电池。当电源709包括可充电电池时,该可充电电池是有线充电电池或无线充电电池。有线充电电池是通过有线线路充电的电池,无线充电电池是通过无线线圈充电的电池。该可充电电池还用于支持快充技术。
在一些实施例中,终端700还包括有一个或多个传感器710。该一个或多个传感器710包括但不限于:加速度传感器711、陀螺仪传感器712、压力传感器713、指纹传感器714、光学传感器715以及接近传感器716。
在一些实施例中,加速度传感器711检测以终端700建立的坐标系的三个坐标轴上的加速度大小。比如,加速度传感器711用于检测重力加速度在三个坐标轴上的分量。在一些实施例中,处理器701根据加速度传感器711采集的重力加速度信号,控制显示屏705以横向视图或纵向视图进行用户界面的显示。在一些实施例中,加速度传感器711还用于游戏或者用户的运动数据的采集。
在一些实施例中,陀螺仪传感器712检测终端700的机体方向及转动角度,陀螺仪传感器712与加速度传感器711协同采集用户对终端700的3D动作。处理器701根据陀螺仪传感器712采集的数据,能够实现如下功能:动作感应(比如根据用户的倾斜操作来改变UI)、拍摄时的图像稳定、游戏控制以及惯性导航。
在一些实施例中,压力传感器713设置在终端700的侧边框和/或显示屏705的下层。当压力传感器713设置在终端700的侧边框时,能够检测用户对终端700的握持信号,由处理器701根据压力传感器713采集的握持信号进行左右手识别或快捷操作。当压力传感器713设置在显示屏705的下层时,由处理器701根据用户对显示屏705的压力操作,实现对UI界面上的可操作性控件进行控制。可操作性控件包括按钮控件、滚动条控件、图标控件、菜单控件中的至少一种。
指纹传感器714用于采集用户的指纹,由处理器701根据指纹传感器714采集到的指纹识别用户的身份,或者,由指纹传感器714根据采集到的指纹识别用户的身份。在识别出用 户的身份为可信身份时,由处理器701授权该用户执行相关的敏感操作,该敏感操作包括解锁屏幕、查看加密信息、下载软件、支付及更改设置等。在一些实施例中,指纹传感器714被设置在终端700的正面、背面或侧面。在一些实施例中,当终端700上设置有物理按键或厂商Logo时,指纹传感器714与物理按键或厂商Logo集成在一起。
光学传感器715用于采集环境光强度。在一个实施例中,处理器701根据光学传感器715采集的环境光强度,控制显示屏705的显示亮度。具体地,当环境光强度较高时,调高显示屏705的显示亮度;当环境光强度较低时,调低显示屏705的显示亮度。在另一个实施例中,处理器701还根据光学传感器715采集的环境光强度,动态调整摄像头组件706的拍摄参数。
接近传感器716,也称距离传感器,通常设置在终端700的前面板。接近传感器716用于采集用户与终端700的正面之间的距离。在一个实施例中,当接近传感器716检测到用户与终端700的正面之间的距离逐渐变小时,由处理器701控制显示屏705从亮屏状态切换为息屏状态;当接近传感器716检测到用户与终端700的正面之间的距离逐渐变大时,由处理器701控制显示屏705从息屏状态切换为亮屏状态。
本领域技术人员能够理解,图7中示出的结构并不构成对终端700的限定,能够包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
在一些实施例中,电子设备为用于提供本公开所提供的视频处理方法的服务器。图8示出了本公开一个示例性实施例提供的服务器800的结构框图。在一些实施例中,该服务器800可因配置或性能不同而产生比较大的差异,包括一个或一个以上处理器(central processing units,CPU)801和一个或一个以上的存储器802,其中,所述存储器801中存储有至少一条指令,所述至少一条指令由所述处理器801加载并执行以实现上述各个方法实施例提供的目标对象的检索方法。当然,在一些实施例中,该服务器800还具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该服务器800还包括其他用于实现设备功能的部件,在此不做赘述。
本公开实施例还提供了一种电子设备,包括处理器;用于存储处理器可执行指令的存储器;其中,处理器被配置为执行指令,以实现如下步骤:获取视频文件和第一文本信息;将视频文件和第一文本信息输入至视频处理模型的识别网络,得到第一文本信息匹配的第一视频段;将第一视频段输入至视频处理模型的特征提取网络,得到第一视频段的视频特征;将第一视频段的视频特征输入至视频处理模型的翻译网络,得到第一视频段的第二文本信息,第二文本信息用于描述第一视频段的视频内容;基于视频处理模型输出第一视频段和第二文本信息。
在一些实施例中,处理器被配置为执行指令,以实现上述方法实施例中的其他实施例提供的视频处理方法和视频处理模型的训练方法。
本公开实施例还提供了一种计算机可读存储介质,当该计算机可读存储介质中的指令由电子设备的处理器执行时,使得电子设备能够执行如下步骤:获取视频文件和第一文本信息;将视频文件和第一文本信息输入至视频处理模型的识别网络,得到第一文本信息匹配的第一视频段;将第一视频段输入至视频处理模型的特征提取网络,得到第一视频段的视频特征;将第一视频段的视频特征输入至视频处理模型的翻译网络,得到第一视频段的第二文本信息,第二文本信息用于描述第一视频段的视频内容;基于视频处理模型输出第一视频段和第二文本信息。
在一些实施例中,当该计算机可读存储介质中的指令由电子设备的处理器执行时,使得电子设备能够执行上述方法实施例中的其他实施例提供的视频处理方法和视频处理模型的训练方法。
本公开实施例还提供了一种计算机程序产品,包括计算机指令,该计算机指令被处理器执行时实现如下步骤:获取视频文件和第一文本信息;将视频文件和第一文本信息输入至视 频处理模型的识别网络,得到第一文本信息匹配的第一视频段;将第一视频段输入至视频处理模型的特征提取网络,得到第一视频段的视频特征;将第一视频段的视频特征输入至视频处理模型的翻译网络,得到第一视频段的第二文本信息,第二文本信息用于描述第一视频段的视频内容;基于视频处理模型输出第一视频段和第二文本信息。
在一些实施例中,当该计算机指令被处理器执行时实现上述方法实施例中的其他实施例提供的视频处理方法和视频处理模型的训练方法。
本公开所有实施例均可以单独被执行,也可以与其他实施例相结合被执行,均视为本公开要求的保护范围。

Claims (20)

  1. 一种视频处理方法,所述方法包括:
    获取视频文件和第一文本信息;
    将所述视频文件和所述第一文本信息输入至视频处理模型的识别网络,得到所述第一文本信息匹配的第一视频段;
    将所述第一视频段输入至所述视频处理模型的特征提取网络,得到所述第一视频段的视频特征;
    将所述第一视频段的视频特征输入至所述视频处理模型的翻译网络,得到所述第一视频段的第二文本信息,所述第二文本信息用于描述所述第一视频段的视频内容;
    基于所述视频处理模型输出所述第一视频段和所述第二文本信息。
  2. 根据权利要求1所述的方法,其中,所述将所述视频文件和所述第一文本信息输入至视频处理模型的识别网络,得到所述第一文本信息匹配的第一视频段,包括:
    调用所述识别网络,分别提取所述视频文件的视频特征和所述第一文本信息的文本特征;
    从所述视频文件的视频特征中确定与所述文本特征匹配的目标视频特征;
    将所述目标视频特征对应的视频段,确定为所述第一视频段。
  3. 一种视频处理模型的训练方法,所述方法包括:
    将视频样本输入至视频处理模型的识别网络,得到第三文本信息匹配的第二视频段,所述视频样本中标注有所述第三文本信息;
    基于所述第二视频段和所述视频样本中标注的第三视频段,确定所述识别网络的识别损失参数;
    基于所述第二视频段和所述第三文本信息,确定第一相似度和所述第二视频段的视频特征,所述第一相似度指示所述第二视频段和所述第三文本信息之间的相似度;
    基于所述第二视频段的视频特征和所述第三文本信息,确定所述视频处理模型的翻译网络的翻译质量参数,所述翻译质量参数表征所述翻译网络将视频特征翻译为文本信息的质量;
    基于所述识别损失参数、所述第一相似度和所述翻译质量参数,调整所述视频处理模型的参数。
  4. 根据权利要求3所述的方法,其中,所述基于所述第二视频段和所述视频样本中标注的第三视频段,确定所述识别网络的识别损失参数,包括:
    确定所述第二视频段在所述视频样本中的起始时间和终止时间,以及所述第三视频段在所述视频样本中的起始时间和终止时间;
    基于识别损失函数、所述第二视频段在所述视频样本中的起始时间和终止时间,以及所述第三视频段在所述视频样本中的起始时间和终止时间,确定所述识别损失参数。
  5. 根据权利要求3所述的方法,其中,所述基于所述第二视频段和所述第三文本信息,确定第一相似度参数和所述第二视频段的视频特征,包括:
    将所述第二视频段和所述第三文本信息输入至所述视频处理模型的特征提取网络,得到所述第二视频段的视频特征和所述第三文本信息的文本特征;
    确定所述第二视频段的视频特征和所述第三文本信息的文本特征之间的余弦相似度,得到所述第一相似度。
  6. 根据权利要求3所述的方法,其中,所述基于所述第二视频段的视频特征和所述第三 文本信息,确定所述视频处理模型的翻译网络的翻译质量参数,包括:
    将所述第二视频段的视频特征输入至所述翻译网络,得到所述第二视频段的第四文本信息;
    确定所述第四文本信息与所述第三文本信息之间的第二相似度;
    将所述第二相似度确定为所述翻译质量参数。
  7. 根据权利要求3所述的方法,其中,所述基于所述识别损失参数、所述第一相似度和所述翻译质量参数,调整所述视频处理模型的参数,包括:
    基于所述识别损失参数、所述第一相似度和所述翻译质量参数,分别对所述识别网络、所述特征提取网络和所述翻译网络的网络参数进行调整,直到所述识别损失参数小于第一阈值,且所述第一相似度大于第二阈值,且所述翻译质量参数大于第三阈值,完成模型训练。
  8. 一种视频处理装置,所述装置包括:
    获取单元,被配置为获取视频文件和第一文本信息;
    时序标注单元,被配置为将所述视频文件和所述第一文本信息输入至视频处理模型的识别网络,得到所述第一文本信息匹配的第一视频段;
    特征提取单元,被配置为将所述第一视频段输入至所述视频处理模型的特征提取网络,得到所述第一视频段的视频特征;
    视觉文本翻译单元,被配置为将所述第一视频段的视频特征输入至所述视频处理模型的翻译网络,得到第一视频段的第二文本信息,所述第二文本信息用于描述所述第一视频段的视频内容;
    输出单元,被配置为基于所述视频处理模型输出所述第一视频段和所述第二文本信息。
  9. 一种视频处理模型的训练装置,所述装置包括:
    时序标注单元,被配置为将视频样本输入至视频处理模型的识别网络,得到第三文本信息匹配的第二视频段,所述视频样本中标注有所述第三文本信息;
    第二确定单元,被配置为基于所述第二视频段和所述视频样本中标注的第三视频段,确定所述识别网络的识别损失参数;
    第三确定单元,被配置为基于所述第二视频段和所述第三文本信息,确定第一相似度和所述第二视频段的视频特征,所述第一相似度指示所述第二视频段和所述第三文本信息之间的相似度;
    第四确定单元,被配置为基于所述第二视频段的视频特征和所述第三文本信息,确定所述视频处理模型的翻译网络的翻译质量参数,所述翻译质量参数表征所述翻译网络将视频特征翻译为文本信息的质量;
    参数调整单元,被配置为基于所述识别损失参数、所述第一相似度和所述翻译质量参数,调整所述视频处理模型的参数。
  10. 一种电子设备,包括:
    处理器;
    用于存储所述处理器可执行指令的存储器;
    其中,所述处理器被配置为执行所述指令,以实现如下步骤:
    获取视频文件和第一文本信息;
    将所述视频文件和所述第一文本信息输入至视频处理模型的识别网络,得到所述第一文本信息匹配的第一视频段;
    将所述第一视频段输入至所述视频处理模型的特征提取网络,得到所述第一视频段的视频特征;
    将所述第一视频段的视频特征输入至所述视频处理模型的翻译网络,得到第一视频段的第二文本信息,所述第二文本信息用于描述所述第一视频段的视频内容;
    基于所述视频处理模型输出所述第一视频段和所述第二文本信息。
  11. 根据权利要求10所述的电子设备,其中,所述处理器被配置为执行所述指令,以实现如下步骤:
    调用所述识别网络,分别提取所述视频文件的视频特征和所述第一文本信息的文本特征;
    从所述视频文件的视频特征中确定与所述文本特征匹配的目标视频特征;
    将所述目标视频特征对应的视频段,确定为所述第一视频段。
  12. 一种电子设备,包括:
    处理器;
    用于存储所述处理器可执行指令的存储器;
    其中,所述处理器被配置为执行所述指令,以实现如下步骤::
    将视频样本输入至视频处理模型的识别网络,得到第三文本信息匹配的第二视频段,所述视频样本中标注有所述第三文本信息;
    基于所述第二视频段和所述视频样本中标注的第三视频段,确定所述识别网络的识别损失参数;
    基于所述第二视频段和所述第三文本信息,确定第一相似度和所述第二视频段的视频特征,所述第一相似度指示所述第二视频段和所述第三文本信息之间的相似度;
    基于所述第二视频段的视频特征和所述第三文本信息,确定所述视频处理模型的翻译网络的翻译质量参数,所述翻译质量参数表征所述翻译网络将视频特征翻译为文本信息的质量;
    基于所述识别损失参数、所述第一相似度和所述翻译质量参数,调整所述视频处理模型的参数。
  13. 根据权利要求12所述的电子设备,其中,所述处理器被配置为执行所述指令,以实现如下步骤:
    确定所述第二视频段在所述视频样本中的起始时间和终止时间,以及所述第三视频段在所述视频样本中的起始时间和终止时间;
    基于识别损失函数、所述第二视频段在所述视频样本中的起始时间和终止时间,以及所述第三视频段在所述视频样本中的起始时间和终止时间,确定所述识别损失参数。
  14. 根据权利要求12所述的电子设备,其中,所述处理器被配置为执行所述指令,以实现如下步骤:
    将所述第二视频段和所述第三文本信息输入至所述视频处理模型的特征提取网络,得到所述第二视频段的视频特征和所述第三文本信息的文本特征;
    确定所述第二视频段的视频特征和所述第三文本信息的文本特征之间的余弦相似度,得到所述第一相似度。
  15. 根据权利要求12所述的电子设备,其中,所述处理器被配置为执行所述指令,以实现如下步骤:
    将所述第二视频段的视频特征输入至所述翻译网络,得到所述第二视频段的第四文本信息;
    确定所述第四文本信息与所述第三文本信息之间的第二相似度;
    将所述第二相似度确定为所述翻译质量参数。
  16. 根据权利要求12所述的电子设备,其中,所述处理器被配置为执行所述指令,以实现如下步骤:
    基于所述识别损失参数、所述第一相似度和所述翻译质量参数,分别对所述识别网络、所述特征提取网络和所述翻译网络的网络参数进行调整,直到所述识别损失参数小于第一阈值,且所述第一相似度大于第二阈值,且所述翻译质量参数大于第三阈值,完成模型训练。
  17. 一种计算机可读存储介质,当所述计算机可读存储介质中的指令由电子设备的处理器执行时,使得电子设备能够执行如下步骤:
    获取视频文件和第一文本信息;
    将所述视频文件和所述第一文本信息输入至视频处理模型的识别网络,得到所述第一文本信息匹配的第一视频段;
    将所述第一视频段输入至所述视频处理模型的特征提取网络,得到所述第一视频段的视频特征;
    将所述第一视频段的视频特征输入至所述视频处理模型的翻译网络,得到第一视频段的第二文本信息,所述第二文本信息用于描述所述第一视频段的视频内容;
    基于所述视频处理模型输出所述第一视频段和所述第二文本信息。
  18. 一种计算机可读存储介质,当所述计算机可读存储介质中的指令由电子设备的处理器执行时,使得电子设备能够执行如下步骤:
    将视频样本输入至视频处理模型的识别网络,得到第三文本信息匹配的第二视频段,所述视频样本中标注有所述第三文本信息;
    基于所述第二视频段和所述视频样本中标注的第三视频段,确定所述识别网络的识别损失参数;
    基于所述第二视频段和所述第三文本信息,确定第一相似度和所述第二视频段的视频特征,所述第一相似度指示所述第二视频段和所述第三文本信息之间的相似度;
    基于所述第二视频段的视频特征和所述第三文本信息,确定所述视频处理模型的翻译网络的翻译质量参数,所述翻译质量参数表征所述翻译网络将视频特征翻译为文本信息的质量;
    基于所述识别损失参数、所述第一相似度和所述翻译质量参数,调整所述视频处理模型的参数。
  19. 一种计算机程序产品,包括计算机指令,所述计算机指令被处理器执行时实现如下步骤:
    获取视频文件和第一文本信息;
    将所述视频文件和所述第一文本信息输入至视频处理模型的识别网络,得到所述第一文本信息匹配的第一视频段;
    将所述第一视频段输入至所述视频处理模型的特征提取网络,得到所述第一视频段的视频特征;
    将所述第一视频段的视频特征输入至所述视频处理模型的翻译网络,得到第一视频段的第二文本信息,所述第二文本信息用于描述所述第一视频段的视频内容;
    基于所述视频处理模型输出所述第一视频段和所述第二文本信息。
  20. 一种计算机程序产品,包括计算机指令,所述计算机指令被处理器执行时实现如下步骤:
    将视频样本输入至视频处理模型的识别网络,得到第三文本信息匹配的第二视频段,所述视频样本中标注有所述第三文本信息;
    基于所述第二视频段和所述视频样本中标注的第三视频段,确定所述识别网络的识别损 失参数;
    基于所述第二视频段和所述第三文本信息,确定第一相似度和所述第二视频段的视频特征,所述第一相似度指示所述第二视频段和所述第三文本信息之间的相似度;
    基于所述第二视频段的视频特征和所述第三文本信息,确定所述视频处理模型的翻译网络的翻译质量参数,所述翻译质量参数表征所述翻译网络将视频特征翻译为文本信息的质量;
    基于所述识别损失参数、所述第一相似度和所述翻译质量参数,调整所述视频处理模型的参数。
PCT/CN2021/114059 2020-12-22 2021-08-23 视频处理方法及电子设备 WO2022134634A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21887878.3A EP4047944A4 (en) 2020-12-22 2021-08-23 VIDEO PROCESSING METHOD AND ELECTRONIC DEVICE
US17/842,654 US11651591B2 (en) 2020-12-22 2022-06-16 Video timing labeling method, electronic device and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011526967.5 2020-12-22
CN202011526967.5A CN112261491B (zh) 2020-12-22 2020-12-22 视频时序标注方法、装置、电子设备及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/842,654 Continuation US11651591B2 (en) 2020-12-22 2022-06-16 Video timing labeling method, electronic device and storage medium

Publications (1)

Publication Number Publication Date
WO2022134634A1 true WO2022134634A1 (zh) 2022-06-30

Family

ID=74225296

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/114059 WO2022134634A1 (zh) 2020-12-22 2021-08-23 视频处理方法及电子设备

Country Status (4)

Country Link
US (1) US11651591B2 (zh)
EP (1) EP4047944A4 (zh)
CN (1) CN112261491B (zh)
WO (1) WO2022134634A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112261491B (zh) 2020-12-22 2021-04-16 北京达佳互联信息技术有限公司 视频时序标注方法、装置、电子设备及存储介质
CN113553858B (zh) * 2021-07-29 2023-10-10 北京达佳互联信息技术有限公司 文本向量表征模型的训练和文本聚类
CN113590881B (zh) * 2021-08-09 2024-03-19 北京达佳互联信息技术有限公司 视频片段检索方法、视频片段检索模型的训练方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9129158B1 (en) * 2012-03-05 2015-09-08 Hrl Laboratories, Llc Method and system for embedding visual intelligence
CN105677735A (zh) * 2015-12-30 2016-06-15 腾讯科技(深圳)有限公司 一种视频搜索方法及装置
US20170300150A1 (en) * 2011-11-23 2017-10-19 Avigilon Fortress Corporation Automatic event detection, text generation, and use thereof
WO2018127627A1 (en) * 2017-01-06 2018-07-12 Nokia Technologies Oy Method and apparatus for automatic video summarisation
CN110321958A (zh) * 2019-07-08 2019-10-11 北京字节跳动网络技术有限公司 神经网络模型的训练方法、视频相似度确定方法
CN112261491A (zh) * 2020-12-22 2021-01-22 北京达佳互联信息技术有限公司 视频时序标注方法、装置、电子设备及存储介质

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030076413A1 (en) * 2001-10-23 2003-04-24 Takeo Kanade System and method for obtaining video of multiple moving fixation points within a dynamic scene
US20070106685A1 (en) * 2005-11-09 2007-05-10 Podzinger Corp. Method and apparatus for updating speech recognition databases and reindexing audio and video content using the same
US8005356B2 (en) * 2007-02-13 2011-08-23 Media Global Links Co., Ltd. Video transmission system of a ring network
US8756233B2 (en) * 2010-04-16 2014-06-17 Video Semantics Semantic segmentation and tagging engine
US20120207207A1 (en) * 2011-02-10 2012-08-16 Ofer Peer Method, system and associated modules for transmission of complimenting frames
US9064170B2 (en) * 2012-01-11 2015-06-23 Nokia Technologies Oy Method, apparatus and computer program product for estimating image parameters
EA201492098A1 (ru) * 2012-05-14 2015-04-30 Лука Россато Кодирование и декодирование на основании смешивания последовательностей выборок с течением времени
US9154761B2 (en) * 2013-08-19 2015-10-06 Google Inc. Content-based video segmentation
US9807291B1 (en) * 2014-01-29 2017-10-31 Google Inc. Augmented video processing
US9848132B2 (en) * 2015-11-24 2017-12-19 Gopro, Inc. Multi-camera time synchronization
US10229719B1 (en) * 2016-05-09 2019-03-12 Gopro, Inc. Systems and methods for generating highlights for a video
WO2018040059A1 (en) * 2016-09-02 2018-03-08 Microsoft Technology Licensing, Llc Clip content categorization
US10979761B2 (en) * 2018-03-14 2021-04-13 Huawei Technologies Co., Ltd. Intelligent video interaction method
CN109905772B (zh) * 2019-03-12 2022-07-22 腾讯科技(深圳)有限公司 视频片段查询方法、装置、计算机设备及存储介质
AU2020326739A1 (en) * 2019-08-08 2022-02-03 Dejero Labs Inc. Systems and methods for managing data packet communications
CN110751224B (zh) * 2019-10-25 2022-08-05 Oppo广东移动通信有限公司 视频分类模型的训练方法、视频分类方法、装置及设备
CN111222500B (zh) * 2020-04-24 2020-08-04 腾讯科技(深圳)有限公司 一种标签提取方法及装置
CN111914644B (zh) * 2020-06-30 2022-12-09 西安交通大学 一种基于双模态协同的弱监督时序动作定位方法及系统
CN111950393B (zh) * 2020-07-24 2021-05-04 杭州电子科技大学 一种基于边界搜索智能体的时序动作片段分割方法
US11128832B1 (en) * 2020-08-03 2021-09-21 Shmelka Klein Rule-based surveillance video retention system
CN112101329B (zh) * 2020-11-19 2021-03-30 腾讯科技(深圳)有限公司 一种基于视频的文本识别方法、模型训练的方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170300150A1 (en) * 2011-11-23 2017-10-19 Avigilon Fortress Corporation Automatic event detection, text generation, and use thereof
US9129158B1 (en) * 2012-03-05 2015-09-08 Hrl Laboratories, Llc Method and system for embedding visual intelligence
CN105677735A (zh) * 2015-12-30 2016-06-15 腾讯科技(深圳)有限公司 一种视频搜索方法及装置
WO2018127627A1 (en) * 2017-01-06 2018-07-12 Nokia Technologies Oy Method and apparatus for automatic video summarisation
CN110321958A (zh) * 2019-07-08 2019-10-11 北京字节跳动网络技术有限公司 神经网络模型的训练方法、视频相似度确定方法
CN112261491A (zh) * 2020-12-22 2021-01-22 北京达佳互联信息技术有限公司 视频时序标注方法、装置、电子设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4047944A4 *

Also Published As

Publication number Publication date
CN112261491B (zh) 2021-04-16
US11651591B2 (en) 2023-05-16
EP4047944A1 (en) 2022-08-24
US20220327827A1 (en) 2022-10-13
CN112261491A (zh) 2021-01-22
EP4047944A4 (en) 2023-06-14

Similar Documents

Publication Publication Date Title
US11551726B2 (en) Video synthesis method terminal and computer storage medium
CN107885533B (zh) 管理组件代码的方法及装置
WO2022134634A1 (zh) 视频处理方法及电子设备
JP2021524957A (ja) 画像処理方法およびその、装置、端末並びにコンピュータプログラム
CN108304506B (zh) 检索方法、装置及设备
WO2019128593A1 (zh) 搜索音频的方法和装置
CN111382624A (zh) 动作识别方法、装置、设备及可读存储介质
WO2022057435A1 (zh) 基于搜索的问答方法及存储介质
CN111127509B (zh) 目标跟踪方法、装置和计算机可读存储介质
CN112052897B (zh) 多媒体数据拍摄方法、装置、终端、服务器及存储介质
CN110933468A (zh) 播放方法、装置、电子设备及介质
CN113918767A (zh) 视频片段定位方法、装置、设备及存储介质
CN109917988B (zh) 选中内容显示方法、装置、终端及计算机可读存储介质
CN109961802B (zh) 音质比较方法、装置、电子设备及存储介质
CN109547847B (zh) 添加视频信息的方法、装置及计算机可读存储介质
CN110991445A (zh) 竖排文字识别方法、装置、设备及介质
CN111753606A (zh) 一种智能模型的升级方法及装置
CN108831423B (zh) 提取音频数据中主旋律音轨的方法、装置、终端及存储介质
CN111611414A (zh) 车辆检索方法、装置及存储介质
CN113593521B (zh) 语音合成方法、装置、设备及可读存储介质
CN113361376B (zh) 获取视频封面的方法、装置、计算机设备及可读存储介质
CN114817709A (zh) 排序方法、装置、设备及计算机可读存储介质
CN113469322B (zh) 确定模型的可执行程序的方法、装置、设备及存储介质
CN114360494A (zh) 韵律标注方法、装置、计算机设备及存储介质
CN111063372B (zh) 确定音高特征的方法、装置、设备及存储介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021887878

Country of ref document: EP

Effective date: 20220512

NENP Non-entry into the national phase

Ref country code: DE