CN113591570A - Video processing method and device, electronic equipment and storage medium - Google Patents
Video processing method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN113591570A CN113591570A CN202110721821.4A CN202110721821A CN113591570A CN 113591570 A CN113591570 A CN 113591570A CN 202110721821 A CN202110721821 A CN 202110721821A CN 113591570 A CN113591570 A CN 113591570A
- Authority
- CN
- China
- Prior art keywords
- video
- feature
- boundary
- information
- time sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 34
- 238000000605 extraction Methods 0.000 claims abstract description 56
- 238000012545 processing Methods 0.000 claims abstract description 26
- 239000013598 vector Substances 0.000 claims description 63
- 238000000034 method Methods 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 12
- 238000012549 training Methods 0.000 claims description 10
- 230000002123 temporal effect Effects 0.000 claims description 8
- 230000007246 mechanism Effects 0.000 claims description 6
- 239000012634 fragment Substances 0.000 claims description 3
- 238000013473 artificial intelligence Methods 0.000 abstract description 7
- 238000013135 deep learning Methods 0.000 abstract description 7
- 238000004458 analytical method Methods 0.000 abstract description 3
- 238000004891 communication Methods 0.000 description 12
- 238000004422 calculation algorithm Methods 0.000 description 10
- 230000006399 behavior Effects 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 7
- 230000009471 action Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000001953 sensory effect Effects 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000003924 mental process Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a video processing method, a video processing device, electronic equipment and a storage medium, relates to the technical field of artificial intelligence, in particular to the technical field of computer vision and deep learning, and is particularly used in a video analysis scene. The specific implementation scheme is as follows: acquiring a video; performing feature extraction on the video to acquire feature information of the video; calling a boundary prediction model to perform time sequence boundary prediction on the characteristic information so as to generate a time sequence boundary of the video; and segmenting the video according to the time sequence boundary to generate a video segment of the video. Therefore, the accuracy of time sequence boundary prediction can be improved, and the recall rate of time sequence nomination in the un-segmented video is increased.
Description
Technical Field
The application relates to the technical field of artificial intelligence, in particular to the technical field of computer vision and deep learning, is particularly used in a video analysis scene, and particularly relates to a video processing method and device, electronic equipment and a storage medium.
Background
And (2) time sequence action positioning, namely inputting a section of non-segmented video, positioning a behavior segment according to the content of the video, wherein the behavior segment comprises the starting time and the ending time, and the produced behavior segment is called a time sequence nomination (proposal). Time-series action positioning is one of the most important and challenging problems in the field of computer vision video understanding, which is attributed to the huge application potential in video collection generation, video recommendation, retrieval and the like. A
An important aspect of evaluating a time series action positioning method is the average recall rate. Most current methods are directed to generating flexible and accurate timing boundaries and reliable nomination confidence levels.
In the related art, methods related to deep learning mainly fall into two categories:
firstly, generating a large number of candidate time sequence nominations possibly containing behaviors based on a predefined anchor point frame regression method, and then selecting correct candidate time sequence nominations through a classification task;
secondly, modeling a video frame time sequence relation, predicting a boundary by using local details around the boundary, and generating a time sequence nomination by using boundary combination.
Disclosure of Invention
The application provides a video processing method, a video processing device, electronic equipment and a storage medium.
According to an aspect of the present application, there is provided a video processing method including:
acquiring a video;
performing feature extraction on the video to acquire feature information of the video;
calling a boundary prediction model to perform time sequence boundary prediction on the characteristic information so as to generate a time sequence boundary of the video; and
and segmenting the video according to the time sequence boundary to generate a video segment of the video.
According to another aspect of the present application, there is provided a video processing apparatus including:
the first acquisition module is used for acquiring a video;
the second acquisition module is used for extracting the characteristics of the video to acquire the characteristic information of the video;
the first generation module is used for calling a boundary prediction model to perform time sequence boundary prediction on the characteristic information so as to generate a time sequence boundary of the video; and
and the second generation module is used for segmenting the video according to the time sequence boundary so as to generate a video segment of the video.
According to another aspect of the present application, there is provided an electronic device including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a video processing method as described in embodiments of one aspect above.
According to another aspect of the present application, there is provided a non-transitory computer-readable storage medium storing thereon a computer program for causing a computer to perform the video processing method according to the embodiment of the above aspect.
According to another aspect of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the video processing method of an embodiment of the above-described aspect.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
fig. 1 is a schematic flowchart of a video processing method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of another video processing method according to an embodiment of the present application;
fig. 3 is a schematic flowchart of another video processing method according to an embodiment of the present application;
fig. 4 is a schematic diagram of global information for generating a video according to an embodiment of the present application;
fig. 5 is a schematic flowchart of another video processing method according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application; and
fig. 7 is a block diagram of an electronic device of a video processing method according to an embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
A video processing method, an apparatus, an electronic device, and a storage medium according to embodiments of the present application are described below with reference to the accompanying drawings.
Artificial intelligence is the subject of research on the use of computers to simulate certain mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.) of humans, both in the hardware and software domain. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology comprises a computer vision technology, a voice recognition technology, a natural language processing technology, deep learning, a big data processing technology, a knowledge map technology and the like.
Deep learning is a new research direction in the field of machine learning. Deep learning is the intrinsic law and expression level of the learning sample data, and the information obtained in the learning process is very helpful for the interpretation of data such as characters, images and sounds. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds. Deep learning is a complex machine learning algorithm, and achieves the effect in speech and image recognition far exceeding the prior related art.
Computer vision is a science for researching how to make a machine "see", and further, it means that a camera and a computer are used to replace human eyes to perform machine vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can acquire 'information' from images or multidimensional data. The information referred to herein refers to information defined by Shannon (Shannon's formula) that can be used to help make a "decision". Because perception can be viewed as extracting information from sensory signals, computer vision can also be viewed as the science of how to make an artificial system "perceive" from images or multidimensional data.
The video processing method provided in the embodiment of the present application may be executed by an electronic device, where the electronic device may be a Personal Computer (PC), a tablet Computer, a palmtop Computer, or the like, and is not limited herein.
In the embodiment of the application, the electronic device can be provided with a processing component, a storage component and a driving component. Optionally, the driving component and the processing component may be integrated, the storage component may store an operating system, an application program, or other program modules, and the processing component implements the video processing method provided in the embodiment of the present application by executing the application program stored in the storage component.
Fig. 1 is a schematic flowchart of a video processing method according to an embodiment of the present disclosure.
The video processing method of the embodiment of the application can be further executed by the video processing device provided by the embodiment of the application, and the device can be configured in electronic equipment to extract the characteristics of the acquired video, acquire the characteristic information of the video, call a boundary prediction model to perform time sequence boundary prediction on the characteristic information to generate a time sequence boundary of the video, and segment the video according to the time sequence boundary to generate a video segment of the video, so that the accuracy of the time sequence boundary prediction can be improved.
As a possible situation, the video processing method in the embodiment of the present application may also be executed at a server, where the server may be a cloud server, and the video processing method may be executed at a cloud end.
As shown in fig. 1, the video processing method may include:
In the embodiment of the present application, there may be multiple ways for the electronic device to obtain a video, where first, the electronic device may obtain the video from the video providing device, for example, the electronic device may download the video from the video providing device through a Uniform Resource Locator (URL) corresponding to the video, where the video providing device may include a digital versatile disc player, a video and audio compact disc player, a server, a usb disk, an intelligent hard disk, a mobile phone, and the like; the electronic equipment can store videos, and the electronic equipment can acquire a target video from the videos stored by the electronic equipment; and thirdly, the electronic equipment can shoot videos through the built-in camera to obtain videos. And are not limited in any way herein.
The video may also be a video downloaded by the user through an associated video website, as one possibility.
It should be noted that the video described in this embodiment may be a target video that a user wants to perform a time-series action positioning to produce a behavior segment (i.e., a video segment).
And 102, extracting the characteristics of the video to acquire the characteristic information of the video.
In the embodiment of the application, the feature extraction can be performed on the video according to a preset feature extraction algorithm to obtain the feature information of the video, wherein the preset feature extraction algorithm can be calibrated according to the actual situation.
Specifically, after the electronic device acquires the video, feature extraction may be performed on the video according to a preset feature extraction algorithm to acquire feature information of the video. Wherein the feature information may be feature sequence information of the video.
As a possible scenario, after the electronic device acquires the video, feature extraction may be performed on the video through a feature extraction tool (e.g., a plug-in) to acquire feature information of the video.
And 103, calling a boundary prediction model to perform time sequence boundary prediction on the characteristic information so as to generate a time sequence boundary of the video.
It should be noted that the boundary prediction model described in this embodiment may be trained in advance and pre-stored in a storage space of the electronic device to facilitate retrieval of the application, where the storage space is not limited to an entity-based storage space, such as a hard disk, and the storage space may also be a storage space of a network hard disk connected to the electronic device (cloud storage space).
The training and the generation of the boundary prediction model can be performed by a related server, the server can be a cloud server or a host of a computer, a communication connection is established between the server and the electronic device capable of executing the video processing method provided by the application embodiment, and the communication connection can be at least one of a wireless network connection and a wired network connection. The server can send the trained boundary prediction model to the electronic device so that the electronic device can call the trained boundary prediction model when needed, and therefore the computing stress of the electronic device is greatly reduced.
Specifically, after acquiring the feature information of the video, the electronic device may first use a boundary prediction model in its own storage space, and then input the feature information into the boundary prediction model, so as to perform time-series boundary prediction on the feature information through the boundary prediction model to obtain a time-series boundary of the video output (generated) by the boundary prediction model.
And 104, segmenting the video according to the time sequence boundary to generate a video segment of the video.
In this embodiment, the timing boundary may be multiple, and one video segment needs to start and end two timing boundaries to be determined, that is, the timing boundaries corresponding to the start time and the end time respectively. That is, the timing boundary may be plural and may be even.
Specifically, after obtaining the timing boundary of the video, the electronic device may segment the video according to the timing boundary to generate a video segment of the video.
For example, assume that the timing boundary is multiple, wherein after obtaining the multiple timing boundaries of the video, the electronic device may analyze the multiple timing boundaries to determine multiple sets of timing boundaries of start times and end times, and then may segment the video according to each set of timing boundaries of start times and end times to generate multiple video segments of the video.
Therefore, relevant personnel can perform video collection generation, video recommendation, retrieval and the like based on the video processing method of the embodiment of the application.
In the embodiment of the application, a video is obtained, feature extraction is performed on the video to obtain feature information of the video, then a boundary prediction model is called to perform time sequence boundary prediction on the feature information to generate a time sequence boundary of the video, and finally the video is segmented according to the time sequence boundary to generate a video segment of the video. Therefore, the accuracy of time sequence boundary prediction can be improved, and the recall rate of time sequence nomination in the un-segmented video is increased.
To clearly illustrate the above embodiment, in an embodiment of the present application, as shown in fig. 2, the performing feature extraction on the video to obtain feature information of the video may include:
It should be noted that the feature extraction model described in this embodiment may be trained in advance and pre-stored in the storage space of the electronic device to facilitate the retrieval of the application.
And step 203, performing feature extraction on the video through the feature extraction model to acquire feature information of the video.
Specifically, after acquiring a video, the electronic device may call (acquire) a feature extraction model from its own storage space, and input the video to the feature extraction model, where the feature extraction model performs feature extraction on the video, thereby outputting feature information of the video. Therefore, the accuracy of recognition can be improved by the aid of the feature extraction model to assist in extracting the video feature information.
Further, in an embodiment of the present application, the boundary prediction model may be a boundary prediction model based on a transform mechanism, as shown in fig. 3, which performs temporal boundary prediction on the feature information to generate a temporal boundary of the video by:
Specifically, after the electronic device inputs the feature information of the video into the boundary prediction model, the boundary prediction model can map the feature information into different vectors (i.e., a plurality of feature vectors) through self-attentions (self-attention mechanism) inside the transform
In the embodiment of the application, a plurality of feature vectors can be fused by calculating the similarity of different time sequences, so that the global information of the video is obtained.
To clarify the above embodiment, in an embodiment of the present application, the generating global information of the video according to the plurality of feature vectors may include: and acquiring the dimensionality of the first feature vector, and generating global information of the video according to the first feature vector, the second feature vector, the third feature vector and the dimensionality.
It should be noted that the first feature vector, the second feature vector, and the third feature vector described in this embodiment may be a Query vector, a Key vector, and a Value vector, respectively.
Specifically, referring to the figure, it is assumed that the feature vectors may include a first feature vector (Query vector), a second feature vector (Key vector), and a third feature vector (Value vector), wherein after the feature vectors are obtained by the boundary prediction model, a dimension of the first feature vector (Query vector) is obtained, the first feature vector (Query vector) and the second feature vector (Key vector) are multiplied to obtain a first intermediate Value, the first intermediate Value and the dimension are divided to obtain a second intermediate Value, and the second intermediate Value and the third feature vector (Value vector) are multiplied to obtain global information of the video.
Specifically, after the global information is calculated by self-attention (self-attention mechanism) inside the transform, the boundary prediction model may perform timing boundary prediction according to the global information to generate a timing boundary of the video.
Therefore, the time sequence boundary prediction can be effectively carried out on the characteristic information of the video through the boundary prediction model, and the accuracy of the time sequence boundary prediction is improved.
Further, in one embodiment of the present application, as shown in fig. 5, the boundary prediction model may be generated by:
In the embodiment of the present application, there are multiple ways to obtain the sample video, where the sample video may be obtained by downloading the video through a related video website, and the sample video may also be created artificially, for example, by taking a video through a video camera to obtain the sample video, which is not limited herein.
In this embodiment of the application, the tag may be marked in the sample video by a relevant person and pre-stored in a storage space of the electronic device, so as to facilitate retrieval of the application.
In the embodiment of the present application, feature extraction may be performed on the sample video according to the preset feature extraction algorithm, so as to obtain sample feature information of the sample video.
Specifically, after the sample video is acquired, the labels of the sample video segments in the sample video can be acquired from the storage space of the sample video. And then, extracting the characteristics of the sample video according to the preset characteristic extraction algorithm to obtain the sample characteristic information of the sample video.
As a possible scenario, after the labels of the plurality of sample video segments in the sample video are obtained, feature extraction may be further performed on the sample video according to the above feature extraction model to obtain sample feature information of the sample video.
As another possible scenario, after the labels of the plurality of sample video segments in the sample video are obtained, feature extraction may be performed on the sample video through a feature extraction tool (e.g., a plug-in) to obtain sample feature information of the sample video.
And step 504, generating a loss value according to the predicted boundary score and the label, and training a boundary prediction model according to the loss value.
Specifically, after sample feature information of a sample video is acquired, the sample feature information may be input into a boundary prediction model to generate a predicted boundary score, then a loss value is generated according to the predicted boundary score and the above-mentioned label, and the boundary prediction model is trained according to the loss value (for example, a random gradient (SGD) is used to optimize the loss value, and a network weight layer and a scaling parameter in the boundary prediction model are continuously updated until the loss value convergence training stops), so as to optimize the boundary prediction model and improve the accuracy of recognition.
It should be noted that the loss value described in this embodiment can be calculated by the class cross entropy.
According to the video processing method, firstly, a video is obtained, feature extraction is carried out on the video to obtain feature information of the video, then, a boundary prediction model is called to carry out time sequence boundary prediction on the feature information to generate a time sequence boundary of the video, and finally, the video is segmented according to the time sequence boundary to generate a video segment of the video. Therefore, the accuracy of time sequence boundary prediction can be improved, and the recall rate of time sequence nomination in the un-segmented video is increased.
Fig. 6 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present disclosure.
The video processing device can be configured in the electronic equipment to extract the characteristics of the acquired video, acquire the characteristic information of the video, call the boundary prediction model to perform time sequence boundary prediction on the characteristic information to generate a time sequence boundary of the video, and segment the video according to the time sequence boundary to generate a video segment of the video, so that the accuracy of the time sequence boundary prediction can be improved.
As shown in fig. 6, the video processing apparatus 600 may include: a first acquisition module 610, a second acquisition module 620, a first generation module 630, and a second generation module 640.
The first obtaining module 610 is configured to obtain a video.
In this embodiment of the present application, there may be multiple ways for the first obtaining module 610 to obtain the video, where firstly, the first obtaining module 610 may obtain the video from the video providing device, for example, the electronic device may download the video from the video providing device through a Uniform Resource Locator (URL) corresponding to the video, where the video providing device may include a digital versatile disc player, a video and audio compact disc player, a server, a usb disk, an intelligent hard disk, a mobile phone, and the like; the electronic device can store videos, and the first obtaining module 610 can obtain a target video from the videos stored in the electronic device; the first obtaining module 610 may perform video shooting through a camera built in the electronic device to obtain a video. And are not limited in any way herein.
The video may also be a video downloaded by the user through an associated video website, as one possibility.
It should be noted that the video described in this embodiment may be a target video that a user wants to perform a time-series action positioning to produce a behavior segment (i.e., a video segment).
The second obtaining module 620 is configured to perform feature extraction on the video to obtain feature information of the video.
In the embodiment of the application, the feature extraction can be performed on the video according to a preset feature extraction algorithm to obtain the feature information of the video, wherein the preset feature extraction algorithm can be calibrated according to the actual situation.
Specifically, after the first obtaining module 610 obtains the video, the second obtaining module 620 may perform feature extraction on the video according to a preset feature extraction algorithm to obtain feature information of the video. Wherein the feature information may be feature sequence information of the video.
As a possible scenario, after the video is acquired, the second acquiring module 620 may further perform feature extraction on the video through a feature extraction tool (e.g., a plug-in) to acquire feature information of the video.
The first generation module 630 is configured to invoke a boundary prediction model to perform temporal boundary prediction on the feature information to generate a temporal boundary of the video.
It should be noted that the boundary prediction model described in this embodiment may be trained in advance and pre-stored in a storage space of the electronic device to facilitate retrieval of the application, where the storage space is not limited to an entity-based storage space, such as a hard disk, and the storage space may also be a storage space of a network hard disk connected to the electronic device (cloud storage space).
The training and the generation of the boundary prediction model can be performed by a related server, the server can be a cloud server or a host of a computer, and a communication connection is established between the server and an electronic device which can be provided with the video processing device provided by the application embodiment, wherein the communication connection can be at least one of a wireless network connection and a wired network connection. The server can send the trained boundary prediction model to the electronic device so that the electronic device can call the trained boundary prediction model when needed, and therefore the computing stress of the electronic device is greatly reduced.
Specifically, after the second obtaining module 620 obtains the feature information of the video, the first generating module 630 may first obtain the boundary prediction model from its own storage space, and then input the feature information to the boundary prediction model, so as to perform the temporal boundary prediction on the feature information through the boundary prediction model to obtain the temporal boundary of the video output (generated) by the boundary prediction model.
The second generation module 640 is configured to segment the video according to the timing boundary to generate a video segment of the video.
In this embodiment, the timing boundary may be multiple, and one video segment needs to start and end two timing boundaries to be determined, that is, the timing boundaries corresponding to the start time and the end time respectively. That is, the timing boundary may be plural and may be even.
Specifically, after the first generation module 630 obtains the timing boundary of the video, the second generation module 640 may segment the video according to the timing boundary to generate a video segment of the video.
For example, assume that the timing boundary is multiple, wherein after the first generating module 630 obtains multiple timing boundaries of the video, the second generating module 640 may analyze the multiple timing boundaries to determine multiple sets of timing boundaries of start times and end times, and then may segment the video according to each set of timing boundaries of start times and end times to generate multiple video segments of the video.
In the embodiment of the application, a video is obtained through a first obtaining module, feature extraction is carried out on the video through a second obtaining module so as to obtain feature information of the video, then a boundary prediction model is called through a first generating module to carry out time sequence boundary prediction on the feature information so as to generate a time sequence boundary of the video, and finally the video is segmented through a second generating module according to the time sequence boundary so as to generate a video segment of the video. Therefore, the accuracy of time sequence boundary prediction can be improved, and the recall rate of time sequence nomination in the un-segmented video is increased.
In an embodiment of the present application, the second obtaining module 620 is specifically configured to: acquiring a feature extraction model; inputting the video to a feature extraction model; and performing feature extraction on the video through a feature extraction model to obtain feature information of the video.
In an embodiment of the present application, the boundary prediction model may be a boundary prediction model based on a Transformer mechanism, and as shown in fig. 6, the first generating module 630 may include: a first generation unit 631, a second generation unit 632, and a third generation unit 633.
Wherein the first generating unit 631 is configured to generate a plurality of feature vectors from the feature information.
The second generating unit 632 is configured to generate global information of the video according to the plurality of feature vectors.
The third generating unit 633 is configured to generate a timing boundary of the video according to the global information.
In an embodiment of the present application, the plurality of feature vectors may include a first feature vector, a second feature vector, and a third feature vector, and the second generating unit 632 is specifically configured to: obtaining the dimension of the first feature vector; and generating global information of the video according to the first feature vector, the second feature vector, the third feature vector and the dimension.
In an embodiment of the present application, as shown in fig. 6, the video processing apparatus 600 may further include: a training module 650, wherein the training module 650 is configured to generate the boundary prediction model by: obtaining a sample video and obtaining labels of a plurality of sample video fragments in the sample video; performing feature extraction on the sample video to obtain sample feature information of the sample video; inputting sample feature information into a boundary prediction model to generate a predicted boundary score; and generating a loss value according to the predicted boundary score and the label, and training a boundary prediction model according to the loss value.
It should be noted that the foregoing explanation on the embodiment of the video processing method is also applicable to the video processing apparatus of this embodiment, and is not repeated here.
The video processing device of the embodiment of the application acquires a video through the first acquisition module, performs feature extraction on the video through the second acquisition module to acquire feature information of the video, calls the boundary prediction model through the first generation module to perform time sequence boundary prediction on the feature information to generate a time sequence boundary of the video, and finally segments the video through the second generation module according to the time sequence boundary to generate a video segment of the video. Therefore, the accuracy of time sequence boundary prediction can be improved, and the recall rate of time sequence nomination in the un-segmented video is increased.
According to the technical scheme, the acquisition, storage, application and the like of the personal information of the related user are all in accordance with the regulations of related laws and regulations, and the customs of the public order is not violated.
There is also provided, in accordance with an embodiment of the present application, an electronic device, a readable storage medium, and a computer program product.
FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present application may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Claims (13)
1. A video processing method, comprising:
acquiring a video;
performing feature extraction on the video to acquire feature information of the video;
calling a boundary prediction model to perform time sequence boundary prediction on the characteristic information so as to generate a time sequence boundary of the video; and
and segmenting the video according to the time sequence boundary to generate a video segment of the video.
2. The method of claim 1, wherein the extracting the features of the video to obtain the feature information of the video comprises:
acquiring a feature extraction model;
inputting the video to the feature extraction model;
and performing feature extraction on the video through the feature extraction model to obtain feature information of the video.
3. The method of claim 1, wherein the boundary prediction model is a Transformer mechanism-based boundary prediction model that performs temporal boundary prediction on the feature information to generate temporal boundaries of the video by:
generating a plurality of feature vectors according to the feature information;
generating global information of the video according to the plurality of feature vectors;
and generating a time sequence boundary of the video according to the global information.
4. The method of claim 3, wherein the plurality of feature vectors includes a first feature vector, a second feature vector, and a third feature vector, the generating global information for the video from the plurality of feature vectors comprising:
obtaining the dimension of the first feature vector;
and generating global information of the video according to the first feature vector, the second feature vector, the third feature vector and the dimension.
5. The method of claim 1, wherein the boundary prediction model is generated by:
obtaining a sample video and obtaining labels of a plurality of sample video fragments in the sample video;
performing feature extraction on the sample video to obtain sample feature information of the sample video;
inputting the sample feature information into the boundary prediction model to generate a predicted boundary score;
and generating a loss value according to the predicted boundary score and the label, and training the boundary prediction model according to the loss value.
6. A video processing apparatus comprising:
the first acquisition module is used for acquiring a video;
the second acquisition module is used for extracting the characteristics of the video to acquire the characteristic information of the video;
the first generation module is used for calling a boundary prediction model to perform time sequence boundary prediction on the characteristic information so as to generate a time sequence boundary of the video; and
and the second generation module is used for segmenting the video according to the time sequence boundary so as to generate a video segment of the video.
7. The apparatus according to claim 6, wherein the second obtaining module is specifically configured to:
acquiring a feature extraction model;
inputting the video to the feature extraction model;
and performing feature extraction on the video through the feature extraction model to obtain feature information of the video.
8. The apparatus of claim 6, wherein the boundary prediction model is a Transformer mechanism-based boundary prediction model, and the first generation module comprises:
a first generating unit configured to generate a plurality of feature vectors according to the feature information;
a second generating unit, configured to generate global information of the video according to the plurality of feature vectors;
and the third generating unit is used for generating a time sequence boundary of the video according to the global information.
9. The apparatus according to claim 8, wherein the plurality of feature vectors includes a first feature vector, a second feature vector, and a third feature vector, and the second generating unit is specifically configured to:
obtaining the dimension of the first feature vector;
and generating global information of the video according to the first feature vector, the second feature vector, the third feature vector and the dimension.
10. The apparatus of claim 6, further comprising:
a training module to generate the boundary prediction model by:
obtaining a sample video and obtaining labels of a plurality of sample video fragments in the sample video;
performing feature extraction on the sample video to obtain sample feature information of the sample video;
inputting the sample feature information into the boundary prediction model to generate a predicted boundary score;
and generating a loss value according to the predicted boundary score and the label, and training the boundary prediction model according to the loss value.
11. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the video processing method of any of claims 1-5.
12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the video processing method according to any one of claims 1-5.
13. A computer program product comprising a computer program which, when executed by a processor, implements a video processing method according to any one of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110721821.4A CN113591570A (en) | 2021-06-28 | 2021-06-28 | Video processing method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110721821.4A CN113591570A (en) | 2021-06-28 | 2021-06-28 | Video processing method and device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113591570A true CN113591570A (en) | 2021-11-02 |
Family
ID=78244840
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110721821.4A Pending CN113591570A (en) | 2021-06-28 | 2021-06-28 | Video processing method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113591570A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114390365A (en) * | 2022-01-04 | 2022-04-22 | 京东科技信息技术有限公司 | Method and apparatus for generating video information |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110188733A (en) * | 2019-06-10 | 2019-08-30 | 电子科技大学 | Timing behavioral value method and system based on the region 3D convolutional neural networks |
CN110263215A (en) * | 2019-05-09 | 2019-09-20 | 众安信息技术服务有限公司 | A kind of video feeling localization method and system |
CN110852256A (en) * | 2019-11-08 | 2020-02-28 | 腾讯科技(深圳)有限公司 | Method, device and equipment for generating time sequence action nomination and storage medium |
CN112804558A (en) * | 2021-04-14 | 2021-05-14 | 腾讯科技(深圳)有限公司 | Video splitting method, device and equipment |
-
2021
- 2021-06-28 CN CN202110721821.4A patent/CN113591570A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110263215A (en) * | 2019-05-09 | 2019-09-20 | 众安信息技术服务有限公司 | A kind of video feeling localization method and system |
CN110188733A (en) * | 2019-06-10 | 2019-08-30 | 电子科技大学 | Timing behavioral value method and system based on the region 3D convolutional neural networks |
CN110852256A (en) * | 2019-11-08 | 2020-02-28 | 腾讯科技(深圳)有限公司 | Method, device and equipment for generating time sequence action nomination and storage medium |
CN112804558A (en) * | 2021-04-14 | 2021-05-14 | 腾讯科技(深圳)有限公司 | Video splitting method, device and equipment |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114390365A (en) * | 2022-01-04 | 2022-04-22 | 京东科技信息技术有限公司 | Method and apparatus for generating video information |
CN114390365B (en) * | 2022-01-04 | 2024-04-26 | 京东科技信息技术有限公司 | Method and apparatus for generating video information |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112929695B (en) | Video duplicate removal method and device, electronic equipment and storage medium | |
CN115063875B (en) | Model training method, image processing method and device and electronic equipment | |
CN114020950B (en) | Training method, device, equipment and storage medium for image retrieval model | |
CN112784778A (en) | Method, apparatus, device and medium for generating model and identifying age and gender | |
CN112861885A (en) | Image recognition method and device, electronic equipment and storage medium | |
CN111104874B (en) | Face age prediction method, training method and training device for model, and electronic equipment | |
CN112580666A (en) | Image feature extraction method, training method, device, electronic equipment and medium | |
CN112949433B (en) | Method, device and equipment for generating video classification model and storage medium | |
CN113011309A (en) | Image recognition method, apparatus, device, medium, and program product | |
CN113177449A (en) | Face recognition method and device, computer equipment and storage medium | |
CN114863437A (en) | Text recognition method and device, electronic equipment and storage medium | |
CN116611491A (en) | Training method and device of target detection model, electronic equipment and storage medium | |
CN111666771A (en) | Semantic label extraction device, electronic equipment and readable storage medium of document | |
CN114120454A (en) | Training method and device of living body detection model, electronic equipment and storage medium | |
CN113191261B (en) | Image category identification method and device and electronic equipment | |
CN117493595A (en) | Image searching method, device, equipment and medium based on large model | |
CN116578925B (en) | Behavior prediction method, device and storage medium based on feature images | |
CN113827240A (en) | Emotion classification method and emotion classification model training method, device and equipment | |
CN113591570A (en) | Video processing method and device, electronic equipment and storage medium | |
CN113177466A (en) | Identity recognition method and device based on face image, electronic equipment and medium | |
CN117056728A (en) | Time sequence generation method, device, equipment and storage medium | |
CN114120180B (en) | Time sequence nomination generation method, device, equipment and medium | |
CN115565186A (en) | Method and device for training character recognition model, electronic equipment and storage medium | |
CN113361519B (en) | Target processing method, training method of target processing model and device thereof | |
CN114724144A (en) | Text recognition method, model training method, device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |