CN116866662A

CN116866662A - Video identification method, device, computer equipment and storage medium

Info

Publication number: CN116866662A
Application number: CN202310926660.1A
Authority: CN
Inventors: 唐小林
Original assignee: Insta360 Innovation Technology Co Ltd
Current assignee: Insta360 Innovation Technology Co Ltd
Priority date: 2023-07-26
Filing date: 2023-07-26
Publication date: 2023-10-10

Abstract

The application relates to a video identification method, a video identification device, computer equipment and a storage medium. The method comprises the following steps: obtaining a fusion feature map according to the target feature map of the N frames of the current sampling frames of the video to be identified, and determining a highlight prediction result of the video fragment according to the fusion feature map; the video clips comprise N frames of sampling frames of the current time, and then the highlight video clips of the video to be identified are determined according to highlight prediction results of the video clips of each time. The N target feature maps comprise target sampling frames and target feature maps of adjacent sampling frames of the target sampling frames, the target feature maps of the adjacent sampling frames comprise target feature maps of at least one sampling frame of the last N sampling frames, the target feature maps are determined based on at least two different information of the sampling frames, and N is more than or equal to 2. By adopting the method, the highlight video clips in the video can be efficiently determined and identified, and time and labor are saved.

Description

Video identification method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of video technologies, and in particular, to a video identification method, apparatus, computer device, and storage medium.

Background

With the rapid development of internet technology and the popularization of video photographing apparatuses, a large amount of video is generated every day. However, people usually focus only on some highlight content of specific significance in videos, and how to determine highlight clips from a large number of videos becomes an important study for people in the art.

At present, highlight video clips in videos are usually identified manually, but this mode is inefficient, and causes great waste of manpower and time. Therefore, how to identify highlight video clips in videos in a time-saving and labor-saving manner is an important study of the person skilled in the art.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a video recognition method, apparatus, computer device, and storage medium that can save time and effort.

In a first aspect, the present application provides a video recognition method. The method comprises the following steps:

obtaining a fusion feature map according to the target feature map of the N frame sampling frames of the current time of the video to be identified; the N target feature maps comprise target sampling frames and target feature maps of adjacent sampling frames of the target sampling frames, the target feature maps of the adjacent sampling frames comprise target feature maps of at least one sampling frame of the last N sampling frames, the target feature maps are determined based on at least two different information of the sampling frames, and N is more than or equal to 2;

Determining a highlight prediction result of the video clip according to the fusion feature map; the video clip comprises N frames of sampling frames of the current time;

and determining the highlight video fragments of the video to be identified according to the highlight prediction results of the video fragments of each time.

In a second aspect, the application further provides a video identification device. The device comprises:

the acquisition module is used for acquiring a fusion feature map according to the target feature map of the N frame sampling frames of the current time of the video to be identified; the N target feature maps comprise target sampling frames and target feature maps of adjacent sampling frames of the target sampling frames, the target feature maps of the adjacent sampling frames comprise target feature maps of at least one sampling frame of the last N sampling frames, the target feature maps are determined based on at least two different information of the sampling frames, and N is more than or equal to 2;

the first determining module is used for determining a highlight prediction result of the video clip according to the fusion feature map; the video clip comprises N frames of sampling frames of the current time;

and the second determining module is used for determining the highlight video fragments of the video to be identified according to the highlight prediction results of the video fragments of each time.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of any of the methods described above when the processor executes the computer program.

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of any of the above.

In a fifth aspect, the present application also provides a computer program product. The computer program product comprising a computer program which, when executed by a processor, implements the steps of any of the methods described above.

According to the video identification method, the video identification device, the computer equipment and the storage medium, a fusion feature map is obtained according to the target feature map of the N frame sampling frames of the current time of the video to be identified, and the highlight prediction result of the video fragment is determined according to the fusion feature map; the video clips comprise N frames of sampling frames of the current time, and then the highlight video clips of the video to be identified are determined according to highlight prediction results of the video clips of each time. Since the target feature map is determined based on at least two different information of the sampling frame, the determined target feature map is relatively rich in information. And because the N target feature maps comprise target sampling frames and target feature maps of adjacent sampling frames of the target sampling frames, and the target feature maps of the adjacent sampling frames comprise target feature maps of at least one frame of sampling frames in the last N frames of sampling frames, local early fusion can be realized by determining the target feature maps. And further, obtaining a fusion feature map according to the target feature map of the N frames of sampling frames of the current time, so that integral late fusion can be realized. The method can give consideration to the efficiency of feature extraction and the operation efficiency by means of local early fusion and integral late fusion, and improves the recognition accuracy. After the highlight prediction result of the video clips is obtained based on the fusion feature map, the highlight video clips of the video to be identified can be determined according to the highlight prediction result of each video clip, so that the highlight video clips in the video do not need to be manually identified, and time and labor are saved.

Drawings

FIG. 1 is an application environment diagram of a video clip identification method in an embodiment of the present application;

FIG. 2 is a flow chart of a video recognition method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of obtaining a target feature map according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating a method for determining an initial stitching feature map according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a process for obtaining an initial stitching feature map according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating another embodiment of determining an initial stitching feature map;

FIG. 7 is a schematic diagram of a process for obtaining an initial stitching feature map according to an embodiment of the present application;

FIG. 8 is a flowchart illustrating another embodiment of determining an initial stitching feature map;

FIG. 9 is a schematic diagram of a process for obtaining an initial stitching feature map according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a method for determining a highlight video clip according to an embodiment of the present application;

FIG. 11 is a flowchart illustrating a method for determining a highlight prediction result according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a process for obtaining a highlight prediction result according to an embodiment of the present application;

FIG. 13 is a flowchart of a method for determining a highlight video clip according to an embodiment of the present application;

FIG. 14 is a schematic process diagram of a video recognition method according to an embodiment of the present application;

fig. 15 is a block diagram of a video recognition apparatus according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

Fig. 1 is an application environment diagram of a video clip identification method in an embodiment of the present application, and in an embodiment of the present application, a computer device is provided, where the computer device may be a terminal, and an internal structure diagram of the computer device may be as shown in fig. 1. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a video clip identification method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the architecture shown in fig. 1 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements may be implemented, as a particular computer device may include more or less components than those shown, or may be combined with some components, or may have a different arrangement of components.

The embodiment is illustrated by the method applied to the terminal, and it is understood that the method can also be applied to the server, and can also be applied to a system comprising the terminal and the server, and implemented through interaction between the terminal and the server. The terminal may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers. The server may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.

Fig. 2 is a flow chart of a video recognition method according to an embodiment of the present application, which can be applied to the computer device shown in fig. 1, and in one embodiment, as shown in fig. 2, the method includes the following steps:

s201, obtaining a fusion feature map according to a target feature map of N frame sampling frames of the current time of a video to be identified; the N target feature maps comprise target sampling frames and target feature maps of adjacent sampling frames of the target sampling frames, the target feature maps of the adjacent sampling frames comprise target feature maps of at least one sampling frame of the last N sampling frames, the target feature maps are determined based on at least two different information of the sampling frames, and N is more than or equal to 2.

In this embodiment, the video to be identified may be a video received by the computer device in real time, or may be a video stored in the computer device in advance. The computer device can sample the video to be identified according to a preset sampling frequency to obtain a sampling frame in the video to be identified. The preset sampling frequency can be set according to requirements, and the embodiment is not limited. For example, assuming that the video to be identified is 10 seconds in total, the video frames per second in the video to be identified are 30 frames, and 10 frames of sampling frames are sampled from 30 frames of video frames per second, then 100 frames of sampling frames are obtained from the video to be identified in total, for convenience of description, the 10 frames of sampling frames within the 1 st second are sequentially sampled to be 1, 2, 3 and 10 of … … samples according to time sequence, and the time point of the sampled frame 1 is earlier than the time point of the sampled frame 2, 3 and 10 of … … samples. The 10 frames of sampling frames in the 2 nd second are sequentially sampling frame 11, sampling frame 2, sampling frame 3 and … … sampling frame 20 according to the time sequence, and the like, namely, 100 frames of sampling frames are obtained from the video to be identified, namely, sampling frame 1, sampling frame 2, sampling frame 3 and … … sampling frame 100.

Wherein the current N frame sample frame may include a target sample frame and a neighbor sample frame of the target sample frame. Let n=10. When the previous time is equal to 1, i.e., the 1 st time N frame sample frame may include sample frame 1, sample frame 2, sample frames 3, … … sample frame 10. In the case where the current time is equal to 1, the target sampling frame may be any one of N sampling frames. For example, when the previous time is the first time, the target sampling frame is sampling frame 10, and the adjacent sampling frames of sampling frame 10 include sampling frames 1 to 9. The N target feature maps comprise 10 target feature maps 1-10 corresponding to the sampling frames 1-10.

In the case that the current time is greater than 1, the target sampling frame may be the next frame of the last frame among the last N frame sampling frames. For example, when the current time is the 2 nd time, the target sampling frame may be the sampling frame 11, and the computer device may take the sampling frames 3 to 10 and 12 as adjacent sampling frames of the sampling frame 11, so as to obtain 10 target feature maps 3 to 12 corresponding to the sampling frames 3 to 12, that is, the last frame of the 10 th frame of sampling frames of the 2 nd time is the sampling frame 12. Likewise, when the current time is 3 rd time, the target sample frame may be the sample frame 13, that is, the sample frame 13 is the next frame to the last frame among the 10 th frame sample frames of the 2 nd time. The computer device takes the sampling frames 5 to 12 and the sampling frame 14 of the 2 nd time as the adjacent sampling frames of the sampling frame 13, so as to obtain the target characteristic diagrams 5 to 14 corresponding to the sampling frames 5 to 14.

Or when the previous time is 2 nd time, the target sampling frame is sampling frame 11, and the computer equipment can take sampling frames 2-10 as adjacent sampling frames of the sampling frame 11, so as to obtain 10 target characteristic diagrams 2-11 corresponding to the sampling frames 2-11. When the previous time is 3 rd time, the target sampling frame may be sampling frame 12, and the computer device may use sampling frames 3 to 11 as adjacent sampling frames of sampling frame 12, so as to obtain 10 target feature images 3 to 12 corresponding to sampling frames 3 to 12, and so on, which are not described herein again.

Wherein the target feature map is determined based on at least two different information of the sampled frames. For example, the computer device may determine the target feature map based on at least two of image information of the sample frame, information convolved with the image information, association information of the target sample frame with other video frames, or other feature information.

Still further, the computer device can obtain a fused feature map from the target feature map of the N sampled frames of the current time of the video to be identified. Optionally, the computer device may perform fusion processing on the target feature map of the N sampling frames of the current time to obtain a fused feature map.

Illustratively, when the previous time is the first time, the computer device performs fusion processing on the target feature maps 1 to 10 corresponding to the sampling frames 1 to 10 to obtain a fusion feature map 1. And when the previous time is the second time, the computer equipment performs fusion processing on the target feature images 3-12 corresponding to the sampling frames 3-12 to obtain a fusion feature image 2, and the like.

It can be seen that in the above process, when determining the target feature map of the N-frame sampling frame, the computer device determines according to at least two different information of the sampling frame, so as to realize local early fusion of the N-frame sampling frame. After determining the N target feature maps of the N frame sampling frames, the computer equipment obtains a fusion feature map according to the N target feature maps, so that the process of obtaining the fusion feature map realizes the whole late fusion of the N frame sampling frames.

S202, determining a highlight prediction result of the video clip according to the fusion feature map; the video clip includes the current N frames of sample frames.

In this embodiment, after the computer device acquires the fusion feature map, the computer device may determine the highlight prediction result of the N frame sampling frames of the current time, that is, the highlight prediction result of the video segment. Wherein the highlight prediction is used to indicate whether the video segment belongs to highlight or non-highlight.

Alternatively, the computer device may input the fused feature map into a trained predictive model to output highlight predictions of the video segments from the predictive model. The prediction model may be determined according to a plurality of video clip samples and highlight labels corresponding to the video clip samples, and the prediction model may be a convolutional neural network (Convolutional Neural Networks, CNN), a cyclic neural network (Recurrent Neural Network, RNN), or other deep learning networks, machine learning networks, or the like.

Continuing with the above example, when the previous time is the first time, the video clip includes sample frame 1 through sample frame 10, and the computer device inputs the fused feature map 1 into the trained prediction model to determine the highlight prediction result 1 corresponding to sample frame 1 through sample frame 10. Likewise, when the current time is the second time, the video clip includes sample frames 3 to 12, and the computer device determines a highlight prediction result 2 of sample frames 3 to 12; when the current time is the third time, the video clip includes sample frames 5-14, the computer device determines the highlight prediction 3 for sample frames 5-14, and so on.

And S203, determining the highlight video segments of the video to be identified according to the highlight prediction results of the video segments of each time.

In this embodiment, after determining the highlight prediction result of each video segment, the computer device may determine the highlight video segment of the video to be identified according to the highlight prediction result of each video segment.

Alternatively, the computer device may determine a highlight video segment of the video to be identified based on consecutive highlight prediction results in each video segment being a highlight video segment. Continuing with the example above, where the computer device determines that the highlight prediction 1 is non-highlight and the highlight predictions 2-3 are highlight, the computer device may determine sample frame 3 from sample frame 3-sample frame 12 based on highlight prediction 2 and sample frame 14 from sample frame 5-sample frame 14 based on highlight prediction 3 to determine that the content between sample frame 3 and sample frame 14 in the video to be identified is a highlight video clip.

According to the video identification method provided by the embodiment, a fusion feature map is obtained according to the target feature map of the N frame sampling frames of the current time of the video to be identified, and the highlight prediction result of the video fragment is determined according to the fusion feature map; the video clips comprise N frames of sampling frames of the current time, and then the highlight video clips of the video to be identified are determined according to highlight prediction results of the video clips of each time. Because the target feature map is determined based on at least two different information of the sampling frames, the determined target feature map is rich in information, and because the N target feature maps comprise target sampling frames and target feature maps of adjacent sampling frames of the target sampling frames, the target feature map of the adjacent sampling frames comprises target feature maps of at least one frame of the last N frames of sampling frames, local early fusion can be realized by determining the target feature map. And further, obtaining a fusion feature map according to the target feature map of the N frames of sampling frames of the current time, so that integral late fusion can be realized. The method can give consideration to the efficiency of feature extraction and the operation efficiency by means of local early fusion and integral late fusion, and improves the recognition accuracy. After the highlight prediction result of the video clips is obtained based on the fusion feature map, the highlight video clips of the video to be identified can be determined according to the highlight prediction result of each video clip, so that the highlight video clips in the video do not need to be manually identified, and time and labor are saved.

Optionally, on the basis of the foregoing embodiment, the video recognition method further includes the following steps:

acquiring at least two different information of a first sampling frame and a second sampling frame based on the first sampling frame and the second sampling frame in the N frame sampling frames of the current time;

the first sampling frame and the second sampling frame are two sampling frames adjacent to each other or separated by at least one frame sampling frame, and the at least two different information comprises at least two items of image information of the first sampling frame, image information of the second sampling frame and association information between the first sampling frame and the second sampling frame.

In this embodiment, the computer device needs to acquire at least two different pieces of information of the first sampling frame and the second sampling frame. The first sample frame and the second sample frame are two adjacent sample frames or two sample frames separated by at least one frame sample frame. Taking the first time as an example, the computer device may take the sample frame 1 and the sample frame 2 as a first sample frame and a second sample frame, or may take the sample frame 1 and the sample frame 3 as a first sample frame and a second sample frame.

In the case that the sampling frame 1 is a first sampling frame and the sampling frame 2 is a second sampling frame, the computer device may determine the target feature map 1 according to the image information of the sampling frame 1 and the image information of the sampling frame 2; the computer device may also determine the target feature map 1 according to the image information of the sampling frame 1 and the association information between the sampling frame 1 and the sampling frame 2, which is not limited in this embodiment, as long as the at least two different information of the first sampling frame and the second sampling frame includes at least two of the image information of the first sampling frame, the image information of the second sampling frame, and the association information between the first sampling frame and the second sampling frame.

In this embodiment, the first sampling frame and the second sampling frame are two sampling frames adjacent to each other or separated by at least one frame sampling frame, and the at least two different information includes at least two of image information of the first sampling frame, image information of the second sampling frame, and association information between the first sampling frame and the second sampling frame, so that based on the first sampling frame and the second sampling frame in the N sampling frames of the current time, the at least two different information of the first sampling frame and the second sampling frame can be obtained.

Optionally, on the basis of the above embodiment, the association information includes at least one of optical flow information, pixel difference information, and convolution characteristic information between the first sampling frame and the second sampling frame.

In this embodiment, the optical flow information may include, but is not limited to, an optical flow graph and an optical flow gradient. The convolution characteristic information may include characteristic information extracted from the first sample frame and characteristic information extracted from the second sample frame.

Continuing taking the sampling frame 1 as a first sampling frame and the sampling frame 2 as a second sampling frame as an example, the computer equipment can take an optical flow diagram between the sampling frame 1 and the sampling frame 2 as the association information between the sampling frame 1 and the sampling frame 2; the pixel difference information between the sampling frame 1 and the sampling frame 2 may also be used as the correlation information between the sampling frame 1 and the sampling frame 2, and the present embodiment is not limited as long as the correlation information between the first sampling frame and the second sampling frame includes at least one of optical flow information, pixel difference information, and convolution characteristic information between the first sampling frame and the second sampling frame.

In this embodiment, the correlation information includes at least one of optical flow information, pixel difference information and convolution feature information between the first sampling frame and the second sampling frame, so that the correlation information can embody motion information between the first sampling frame and the second sampling frame, and accuracy of the target feature map is improved.

Fig. 3 is a schematic flow chart of obtaining a target feature map according to an embodiment of the present application, and referring to fig. 3, this embodiment relates to an alternative implementation manner of obtaining the target feature map. On the basis of the above embodiment, the above "obtaining at least two different information of the first sampling frame and the second sampling frame based on the first sampling frame and the second sampling frame in the N frame sampling frames of the current time" includes the following steps:

s301, determining an initial splicing characteristic diagram of a first sampling frame based on the first sampling frame and a second sampling frame; the time point of the first sampling frame is earlier than the time point of the second sampling frame, and the initial stitching feature map includes at least two different pieces of information.

In this embodiment, the time point of the first sampling frame is earlier than the time point of the second sampling frame. Optionally, the computer device may determine the initial stitching feature map of the first sample frame according to at least two of image information of the first sample frame, image information of the second sample frame, and association information between the first sample frame and the second sample frame. As such, the initial stitching profile includes at least two different pieces of information.

Taking the sample frame 1 as a first sample frame and the sample frame 2 as a second sample frame as an example, the computer device may splice the image information of the sample frame 1 and the image information of the sample frame 2 to obtain an initial spliced feature map of the sample frame 1. The computer device may also perform a stitching process on the image information of the sampling frame 1 and the association information between the sampling frame 1 and the sampling frame 2, to obtain an initial stitching feature map 1 of the sampling frame 1.

S302, inputting the initial spliced feature map to a backbone network in a wonderful prediction network to obtain a target feature map; alternatively, the initial stitching feature map is taken as the target feature map.

Continuing with the above example, after obtaining the initial stitching feature map of the sample frame 1, the computer device may directly use the initial stitching feature map 1 of the sample frame 1 as the target feature map 1 of the sample frame 1, or may input the initial stitching feature map 1 of the sample frame 1 into a backbone network (backhaul) in the highlight prediction network, so as to determine the target feature map 1 of the sample frame 1 through the backhaul.

In this embodiment, since the time point of the first sampling frame is earlier than the time point of the second sampling frame, the initial stitching feature map includes at least two different information, and therefore, after determining the initial stitching feature map of the first sampling frame based on the first sampling frame and the second sampling frame, the target feature map determined based on the at least two different information of the sampling frame can be obtained. The initial spliced characteristic map is input to a main network in the wonderful prediction network, and the accuracy of the target characteristic map can be improved because the target characteristic map is obtained after the characteristic extraction of the initial spliced characteristic map is carried out by using the main network. The initial spliced characteristic diagram is used as a target characteristic diagram, so that the speed of obtaining the target characteristic diagram can be improved.

Fig. 4 is a schematic flow chart of determining an initial stitching feature map according to an embodiment of the present application, and referring to fig. 4, this embodiment relates to an alternative implementation of determining an initial stitching feature map of a first sample frame. Based on the above embodiment, the step S301 of determining an initial stitching feature map of the first sampling frame based on the first sampling frame and the second sampling frame includes the following steps:

s401, optical flow information is obtained based on the first sampling frame and the second sampling frame, and features of the optical flow information are extracted to obtain a first feature map.

In this embodiment, the computer device may perform convolution processing on the optical flow information to obtain the first feature map. Taking sample frame 1 as the first sample frame and sample frame 2 as the second sample frame as an example. The computer device may obtain optical flow information based on the sampling frame 1 and the sampling frame 2, and perform convolution processing on the optical flow information to extract features of the optical flow information to obtain the first feature map 1.

S402, extracting the features of the first sampling frame to obtain a second feature map.

Optionally, the computer device may also perform convolution processing on the first sampling frame to obtain the second feature map. Continuing with the example above, the computer device also needs to extract features of sample frame 1 to obtain a second feature map 1.

S403, performing splicing treatment on the first feature map and the second feature map to obtain an initial spliced feature map; the initial stitching feature map includes optical flow information and target image information including image information of a first sample frame and/or image information of a second sample frame.

After S401 and S402, the computer device may perform a stitching process on the first feature map and the second feature map to obtain an initial stitched feature map. The computer device performs the splicing process on the first feature map 1 and the second feature map 1, that is, after the concat process, an initial spliced feature map corresponding to the sampling frame 1 is obtained.

Since the first feature map is obtained by extracting features of optical flow information between the first sampling frame and the second sampling frame, and the second feature map is obtained by extracting features of the first sampling frame, the initial stitching feature map includes the optical flow information and the target image information between the first sampling frame and the second sampling frame. The target image information includes image information of the first sampling frame and/or image information of the second sampling frame.

Fig. 5 is a schematic diagram of a process for obtaining an initial stitching feature map according to an embodiment of the present application. As shown in fig. 5, after obtaining optical flow information based on the first sampling frame and the second sampling frame, the computer device performs convolution processing on the optical flow information to extract features of the optical flow information to obtain a first feature map. And the computer equipment also carries out convolution processing on the first sampling frame with the earlier time point so as to extract the characteristics of the first sampling frame to obtain a second characteristic diagram. And finally, the computer equipment performs splicing processing on the first feature map and the second feature map to obtain an initial spliced feature map of the first sampling frame.

In this embodiment, optical flow information is obtained based on the first sampling frame and the second sampling frame, features of the optical flow information are extracted to obtain a first feature map, features of the first sampling frame are extracted to obtain a second feature map, and then the first feature map and the second feature map are subjected to splicing processing to obtain an initial spliced feature map. Because the initial stitching feature map comprises optical flow information and target image information, and the target image information comprises the image information of the first sampling frame and/or the image information of the second sampling frame, the obtained target feature map comprises two different information, and the accuracy of the fusion feature map determined based on the target feature map is improved.

Fig. 6 is a schematic flow chart of yet another embodiment of determining an initial stitching feature map according to the present application, and referring to fig. 6, this embodiment relates to an alternative implementation of how to determine an initial stitching feature map of a first sample frame. Based on the above embodiment, the step S301 of determining an initial stitching feature map of the first sampling frame based on the first sampling frame and the second sampling frame includes the following steps:

s601, pixel difference information is obtained based on the first sampling frame and the second sampling frame, and features of the pixel difference information are extracted to obtain a third feature map.

In this embodiment, the computer device may perform convolution processing on the pixel difference information to obtain the first feature map. Continuing to sample frame 1 as the first sampling frame and sample frame 2 as the second sampling frame, the computer device may obtain pixel difference information between sample frame 1 and sample frame 2 based on sample frame 1 and sample frame 2, and convolve the pixel difference information to extract a feature of the pixel difference information to obtain a third feature map 1.

Assuming that each of the sampling frame 1 and the sampling frame 2 includes the pixel point 1 to the pixel point 200, the computer device may determine a pixel difference between the pixel value of the pixel point 1 in the sampling frame 1 and the pixel value of the pixel point 1 in the sampling frame 2, a pixel difference between the pixel value of the pixel point 2 in the sampling frame 1 and the pixel value of the pixel point 2 in the sampling frame 2, and so on, the computer device may determine a pixel difference of 200 pixel points, that is, pixel difference information.

S602, extracting the features of the first sampling frame to obtain a fourth feature map.

Optionally, the computer device may also perform convolution processing on the first sampling frame to obtain a fourth feature map. Continuing with the example above, the computer device also needs to extract features of sample frame 1 to obtain a fourth feature map 1.

S603, performing splicing processing on the third feature map and the fourth feature map to obtain an initial spliced feature map; the initial stitching feature map includes pixel difference information and target image information including image information of a first sample frame and/or image information of a second sample frame.

After S601 and S602, the computer device may perform a stitching process on the third feature map and the fourth feature map to obtain an initial stitched feature map. The computer device performs the splicing process on the third feature map 1 and the fourth feature map 1, that is, after the concat process, an initial spliced feature map 1 corresponding to the sampling frame 1 is obtained.

Fig. 7 is a schematic diagram of a process of obtaining an initial stitching feature map according to another embodiment of the present application, where, as shown in fig. 7, after obtaining pixel difference information based on a first sampling frame and a second sampling frame, a computer device performs convolution processing on the pixel difference information to extract features of optical flow information to obtain a third feature map. And the computer equipment also carries out convolution processing on the first sampling frame with the earlier time point so as to extract the characteristics of the first sampling frame to obtain a fourth characteristic diagram. And finally, the computer equipment performs splicing processing on the third feature map and the fourth feature map to obtain an initial spliced feature map of the first sampling frame.

In this embodiment, because the pixel difference information is determined based on the first sampling frame and the second sampling frame, and the feature of the first sampling frame is extracted to obtain the third feature map, after the third feature map and the pixel difference information are spliced to obtain the initial spliced feature map, the initial spliced feature map includes the pixel difference information and the target image information, and the target image information includes the image information of the first sampling frame and/or the image information of the second sampling frame, and further the target feature map obtained according to the initial spliced feature map includes two different information.

Fig. 8 is a schematic flow chart of yet another embodiment of determining an initial stitching feature map according to the present application, and referring to fig. 8, this embodiment relates to an alternative implementation of how to determine an initial stitching feature map of a first sample frame. Based on the above embodiment, the step S301 of determining an initial stitching feature map of the first sampling frame based on the first sampling frame and the second sampling frame includes the following steps:

s801, carrying out convolution processing on the first sampling frame to obtain a fifth characteristic diagram.

S802, carrying out convolution processing on the second sampling frame to obtain a sixth feature map.

S803, performing splicing processing on the fifth feature map and the sixth feature map to obtain an initial spliced feature map; the initial stitching feature map includes convolution feature information and target image information including image information of the first sample frame and/or image information of the second sample frame.

Fig. 9 is a schematic diagram of a process of obtaining an initial stitching feature map according to another embodiment of the present application, and in this embodiment, taking a sample frame 1 as a first sample frame and a sample frame 2 as a second sample frame as an example, a computer device convolves the sample frame 1 to obtain a fifth feature map 1, convolves the sample frame 2 to obtain a sixth feature map 1, and stitches the fifth feature map 1 and the sixth feature map 1, that is, after concat processing, obtains an initial stitching feature map 1 corresponding to the sample frame 1.

It should be noted that, the convolution kernel and the step size in the above convolution process may be set according to requirements, for example, the convolution kernel is 3×3, and the step size is 1.

In this embodiment, the convolution processing is performed on the first sampling frame to obtain a fourth feature map, and the convolution processing is performed on the second sampling frame to obtain a fifth feature map, so after the initial stitching feature map is obtained by performing the stitching processing on the fourth feature map and the fifth feature map, the initial stitching feature map includes convolution feature information and target image information, the target image information includes image information of the first sampling frame and/or image information of the second sampling frame, and then the target feature map obtained according to the initial stitching feature map includes two different information.

Optionally, on the basis of the foregoing embodiment, the target sampling frame is an L-th sampling frame; the video identification method further comprises the following steps:

and under the condition that L is larger than N, storing the target feature map of the sampling frame of the L frame, and deleting the target feature map of the sampling frame of the L-N frame.

Fig. 10 is a schematic diagram of determining a highlight video clip according to an embodiment of the present application, as shown in fig. 10 (a). Illustratively, in the case that the current time is the first time, the initial stitching feature map at the time t represents the initial stitching feature map of the sampling frame 1, and the initial stitching feature map of the sampling frame 1 is determined according to the sampling frame 1 and the sampling frame 2. the initial stitching feature map at time t+1 represents an initial stitching feature map of a sampling frame 2, and the initial stitching feature map of the sampling frame 2 is determined according to the sampling frame 2 and the sampling frame 3. the initial stitching feature map at time t+2 represents an initial stitching feature map of the sampling frame 3, and the initial stitching feature map of the sampling frame 3 is determined according to the sampling frame 3 and the sampling frame 4.

Similarly, the initial stitching characteristic diagram at time t+L-2 represents the initial stitching characteristic diagram of the sampling frame 9, and the initial stitching characteristic diagram of the sampling frame 9 is determined by sampling the frame 9 and the sampling frame 10. the initial stitching characteristic diagram at time t+L-1 represents an initial stitching characteristic diagram of the sampling frame 10, and the initial stitching characteristic diagram of the sampling frame 10 is determined by the sampling frame 10 and the sampling frame 11.

The initial spliced feature map is input into a backbone network in the highlight prediction network to obtain target feature maps 1-10 corresponding to the sampling frames 1-10 respectively, and a fusion feature map 1 can be obtained according to the target feature maps 1-10. Further, from the fusion profile fig. 1, a highlight prediction result 1 of the video segment 1 including the sampling frames 1 to 10 can be determined.

In the case that the current time is the second time, the initial stitching feature map at the time t represents the initial stitching feature map of the sampling frame 2. the initial mosaic feature map at time t+1 represents the initial mosaic feature map of the sample frame 3. the initial stitching characteristic map at time t+2 represents the initial stitching characteristic map of the sample frame 4. Similarly, the initial stitching characteristic diagram at time t+L-2 represents the initial stitching characteristic diagram of the sampling frame 10, and the initial stitching characteristic diagram of the sampling frame 9 is determined by the sampling frame 9 and the sampling frame 10. the initial stitching characteristic map at time t+l-1 represents the initial stitching characteristic map of the sample frame 11.

The initial spliced feature map is input into a backbone network in the highlight prediction network to obtain target feature maps 2-11 corresponding to the sampling frames 2-11 respectively, and a fusion feature map 2 can be obtained according to the target feature maps 2-11. Further, from the fusion profile fig. 3, the highlight prediction results 2 of the video segments 2 including the sample frames 2 to 11 can be determined.

It can be seen that in case L is greater than N, the target feature map of the previous L-1 sampling frames is still used, so that each time the computer device calculates the target feature map, in case L is greater than N, the target feature map of the L-th sampling frame is stored, and the target feature map of the L-N-th sampling frame is deleted.

With continued reference to fig. 10 (b), for example, when the current time is the second time, the target sampling frame is the sampling frame 11, L is greater than N, the computer device needs to store the target feature images 2-11 according to the target feature images 2-11 corresponding to the sampling frames 2-11, and the target feature images 2-10 are already stored in the first time, that is, the computer device may reuse the t+1 time target feature image, the t+2 time target feature image, the t+3 time target feature image, … …, and the t+l-1 time target feature image. Therefore, the computer device can repeatedly use the target feature map 2-10 when determining the fusion feature map for the second time. And, the target feature map 11 of the sample frame 11 may be stored, and the target feature map 1 of the sample frame 1 may be deleted,

when the current time is the third time, the target sampling frame is the sampling frame 12, L is greater than N, the computer device needs to store the target feature images 3-12 according to the target feature images 3-12 corresponding to the sampling frames 3-12, and the target feature images 3-11 are already stored in the second time, that is, the computer device needs to reuse the t+2 time target feature image, the t+3 time target feature image, … … and the t+l time target feature image. Therefore, the computer device only needs to store the target feature map 12 of the sampling frame 12 and delete the target feature map 2 of the sampling frame 2 in the third time, and can repeatedly use the target feature maps 3 to 11. And so on, they are not described in detail herein.

In this embodiment, when L is greater than N, the target feature map of the sampling frame of the L-th frame is stored, and the target feature map of the sampling frame of the L-N-th frame is deleted, so that the target feature map can be effectively extracted, so that the target feature maps of each sampling frame are independent of each other, and can be reused by the subsequent sampling frames, thereby improving the operation efficiency. Moreover, the consumption of computing resources can be reduced, the resource occupancy rate is improved, and the method is friendly to mobile terminal deployment.

Optionally, on the basis of the above embodiment, the adjacent sampling frame of the target sampling frame includes a first N-1 frame sampling frame of the target sampling frame.

In this embodiment, taking n=10 as an example, when the current time is the second time, the target sampling frame is sampling frame 11, and the sampling frames adjacent to the target sampling frame include the first 9 sampling frames of sampling frame 11, that is, the sampling frames adjacent to sampling frame 11 include sampling frames 2 to 10.

When the current time is the third time, the target sampling frame is the sampling frame 12, and the adjacent sampling frames of the target sampling frame include the first 9 sampling frames of the sampling frame 12, that is, the adjacent sampling frames of the sampling frame 11 include the sampling frames 3 to 11.

That is, when the current time is the first time, the computer device determines the target feature map 1 to the target feature map 10 of the sampling frames 1 to 10; when the current time is the second time, the computer equipment determines the target characteristic diagrams 2-11 of the sampling frames 2-11; when the current time is the third time, the computer device determines the target feature map 3 to the target feature map 12 of the sampling frame 3 to the sampling frame 12, and the like.

In the embodiment, the adjacent sampling frames of the target sampling frame comprise the first N-1 frame sampling frames of the target sampling frame, so that local early fusion is realized, the first N-1 frame sampling frames of the target sampling frame can be reused, the storage space of the wonderful prediction network is reduced, the occupancy rate of resources is reduced, and the identification efficiency is improved.

Fig. 11 is a flowchart illustrating a method for determining a highlight prediction result according to an embodiment of the present application, and referring to fig. 11, this embodiment relates to an alternative implementation of how to determine a highlight prediction result of a video clip. Based on the above embodiment, the step S202 of determining the highlight prediction result of the video clip according to the fusion feature map includes the following steps:

s1101, inputting the fusion feature map to a first classification sub-network in the highlight prediction network to obtain action labels and action feature information of the video clips.

Both highlight and non-highlight are a generic class, and feature-level clustering is more advantageous for summarization of highlight and non-highlight content, and thus, in this embodiment, the highlight-prediction network may include a first classification sub-network and a second classification sub-network.

Further, after obtaining the fusion feature map, the computer device first inputs the fusion feature map to a first classification sub-network in the highlight prediction network to obtain action tags and action feature information of the video clips.

The action label is used to represent an action of the video clip, for example, action a. The action feature information is a feature determined from the first classification sub-network and the fused feature map. For example, the first classification sub-network obtains action feature information after feature extraction of the fusion feature map.

And S1102, inputting the action characteristic information into a second classification sub-network in the highlight prediction network to obtain the highlight label of the video clip.

In this embodiment, after the first classification sub-network obtains the motion feature information, the motion feature information is input to a second classification sub-network in the highlight prediction network, so that the highlight label of the video clip is output by the second classification sub-network.

Wherein the highlight label is used to indicate whether the video clip is highlight. Alternatively, the second sub-category network may output a confidence level indicating whether the video segment is highlighted, the confidence level being a number between 0 and 1, and the computer device may determine the highlighting label of the video segment based on the preset threshold and the confidence level. For example, the preset threshold may be 0.5, and when the confidence is greater than 0.5, the highlight label is determined to be "highlight", and when the confidence is not greater than 0.5, the highlight label is determined to be "non-highlight".

S1103, a highlight prediction result is determined based on the action tag and the highlight tag.

In this embodiment, after obtaining the action tag and the highlight tag, the computer device may determine a highlight prediction based on the action tag and the highlight tag. In other words, after the highlight prediction result is determined based on the action tag and the highlight tag, the highlight prediction result may indicate not only whether the video clip is highlight, but also which action the video clip is determined to be highlight, so that the level of highlight is more interpretable.

Illustratively, the action label of video segment 1 is "action B", the highlight label is "highlight", and the highlight prediction result of video segment 1 indicates that video segment 1 is a highlight, and the action corresponding to the highlight is action B.

Fig. 12 is a schematic diagram of a process of obtaining a highlight prediction result in an embodiment of the present application, please refer to fig. 12, after the computer device obtains a target feature map of a current N-frame sampling frame, and obtains a fused feature map of the current N-frame sampling frame according to the target feature map, the computer device inputs the fused feature map to a first classification sub-network of the highlight prediction network, and then the first classification sub-network determines action feature information and a highlight label of a video clip according to the fused feature map. Further, the computer device inputs the motion characteristic information into a second classification sub-network of the highlight reel, which determines a highlight label of the video segment based on the motion characteristic information.

In the embodiment, the fusion feature map is input to a first classification sub-network in the highlight prediction network to obtain action labels and action feature information of the video clips, the action feature information is input to a second classification sub-network in the highlight prediction network to obtain highlight labels of the video clips, and highlight prediction results are determined based on the action labels and the highlight labels. In one aspect, the first classifying sub-network classifies the motion and then determines the motion label, the second classifying sub-network determines whether the motion label is wonderful to determine the wonderful label, and the determining process is more logical and is more beneficial to the aggregation of the same class characteristics, so that the classifying capability of the wonderful predicting network is improved. Also, since the highlight prediction result is determined based on the action tag and the highlight tag, the action corresponding to the highlight-determined video clip in determining whether or not to highlight is made higher in interpretability of the highlight determination. On the other hand, only the video to be identified is needed to be utilized, and other information such as sound, browsing record and barrage is not needed to be relied on, so that the application scene of video identification is enlarged.

It will be appreciated that the highlight network first needs to be trained before it can be used. First, the computer device needs to determine a sample data set, where the sample set includes a plurality of video samples and action tags corresponding to the video samples, and the action tags include highlight tags, where the highlight tags are used to characterize whether actions corresponding to the action tags are highlight actions.

The computer equipment can provide an interactive interface, further determines action labels corresponding to all video clip samples in the video samples in response to interactive operation based on the interactive interface, and completes automatic labeling. The interaction may include, but is not limited to, mouse click, keyboard selection, gesture triggering, voice control. The user may select a start point and an end point of the video sample in the interactive interface, and select at least one action tag from the plurality of candidate action tags, and then the computer device may slice the video sample according to the start point, the end point and the at least one action tag selected by the user to obtain a video clip sample, and automatically label the selected action tag in the video clip sample.

That is, the computer device defines the highlight label of each action label in advance, for example, the highlight label indicates no highlight when the highlight label is 0, the highlight label indicates highlight when the highlight label is 1, the highlight label corresponding to the action a is 1, and the highlight label corresponding to the action B is 0. Further, the computer device may label the action labels corresponding to each video clip sample in the video samples, e.g., video clip sample 1 labels action a and action B.

Optionally, the sample data sets include positive and negative sample data sets, and in order to enhance the generalization ability of the prediction model, the video sample should also include disturbance scenes that are daily encountered, such as shot shake without highlight, body part touch or a segment close to the shot, shot against the ground. In this way, the computer device can determine positive and negative sample data sets of a vast natural scene.

Then, the same as the used process principle, the computer device may input the sample dataset into the highlight prediction network, determine N target feature maps of N sampling frames of the current time in the video sample by the highlight prediction network, obtain a fused feature map according to the N target feature maps of N sampling frames of the current time in the video sample, and then input the fused feature map into the first classification sub-network in the highlight prediction network, to predict the action tag and the action feature information corresponding to the video segment sample, where the video segment sample includes N sampling frames of the current time.

The computer device then inputs the motion characteristic information into a second classification sub-network in the highlight-prediction network, predicting the highlight labels of the video segments. Further, the computer device may determine a highlight prediction result for each video clip sample based on the action tags and highlight tags predicted by the highlight prediction network.

Further, the computer device can calculate the loss function to perform inverse gradient propagation to train the highlight prediction network according to the highlight prediction results of the video clip samples of each time and the actual labeling condition of the video clip samples in the positive and negative sample data sets, and the computer device can stop training under the condition that the loss function meets the preset condition.

Fig. 13 is a schematic flow chart of determining a highlight video according to an embodiment of the present application, and referring to fig. 13, this embodiment relates to an alternative implementation of how to determine a highlight video. On the basis of the above embodiment, the step S203 of determining a highlight video segment of the video to be identified according to the highlight prediction result of each video segment includes the following steps:

s1301, determining a start image frame and an end image frame of the highlight video clip according to the highlight prediction result of each video clip.

In this embodiment, it is assumed that the computer device determines that 40 video clips 1 to 40 are given according to the video to be identified, the video clip 1 includes a sampling frame 1 to a sampling frame 10, the video clip 2 includes a sampling frame 2 to a sampling frame 11, the video clip 3 includes a sampling frame 3 to a sampling frame 12, … …, and the video clip 40 includes a sampling frame 40 to a sampling frame 50.

And the highlight prediction result of each video segment, namely, the video segment 1 to the video segment 40 respectively correspond to the highlight prediction result 1 to the highlight prediction result 40.

Based on the 40 highlight predictions, the computer device can determine the starting image frame and the ending image frame of the highlight video clip. It will be appreciated that the starting image frame and the ending image frame are used to locate a highlight video clip from the video to be identified.

Alternatively, the computer device may determine the starting image frame and the ending image frame based on the successive highlight-predicted results being highlight video segments to be identified. For example, if the highlight prediction 10-20 are highlight among the highlight predictions 1-40, the computer device may use the first frame of the video segment 10 as the starting image frame and the last frame of the video segment 20 as the ending image frame. In some embodiments, the computer device may also take the sample frame in the middle of the video segment 10 as a starting image frame and the sample frame in the middle of the video segment 20 as an ending image frame.

Alternatively, the computer device may also select the first segment and the second segment from each video segment based on the highlight prediction result of each video segment. The computer device may then take the sample frame at the specified location in the first segment as the starting image frame and the sample frame at the specified location in the second segment as the ending image frame.

The first segment may be a segment of the video segment in which the first highlight prediction result is highlight, or may be a segment in which the highlight prediction result is highlight and the highlight prediction result of the previous video segment is non-highlight. The second segment may be a segment of the video segment where the last highlight prediction is not highlight, or a segment where the highlight prediction is not highlight and the highlight prediction of the previous segment is highlight.

Illustratively, assume that in video segments 1-40, the highlight prediction results of video segment 5, video segment 15, video segment 28, and video segment 40 are all non-highlight, and the highlight prediction results of the remaining video segments are all highlight. The computer device may take the video segment 1 to be identified as a first segment 1 and the video segment 5 as a second segment 1 for the first time; the video segment 6 is taken as a first segment 2 for the second time, and the video segment 15 is taken as a second segment 2; third time video clip 16 is taken as first clip 3 and video clip 28 is taken as second clip 3; the fourth time video clip 29 will be the first clip 4 and video clip 40 will be the second clip 4.

Further, taking the example of determining a highlight video clip from the first clip 1 and the second clip 1, the computer apparatus may take the sample frame at the specified position in the first clip 1 as a start image frame and the sample frame at the specified position in the second clip 1 as an end image frame. The designated position may be a position where any one of the sampling frames is located, for example, an intermediate position.

And S1302, determining a highlight video fragment from the video to be identified according to the starting image frame and the ending image frame.

In this embodiment, after determining the start image frame and the end image frame, the computer device may locate, from the start image frame and the end image frame, a highlight video clip starting from the start image frame to the end image frame from the video to be identified.

Illustratively, assuming that the sampling frame located at the specified position in the first segment 1 is the sampling frame 5 and the sampling frame located at the specified position in the second segment 1 is the sampling frame 10, the computer device regards the content of the sampling frame 5, the sampling frame 10, and the sampling frames 5 to 10 in the video to be identified as one highlight video segment.

According to the method and the device for determining the highlight video, the starting image frame and the ending image frame of the highlight video are determined according to highlight prediction results of the video clips of each time, the highlight video is determined from the video to be identified according to the starting image frame and the ending image frame, manual identification is not needed, time and labor are saved, and the highlight video is determined from the video to be identified according to highlight prediction results of the video clips of each time, so that accuracy of the highlight video is improved.

In order to more clearly describe the video recognition method of the present application, it is described herein with reference to fig. 14. Fig. 14 is a schematic process diagram of a video recognition method according to an embodiment of the present application, and as shown in fig. 14, a computer device may perform the video recognition method according to the following procedure.

S1401, determining an initial stitching feature map of the first sampling frame based on the first sampling frame and the second sampling frame in the N-frame sampling frames of the current time. The computer device may determine the initial stitching feature map of the first sampled frame by using the manner shown in fig. 5, may determine the initial stitching feature map of the first sampled frame by using the manner shown in fig. 7, and may determine the initial stitching feature maps corresponding to the N sampled frames respectively by using the manner shown in fig. 9. Illustratively, when the previous time is the first time, the initial stitching characteristic map 1 of the sampling frame 1 is determined according to the sampling frame 1 and the sampling frame 2, the initial stitching characteristic map 2 of the sampling frame 2 is determined according to the sampling frame 2 and the sampling frame 3, and the like.

S1402, inputting the initial spliced feature map to a backbone network in a highlight prediction network to obtain N frames of sampling frames corresponding to target feature maps respectively; or taking the initial spliced characteristic map as the target characteristic map corresponding to the N frames of sampling frames respectively. It will be appreciated that the target feature map includes target sample frames and target feature maps of adjacent sample frames of the target sample frames, the target feature maps of adjacent sample frames including target feature maps of at least one of the last N frame sample frames, the adjacent sample frames of the target sample frames including the first N-1 frame sample frames of the target sample frames.

S1403, obtaining a fusion feature map of the current time according to the target feature map of the N frame sampling frames of the video to be identified. And storing the target feature map of the L-th frame sampling frame and deleting the target feature map of the L-N-th frame sampling frame under the condition that L is larger than N.

And S1404, inputting the fusion feature map of the current time into a first classification sub-network in the highlight prediction network to obtain action labels and action feature information of the video clips of the current time. Wherein the current video clip comprises the current N frames of sample frames.

And S1405, inputting the action characteristic information into a second classification sub-network in the highlight prediction network to obtain the highlight label of the video clip.

S1406, determining the highlight prediction result of the video segment of the current time based on the action label and the highlight label.

S1407, determining a start image frame and an end image frame of the highlight video segment according to the highlight prediction result of each video segment.

S1408, determining a highlight video clip from the video to be identified according to the start image frame and the end image frame.

S1401 to S1408 may refer to the above embodiments, and are not described here again.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a video recognition device for realizing the video recognition method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in one or more embodiments of the video recognition device provided below may refer to the limitation of the video recognition method hereinabove, and will not be repeated here.

Fig. 15 is a block diagram of a video recognition device according to an embodiment of the present application, and as shown in fig. 15, in an embodiment of the present application, there is provided a video recognition device 1500, including: a obtaining module 1501, a first determining module 1502 and a second determining module 1503, wherein:

an obtaining module 1501, configured to obtain a fusion feature map according to a target feature map of a current N-frame sampling frame of a video to be identified; the N target feature maps comprise target sampling frames and target feature maps of adjacent sampling frames of the target sampling frames, the target feature maps of the adjacent sampling frames comprise target feature maps of at least one sampling frame of the last N sampling frames, the target feature maps are determined based on at least two different information of the sampling frames, and N is more than or equal to 2.

A first determining module 1502, configured to determine a highlight prediction result of the video segment according to the fusion feature map; the video clip includes the current N frames of sample frames.

A second determining module 1503, configured to determine a highlight video segment of the video to be identified according to the highlight prediction result of each video segment.

According to the video identification device provided by the embodiment, a fusion feature map is obtained according to the target feature map of the N frame sampling frames of the current time of the video to be identified, and the highlight prediction result of the video fragment is determined according to the fusion feature map; the video clips comprise N frames of sampling frames of the current time, and then the highlight video clips of the video to be identified are determined according to highlight prediction results of the video clips of each time. Since the target feature map is determined based on at least two different information of the sampling frame, the determined target feature map is relatively rich in information. And because the N target feature maps comprise target sampling frames and target feature maps of adjacent sampling frames of the target sampling frames, and the target feature maps of the adjacent sampling frames comprise target feature maps of at least one frame of sampling frames in the last N frames of sampling frames, local early fusion can be realized by determining the target feature maps. And further, obtaining a fusion feature map according to the target feature map of the N frames of sampling frames of the current time, so that integral late fusion can be realized. The method can give consideration to the efficiency of feature extraction and the operation efficiency by means of local early fusion and integral late fusion, and improves the recognition accuracy. After the highlight prediction result of the video clips is obtained based on the fusion feature map, the highlight video clips of the video to be identified can be determined according to the highlight prediction result of each video clip, so that the highlight video clips in the video do not need to be manually identified, and time and labor are saved.

Optionally, the video recognition apparatus 1500 further includes:

the acquisition module is used for acquiring at least two different information of the first sampling frame and the second sampling frame based on the first sampling frame and the second sampling frame in the N frame sampling frames of the current time; the first sampling frame and the second sampling frame are two sampling frames adjacent to each other or separated by at least one frame sampling frame, and the at least two different information comprises at least two items of image information of the first sampling frame, image information of the second sampling frame and association information between the first sampling frame and the second sampling frame.

Optionally, the association information includes at least one of optical flow information, pixel difference information, convolution characteristic information between the first sample frame and the second sample frame.

Optionally, the acquiring module includes:

the first determining unit is used for determining an initial splicing characteristic diagram of the first sampling frame based on the first sampling frame and the second sampling frame; the time point of the first sampling frame is earlier than the time point of the second sampling frame, and the initial stitching feature map includes at least two different pieces of information.

The second determining unit is used for inputting the initial spliced characteristic diagram into a backbone network in the wonderful prediction network to obtain a target characteristic diagram; alternatively, the initial stitching feature map is taken as the target feature map.

Optionally, the first determining unit includes:

the first extraction subunit is configured to obtain optical flow information based on the first sampling frame and the second sampling frame, and extract features of the optical flow information to obtain a first feature map.

And the second extraction subunit is used for extracting the features of the first sampling frame to obtain a second feature map.

The first splicing subunit is used for carrying out splicing treatment on the first characteristic map and the second characteristic map to obtain an initial spliced characteristic map; the initial stitching feature map includes optical flow information and target image information including image information of a first sample frame and/or image information of a second sample frame.

Optionally, the first determining unit includes:

and the third extraction subunit is used for obtaining pixel difference information based on the first sampling frame and the second sampling frame, and extracting the characteristics of the pixel difference information to obtain a third characteristic diagram.

And the fourth extraction subunit is used for extracting the features of the first sampling frame to obtain a fourth feature map.

The second splicing subunit is used for carrying out splicing treatment on the third characteristic diagram and the fourth characteristic diagram to obtain an initial spliced characteristic diagram; the initial stitching feature map includes pixel difference information and target image information including image information of a first sample frame and/or image information of a second sample frame.

Optionally, the first determining unit includes:

and the first convolution subunit is used for carrying out convolution processing on the first sampling frame to obtain a fifth characteristic diagram.

And the second convolution subunit is used for carrying out convolution processing on the second sampling frame to obtain a sixth characteristic diagram.

The third splicing subunit is used for carrying out splicing treatment on the fifth characteristic diagram and the sixth characteristic diagram to obtain an initial spliced characteristic diagram; the initial stitching feature map includes convolution feature information and target image information including image information of the first sample frame and/or image information of the second sample frame.

Optionally, the target sampling frame is an L-th frame sampling frame; the video recognition apparatus 1500 further includes:

and the storage module is used for storing the target feature map of the L-th frame sampling frame and deleting the target feature map of the L-N-th frame sampling frame under the condition that L is larger than N.

Optionally, the adjacent sample frames of the target sample frame include the first N-1 frame sample frames of the target sample frame.

Optionally, the first determining module 1502 includes:

the first input unit is used for inputting the fusion feature map to a first classification sub-network in the highlight prediction network to obtain action labels and action feature information of the video clips.

And the second input unit is used for inputting the action characteristic information into a second classification sub-network in the highlight prediction network to obtain the highlight label of the video clip.

And a third determining unit for determining a highlight prediction result based on the action tag and the highlight tag.

Optionally, the second determining module 1503 includes:

a fourth determining unit for determining a start image frame and an end image frame of the highlight video clip according to the highlight prediction result of each video clip;

and a fifth determining unit for determining a highlight video clip from the video to be identified according to the start image frame and the end image frame.

The various modules in the video recognition device described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:

Determining a highlight prediction result of the video clip according to the fusion feature map; the video clip comprises the N frames of sampling frames of the current time;

and determining the highlight video fragments of the video to be identified according to highlight prediction results of the video fragments of each time.

In one embodiment, the processor when executing the computer program further performs the steps of:

acquiring at least two different information of a first sampling frame and a second sampling frame based on the first sampling frame and the second sampling frame in the N frame sampling frames of the current time; the first sampling frame and the second sampling frame are two sampling frames adjacent to each other or separated by at least one frame of sampling frames, and the at least two different information comprises at least two items of image information of the first sampling frame, image information of the second sampling frame and association information between the first sampling frame and the second sampling frame.

the association information includes at least one of optical flow information, pixel difference information, convolution characteristic information between the first sample frame and the second sample frame.

determining an initial stitching feature map of the first sampled frame based on the first sampled frame and the second sampled frame; the time point of the first sampling frame is earlier than the time point of the second sampling frame, and the initial stitching feature map comprises the at least two different information; inputting the initial spliced feature map to a backbone network in a highlight prediction network to obtain the target feature map; or, taking the initial stitching feature map as the target feature map.

obtaining the optical flow information based on the first sampling frame and the second sampling frame, and extracting features of the optical flow information to obtain a first feature map; extracting the characteristics of the first sampling frame to obtain a second characteristic diagram; performing splicing processing on the first characteristic diagram and the second characteristic diagram to obtain the initial spliced characteristic diagram; the initial stitching feature map includes the optical flow information and target image information including image information of the first sample frame and/or image information of the second sample frame.

obtaining the pixel difference information based on the first sampling frame and the second sampling frame, and extracting the characteristics of the pixel difference information to obtain a third characteristic diagram; extracting the characteristics of the first sampling frame to obtain a fourth characteristic diagram; performing splicing processing on the third characteristic diagram and the fourth characteristic diagram to obtain the initial spliced characteristic diagram; the initial stitching feature map includes the pixel difference information and target image information including image information of the first sample frame and/or image information of the second sample frame.

carrying out convolution processing on the first sampling frame to obtain a fifth characteristic diagram; carrying out convolution processing on the second sampling frame to obtain a sixth feature map; performing splicing processing on the fifth characteristic diagram and the sixth characteristic diagram to obtain the initial spliced characteristic diagram; the initial stitching feature map comprises the convolution feature information and target image information, wherein the target image information comprises the image information of the first sampling frame and/or the image information of the second sampling frame.

the target sampling frame is an L-th frame sampling frame; and under the condition that L is larger than N, storing the target feature map of the sampling frame of the L frame, and deleting the target feature map of the sampling frame of the L-N frame.

the adjacent sample frames of the target sample frame include the first N-1 frame sample frames of the target sample frame.

inputting the fusion feature map to a first classification sub-network in the highlight prediction network to obtain action labels and action feature information of the video clips; inputting the action characteristic information into a second classification sub-network in the highlight prediction network to obtain a highlight label of the video clip; the highlight prediction result is determined based on the action tag and the highlight tag.

determining a starting image frame and an ending image frame of the highlight video clip according to highlight prediction results of the video clip for each time; and determining the highlight video fragment from the video to be identified according to the starting image frame and the ending image frame.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, the computer program when executed by the processor further performs the steps of:

In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of:

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method of video recognition, the method comprising:

2. The method according to claim 1, wherein the method further comprises:

the first sampling frame and the second sampling frame are two sampling frames adjacent to each other or separated by at least one frame of sampling frames, and the at least two different information comprises at least two items of image information of the first sampling frame, image information of the second sampling frame and association information between the first sampling frame and the second sampling frame.

3. The method of claim 2, wherein the association information comprises at least one of optical flow information, pixel difference information, convolution characteristic information between the first and second sampled frames.

4. A method according to claim 3, wherein said obtaining at least two different information of said first and second sampled frames based on a first and second sampled frame of said current N-frame sampled frames comprises:

Determining an initial stitching feature map of the first sampled frame based on the first sampled frame and the second sampled frame; the time point of the first sampling frame is earlier than the time point of the second sampling frame, and the initial stitching feature map comprises the at least two different information;

inputting the initial spliced feature map to a backbone network in a highlight prediction network to obtain the target feature map; or, taking the initial stitching feature map as the target feature map.

5. The method of claim 4, wherein the determining an initial stitching profile for the first sample frame based on the first sample frame and the second sample frame comprises:

obtaining the optical flow information based on the first sampling frame and the second sampling frame, and extracting features of the optical flow information to obtain a first feature map;

extracting the characteristics of the first sampling frame to obtain a second characteristic diagram;

performing splicing processing on the first characteristic diagram and the second characteristic diagram to obtain the initial spliced characteristic diagram; the initial stitching feature map includes the optical flow information and target image information including image information of the first sample frame and/or image information of the second sample frame.

6. The method of claim 4, wherein the determining an initial stitching profile for the first sample frame based on the first sample frame and the second sample frame comprises:

obtaining the pixel difference information based on the first sampling frame and the second sampling frame, and extracting the characteristics of the pixel difference information to obtain a third characteristic diagram;

extracting the characteristics of the first sampling frame to obtain a fourth characteristic diagram;

performing splicing processing on the third characteristic diagram and the fourth characteristic diagram to obtain the initial spliced characteristic diagram; the initial stitching feature map includes the pixel difference information and target image information including image information of the first sample frame and/or image information of the second sample frame.

7. The method of claim 4, wherein the determining an initial stitching profile for the first sample frame based on the first sample frame and the second sample frame comprises:

carrying out convolution processing on the first sampling frame to obtain a fifth characteristic diagram;

carrying out convolution processing on the second sampling frame to obtain a sixth feature map;

performing splicing processing on the fifth characteristic diagram and the sixth characteristic diagram to obtain the initial spliced characteristic diagram; the initial stitching feature map comprises the convolution feature information and target image information, wherein the target image information comprises the image information of the first sampling frame and/or the image information of the second sampling frame.

8. The method of any of claims 1-7, wherein the target sample frame is an L-th frame sample frame; the method further comprises the steps of:

9. The method of any of claims 1-7, wherein the adjacent sample frames of the target sample frame comprise a first N-1 frame sample frame of the target sample frame.

10. The method of any of claims 2-7, wherein determining a highlight prediction of a video segment from the fused feature map comprises:

inputting the fusion feature map to a first classification sub-network in the highlight prediction network to obtain action labels and action feature information of the video clips;

inputting the action characteristic information into a second classification sub-network in the highlight prediction network to obtain a highlight label of the video clip;

the highlight prediction result is determined based on the action tag and the highlight tag.

11. The method of claim 10, wherein determining the highlight video segments of the video to be identified based on the highlight prediction results of the video segments of each time, comprises:

Determining a starting image frame and an ending image frame of the highlight video clip according to highlight prediction results of the video clip for each time;

and determining the highlight video fragment from the video to be identified according to the starting image frame and the ending image frame.

12. A video recognition device, the device comprising:

the first determining module is used for determining a highlight prediction result of the video clip according to the fusion feature map; the video clip comprises the N frames of sampling frames of the current time;

13. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 11 when the computer program is executed.

14. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 11.

15. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 11.