CN109121022B

CN109121022B - Method and apparatus for marking video segments

Info

Publication number: CN109121022B
Application number: CN201811139639.2A
Authority: CN
Inventors: 刘霄; 杨凡; 文石磊; 柏提; 李鑫; 赵翔; 李旭斌; 丁二锐
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-09-28
Filing date: 2018-09-28
Publication date: 2020-05-05
Anticipated expiration: 2038-09-28
Also published as: CN109121022A

Abstract

The embodiment of the application discloses a method and a device for marking video clips. One embodiment of the method comprises: acquiring a video characteristic information sequence from a video to be marked; grouping the adjacent video characteristic information with the set quantity in the video characteristic information sequence to obtain a video characteristic segment sequence; for the video characteristic segments in the video characteristic segment sequence, importing the video characteristic segments into a video marking model trained in advance to obtain identification information corresponding to the video characteristic segments; and marking the video segments of the video to be marked by the identification information in the identification information sequence in response to the identification information sequence corresponding to the video characteristic segment sequence. This embodiment improves the efficiency and accuracy of marking video segments.

Description

Method and apparatus for marking video segments

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a method and a device for marking video clips.

Background

Video can generally integrate information such as image information and audio information, and becomes an important information carrier for users to acquire information. The video website can provide massive videos with various types or contents for the user, so that the user can simultaneously acquire information such as various images and audios through the videos, the effectiveness of the user in acquiring the information is improved, and the video transmission is facilitated.

Disclosure of Invention

The embodiment of the application provides a method and a device for marking video clips.

In a first aspect, an embodiment of the present application provides a method for marking a video segment, where the method includes: setting image frames at intervals, extracting an image sequence and an audio information sequence corresponding to the image sequence from a video to be marked respectively, and establishing a corresponding relation between an image in the image information sequence and audio information of a corresponding image in the audio information sequence to obtain a video characteristic information sequence, wherein the video characteristic information is used for representing image characteristics and audio characteristics of the video to be marked, the image characteristics are contents contained in the image, and the audio characteristics are specific audio information in the audio; grouping the adjacent video characteristic information with the set quantity in the video characteristic information sequence to obtain a video characteristic segment sequence; for the video feature segments in the video feature segment sequence, importing the video feature segments into a pre-trained video tag model to obtain identification information corresponding to the video feature segments, wherein the video tag model is used for matching the identification information corresponding to the video feature segments, and the identification information is used for representing that the video feature segments are in a starting state, an intermediate state or a terminating state of an event; and marking the video segments of the video to be marked by the identification information in the identification information sequence in response to the identification information sequence corresponding to the video characteristic segment sequence.

In some embodiments, the video tagging model is trained by: obtaining a plurality of sample video characteristic fragments and sample identification information corresponding to each sample video characteristic fragment in the plurality of sample video characteristic fragments; and taking each sample video characteristic segment in the plurality of sample video characteristic segments as input, taking the sample identification information corresponding to each sample video characteristic segment in the plurality of sample video characteristic segments as output, and training to obtain a video marking model.

In some embodiments, the training, with each of the plurality of sample video feature fragments as an input and sample identification information corresponding to each of the plurality of sample video feature fragments as an output, to obtain the video tagging model includes: the following training steps are performed: sequentially inputting each sample video feature segment of the plurality of sample video feature segments into an initialized video marker model to obtain prediction identification information corresponding to each sample video feature segment of the plurality of sample video feature segments, comparing the prediction identification information corresponding to each sample video feature segment of the plurality of sample video feature segments with the sample identification information corresponding to the sample video feature segment to obtain the prediction accuracy of the initialized video marker model, determining whether the prediction accuracy is greater than a preset accuracy threshold, and if so, taking the initialized video marker model as the trained video marker model.

In some embodiments, the training, with each of the plurality of sample video feature segments as an input and sample identification information corresponding to each of the plurality of sample video feature segments as an output, to obtain the video tagging model, further includes: and responding to the condition that the accuracy is not larger than the preset accuracy threshold, adjusting the parameters of the initialized video mark model, and continuing to execute the training step.

In some embodiments, the sequentially inputting each of the plurality of sample video feature segments to an initialized video marker model to obtain the prediction identifier information corresponding to each of the plurality of sample video feature segments includes: for the images contained in the sample video characteristic fragments and the audio information corresponding to the images, carrying out image identification on the images to obtain image content information corresponding to the images, and carrying out audio identification on the audio information to obtain audio content information corresponding to the audio information; and determining the prediction identification information of the sample video feature segment as the starting state or the ending state of the event in response to the fact that the image content information of two adjacent frames of images in the image content information sequence is different and the audio content information corresponding to the two adjacent frames of images in the audio content information sequence is different.

In some embodiments, the determining that the prediction identification information of the sample video feature segment is a start state or a stop state of the event includes: and in response to that the image content information of the previous image in the two adjacent images comprises the specified image content, the audio content information corresponding to the previous image comprises the specified audio content, the next image does not comprise the specified image content, and the audio content information corresponding to the next image does not comprise the specified audio content, setting the prediction identification information of the sample video feature segment as the termination state of the event.

In some embodiments, the determining that the prediction identification information of the sample video feature segment is a start state or a stop state of the event includes: and in response to that the image content information of the next frame image in the two adjacent frames of images comprises the specified image content, the audio content information corresponding to the next frame image comprises the specified audio content, the previous frame image does not comprise the specified image content, and the audio content information corresponding to the previous frame image does not comprise the specified audio content, setting the prediction identification information of the sample video feature segment as the starting state of the event.

In some embodiments, the marking the video segment of the video to be marked by the identification information in the identification information sequence includes: and for the identification information in the identification information sequence, when the identification information is in a starting state, marking the video characteristic segment corresponding to the identification information between the identification information and the identification information which is in a termination state next to the identification information as a target video segment.

In a second aspect, an embodiment of the present application provides an apparatus for marking a video segment, including: the video characteristic information sequence acquisition unit comprises an information extraction subunit and a video characteristic information sequence acquisition subunit, wherein the information extraction subunit is configured to set image frames at intervals and respectively extract an image sequence and an audio information sequence corresponding to the image sequence from a video to be marked, the video characteristic information sequence acquisition subunit is configured to establish a corresponding relationship between images in the image information sequence and audio information corresponding to images in the audio information sequence to obtain a video characteristic information sequence, the video characteristic information is used for representing image characteristics and audio characteristics of the video to be marked, the image characteristics are contents contained in the images, and the audio characteristics are specific audio information in the audio; the video characteristic segment sequence acquisition unit is configured to group the adjacent video characteristic information with the set quantity in the video characteristic information sequence to obtain a video characteristic segment sequence; the identification information acquisition unit is used for leading the video characteristic segment in the video characteristic segment sequence into a pre-trained video marking model to obtain identification information corresponding to the video characteristic segment, wherein the video marking model is used for matching the identification information corresponding to the video characteristic segment, and the identification information is used for representing that the video characteristic segment is in a starting state, an intermediate state or a termination state of an event; and the marking unit is used for responding to the identification information sequence corresponding to the video characteristic segment sequence and marking the video segment of the video to be marked through the identification information in the identification information sequence.

In some embodiments, the apparatus further comprises a videomark model training unit configured to train a videomark model, the videomark model training unit comprising: the system comprises a sample acquisition subunit, a video processing subunit and a video processing unit, wherein the sample acquisition subunit is configured to acquire a plurality of sample video characteristic fragments and sample identification information corresponding to each sample video characteristic fragment in the plurality of sample video characteristic fragments; and the video marking model training subunit is configured to take each sample video feature segment in the plurality of sample video feature segments as input, take the sample identification information corresponding to each sample video feature segment in the plurality of sample video feature segments as output, and train to obtain the video marking model.

In some embodiments, the videomark model training subunit includes: a video tagging model training module, configured to sequentially input each of the plurality of sample video feature segments to an initialized video tagging model, obtain prediction identification information corresponding to each of the plurality of sample video feature segments, compare the prediction identification information corresponding to each of the plurality of sample video feature segments with the sample identification information corresponding to the sample video feature segment, obtain a prediction accuracy of the initialized video tagging model, determine whether the prediction accuracy is greater than a preset accuracy threshold, and if so, take the initialized video tagging model as a trained video tagging model.

In some embodiments, the aforementioned video tag model training subunit further includes: and the parameter adjusting module is used for responding to the condition that the accuracy is not greater than the preset accuracy threshold, adjusting the parameters of the initialized video marker model and returning to execute the video marker model training module.

In some embodiments, the video tag model training module comprises: a content information obtaining sub-module, configured to perform image recognition on the image to obtain image content information corresponding to the image, and perform audio recognition on the audio information to obtain audio content information corresponding to the audio information, for the image contained in the sample video feature segment and the audio information corresponding to the image; and the prediction identification information setting sub-module is used for responding to the fact that the image content information of the two adjacent frames of images in the image content information sequence is different and the audio content information corresponding to the two adjacent frames of images in the audio content information sequence is different and is configured to determine the prediction identification information of the sample video characteristic segment as the starting state or the ending state of the event.

In some embodiments, the prediction identification information setting sub-module includes: and the first prediction identification information setting module is used for setting the prediction identification information of the sample video characteristic segment as the termination state of the event in response to the fact that the image content information of the previous frame image in the two adjacent frames of images comprises the specified image content, the audio content information corresponding to the previous frame image comprises the specified audio content, the next frame image does not comprise the specified image content, and the audio content information corresponding to the next frame image does not comprise the specified audio content.

In some embodiments, the prediction identification information setting sub-module includes: and the second prediction identification information setting module is used for responding to the fact that the image content information of the next frame image in the two adjacent frames of images comprises the specified image content, the audio content information corresponding to the next frame image comprises the specified audio content, the previous frame image does not comprise the specified image content, and the audio content information corresponding to the previous frame image does not comprise the specified audio content, and is configured to set the prediction identification information of the sample video characteristic segment as the starting state of the event.

In some embodiments, the marking unit includes: and the marking subunit is configured to mark, for the identification information in the identification information sequence, when the identification information is in a starting state, a video feature segment corresponding to the identification information between the identification information and the identification information which is in a terminating state next to the identification information as a target video segment.

In a third aspect, an embodiment of the present application provides a server, including: one or more processors; memory having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to perform the method for marking video segments of the first aspect.

In a fourth aspect, the present application provides a computer-readable medium, on which a computer program is stored, where the computer program is executed by a processor to implement the method for marking video segments of the first aspect.

The method and the device for marking the video clip provided by the embodiment of the application firstly acquire a video characteristic information sequence from a video to be marked; then grouping the adjacent video characteristics with the set quantity in the video characteristic information sequence to obtain a video characteristic segment sequence; then, importing the video characteristic segment into a video marking model trained in advance to obtain identification information corresponding to the video characteristic segment; and finally, marking the video clip of the video to be marked through the adjacent identification information in the identification information sequence. Thus, the efficiency and accuracy of marking the video segments are improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for marking video segments in accordance with the present application;

FIG. 3 is a schematic diagram of an application scenario of a method for marking video segments according to the present application;

FIG. 4 is a flow diagram for one embodiment of a video tagging model training method according to the present application;

FIG. 5 is a schematic block diagram illustrating one embodiment of an apparatus for marking video segments in accordance with the present application;

FIG. 6 is a schematic block diagram of a computer system suitable for use in implementing a server according to embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary system architecture 100 to which the method for marking video segments or the apparatus for marking video segments of the embodiments of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

video servers

101, 102, 103, a network 104, and a videomark server 105. Network 104 is the medium used to provide communication links between

video servers

101, 102, 103 and videomark server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The

video servers

101, 102, 103 interact with the videomark server 105 over the network 104 to receive or send messages or the like. The

video servers

101, 102, 103 may have various video applications installed thereon, such as a video classification application, a video playing application, a video search application, a video screening application, a video recommendation application, and the like.

The

video servers

101, 102, 103 may be hardware or software. When the

video servers

101, 102, 103 are hardware, they may be various electronic devices having display screens and supporting video playback, including but not limited to tablet computers, laptop portable computers, desktop computers, and the like. When the

video servers

101, 102, 103 are software, they can be installed in the electronic devices listed above. It may be implemented as a plurality of software or software modules (for example, for providing distributed services), or as a single software or software module, which is not specifically limited herein.

Videomark server 105 may be a server that provides various services, such as a server that marks videos on

video servers

101, 102, 103. The video tagging server 105 may analyze and process the received data of the video to be tagged, so as to tag the video segments in the video to be tagged.

It should be noted that the method for marking video segments provided by the embodiment of the present application is generally performed by the videomark server 105, and accordingly, the apparatus for marking video segments is generally disposed in the videomark server 105.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules (for example, to provide distributed services), or may be implemented as a single software or software module, and is not limited specifically herein.

It should be understood that the number of video servers, networks, and videomark servers in fig. 1 is illustrative only. There may be any number of video servers, networks, and videomark servers, as desired for an implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for marking video segments in accordance with the present application is shown. The method for marking video segments comprises the following steps:

step 201, acquiring a video characteristic information sequence from a video to be marked.

In this embodiment, an executing body (e.g., the video tagging server 105 shown in fig. 1) of the method for tagging video segments may acquire the video to be tagged from the video server by a wired connection manner or a wireless connection manner. The video to be marked can be related to sports, concerts, dancing, basketball, football and the like, and can also be other types of videos, and the videos are not repeated one by one. It should be noted that the wireless connection means may include, but is not limited to, a 3G/4G connection, a WiFi connection, a bluetooth connection, a WiMAX connection, a Zigbee connection, a uwb (ultra wideband) connection, and other wireless connection means now known or developed in the future.

To promote a video, technicians of existing video websites can extract a specific video segment of the video (for example, a singing segment or a dancing segment in the whole video) and display the video segment as a highlight segment of the video on a video browsing page, so that a user can select the video according to the highlight segment. Therefore, the highlight content of the video can be displayed, and the user can effectively acquire the video. The existing method usually needs technicians to label the video segments manually, which causes the efficiency of labeling the video segments to be low and the condition of inaccurate labeling is easy to occur.

For this reason, after acquiring the video to be marked from the

video servers

101, 102, 103, the executing body of the present application may acquire a video feature information sequence from the video to be marked. The video feature information may be used to characterize image features and audio features of the video to be marked. The image feature may be the content contained by the image (e.g., may be a person, dance, etc.); the audio feature may be specific audio information in the audio (e.g., may be singing, drum sound, etc.).

In some optional implementation manners of this embodiment, the obtaining a video feature information sequence from a video to be marked may include the following steps:

firstly, setting image frames at intervals, and respectively extracting an image sequence and an audio information sequence corresponding to the image sequence from the video to be marked.

The video may be composed of temporally successive images and corresponding audio information. Typically, video contains many images. In order to mark video segments quickly, the execution subject of the application can set image frames at intervals and extract image sequences from videos to be marked respectively. Similarly, the execution body may further extract a sequence of audio information corresponding to the sequence of images from the audio information of the video to be marked. That is, the image sequence and the audio information sequence are obtained according to the same frequency or the same time, and the images in the image sequence and the audio information in the audio information sequence have a one-to-one correspondence relationship.

And secondly, establishing a corresponding relation between the images in the image information sequence and the audio information of the corresponding images in the audio information sequence to obtain a video characteristic information sequence.

As can be seen from the above description, the images in the image sequence and the audio information in the audio information sequence have a one-to-one correspondence. The execution subject may establish a correspondence between an image in the sequence of image information and audio information of a corresponding image in the sequence of audio information. I.e. each image in the sequence of image information has audio information in the sequence of audio information corresponding to the image. The execution subject may combine the image and audio information after the correspondence is established into a video feature information sequence. I.e. the sequence of video feature information can be seen as video information extracted at set intervals of time or frequency. Video features

Step 202, grouping the adjacent video characteristic information with the set quantity in the video characteristic information sequence to obtain a video characteristic segment sequence.

Each piece of video feature information within a sequence of video feature information typically contains video information for a short period of time. In order to improve the accuracy and efficiency of marking video clips, the execution subject may group a set number of adjacent video feature information within the video feature information sequence to obtain a video feature clip sequence. For example, the execution subject may divide the video feature information sequence into video feature segment sequences by taking 5 adjacent pieces of video feature information in the video feature information sequence as one group.

Step 203, for the video feature segment in the video feature segment sequence, importing the video feature segment into a pre-trained video tagging model to obtain identification information corresponding to the video feature segment.

After the video feature segment sequence is obtained, the execution subject can import the video feature segments in the video feature segment sequence into a video marking model trained in advance to obtain identification information corresponding to the video feature segments. Wherein the videomark model can be used to match identification information corresponding to the video feature segment. The identification information may be used to characterize the video feature segment as being in a start state, an intermediate state, or a termination state of the event. Here, an event may be considered a highlight video or a video of other content. For example, if the video to be tagged is a movie, the event may be a video of a drag bike, a fighting video, etc. in the video to be tagged. For different videos to be tagged, the event may be a video of different content.

In some optional implementations of this embodiment, the video tag model is trained by the following steps:

the method comprises the steps of firstly, obtaining a plurality of sample video characteristic fragments and sample identification information corresponding to each sample video characteristic fragment in the plurality of sample video characteristic fragments.

In order to quickly obtain the identification information of the video feature segment through the video tagging model, the execution subject may first obtain a plurality of sample video feature segments and sample identification information corresponding to each of the plurality of sample video feature segments. The sample identification information may be set by a technician according to experience on the sample video feature segment.

And secondly, taking each sample video characteristic segment in the plurality of sample video characteristic segments as input, taking sample identification information corresponding to each sample video characteristic segment in the plurality of sample video characteristic segments as output, and training to obtain a video marking model.

The execution subject may obtain the video tagging model by training, through a plurality of intelligent algorithms (for example, a deep learning network, a neural network, or the like), with each of the plurality of sample video feature segments as an input, and with sample identification information corresponding to each of the plurality of sample video feature segments as an output.

And 204, in response to the identification information sequence corresponding to the video characteristic segment sequence, marking the video segment of the video to be marked by the identification information in the identification information sequence.

The video feature segments in the video feature segment sequence can all obtain corresponding identification information. The identification information is arranged according to the sequence of the video characteristic segments in the video characteristic segment sequence, so that the identification information sequence can be obtained. And then, the execution main body marks the video clip of the video to be marked through the identification information, so as to obtain the video corresponding to the event.

In some optional implementation manners of this embodiment, the marking a video segment of the video to be marked by the identification information in the identification information sequence may include:

and for the identification information in the identification information sequence, when the identification information is in a starting state, marking the video characteristic segment corresponding to the identification information between the identification information and the identification information which is in a termination state next to the identification information as a target video segment.

The video mark model is used for matching identification information corresponding to the video characteristic segment, and the identification information is used for identifying that the video characteristic segment is in a starting state, an intermediate state or a termination state of an event. In general, events in a video to be marked are a complete process from a start state, an intermediate state, to a final end state. I.e. the identification information in the identification information sequence comprises at least one set of identification information of the start state and identification information of the end state. The start time and the end time of an event are typically short, and the process time is typically longer than the start state and the end state. Therefore, a set of identification information of the start state and the identification information of the end state usually includes identification information of a plurality of consecutive intermediate states therebetween. When the identification information of the execution main body is in a starting state, and the video feature segment corresponding to the identification information between the identification information and the identification information which is in a next termination state after the identification information is marked as a target video segment. After marking the video segment, the executing entity may send the target video segment or the related information of the target video segment (for example, time information of the target video segment in the video to be marked) to the corresponding video server. The video server can set display information for the video to be marked according to the target video clip or the related information of the target video clip, so that the highlight clip of the video to be marked is displayed to the user through the display information. The display information may be a target video clip, or a partial video captured from the target video clip, which is determined according to actual needs.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for marking a video segment according to the present embodiment. In the application scenario of fig. 3, the video tagging server 105 acquires the video to be tagged from the video server 103 through the network 104: and A, video. The videomark server 105 then obtains a sequence of video feature information from the a video. And grouping the adjacent video characteristic information with the set quantity in the video characteristic information sequence to obtain a video characteristic segment sequence. For the video feature segment in the video feature segment sequence, the video tagging server 105 imports the video feature segment into a video tagging model, and obtains identification information corresponding to the video feature segment. In response to obtaining the identification information sequence corresponding to the video feature segment sequence, the video tagging server 105 tags the video segment of the video to be tagged with the identification information in the identification information sequence, so as to obtain the video segment corresponding to the event in the a video.

The method provided by the above embodiment of the present application first obtains a video feature information sequence from a video to be marked; then grouping the adjacent video characteristics with the set quantity in the video characteristic information sequence to obtain a video characteristic segment sequence; then, importing the video characteristic segment into a video marking model trained in advance to obtain identification information corresponding to the video characteristic segment; and finally, marking the video clip of the video to be marked through the adjacent identification information in the identification information sequence. Thus, the efficiency and accuracy of marking the video segments are improved.

With further reference to FIG. 4, a flow 400 of one embodiment of a video tagging model training method is shown. The process 400 of the video tag model training method includes the following steps:

step 401, obtaining a plurality of sample video feature fragments and sample identification information corresponding to each sample video feature fragment in the plurality of sample video feature fragments.

In this embodiment, an executing entity (for example, the video tag server 105 shown in fig. 1) on which the video tag model training method operates may obtain the plurality of sample video feature fragments and the sample identification information corresponding to each of the plurality of sample video feature fragments through a wired connection manner or a wireless connection manner.

Step 402, sequentially inputting each sample video feature segment of the plurality of sample video feature segments to an initialized video marker model, and obtaining prediction identification information corresponding to each sample video feature segment of the plurality of sample video feature segments.

Based on each of the plurality of sample video feature segments obtained in step 401, the execution subject may sequentially input each of the plurality of sample video feature segments to the initialized video marker model, so as to obtain the prediction identification information corresponding to each of the plurality of sample video feature segments. Here, the execution subject may input each sample video feature segment from an input side of the initialized video marker model, sequentially perform processing on parameters of each layer in the initialized video marker model, and output the sample video feature segment from an output side of the initialized video marker model, where information output by the output side is prediction identification information corresponding to the sample video feature segment. Where initializing the videomark model may be pre-constructed by a technician. The initialized video tag model can be an untrained video tag model or an untrained video tag model, and each layer of the initialized video tag model is provided with initialization parameters which can be continuously adjusted in the training process of the video tag model.

In some optional implementation manners of this embodiment, the sequentially inputting each of the plurality of sample video feature segments to the initialized video marker model to obtain the prediction identifier information corresponding to each of the plurality of sample video feature segments may include the following steps:

the method comprises the steps of firstly, carrying out image recognition on an image and audio information corresponding to the image contained in the sample video characteristic segment to obtain image content information corresponding to the image, and carrying out audio recognition on the audio information to obtain audio content information corresponding to the audio information.

The executing subject can perform image recognition on the image to obtain image content information (for example, the image content information can be vehicle, dance, etc.) corresponding to the image; then, the executing body performs audio recognition on the audio information to obtain audio content information (such as singing voice, engine voice, etc.) corresponding to the audio information.

And secondly, determining that the prediction identification information of the sample video characteristic segment is the starting state or the ending state of the event in response to the fact that the image content information of the two adjacent frames of images in the image content information sequence is different and the audio content information corresponding to the two adjacent frames of images in the audio content information sequence is different.

At the start and end of an event, there is typically a conversion of image and audio information accompanying it. For example, a video to be tagged is an entertainment program, the relevant person starts chatting, and then a singer starts singing. The video to be tagged may place the image of the singer in the middle of the image frame while the music sounds. Therefore, the execution subject can perform image recognition on two adjacent frames of images in the image content information sequence of the sample video feature segment. When the image content information of two adjacent frames of images in the image content information sequence is different, and the audio content information corresponding to the two adjacent frames of images in the audio content information sequence is different, the execution subject may determine that the prediction identification information of the sample video feature segment is the start state or the end state of the event.

Specifically, in response to that the image content information of the previous image in the two adjacent images includes the specified image content, the audio content information corresponding to the previous image includes the specified audio content, the next image does not include the specified image content, and the audio content information corresponding to the next image does not include the specified audio content, the prediction identification information of the sample video feature segment is set as the termination state of the event. And in response to that the image content information of the next frame image in the two adjacent frames of images comprises the specified image content, the audio content information corresponding to the next frame image comprises the specified audio content, the previous frame image does not comprise the specified image content, and the audio content information corresponding to the previous frame image does not comprise the specified audio content, setting the prediction identification information of the sample video feature segment as the starting state of the event. Wherein the specified image content may include at least one of: car images, airplane images. The specified audio content includes at least one of: music audio, collision audio. According to the actual video and event to be marked, the designated image content and the designated audio content may also be other contents, which are not described in detail herein.

Step 403, comparing the prediction identification information corresponding to each sample video feature segment in the plurality of sample video feature segments with the sample identification information corresponding to the sample video feature segment, so as to obtain the prediction accuracy of the initialized video marker model.

In this embodiment, based on the prediction identification information corresponding to each of the plurality of sample video feature segments obtained in step 402, the execution subject may compare the prediction identification information corresponding to each of the plurality of sample video feature segments with the sample identification information corresponding to the sample video feature segment, so as to obtain the prediction accuracy of the initialized video marker model.

Step 404, determining whether the prediction accuracy is greater than a preset accuracy threshold.

In this embodiment, based on the prediction accuracy of the initialized videomark model obtained in step 403, the executing entity may compare the prediction accuracy of the initialized videomark model with a preset accuracy threshold. If the accuracy is greater than the preset accuracy threshold, go to step 405; if not, go to step 406.

And 405, taking the initialized video label model as a trained video label model.

In this embodiment, when the prediction accuracy of the initialized videomark model is greater than the preset accuracy threshold, it indicates that the training of the videomark model is completed, and at this time, the executing entity may use the initialized videomark model as the trained videomark model.

Step 406, adjusting the parameters of the initialized video markup model.

In this embodiment, in the case that the prediction accuracy of the initialized videomark model is not greater than the preset accuracy threshold, the executing entity may adjust parameters of the initialized videomark model, and return to the executing step 402 until a videomark model capable of matching the identification information corresponding to the video feature segment is trained.

With further reference to fig. 5, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of an apparatus for marking a video segment, which corresponds to the embodiment of the method shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 5, the apparatus 500 for marking a video clip of the present embodiment may include: a video feature information sequence acquisition unit 501, a video feature segment sequence acquisition unit 502, an identification information acquisition unit 503, and a marking unit 504. The video feature information sequence obtaining unit 501 is configured to obtain a video feature information sequence from a video to be marked, where the video feature information is used to represent image features and audio features of the video to be marked; the video feature segment sequence obtaining unit 502 is configured to group a set number of adjacent video feature information in the video feature information sequence to obtain a video feature segment sequence; an identification information obtaining unit 503, configured to, for a video feature segment in the video feature segment sequence, import the video feature segment into a pre-trained video tagging model to obtain identification information corresponding to the video feature segment, where the video tagging model is used to match identification information corresponding to the video feature segment, and the identification information is used to represent that the video feature segment is in a start state, an intermediate state, or an end state of an event; the marking unit 504, in response to obtaining the identification information sequence corresponding to the video feature segment sequence, is configured to mark the video segment of the video to be marked by the identification information in the identification information sequence.

In some optional implementations of this embodiment, the video feature information sequence obtaining unit 501 may include: an information extraction sub-unit (not shown in the figure) and a video feature information sequence acquisition sub-unit (not shown in the figure). The information extraction subunit is configured to set image frames at intervals, and respectively extract an image sequence and an audio information sequence corresponding to the image sequence from the video to be marked; the video characteristic information sequence obtaining subunit is configured to establish a corresponding relationship between an image in the image information sequence and audio information of a corresponding image in the audio information sequence, and obtain a video characteristic information sequence.

In some optional implementations of this embodiment, the apparatus 500 for tagging video segments may further include a video tagging model training unit (not shown in the figure) configured to train a video tagging model. The video tag model training unit may include: a sample acquisition subunit (not shown in the figure) and a videomark model training subunit (not shown in the figure). The sample acquiring subunit is configured to acquire a plurality of sample video feature fragments and sample identification information corresponding to each of the plurality of sample video feature fragments; the video marking model training subunit is configured to train to obtain a video marking model by taking each of the plurality of sample video feature segments as an input and taking sample identification information corresponding to each of the plurality of sample video feature segments as an output.

In some optional implementations of this embodiment, the aforementioned video tag model training subunit may include: a video tagging model training module (not shown in the figure) configured to sequentially input each of the plurality of sample video feature segments to an initialized video tagging model, obtain prediction identification information corresponding to each of the plurality of sample video feature segments, compare the prediction identification information corresponding to each of the plurality of sample video feature segments with the sample identification information corresponding to the sample video feature segment, obtain a prediction accuracy of the initialized video tagging model, determine whether the prediction accuracy is greater than a preset accuracy threshold, and if so, take the initialized video tagging model as a trained video tagging model.

In some optional implementations of this embodiment, the video tag model training subunit may further include: a parameter adjustment module (not shown) configured to adjust the parameters of the initialized videomark model in response to the accuracy not being greater than the predetermined accuracy threshold, and return to the videomark model training module.

In some optional implementations of this embodiment, the video tag model training module includes: a content information acquisition sub-module (not shown in the figure) and a prediction identification information setting sub-module (not shown in the figure). The content information acquisition sub-module is configured to perform image recognition on the image to obtain image content information corresponding to the image, and perform audio recognition on the audio information to obtain audio content information corresponding to the audio information, for the image contained in the sample video feature segment and the audio information corresponding to the image; and the prediction identification information setting sub-module is used for responding to the fact that the image content information of the two adjacent frames of images in the image content information sequence is different and the audio content information corresponding to the two adjacent frames of images in the audio content information sequence is different and is configured to determine the prediction identification information of the sample video characteristic segment as the starting state or the ending state of the event.

In some optional implementation manners of this embodiment, the prediction identification information setting sub-module may include: a first prediction identification information setting module (not shown in the figure), in response to that the image content information of the previous image of the two adjacent images includes the specified image content, the audio content information corresponding to the previous image includes the specified audio content, and the next image does not include the specified image content, and the audio content information corresponding to the next image does not include the specified audio content, configured to set the prediction identification information of the sample video feature segment as the termination state of the event.

In some optional implementation manners of this embodiment, the prediction identification information setting sub-module may include: and a second prediction identification information setting module (not shown in the figure), in response to that the image content information of the next image of the two adjacent images includes the designated image content, the audio content information corresponding to the next image includes the designated audio content, the previous image does not include the designated image content, and the audio content information corresponding to the previous image does not include the designated audio content, configured to set the prediction identification information of the sample video feature segment as the start state of the event.

In some optional implementations of the present embodiment, the marking unit 504 may include: and a marking subunit (not shown in the figure), configured to mark, for the identification information in the identification information sequence, when the identification information is in the start state, the video feature segment corresponding to the identification information between the identification information and the identification information that is in the end state next to the identification information as the target video segment.

The present embodiment further provides a server, including: one or more processors; a memory having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to perform the above-described method for marking video segments.

The present embodiment also provides a computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the above-mentioned method for marking a video segment.

Referring now to FIG. 6, a block diagram of a computer system 600 suitable for use in implementing a server (e.g., videomark server 105 of FIG. 1) of an embodiment of the present application is shown. The server shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the method of the present application when executed by a Central Processing Unit (CPU) 601.

It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a video feature information sequence acquisition unit, a video feature segment sequence acquisition unit, an identification information acquisition unit, and a marking unit. Where the names of these units do not in some cases constitute a limitation on the unit itself, for example, a marked unit may also be described as a "unit for marking a target video segment".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: acquiring a video characteristic information sequence from a video to be marked, wherein the video characteristic information is used for representing image characteristics and audio characteristics of the video to be marked; grouping the adjacent video characteristic information with the set quantity in the video characteristic information sequence to obtain a video characteristic segment sequence; for the video feature segments in the video feature segment sequence, importing the video feature segments into a pre-trained video tag model to obtain identification information corresponding to the video feature segments, wherein the video tag model is used for matching the identification information corresponding to the video feature segments, and the identification information is used for representing that the video feature segments are in a starting state, an intermediate state or a terminating state of an event; and marking the video segments of the video to be marked by the identification information in the identification information sequence in response to the identification information sequence corresponding to the video characteristic segment sequence.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for marking a video segment, comprising:

setting image frames at intervals, extracting an image sequence and an audio information sequence corresponding to the image sequence from a video to be marked respectively, and establishing a corresponding relation between an image in the image information sequence and audio information of a corresponding image in the audio information sequence to obtain a video characteristic information sequence, wherein the video characteristic information is used for representing image characteristics and audio characteristics of the video to be marked, the image characteristics are contents contained in the image, and the audio characteristics are specific audio information in the audio;

grouping the adjacent video characteristic information with the set quantity in the video characteristic information sequence to obtain a video characteristic segment sequence;

for the video feature segments in the video feature segment sequence, importing the video feature segments into a pre-trained video tag model to obtain identification information corresponding to the video feature segments, wherein the video tag model is used for matching the identification information corresponding to the video feature segments, and the identification information is used for representing that the video feature segments are in a starting state, an intermediate state or a termination state of an event;

and in response to the identification information sequence corresponding to the video characteristic segment sequence, marking the video segment of the video to be marked by the identification information in the identification information sequence.

2. The method of claim 1, wherein the videomark model is trained by:

obtaining a plurality of sample video feature fragments and sample identification information corresponding to each sample video feature fragment in the plurality of sample video feature fragments;

and taking each sample video characteristic segment in the plurality of sample video characteristic segments as input, taking sample identification information corresponding to each sample video characteristic segment in the plurality of sample video characteristic segments as output, and training to obtain a video marking model.

3. The method of claim 2, wherein the training a video tagging model by taking each of the plurality of sample video feature fragments as an input and sample identification information corresponding to each of the plurality of sample video feature fragments as an output comprises:

the following training steps are performed: sequentially inputting each sample video feature segment of the plurality of sample video feature segments to an initialized video marker model to obtain prediction identification information corresponding to each sample video feature segment of the plurality of sample video feature segments, comparing the prediction identification information corresponding to each sample video feature segment of the plurality of sample video feature segments with the sample identification information corresponding to the sample video feature segment to obtain the prediction accuracy of the initialized video marker model, determining whether the prediction accuracy is greater than a preset accuracy threshold, and if so, taking the initialized video marker model as the trained video marker model.

4. The method of claim 3, wherein the training a video tagging model by taking each of the plurality of sample video feature fragments as an input and taking sample identification information corresponding to each of the plurality of sample video feature fragments as an output, further comprises:

and responding to the condition that the accuracy is not larger than the preset accuracy threshold, adjusting the parameters of the initialized video mark model, and continuing to execute the training step.

5. The method according to claim 3, wherein the sequentially inputting each of the plurality of sample video feature segments to an initialized video markup model to obtain the prediction identifier information corresponding to each of the plurality of sample video feature segments comprises:

for the images contained in the sample video characteristic fragments and the audio information corresponding to the images, carrying out image identification on the images to obtain image content information corresponding to the images, and carrying out audio identification on the audio information to obtain audio content information corresponding to the audio information;

and determining the prediction identification information of the sample video feature segment as the starting state or the ending state of the event in response to the fact that the image content information of two adjacent frames of images in the image content information sequence is different and the audio content information corresponding to the two adjacent frames of images in the audio content information sequence is different.

6. The method of claim 5, wherein the determining whether the prediction identification information of the sample video feature segment is a start state or a stop state of an event comprises:

and in response to that the image content information of the previous image in the two adjacent images comprises the specified image content, the audio content information corresponding to the previous image comprises the specified audio content, the next image does not comprise the specified image content, and the audio content information corresponding to the next image does not comprise the specified audio content, setting the prediction identification information of the sample video feature segment as the termination state of the event.

7. The method of claim 5, wherein the determining whether the prediction identification information of the sample video feature segment is a start state or a stop state of an event comprises:

and in response to that the image content information of the next frame image in the two adjacent frames of images comprises the specified image content, the audio content information corresponding to the next frame image comprises the specified audio content, the previous frame image does not comprise the specified image content, and the audio content information corresponding to the previous frame image does not comprise the specified audio content, setting the prediction identification information of the sample video feature segment as the starting state of the event.

8. The method according to any one of claims 1 to 7, wherein the marking a video segment of the video to be marked by the identification information in the identification information sequence comprises:

9. An apparatus for marking video segments, comprising:

the video characteristic information sequence acquisition unit comprises an information extraction subunit and a video characteristic information sequence acquisition subunit, wherein the information extraction subunit is configured to set image frames at intervals and respectively extract an image sequence and an audio information sequence corresponding to the image sequence from a video to be marked, the video characteristic information sequence acquisition subunit is configured to establish a corresponding relationship between images in the image information sequence and audio information corresponding to images in the audio information sequence to obtain a video characteristic information sequence, the video characteristic information is used for representing image characteristics and audio characteristics of the video to be marked, the image characteristics are contents contained in the images, and the audio characteristics are specific audio information in the audio;

the video characteristic segment sequence acquisition unit is configured to group the adjacent video characteristic information with the set number in the video characteristic information sequence to obtain a video characteristic segment sequence;

the identification information acquisition unit is used for leading the video characteristic segment in the video characteristic segment sequence into a pre-trained video marking model to obtain identification information corresponding to the video characteristic segment, wherein the video marking model is used for matching the identification information corresponding to the video characteristic segment, and the identification information is used for representing that the video characteristic segment is in a starting state, an intermediate state or a termination state of an event;

and the marking unit is used for responding to the identification information sequence corresponding to the video characteristic segment sequence and marking the video segment of the video to be marked through the identification information in the identification information sequence.

10. The apparatus of claim 9, wherein the apparatus further comprises a videomark model training unit configured to train a videomark model, the videomark model training unit comprising:

a sample obtaining subunit configured to obtain a plurality of sample video feature fragments and sample identification information corresponding to each of the plurality of sample video feature fragments;

and the video marking model training subunit is configured to take each sample video feature segment in the plurality of sample video feature segments as input, take the sample identification information corresponding to each sample video feature segment in the plurality of sample video feature segments as output, and train to obtain the video marking model.

11. The apparatus of claim 10, wherein the videomark model training subunit comprises:

the video marker model training module is configured to input each sample video feature segment of the plurality of sample video feature segments to an initialized video marker model in sequence to obtain prediction identification information corresponding to each sample video feature segment of the plurality of sample video feature segments, compare the prediction identification information corresponding to each sample video feature segment of the plurality of sample video feature segments with the sample identification information corresponding to the sample video feature segment to obtain prediction accuracy of the initialized video marker model, determine whether the prediction accuracy is greater than a preset accuracy threshold, and if the prediction accuracy is greater than the preset accuracy threshold, use the initialized video marker model as a trained video marker model.

12. The apparatus of claim 11, wherein the videomark model training subunit further comprises:

a parameter adjustment module, responsive to not being greater than the preset accuracy threshold, configured to adjust parameters of the initialized video marker model and return to the video marker model training module.

13. The apparatus of claim 11, wherein the videomark model training module comprises:

the content information acquisition sub-module is used for carrying out image identification on the image and audio information corresponding to the image contained in the sample video characteristic fragment to obtain image content information corresponding to the image and carrying out audio identification on the audio information to obtain audio content information corresponding to the audio information;

and the prediction identification information setting sub-module is used for responding to the fact that the image content information of the two adjacent frames of images in the image content information sequence is different and the audio content information corresponding to the two adjacent frames of images in the audio content information sequence is different and is configured to determine the prediction identification information of the sample video characteristic segment as the starting state or the ending state of the event.

14. The apparatus of claim 13, wherein the prediction identification information setting sub-module comprises:

and the first prediction identification information setting module is used for setting the prediction identification information of the sample video characteristic segment as the termination state of the event in response to the fact that the image content information of the previous frame image in the two adjacent frames of images comprises the specified image content, the audio content information corresponding to the previous frame image comprises the specified audio content, the next frame image does not comprise the specified image content, and the audio content information corresponding to the next frame image does not comprise the specified audio content.

15. The apparatus of claim 13, wherein the prediction identification information setting sub-module comprises:

and the second prediction identification information setting module is used for setting the prediction identification information of the sample video characteristic segment as the starting state of the event in response to the fact that the image content information of the next frame image in the two adjacent frames of images comprises the specified image content, the audio content information corresponding to the next frame image comprises the specified audio content, the previous frame image does not comprise the specified image content, and the audio content information corresponding to the previous frame image does not comprise the specified audio content.

16. The apparatus of any one of claims 9 to 15, wherein the marking unit comprises:

and the marking subunit is configured to mark, for the identification information in the identification information sequence, when the identification information is in a starting state, a video feature segment corresponding to the identification information between the identification information and the identification information which is in a terminating state next to the identification information as a target video segment.

17. A server, comprising:

one or more processors;

a memory having one or more programs stored thereon,

the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-8.

18. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 8.