CN115150636B

CN115150636B - Video processing method, electronic device and storage medium

Info

Publication number: CN115150636B
Application number: CN202110343767.4A
Authority: CN
Inventors: 贾美娟; 王硕; 张晓星
Original assignee: Hainan Liangxin Technology Co ltd
Current assignee: Hainan Liangxin Technology Co ltd
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2023-11-14
Anticipated expiration: 2041-03-30
Also published as: CN115150636A

Abstract

The embodiment of the invention provides a video processing method, a device, electronic equipment and a storage medium, wherein in the method, a video to be processed is divided into a plurality of first fragments; the video to be processed comprises video contents corresponding to a plurality of objects, and the similarity between video images contained in the first segment meets a first preset similarity requirement; for each first segment, acquiring segment characteristics of the first segment according to audio information and/or video images contained in the first segment; combining the first fragments with the similarity meeting the second preset similarity requirement according to the fragment characteristics of each first fragment to obtain a second fragment; the different second segments correspond to video content corresponding to different items. Compared with a manual splitting mode, the video to be processed is automatically split into the second segments corresponding to different objects, so that the implementation cost can be reduced to a certain extent, and the splitting efficiency is improved.

Description

Video processing method, electronic device and storage medium

Technical Field

The present invention belongs to the field of network technologies, and in particular, to a video processing method, a device, an electronic apparatus, and a storage medium.

Background

With the continuous development of internet technology, the variety of videos in a network is more abundant. For example, videos obtained by continuously photographing a plurality of objects appear in a network. Specifically, in the e-commerce platform, a host will often explain and recommend a plurality of items to a user through live broadcast, and accordingly, eventually, content will be generated including live broadcast video in which a plurality of items are continuously shot. In order to improve the video utilization rate, it is often required to split the video into video clips corresponding to each item, so as to recommend the corresponding item by using the video clips, for example, as a display video of the item, and so on.

In the prior art, the splitting is often carried out manually, the splitting efficiency of the mode is lower, and the implementation cost is higher.

Disclosure of Invention

The invention provides a video processing method, a video processing device, electronic equipment and a storage medium, so as to solve the problem of low accuracy of article types.

In a first aspect, the present invention provides a video processing method, the method comprising:

dividing a video to be processed into a plurality of first segments; the video to be processed comprises video contents corresponding to a plurality of objects, and the similarity between video images contained in the first segment meets a first preset similarity requirement;

For each first segment, acquiring segment characteristics of the first segment according to audio information and/or video images contained in the first segment;

combining the first fragments with the similarity meeting the second preset similarity requirement according to the fragment characteristics of each first fragment to obtain a second fragment; the different second segments correspond to video content corresponding to different items.

In a second aspect, the present invention provides a video processing apparatus, the apparatus comprising:

the dividing module is used for dividing the video to be processed into a plurality of first fragments; the video to be processed comprises video contents corresponding to a plurality of objects, and the similarity between video images contained in the first segment meets a first preset similarity requirement;

the first acquisition module is used for acquiring the segment characteristics of each first segment according to the audio information and/or the video image contained in the first segment;

the merging module is used for merging the first fragments, the similarity among the fragment features of which meets the second preset similarity requirement, according to the fragment features of each first fragment to obtain a second fragment; the different second segments correspond to video content corresponding to different items.

In a third aspect, the present invention provides an electronic device comprising: a processor, a memory and a computer program stored on the memory and executable on the processor, characterized in that the video processing method is implemented when the processor executes the program.

In a fourth aspect, the present invention provides a readable storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform the above-described video processing method.

In the embodiment of the invention, the video to be processed is divided into a plurality of first fragments; the video to be processed comprises video contents corresponding to a plurality of objects, and the similarity between video images contained in the first segment meets a first preset similarity requirement; for each first segment, acquiring segment characteristics of the first segment according to audio information and/or video images contained in the first segment; combining the first fragments with the similarity meeting the second preset similarity requirement according to the fragment characteristics of each first fragment to obtain a second fragment; the different second segments correspond to video content corresponding to different items. Compared with a manual splitting mode, the video to be processed is automatically split into the second segments corresponding to different objects, so that the implementation cost can be reduced to a certain extent, and the splitting efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of steps of a video processing method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a processing procedure of a segment classification module according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a processing procedure of a segment feature extraction module according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart of video content positioning according to an embodiment of the present invention;

fig. 5 is a block diagram of a video processing apparatus according to an embodiment of the present invention;

fig. 6 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 is a flowchart of steps of a video processing method according to an embodiment of the present invention, where, as shown in fig. 1, the method may include:

step 101, dividing a video to be processed into a plurality of first fragments; the video to be processed comprises video contents corresponding to a plurality of objects, and the similarity between video images contained in the first segment meets a first preset similarity requirement.

In the embodiment of the invention, the video to be processed can be obtained by shooting a plurality of objects, and the video to be processed can be a live video, and for objects which are sequentially explained and displayed in the live broadcast, the recorded live video can comprise video contents corresponding to the objects. For example, a cell phone, crayfish, and a computer are illustrated in the live broadcast. The finally obtained video to be processed can sequentially comprise video clips corresponding to the mobile phone, video clips corresponding to the crayfish and video clips corresponding to the computer, and the contents of the 3 video clips sequentially represent video contents corresponding to the mobile phone, video contents corresponding to the crayfish and video contents corresponding to the computer. Of course, the video to be processed may be other types of video. Further, the video to be processed can be received and sent by the live video recording device, or can be actively downloaded from a network or manually input by a user. Of course, the video to be processed may be other types of video, such as a trailer, etc.

The first preset similarity requirement may be set according to actual requirements, which is not limited in the embodiment of the present invention. For example, the first preset similarity requirement may be that the similarity between the video images is greater than a first preset similarity threshold, or that a similarity relationship exists between the video images, and so on. Further, if the similarity between the video images meets the first preset similarity requirement, the video images may be divided into the same segment, so as to obtain the first segment.

Step 102, for each first segment, obtaining a segment characteristic of the first segment according to the audio information and/or the video image contained in the first segment.

In the embodiment of the invention, the audio information contained in the first segment can be audio carried by the first segment, and because the audio information is often used for explaining the objects and the characteristics of different objects are different, the audio information can be different when explaining different objects, namely, the audio information can reflect the characteristics of the objects corresponding to the first segment to a certain extent, so that in the embodiment of the invention, the accuracy of the segment characteristics can be ensured to a certain extent by combining the audio information to determine the segment characteristics. Further, the appearances of different articles tend to be different, and thus, by determining segment features in combination with video images, accuracy of segment features can also be ensured.

Step 103, merging the first fragments with the similarity between the fragment characteristics meeting the second preset similarity requirement according to the fragment characteristics of each first fragment to obtain a second fragment; the different second segments correspond to video content corresponding to different items.

In the embodiment of the present invention, the second preset similarity requirement may be set according to an actual requirement, and the second preset similarity requirement may be different from the first preset similarity requirement, for example, the second preset similarity requirement may be higher than the first preset similarity requirement, which is not limited in the embodiment of the present invention. For example, the second preset similarity requirement may be that the similarity between video images is greater than a second preset similarity threshold, and so on. Further, if the similarity between the segment features between the first segments meets the second preset similarity requirement, the first segments may be combined into the same segment, thereby obtaining a second segment.

Because the video images in the first segments have a certain similarity, that is, one first segment can correspond to the video content corresponding to one article, correspondingly, the similar first segments are combined to obtain the second segment, so that the first segments corresponding to the video content corresponding to the same article are combined, and further the second segments corresponding to the video content corresponding to different articles are obtained. By way of example, assume that there are 7 first segments: the mobile phone comprises a first segment A, a first segment B, a first segment C, a first segment D, a first segment E, a first segment F and a first segment G, wherein the first segment A and the first segment B correspond to video contents corresponding to a mobile phone, the first segment C and the first segment D correspond to video contents corresponding to a computer, and the first segment E, the first segment F and the first segment G correspond to video contents corresponding to crayfish. Accordingly, the similarity between the segment features of the first segment a and the first segment B meets the second preset similarity requirement, and then the first segment a and the first segment B can be combined to obtain a second segment 1, and the second segment 1 can correspond to the video content corresponding to the mobile phone. The similarity between the segment features of the first segment C and the first segment D meets a second preset similarity requirement, so that the first segment C and the first segment D can be combined to obtain a second segment 2, and the second segment 2 can correspond to video content corresponding to a computer. The similarity among the segment features of the first segment E, the first segment F and the first segment G meets a second preset similarity requirement, so that the first segment E, the first segment F and the first segment G can be combined to obtain a second segment 3, and the second segment 3 can correspond to video content corresponding to the crayfish.

The video processing method provided by the embodiment of the invention divides the video to be processed into a plurality of first fragments; the video to be processed comprises video contents corresponding to a plurality of objects, and the similarity between video images contained in the first segment meets a first preset similarity requirement; for each first segment, acquiring segment characteristics of the first segment according to audio information and/or video images contained in the first segment; combining the first fragments with the similarity meeting the second preset similarity requirement according to the fragment characteristics of each first fragment to obtain a second fragment; the different second segments correspond to video content corresponding to different items. Compared with a manual splitting mode, the video to be processed is automatically split into the second segments corresponding to different objects, so that the implementation cost can be reduced to a certain extent, and the splitting efficiency is improved.

Meanwhile, in the embodiment of the invention, the video to be processed is roughly split into the first fragments, and then the fragments are polymerized according to the fragment characteristics to obtain the second fragments, so that the splitting cost can be reduced to a certain extent, and the video splitting efficiency is improved.

Optionally, in the embodiment of the present invention, after the second segment is obtained, the following steps S21 to S22 may be performed:

And S21, acquiring a target image corresponding to the target object, and extracting image features corresponding to the target image to obtain target image features.

In the embodiment of the invention, the target object may be one or more objects among a plurality of objects corresponding to the video to be processed. For example, assuming that the target object is all objects corresponding to the video to be processed, then the target object segment corresponding to each target object may be obtained later, and further, it is determined which object is specifically corresponding to each second segment.

Further, the target image corresponding to the target article may be an image in which the appearance of the target article is recorded. The target image may be an image that receives user input as the target image. The image features corresponding to the target image may be region of interest (Region of Interest, ROI) features. For example, when extracting the image features corresponding to the target image, the features of 5 stages of the target image may be extracted using a backbone network "res net101" in a preset network (e.g., mask-RCNN network). And then fusing the features of each stage by using a feature pyramid module, recommending the region of interest to the network based on the region recommendation network (e.g. RPN network) according to the fused features, and acquiring the features of the region of interest through a preset module (e.g. ROI alignment module), thereby obtaining the features of the target image.

Step S22, determining a target object segment corresponding to the target object according to the target image features and the image features corresponding to the video images in the second segments; the target object segment is a second segment corresponding to the target object.

Specifically, if the object corresponding to the second segment is the target object, the image feature corresponding to the video image in the second segment is often similar to the target image feature, so as to determine, for the target image feature of any target image, the target object segment corresponding to the target object according to the similarity between the target image feature and the image feature corresponding to the video image in each second segment. By way of example, assume that there are 3 target images: the image features corresponding to the video images in the second segment 1 are closer to the target image features of the target image a, so that the second segment 1 can be determined as a target object segment corresponding to the mobile phone. The target image b is an image corresponding to the computer, so that the image features corresponding to the video image in the second segment 2 are closer to the target image features of the target image b, and therefore the second segment 2 can be determined as a target object segment corresponding to the computer. The target image c is an image corresponding to the crayfish, and then the image features corresponding to the video image in the second segment 3 are closer to the target image features of the target image c, so that the second segment 3 can be determined as a target object segment corresponding to the crayfish.

According to the embodiment of the invention, the object in the video to be processed can be conveniently positioned by further acquiring the object image corresponding to the object and extracting the image characteristic corresponding to the object image to obtain the object image characteristic, and then determining the object segment corresponding to the object according to the object image characteristic of the object image, namely, the object corresponding to the second segment is clear, so that the object can be conveniently displayed by using the object segment subsequently, and the utilization rate of the video is improved. For example, when the target object is put on the network platform, the target object segment corresponding to the target object is used as the display video of the target object, the image/segment is intercepted from the target object segment to be used as the cover of the display video, and the main object information in the display video is displayed, so that the attraction is increased, and the user can conveniently select the object in the network platform.

Optionally, in an implementation manner, the step of determining the target object segment corresponding to the target object according to the target image feature and the image feature corresponding to the video image in each second segment may include the following steps:

Step S221, determining a first article category to which the target article belongs, and determining a second article category to which the article corresponding to each second segment belongs.

In this step, the steps of determining the first article category and determining the second article category may be performed simultaneously or sequentially. For example, the first item class may be determined first and then the second item class may be determined, or the second item class may be determined first and then the first item class may be determined, which is not limited in this embodiment of the present invention.

Specifically, when determining the first article category, article classification may be performed based on the target image features corresponding to the target articles extracted in the foregoing steps, so as to obtain the first article category. For example, the target image may be regressed according to the target image features to determine the commodity candidate frame, that is, the region where the target object in the target image is located. Wherein the regression operation may be implemented based on the Mask-RCNN network of the foregoing description. And then classifying the objects according to the content of the area where the target object is located, and obtaining a first object class. Of course, other algorithms may be used to implement the classification, which is not limited by the embodiments of the present invention. When the second object class is determined, classification can be performed according to the segment characteristics of the second segment based on a preset classification network, so as to obtain the second object class of the second segment.

Step S222, determining a fragment to be selected from the second fragments according to the first article category and the second article category of each second fragment; the to-be-selected segment is a second segment of which the corresponding second article category is matched with the first article category.

In this step, for any second segment, a second object class of the second segment may be compared with the first object class, and if the object classes of the two are identical, the second segment may be used as a candidate segment. If the article categories of the two are not consistent, the second segment can be rejected. For example, assuming that the first article category is an electronic product, the second article category of the second segment 1 is an electronic product, the second article category of the second segment 2 is an electronic product, and the second article category of the second segment 3 is a fresh product, the second segment 3 may be removed, and the second segment 1 and the second segment 2 may be used as candidate segments.

And step S223, determining the target object segment from the segments to be selected according to the target image features and the image features of the video images in the segments to be selected.

In the embodiment of the invention, the first article category to which the target article belongs and the second article category to which the article corresponding to each second segment belongs are firstly determined, the second segment, corresponding to the second article category, matched with the first article category is selected as the segment to be selected, and the target article segment is determined from the segment to be selected according to the target image characteristics and the image characteristics of the video image in the segment to be selected. Therefore, firstly, the second segments inconsistent with the target object category are filtered out based on the object category, namely, the second segments with larger probability are not the target object segments, only the second segments consistent with the target object category are selected as the to-be-selected segments, and the target object segments are selected based on the to-be-selected segments, so that the calculation amount for determining the target object segments can be reduced to a certain extent, and the determination speed is further improved.

Optionally, the step of determining the target object segment from the candidate segment according to the target image feature and the image feature of the video image in the candidate segment may include the following steps:

step S223a, for any of the segments to be selected, calculating a first similarity between the target image feature and the image feature of each video image in the segment to be selected.

In this step, the specific type of the first similarity may be set according to actual requirements, and in an example, the first similarity may be a cosine similarity. In specific implementation, for a sequence set composed of segments to be selected and segment features, one segment to be selected can be selected from the sequence set, then features can be taken from the segment features of the segment to be selected image by image, namely, for any video image in the segment to be selected, the image features of the video image can be obtained, then the image features of the target image and the image features of the video image are used as inputs of a preset similarity algorithm, and then the output of the preset similarity algorithm is used as first similarity. Alternatively, a distance between the target image feature and the image feature of the video image is calculated, and a first similarity is determined based on the distance. The preset similarity algorithm may be set according to actual requirements, which is not limited in the embodiment of the present invention. Further, the method for acquiring the image features of the video image in the to-be-selected segment may refer to the method for acquiring the target image features, or other feature extraction methods may be used to extract the image features of the video image in the to-be-selected segment, which is not described herein.

Step S223b, if the maximum similarity in the first similarity is greater than a first preset similarity threshold, taking the video image corresponding to the maximum similarity as a reference video image.

In this step, the specific value of the first preset similarity threshold may be set according to actual requirements, which is not limited in the embodiment of the present invention. Further, the first similarity corresponding to each video image in the selected segment may be compared to determine the maximum first similarity, so as to obtain the maximum similarity. And then judging whether the maximum similarity is larger than a first preset similarity threshold value. If the video image is larger than the first preset similarity threshold, the video image corresponding to the maximum similarity can be used as a reference video image, and whether the to-be-selected fragment is the target video fragment can be further confirmed through the subsequent steps. Otherwise, if the detected similarity is not greater than the first preset similarity threshold, the fact that the fragment to be selected is not the target video fragment with a high probability can be confirmed, detection of the fragment to be selected can be stopped, detection of the next fragment to be selected can be started, and detection of all fragments to be selected is completed.

Step S223c, starting from the reference video image, determining, for any remaining video image in the to-be-selected segment, a video image of the target object appearing in the remaining video image according to the second similarity between the reference video image and the remaining video image, so as to obtain a target video image.

In the embodiment of the invention, tracking search can be performed forward and backward by taking the reference video image as a starting point so as to determine the video image of the target object in the residual video image. Wherein the remaining video image may be a video image other than the reference video image in the selected clip.

Step S223d, if the target quantity ratio is larger than a preset ratio threshold, determining the segment to be selected as the target object segment; the target number ratio is a ratio of the number of the target video images to the total number of video images in the segment to be selected.

The preset ratio threshold may be set according to actual requirements, which is not limited in the embodiment of the present invention. Further, the number of the determined target video images can be counted first, and then the ratio between the number of the target video images and the total number of the video images contained in the fragment to be selected is calculated to obtain a target number ratio. And finally, judging whether the target quantity ratio is larger than a preset ratio threshold value. If so, the candidate segment may be determined to be the target item segment. Otherwise, if not, the detection of the next fragment to be selected can be started until all fragments to be selected are detected.

Further, in one existing manner, the similarity between the target image feature and the image feature of the video image in the to-be-selected segment is used as a selection criterion, and whether the to-be-selected segment is the target object segment is directly determined according to the similarity between the target image feature and the image feature of the video image in the to-be-selected segment. For example, the candidate segment is used as the target object segment directly when the maximum similarity of the candidate object segment and the target object segment is larger than a preset threshold value. Because the sizes, the visual angles and the positions of the objects in the video clips are all various, the object and the objects in the clips to be selected can be the same object in a mode of selecting the similarity of feature matching, but the similarity between the features of the object and the objects in the clips to be selected is low, so that the problem of inaccurate detection is caused. For example, when the target image is obtained from the front side of the target object and the segment to be selected is obtained from the side of the target object, the similarity between the front view angle and the side view angle of the target object is low, the segment to be selected is mistakenly identified as a segment of a non-target object, and thus false detection is caused.

In the embodiment of the invention, under the condition that the video image with the maximum similarity larger than the first preset similarity threshold exists in the to-be-selected fragment, the video image is used as a reference video image, the target video image of the target object is searched from the residual video image of the reference video image according to the second similarity, and the to-be-selected fragment is determined to be the target object fragment under the condition that the ratio occupied by the target video image is larger than the preset ratio threshold. Therefore, the method continues to track the residual video images while calculating the similarity through feature matching, and further can avoid the problems of false detection and omission to a certain extent. For example, since video clips tend to be a continuously changing process in recording time sequence, there may be time-sequentially progressive transitions from the front of the object to the side of the object in the video clip, so that by tracking the remaining video images, it is possible to find the target video images of the recorded target object from different perspectives, scales and positions in the video clip. Correspondingly, based on the proportion of the target video image, whether the segment to be selected is the target object segment is determined, so that false detection is avoided to a certain extent.

Optionally, the determining, in step S223c, that the video image of the target object appears in the remaining video image according to the second similarity between the reference video image and the remaining video image may include the following substeps:

substep (1): and calculating the similarity between the area of the object in the reference video image and the area of the object in the residual video image to obtain the second similarity, and calculating the similarity between the area of the object in the residual video image and the area of the object in the previous target video image to obtain the third similarity.

The steps of determining the second similarity and determining the third similarity may be performed simultaneously or sequentially. For example, the second similarity may be determined first, then the third similarity may be determined, or the third similarity may be determined first, then the second similarity may be determined, which is not limited in this embodiment of the present invention.

Further, an area where the object in the reference video image is located can be obtained as a target area, the feature of the target area and the feature of each search area in the residual video image to be tracked are subjected to convolution calculation to obtain the similarity between each search area and the target area, and then the search area corresponding to the maximum similarity is mapped back to the residual video image to obtain the second similarity. The search area corresponding to the maximum similarity is the area where the object in the residual video image is located. Accordingly, the third similarity may be calculated based on the same implementation. Of course, the similarity calculation may be implemented in other manners, for example, determining the area where the object is located in the image in advance, and then directly calculating the similarity between the areas where the object is located in the two images. Compared with the mode of directly calculating the similarity based on the whole image, the method can reduce the interference of other areas to a certain extent by directly calculating the similarity based on the areas where the objects in the image are located, so that the calculated second similarity and third similarity are more representative. Meanwhile, other areas do not participate in calculation, so that the calculation amount can be reduced to a certain extent, and the calculation efficiency is improved.

Substep (2): determining target similarity according to the second similarity and the third similarity; the target similarity is positively correlated with the second similarity and the third similarity.

For example, the second similarity and the mean, sum, weighted average, etc. of the third similarity may be calculated to obtain the target similarity.

Substep (3): and if the target similarity is greater than a second preset similarity threshold, taking the residual video image as the video image of the target object.

The second preset similarity threshold may be set according to actual requirements, which is not limited in the embodiment of the present invention. In the embodiment of the invention, the third similarity between the residual video image and the last target video image is further considered in the process of determining the target video image, so that richer reference information can be provided for the process of determining the target video image, the problem of false detection caused by the angle change of the object in the fragment to be selected is avoided to a greater extent, and the accuracy of determining operation can be improved to a certain extent.

Optionally, for any second fragment, the step of determining the second class of the second article corresponding to the second fragment may specifically include:

Step S221a, the text feature of the second segment and the image feature of the second segment are adjusted to the same dimension.

Specifically, the first segment included in the second segment, that is, the first segment used when the second segment is obtained by merging, may be used as the target first segment. The text features of the first segment of the object are then merged into the text features of the second segment, and the image features of the first segment of the object are merged into the image features of the second segment. Further, the text features and the image features of the second segment may be averaged in dimensions, i.e. the dimensions of both are adjusted to the same dimension. Or, according to the dimension number of the text features, adjusting the dimension number of the image features so as to enable the dimensions of the text features and the dimensions of the image features to be consistent.

Step S221b, determining the mean value characteristic according to the adjusted text characteristic and the image characteristic.

For example, the adjusted text feature and the adjusted image feature may be averaged to obtain an average feature.

Step S221c, classifying the mean features according to a preset classification network to determine a second object class.

The preset classification network may be selected according to actual requirements, which is not limited in the embodiment of the present invention. By way of example, the classification network may include a self-attention module and a full-connection layer, and accordingly, the classification network may focus on the characteristic differentiated on the class in the mean characteristic based on the self-attention module, so that the mean characteristic may be classified more accurately when the full-connection layer is used for classifying the mean characteristic later.

The above-described determination of the second class of items may be implemented, for example, by a segment classification module. Fig. 2 is a schematic diagram of a processing procedure of a segment classification module according to an embodiment of the present invention, as shown in fig. 2, for any image feature, the obtained image feature may be expanded first, then the size is adjusted so that the dimensions of the image feature and the text feature are consistent, and then the average feature of the two features is processed by a self-attention module and a full-connection layer to obtain a second object class.

According to the embodiment of the invention, the text features and the image features are adjusted to the same dimension, and the mean features are determined according to the adjusted text features and image features, so that the mean features can be improved to a certain extent, the uniformity of the calculated mean features is improved, and the classification effect based on the mean features is improved.

Alternatively, when the video to be processed is divided into a plurality of first segments, key frames included in the video to be processed may be determined first. And for any key frame, dividing the key frame and similar adjacent frames of the key frame into the same segment to obtain a first segment. Wherein a similar adjacent frame is a video image between the key frame and a next key frame to the key frame. Specifically, the key frame may be an Intra picture (I-frame). For a key frame, the content of the video images between the key frame and the next key frame of the key frame often has more similarity to the content of the key frame, so the video images can be directly regarded as similar adjacent frames, and the key frame and the similar adjacent frames of the key frame are regarded as a first segment, i.e. the image sequence between adjacent I frames is divided into a segment. Therefore, by utilizing the frame characteristics of video coding, the first segment can be conveniently divided only by determining the key frames in the video to be processed, and the dividing efficiency can be improved. Meanwhile, the video segmentation is carried out, the integrity of the object segments can be maintained to a certain extent, and the object features contained in the segments are more complete compared with the single image, so that feature matching with the target image of the target object is facilitated, and the accuracy of object content positioning can be improved.

Of course, the division of the first segments may be implemented in other manners, for example, clustering video images in the video to be processed, and taking the video images divided into one type as one first segment. Alternatively, the similarity between the video images is calculated, the video images with the similarity within the preset range are divided into a first segment, and the like, which is not limited in the embodiment of the present invention.

Alternatively, the segment features in the embodiments of the present invention may include text features and image features. Accordingly, the step of obtaining the segment characteristics of the first segment according to the audio information and/or the video image included in the first segment may include:

step 1021, extracting image characteristics of the video image in the first segment, and extracting text characteristics of a target text; the target text is a text corresponding to the audio information contained in the first segment.

Specifically, image sampling may be performed on the video image sequence in the first segment, and then image feature extraction may be performed on the video image obtained by sampling, so as to obtain image features. Or, image feature extraction is directly carried out on all the video images in the first segment, so that image features are obtained. The implementation manner of extracting the image features of the video image may refer to the specific implementation manner of extracting the target image features in the foregoing description, which is not described herein. It should be noted that, in the actual application scenario, the image features of the video image in the first segment may be determined, and the area where the object in the video image is located may be detected. Accordingly, when the similarity calculation based on the region where the object is located in the video image is performed, the detection result can be directly multiplexed, so that the calculation efficiency can be improved. Meanwhile, the spatial position of the object in the video can be determined by detecting the area of the object in the video image, so that the object positioning effect is improved to a certain extent.

Further, when extracting the text feature, the audio information contained in the first segment may be converted into the target text based on a preset voice conversion algorithm (e.g., DFCNN algorithm). The target text is then processed to obtain text features. For example, the target text may be used as an input of a preset Bert model, and then an output of a penultimate layer (transducer layer) in the model is extracted to obtain text features. Wherein the text feature may be a sentence vector of the target text.

Step 1022, generating the segment feature according to the image feature and the text feature.

For example, the image feature and the text feature may be combined to obtain the segment feature.

In the embodiment of the invention, the image characteristics of the video image and the text characteristics of the text corresponding to the audio information in the first segment are extracted by combining the audio information and the video image at the same time, and the characteristics in the two modes are fused to generate the segment characteristics. Therefore, the segment characteristics can be used for representing the characteristics of the first segment from multiple aspects to a certain extent, and the accuracy of the segment characteristics can be improved.

The above-described operations for generating segment features may be implemented by a segment feature extraction module, for example. Fig. 3 is a schematic diagram of a processing procedure of a segment feature extraction module according to an embodiment of the present invention, where, as shown in fig. 3, voice recognition may be performed on audio information, and then text feature extraction may be performed to obtain text features. And sampling the images of the video image sequence, detecting articles of the sampled video images to filter images which do not contain articles, and extracting image features of the rest video images to obtain image features. Finally, segment features can be generated from the text features as well as the image features.

Optionally, the step of merging the first segments with the similarity between the segment features meeting the second preset similarity requirement according to the segment features of each first segment may include the following steps:

step 1031, for any of the first segments, calculating a fourth similarity between the segment features of the first segment and the segment features of the next first segment.

Alternatively, in calculating the fourth similarity, the text similarity of the text feature of the first segment to the text feature of the next first segment may be calculated, and the image similarity of the image feature of the first segment to the image feature of the next first segment may be calculated. Then, the average value between the text similarity and the image similarity is determined as a fourth similarity. In this way, by combining the text similarity and the image similarity of the first segments at the same time, the fourth similarity is determined, so that the fourth similarity can more comprehensively represent the similarity between the first segments, and the accuracy of the fourth similarity is further improved.

Specifically, the text features of the first segment and the text features of the next first segment may be averaged in dimensions, that is, the number of dimensions of the text features of the first segment and the text features of the next first segment may be adjusted to be the same dimension. And taking a mean value of the image features of the first segment and the image features of the next first segment in dimensions, namely adjusting the number of dimensions of the image features of the first segment and the image features of the next first segment to be the same dimension. And then calculating cosine similarity among the text features to obtain the text similarity, and calculating cosine similarity among the image features to obtain the image similarity.

In this way, the number of images of different first segments and the number of sentences corresponding to the voice information may be different, which may further result in different text features and different dimensions of the image features of different first segments. In the embodiment of the invention, the accuracy of the determined fourth similarity can be improved to a certain extent by firstly adjusting the number of the dimensions of the two to be the same dimension and then carrying out similarity calculation.

Step 1032, merging the first segment with the next first segment if the fourth similarity is greater than a third preset similarity threshold.

The third preset similarity threshold may be set according to actual requirements, and if the fourth similarity is greater than the third preset similarity threshold, it is indicated that the other first segments and the first segment correspond to the same item, so that the current first segment and the next segment may be combined. The merging operation may be implemented based on a preset segment aggregation module. Further, after merging, or if the fourth similarity is not greater than the third preset similarity threshold, traversing the next first video segment may be continued until no corresponding first segment having the fourth similarity greater than the third preset similarity threshold exists. Thus, by continuous traversal, the first segments with higher similarity, which are adjacent to each other, can be combined into one segment. The traversal may be multi-round, and the specific number of rounds may be determined according to the actual situation.

By way of example, assume that there are 7 first segments: A. b, C, D, E, F and G, wherein the first segment a and the first segment B correspond to video content corresponding to the mobile phone, the first segment C and the first segment D correspond to video content corresponding to the computer, and the first segment E, the first segment F and the first segment G correspond to video content corresponding to the crayfish. Starting from a, a fourth similarity between a and the next first segment B is greater than a third preset similarity threshold, A, B may be combined, then C, D may be combined for C, C and the next first segment D is greater than the third preset similarity threshold, and E, F may be combined for E, E and F. Finally, for G, the first round of traversal may end because there is no next first video. After traversing a round, the first segment includes: (A+B), (C+D), (E+F), G. Further, by repeating the above-described traversal process, in the second traversal process, (e+f) and G may be combined, thereby obtaining (a+b), (c+d), and (e+f+g). Since there is no first video with a fourth similarity greater than the third preset similarity threshold with the next first video, the traversal may be stopped, and correspondingly, (a+b), (c+d), (e+f+g) are the second segments. Of course, in another implementation, a fourth similarity between the segment features of each first segment and the segment features of each other first segment may be calculated for each first segment. And combining the first segment with other first segments to obtain a second segment under the condition that the fourth similarity is larger than a third preset similarity threshold.

In the case where there are a plurality of first segments corresponding to the same item, a single first segment cannot characterize the complete content, for example, since complete semantic information cannot be perceived in the process of dividing the first segments based on I frames, the plurality of first segments may be caused to correspond to the same item. According to the embodiment of the invention, the corresponding other first fragments with the fourth similarity larger than the third preset similarity threshold value are combined with the first fragments by calculating the fourth similarity between the fragment characteristics of the first fragments and the fragment characteristics of the other first fragments. In this way, the content integrity and accuracy of the final split second segment can be improved.

In an application scenario, with the continuous development of internet technology, video content in a network is becoming rich. For example, in a live scene, a large number of live videos are generated, and how to split the live videos to improve the utilization rate of the live videos is a problem of great concern. In one existing implementation, often during the recording process, the presenter prompts the photographer to record separately in separate mirrors when starting to introduce a new item, and instructs the presenter to stop separately recording the item when ending to introduce the item. Thus, after the live broadcast is finished, the single video clips shot by the split mirrors can be uploaded for use. However, this approach is less automated, requires a large amount of manual coordination, and is less efficient.

Fig. 4 is a schematic diagram of a video content positioning process according to an embodiment of the present invention, and as shown in fig. 4, the process may include a video segmentation and information acquisition stage, an item information acquisition stage, and a segment screening stage. The video segmentation and information acquisition stage can take video images and audio information as input, firstly perform video rough segmentation to obtain a first segment, then perform segment feature extraction, and then perform segment aggregation, namely, combine to obtain a second segment. Then classifying the segments based on the second segments and the segment features to determine a second class of the item. The item information acquisition stage may take as input a target image of a target item, perform image feature extraction, and then perform item image classification based on the extracted target image features to determine the first item category. In the segment screening stage, class filtering can be performed firstly based on the first object class and the second object class, then the second segment with the same class is used as a segment to be selected, feature matching is performed based on segment features and target image features, and finally the target object segment is obtained. According to the embodiment of the invention, manual intervention is not needed, and the splitting and positioning of the video can be automatically realized, so that the efficiency can be improved.

Fig. 5 is a block diagram of a video processing apparatus according to an embodiment of the present invention, where the apparatus 20 may include:

a dividing module 201, configured to divide a video to be processed into a plurality of first segments; the video to be processed comprises video contents corresponding to a plurality of objects, and the similarity between video images contained in the first segment meets a first preset similarity requirement;

a first obtaining module 202, configured to obtain, for each of the first segments, segment features of the first segments according to audio information and/or video images included in the first segments;

the merging module 203 is configured to merge, according to the segment features of each first segment, the first segments whose similarity between the segment features meets a second preset similarity requirement, so as to obtain a second segment; the different second segments correspond to video content corresponding to different items.

Optionally, the apparatus 20 further includes:

the second acquisition module is used for acquiring a target image corresponding to the target object, extracting image features corresponding to the target image and obtaining target image features;

the determining module is used for determining a target object segment corresponding to the target object according to the target image feature and the image feature corresponding to the video image in each second segment; the target object segment is a second segment corresponding to the target object.

Optionally, the determining module is specifically configured to:

determining a first article category to which the target article belongs, and determining a second article category to which the article corresponding to each second segment belongs;

determining a fragment to be selected from the second fragments according to the first article category and the second article category of each second fragment; the to-be-selected segment is a second segment of which the corresponding second object class is matched with the first object class;

and determining the target object fragment from the fragment to be selected according to the target image characteristics and the image characteristics of the video images in the fragment to be selected.

Optionally, the determining module is further specifically configured to:

for any one of the segments to be selected, calculating a first similarity between the target image features and the image features of each video image in the segment to be selected;

if the maximum similarity in the first similarity is larger than a first preset similarity threshold, taking the video image corresponding to the maximum similarity as a reference video image;

starting from the reference video image, determining a video image of the target object in the residual video image according to the second similarity between the reference video image and the residual video image for any residual video image in the to-be-selected fragment, and obtaining a target video image;

If the target quantity ratio is greater than a preset ratio threshold, determining the segment to be selected as the target object segment; the target number ratio is a ratio of the number of the target video images to the total number of video images in the segment to be selected.

Optionally, the determining module is further specifically configured to:

calculating the similarity between the area of the object in the reference video image and the area of the object in the residual video image to obtain the second similarity, and calculating the similarity between the area of the object in the residual video image and the area of the object in the previous target video image to obtain a third similarity;

determining target similarity according to the second similarity and the third similarity; the target similarity is positively correlated with the second similarity and the third similarity;

and if the target similarity is greater than a second preset similarity threshold, taking the residual video image as the video image of the target object.

Optionally, the dividing module 201 is specifically configured to divide the video to be processed into a plurality of first segments, and includes:

determining key frames contained in the video to be processed;

For any key frame, dividing the key frame and similar adjacent frames of the key frame into the same segment to obtain the first segment;

wherein the similar adjacent frames are video images between the key frame and a next key frame.

Optionally, the first obtaining module 202 is specifically configured to:

extracting image features of the video image in the first segment, and extracting text features of a target text; the target text is a text corresponding to the audio information contained in the first segment;

and generating the fragment characteristic according to the image characteristic and the text characteristic.

Optionally, the merging module 203 is specifically configured to:

for any one of the first segments, calculating a fourth similarity between segment features of the first segment and segment features of a next first segment;

and combining the first segment with the next first segment in the case that the fourth similarity is greater than a preset threshold.

Optionally, the merging module 203 is further specifically configured to: calculating the text similarity of the text features of the first segment and the text features of the next first segment, and calculating the image similarity of the image features of the first segment and the image features of the next first segment;

And determining the average value between the text similarity and the image similarity as the fourth similarity.

Optionally, the segment features include text features and image features; the determining module is further specifically configured to: for any one of the second segments, adjusting the text features of the second segment and the image features of the second segment to the same dimension;

determining a mean value characteristic according to the adjusted text characteristic and the adjusted image characteristic;

and classifying the mean value characteristics according to a preset classification network to determine the second object class.

The video processing device provided by the embodiment of the invention can divide the video to be processed into a plurality of first fragments; the video to be processed comprises video contents corresponding to a plurality of objects, and the similarity between video images contained in the first segment meets a first preset similarity requirement; for each first segment, acquiring segment characteristics of the first segment according to audio information and/or video images contained in the first segment; combining the first fragments with the similarity meeting the second preset similarity requirement according to the fragment characteristics of each first fragment to obtain a second fragment; the different second segments correspond to video content corresponding to different items. Compared with a manual splitting mode, the video to be processed is automatically split into the second segments corresponding to different objects, so that the implementation cost can be reduced to a certain extent, and the splitting efficiency is improved. .

The present invention also provides an electronic device, see fig. 6, comprising: a processor 301, a memory 302 and a computer program 3021 stored on the memory and executable on the processor, which processor implements the video processing method of the previous embodiments when executing the program.

The present invention also provides a readable storage medium which, when executed by a processor of an electronic device, enables the electronic device to perform the video processing method of the foregoing embodiments.

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for a construction of such a system is apparent from the description above. In addition, the present invention is not directed to any particular programming language. It will be appreciated that the teachings of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present invention.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the above description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functions of some or all of the components in a sorting device according to the present invention may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present invention may also be implemented as an apparatus or device program for performing part or all of the methods described herein. Such a program embodying the present invention may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A method of video processing, the method comprising: dividing a video to be processed into a plurality of first segments; the video to be processed comprises video contents corresponding to a plurality of objects, and the similarity between video images contained in the first segment meets a first preset similarity requirement; for each first segment, acquiring segment characteristics of the first segment according to audio information and/or video images contained in the first segment; combining the first fragments with the similarity meeting the second preset similarity requirement according to the fragment characteristics of each first fragment to obtain a second fragment; the different second segments correspond to video content corresponding to different objects;

The method further comprises the steps of: acquiring a target image corresponding to a target object, and extracting image features corresponding to the target image to obtain target image features; determining a target object segment corresponding to the target object according to the target image features and the image features corresponding to the video images in the second segments; the target object segment is a second segment corresponding to the target object;

the determining the target object segment corresponding to the target object according to the target image feature and the image feature corresponding to the video image in each second segment includes: determining a first article category to which the target article belongs, and determining a second article category to which the article corresponding to each second segment belongs; determining a fragment to be selected from the second fragments according to the first article category and the second article category of each second fragment; the to-be-selected segment is a second segment of which the corresponding second object class is matched with the first object class; determining the target object fragment from the fragment to be selected according to the target image characteristics and the image characteristics of the video images in the fragment to be selected;

The determining the target object segment from the segments to be selected according to the target image features and the image features of the video images in the segments to be selected comprises: for any one of the segments to be selected, calculating a first similarity between the target image features and the image features of each video image in the segment to be selected; if the maximum similarity in the first similarity is larger than a first preset similarity threshold, taking the video image corresponding to the maximum similarity as a reference video image; starting from the reference video image, determining a video image of the target object in the residual video image according to the second similarity between the reference video image and the residual video image for any residual video image in the to-be-selected fragment, and obtaining a target video image; if the target quantity ratio is greater than a preset ratio threshold, determining the segment to be selected as the target object segment; the target number ratio is a ratio of the number of the target video images to the total number of video images in the segment to be selected.

2. The method of claim 1, wherein determining the video image of the target item that appears in the remaining video image based on a second similarity between the reference video image and the remaining video image comprises: calculating the similarity between the area of the object in the reference video image and the area of the object in the residual video image to obtain the second similarity, and calculating the similarity between the area of the object in the residual video image and the area of the object in the previous target video image to obtain a third similarity; determining target similarity according to the second similarity and the third similarity; the target similarity is positively correlated with the second similarity and the third similarity; and if the target similarity is greater than a second preset similarity threshold, taking the residual video image as the video image of the target object.

3. The method according to claim 1 or 2, wherein the dividing the video to be processed into a plurality of first segments comprises: determining key frames contained in the video to be processed; for any key frame, dividing the key frame and similar adjacent frames of the key frame into the same segment to obtain the first segment; wherein the similar adjacent frames are video images between the key frame and a next key frame.

4. The method according to claim 1, wherein the obtaining the segment characteristics of the first segment according to the audio information and/or the video image contained in the first segment comprises: extracting image features of the video image in the first segment, and extracting text features of a target text; the target text is a text corresponding to the audio information contained in the first segment; and generating the fragment characteristic according to the image characteristic and the text characteristic.

5. The method according to claim 1, wherein the merging, according to the segment characteristics of each of the first segments, the first segments whose similarity between the segment characteristics satisfies the second preset similarity requirement includes: for any one of the first segments, calculating a fourth similarity between segment features of the first segment and segment features of a next first segment; and combining the first segment with the next first segment in the case that the fourth similarity is greater than a preset threshold.

6. The method of claim 5, wherein the segment features include text features and image features; the computing a fourth similarity between the segment features of the first segment and segment features of a next first segment comprises: calculating the text similarity of the text features of the first segment and the text features of the next first segment, and calculating the image similarity of the image features of the first segment and the image features of the next first segment; and determining the average value between the text similarity and the image similarity as the fourth similarity.

7. The method of claim 1, wherein the segment features include text features and image features; the determining the second article category to which the article corresponding to each second segment belongs includes: for any one of the second segments, adjusting the text features of the second segment and the image features of the second segment to the same dimension; determining a mean value characteristic according to the adjusted text characteristic and the adjusted image characteristic; and classifying the mean value characteristics according to a preset classification network to determine the second object class.

8. An electronic device, comprising: processor, memory and computer program stored on the memory and executable on the processor, characterized in that the processor implements the video processing method according to one or more of claims 1-7 when executing the program.

9. A readable storage medium, characterized in that instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the video processing method of one or more of claims 1-7.