WO2022247849A1

WO2022247849A1 - Multimedia data processing method and apparatus, and device and storage medium

Info

Publication number: WO2022247849A1
Application number: PCT/CN2022/094878
Authority: WO
Inventors: 郐洪楠; 刘伟科; 韩卫召; 沈俊杰
Original assignee: 北京沃东天骏信息技术有限公司
Priority date: 2021-05-25
Filing date: 2022-05-25
Publication date: 2022-12-01
Also published as: CN113377713A

Abstract

Disclosed in the present application is a multimedia data processing method. The method comprises: on the basis of a label to be deleted, determining multimedia data to be processed, wherein a label of said multimedia data comprises the label to be deleted; identifying the content of said multimedia data, and determining a fragment to be deleted from said multimedia data, wherein the content of said fragment comprises an object corresponding to the label to be deleted; and clipping said fragment from said multimedia data, so as to obtain target multimedia data. Further disclosed in the present application are a multimedia data processing apparatus, a device, and a storage medium.

Description

Multimedia data processing method and device, device, storage medium

Cross References to Related Applications

This application is based on a Chinese patent application with application number 202110569790.5 and a filing date of May 25, 2021, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated into this application by reference.

technical field

The present application relates to the field of computer technology, and relates to but not limited to a multimedia data processing method, device, equipment, and storage medium.

Background technique

Today, with the rapid development of live broadcast business, building a content security platform and reducing negative impact are the basic conditions for the development of live broadcast business.

At present, the historical live video processing solution with content security issues is manually operated. In the live broadcast information registered by the merchant, if a live video resource with a product that needs to be removed from the shelf is found, the entire live video resource needs to be deleted.

However, there is such a technical problem in the above solution: consumers will occasionally look through the historical video resources of the live broadcast, check the order details at that time, such as whether the number of gifts is consistent with the live broadcast, and directly download the entire live broadcast resources, which will consume The video content that the reader wants to browse is deleted together, thereby causing unnecessary deletion of video content. .

Contents of the invention

In order to solve at least one problem in the related art, this application provides a multimedia data processing method, device, equipment, and storage medium, which can meet the actual needs of users and automatically delete multimedia data segments that need to be deleted in multimedia data.

The technical scheme of the present application is realized like this:

In a first aspect, an embodiment of the present application provides a multimedia data processing method, the method comprising:

Determine the multimedia data to be processed based on the label to be deleted; the label of the multimedia data to be processed includes the label to be deleted;

Identifying the content of the multimedia data to be processed, and determining the segment to be deleted in the multimedia data to be processed, the content of the segment to be deleted includes the object corresponding to the label to be deleted;

Cutting out the segment to be deleted from the multimedia data to be processed to obtain target multimedia data.

In a second aspect, an embodiment of the present application provides a multimedia data processing device, including:

The determining unit is configured to determine the multimedia data to be processed based on the label to be deleted; the label of the multimedia data to be processed includes the label to be deleted;

The identification unit is configured to identify the content of the multimedia data to be processed, and determine the segment to be deleted in the multimedia data to be processed, and the content of the segment to be deleted includes the object corresponding to the label to be deleted;

The clipping unit is configured to clip the segment to be deleted from the multimedia data to be processed to obtain target multimedia data.

In the third aspect, the embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and operable on the processor, and the above-mentioned multimedia data processing is realized when the processor runs the computer program steps in the method.

In a fourth aspect, the embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the steps in the above multimedia data processing method are implemented.

In the embodiment of the present application, a multimedia data processing method, device, device, and storage medium are provided, including: determining the multimedia data to be processed based on the tag to be deleted; the tag of the multimedia data to be processed includes the tag to be deleted; Identifying the content of the multimedia data to be processed, determining the segment to be deleted in the multimedia data to be processed, the content of the segment to be deleted includes the object corresponding to the label to be deleted; the segment to be deleted, from Cutting out the multimedia data to be processed to obtain the target multimedia data; thereby deleting the part of the multimedia data whose tag includes the tag to be deleted and whose content includes the object corresponding to the tag data, and only retaining the segments that do not involve the tag to be deleted. Therefore, when removing or removing a product from the online store, the products involved, including the video of the product, should be deleted as a whole, so as to avoid the impact of the removed product on the normal display of other products.

Description of drawings

FIG. 1 is a schematic diagram of an optional architecture of a multimedia data processing system provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of an optional architecture of a multimedia data processing system provided in an embodiment of the present application;

FIG. 3 is an optional schematic flowchart of a multimedia data processing method provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of an optional effect of multimedia data clipping provided by an embodiment of the present application;

FIG. 5 is a schematic structural diagram of an optional multimedia data processing system provided by an embodiment of the present application;

FIG. 6 is an optional schematic flowchart of a method for processing multimedia data at a merchant end provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of optional reception of live file information provided by the embodiment of the present application;

FIG. 8 is an optional flow diagram of the merchant-end live broadcast process provided by the embodiment of the present application;

FIG. 9 is an optional schematic flowchart of a method for processing multimedia data at a merchant end provided by an embodiment of the present application;

FIG. 10 is a schematic diagram of optional segmentation of multimedia data provided by the embodiment of the present application;

FIG. 11 is a schematic diagram of an optional clipping effect of multimedia data provided by an embodiment of the present application;

FIG. 12 is a schematic structural diagram of an optional multimedia data processing device provided in an embodiment of the present application;

FIG. 13 is a schematic structural diagram of an optional electronic device provided by an embodiment of the present application.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the specific technical solutions of the application will be further described in detail below in conjunction with the drawings in the embodiments of the present application. The following examples are used to illustrate the present application, but not to limit the scope of the present application.

Embodiments of the present application may provide a multimedia data processing method and system, and a storage medium. In practical applications, the multimedia data processing method can be realized by a multimedia data processing system, and each functional entity in the multimedia data processing system can be composed of hardware resources of an electronic device (such as a terminal device or a server), computing resources such as a processor, communication resources (such as It is used to support the realization of communication in various ways such as optical cable and cellular) and collaborative realization.

The multimedia data processing method of the embodiment of the present application can be applied to the multimedia data processing system shown in FIG. 1, including: a client 10 and a server 20, wherein the client interacts with the user based on the input device, and receives the user input to be deleted. The label, wherein the input device includes: a display, a mouse, a keyboard and other devices capable of receiving user input information.

In an example, the client 10 and the server 20 are respectively located on different physical entities, and at this time, the server 20 can communicate with the client 10 through the network 30 .

In one example, as shown in Figure 2, the multimedia data processing system also includes: a multimedia data acquisition terminal 40, the multimedia data acquisition terminal 40 can collect multimedia data based on the data acquisition device, and send the collected multimedia data to the server 20 . Data collection equipment includes: cameras, microphones and other equipment capable of data collection.

The client terminal 10 receives the label to be deleted based on the number of users input by the input device, and sends the label to be deleted to the server 20, and the server determines the multimedia data to be processed based on the label to be deleted; the label of the multimedia data to be processed includes the Tag to be deleted; identify the content of the multimedia data to be processed, determine the segment to be deleted in the multimedia data to be processed, the content of the segment to be deleted includes the object corresponding to the tag to be deleted; The segment is deleted and cut out from the multimedia data to be processed to obtain the target multimedia data.

In practical applications, the client 10 can be an operation terminal that operates and manages videos stored in the server, the multimedia data collection terminal can be a live broadcast terminal running a live broadcast application program for users to perform live broadcast, and the server is a live broadcast application The server on which the program provides the service.

In combination with the above multimedia data processing system, this embodiment proposes a multimedia data processing method that can meet the actual needs of users and automatically delete multimedia data segments that need to be deleted in the multimedia data.

Next, with reference to the multimedia data processing system shown in FIG. 1 or FIG. 2 , various embodiments of the multimedia data processing method, device, device, and storage medium provided by the embodiments of the present application will be described.

An embodiment of the present application provides a multimedia data processing method. The functions realized by the method can be realized by calling the program codes by the processor in the electronic device, and of course the program codes can be stored in the computer storage medium. It can be seen that the electronic device at least includes a processor and a storage medium.

Fig. 3 is a schematic diagram of the implementation flow of a multimedia data processing method in the embodiment of the present application. As shown in Fig. 3, the method may include the following steps:

S301. The server determines multimedia data to be processed based on the tag to be deleted; the tag of the multimedia data to be processed includes the tag to be deleted.

When the user needs to delete the multimedia data related to a certain object, he can input the label to be deleted to the client, and perform a masking operation that triggers the video deletion function on the client. When the client receives the masking operation, a masking command is generated. , and send the blocking command to the server. Wherein, the masking instruction carries a tag to be deleted.

After receiving the shielding instruction, the server parses the shielding instruction to obtain the tag to be deleted.

After the server determines the tag to be deleted, it determines the multimedia data to be processed. The multimedia data to be processed may be the multimedia data designated by the user through the client, or may be selected from at least one piece of multimedia data. Here, the file type of the multimedia data may be video, audio and other data that lasts for a period of time.

When the multimedia data to be processed is the multimedia data specified by the user through the client, the server judges whether the label of the specified multimedia data includes the label to be deleted, and if the label of the specified multimedia data includes the label to be deleted, then determine the specified multimedia data The data is the label to be deleted.

In an example, the designated multimedia data is multimedia data A, and the tags of multimedia data A include: tag 1, tag 2, tag 3, and tag 4. When the tag to be deleted is tag 1, then multimedia data A is multimedia data to be processed. data, when the label to be deleted is label 5, the multimedia data A is not the multimedia data to be processed.

When the video to be processed is multimedia data selected from the at least two multimedia data, the multimedia data whose tag includes the tag to be deleted in the at least two multimedia data is determined as the multimedia data to be processed.

In an example, at least two multimedia data include: multimedia data A, multimedia data B and multimedia data C, the label to be deleted is label 2, and the label of multimedia data A includes label 2, then multimedia data A is multimedia data to be processed, If the label of multimedia data B does not include label 2, multimedia data B is not multimedia data to be processed, and the label of multimedia data C includes label 2, then multimedia data C is multimedia data to be processed. At this time, multimedia data to be processed includes multimedia data A and multimedia data C.

In the embodiment of the present application, a label list may be established in the server, and the label list includes the identification of each multimedia data and the association relationship between the labels. Based on the label list, the server can determine that the label includes the unprocessed multimedia data of the label to be deleted. Wherein, the tags in the tag list may be input by the user.

In one example, the multimedia data to be processed is the live video of the live broadcast. Before the live broadcast, the user can receive the live file information input by the user at the live broadcast terminal. The current live information can include the label of this live broadcast. The information is sent to the server, and after the live video is generated based on the user's live broadcast, the server establishes an association between the live file information and the video ID of the generated live video.

In the embodiment of this application, the tags in the multimedia data to be processed represent the commodities involved in the content of the multimedia data to be processed. At this time, the multimedia data to be processed can be associated with multiple commodity links, and the content pointed to by each commodity link is commodity purchase. The page includes product information on the product purchase page, that is, the product information of each product can be obtained based on the product link. Here, the product information obtained through the product link is used as the label of the multimedia data to be processed.

In an example, the product links associated with the multimedia data to be processed include: product links of product 1, product links of product 2 and product links of product 3, and the tags of the multimedia data to be processed include: product 1, product 2 and product 3 .

In the embodiment of the present application, when the multimedia data to be processed is a video, the multimedia data to be processed can be a video downloaded from the network side, such as: a video of a TV series, or a live video formed by a video stream uploaded by a user, such as: User A's live video.

S302. The server identifies the content of the multimedia data to be processed, and determines a segment to be deleted in the multimedia data to be processed, where the content of the segment to be deleted includes an object corresponding to the tag to be deleted.

After the server determines the multimedia data to be processed, it retrieves the multimedia data to be processed, and identifies the content of the multimedia data to be processed as the object corresponding to the label to be deleted, that is, the object to be deleted. In the embodiment of the present application, the tag to be deleted may represent any object that can appear in the multimedia data, such as a person, a commodity, or a building. In an example, when the tag to be deleted represents a product, the tag to be deleted may be the name of the product, the stock keeping unit (Stock Keeping Unit, SKU) and other product information.

Here, the file types of the multimedia data are different, and the representation forms of the objects to be deleted in the multimedia data are different. When the file type of the multimedia data is video, the object to be deleted is the image content in the video, and when the file type of the multimedia data is audio, the object to be deleted is the audio content in the audio.

The segment to be deleted is a segment whose content includes the object to be deleted in the multimedia data. In one example, the target character A to be deleted, the multimedia data to be processed is video A, the duration of video A is T1, and within the time period [t1, t2] of video A, the video content includes character A, then the segment to be deleted is the video of time period [t1, t2], where t1 is greater than or equal to 0, and t2 is less than or equal to T1. In an example, the target character A to be deleted, the multimedia data to be processed is audio B, the duration of audio B is T2, and within the time period [t3, t4] of audio B, the audio content includes character A, then the segment to be deleted is an audio segment of time period [t3, t4], wherein t3 is greater than or equal to 0, and t4 is less than or equal to T2.

In the embodiment of the present application, for an object to be deleted, one or more segments to be deleted may be included in one piece of multimedia data to be processed.

In the embodiment of the present application, the multimedia data to be processed may be stored in the server, or may be stored in a storage terminal corresponding to the server. At this time, the server retrieves the multimedia data to be processed from the storage terminal.

S303. The server cuts the segment to be deleted from the multimedia data to be processed to which the segment to be deleted belongs, to obtain target multimedia data.

After the server determines the segment to be deleted, it cuts the segment to be deleted from the video to be processed.

In an example, the multimedia data to be processed is video A, the duration of video A is T1, and the segment to be deleted is the video of the time period [t1, t2]; when t1 is 0, then t2 is less than T1, at this time, the time Segment [t1, t2] is cut from video A to obtain target video A. The video content of the target video is the video content of [t2, T1] of video A; when t1 is greater than 0, then t2 It is equal to T1. At this time, the video of the time period [t1, t2] is cut from video A to obtain the target video A. The video content of the target video is the video content of video A during the period [0, t1]. ; When t1 is greater than 0, then t2 is less than T1. At this time, the video of the time period [t1, t2] is cut from video A to obtain the target video A. The video content of the target video is [0, t1 of video A ] and [t2, T1] the video content during these two periods.

The multimedia data processing method provided in the embodiment of the present application can be applied to the following scenarios:

Scenario 1. A live video is a live video of user A selling commodities, and this live video is shown in (a) in 4, and the commodity sold in the video segment 401 within the time period shown by T0 to T1 is commodity A. The commodity sold in the video segment 402 in the time period shown in T1 to T2 is product B, the product sold in the video segment 403 in the time period shown in T2 to T3 is product C, and in the time period shown in T3 to T4 The commodity sold by the video segment 404 in the time period is commodity D. At this time, if commodity B needs to be removed from the shelves, the video segment 402 is deleted from this video as the segment to be deleted, and the obtained target video is as shown in Figure 4 (b) shown.

Scenario 2. Audio 1 is the interview audio of user A, user B, user C, and user D. When user B does not meet the interview conditions and needs to delete user B’s interview content from the audio, the user B’s interview corresponding Snippets are removed from this audio, and only the interviews with User A, User C, and User D remain.

In the embodiment of the present application, a multimedia data processing method is provided, which determines the multimedia data to be processed based on the label to be deleted; the label of the multimedia data to be processed includes the label to be deleted; the content of the multimedia data to be processed Identifying and determining the segment to be deleted in the multimedia data to be processed, the content of the segment to be deleted includes the object corresponding to the label to be deleted; cutting the segment to be deleted from the multimedia data to be processed , to obtain the target multimedia data; thus, in the multimedia data whose tag includes the tag to be deleted, the content includes some fragments of the object corresponding to the tag data to be deleted, and only keep the fragments that do not involve the tag to be deleted, so that there is no need to put a certain product on the shelf , delete the product involved, including the video of the product as a whole, to avoid the impact of the removed product on the normal display of other products.

In some embodiments, the implementation of S301 determining the multimedia data to be processed based on the tag to be deleted includes:

S3011. Obtain a tag of each of the at least two original multimedia data;

S3012. Compare the tag of each original multimedia data with the tag to be deleted;

S3013. Determine the original multimedia data whose tag includes the tag to be deleted as the multimedia data to be processed.

In the embodiment of the present application, the masking instruction sent by the client may be a one-key masking instruction, and at this time, the server retrieves all multimedia data to be processed from at least two original multimedia data.

When retrieving multimedia data to be processed among at least two original multimedia data, the server acquires tags of each original multimedia data, wherein one original multimedia data includes one or more tags.

In the embodiment of the present application, the tag of the original multimedia data may be input by the user, or may be obtained by the server from the product link of the original multimedia data.

In the case that the label of the original multimedia data is input by the user, when the user uploads the original multimedia data through the multimedia data collection terminal, the label of the multimedia data is input in the multimedia data collection terminal, so that the label of the multimedia data and the original multimedia data sent together to the server.

In one example, the data collection terminal is the user's live broadcast terminal. Before the live broadcast, the user inputs the live broadcast file information of this live broadcast, wherein the live broadcast file information may include: live broadcast title, live broadcast time, home page picture, shopping cart product list, etc. The detailed information of the live broadcast, wherein the product list of the shopping cart includes the product information of the products to be sold in the live broadcast.

After the server obtains the tags of each original multimedia data, for each original multimedia data, perform the following processing: match the tags of the original multimedia data with the tags to be deleted, and determine the original multimedia data whose tags include the tags to be deleted as multimedia files to be processed data.

In one example, at least two original multimedia data include: multimedia data A, multimedia data B and multimedia data C, the label to be deleted is label 2, the label of multimedia data A includes labels: label 1, label 2 and label 3, multimedia The tags of data B include tag 4 and tag 5 , the tags of multimedia data C include tag 2 , tag 5 and tag 6 , and the multimedia data to be processed includes multimedia data A and multimedia data C.

In some embodiments, before S301 determines the multimedia data to be processed based on the tag to be deleted, the following steps are also implemented:

Obtain at least one commodity link under the account to which the original multimedia data belongs;

The commodity information targeted by each commodity link in the at least one commodity link is determined as a tag of the original video data.

In the embodiment of the present application, if the original multimedia data satisfies the set condition, the label of the original multimedia data can be determined from the product link under the account to which the original multimedia data belongs.

Here, setting conditions may include at least one of the following:

Condition 1. The generation time of the original multimedia data is less than the set time from the current time;

Condition 2: The original multimedia data is the latest multimedia data under the account to which the original multimedia data belongs.

In condition one, the set time may be 24 hours.

The server determines the account to which the multimedia data to be processed belongs. The account may be the account used when uploading the multimedia data to be processed. The server determines the product link under the account, and uses the product information of the product targeted by the product link as the pending account. Tags for multimedia data.

In some embodiments, the file type of the multimedia data to be processed includes: video, S302 identifies the content of the multimedia data to be processed, and the implementation of determining the segment to be deleted in the multimedia data to be processed includes:

S3021. Extract the video frame sequence of the multimedia data to be processed;

S3022. Obtain a reference image corresponding to the tag to be deleted;

S3023. Match the reference image with the video frames in the video frame sequence to obtain a matching result;

S3024. According to the matching result, determine the trimming start frame and the trimming end frame corresponding to the trimming start frame; the video frames between the trimming start frame and the trimming end frame constitute the segment to be deleted.

In the embodiment of the present application, the reference image may be the main product image of the object corresponding to the tag to be deleted. Among them, the product main image is the main image displayed to the user on the product detail page, which can directly display the product.

After the server acquires the reference image, it matches the video frame in the video frame sequence with the reference image, wherein the matching result includes: a first video frame and a second video frame, and the first video frame is the content corresponding to the tag to be deleted The video frame of the object, and the second video frame is a video frame of the object whose content does not include the label to be deleted.

In the embodiment of the present application, the image similarity between the video frame and the reference image can be calculated, and the video frame whose image similarity with the reference image is greater than the similarity threshold is determined as the first video frame, and the image similarity with the reference image is determined to be less than or The video frame equal to the similarity threshold is the second video frame. Wherein, the image similarity can be represented by a Hamming distance between images, and in the embodiment of the present application, no limitation is imposed on the representation of the image similarity.

In the embodiment of the present application, a plurality of consecutive first video frames constitute the segment to be deleted, wherein the first video frame in the segment to be deleted is the clipping start video frame, and the last video frame in the segment to be deleted is the clipping end video frame . In addition, in the sequence of video frames, the video frame preceding the clipping start video frame is the second video frame, and the video frame following the clipping end video frame is the second video frame.

The server can determine the image similarity between each video frame and the reference image. When the image similarity between a video frame and the reference image is greater than the similarity threshold, the video frame is the first video frame. When a video frame and the reference image If the image similarity of the image is less than or equal to the similarity threshold, the video frame is the second video frame.

The server can also estimate the live broadcast time of a product, and periodically search for the video frequency band of the product in the video, thereby reducing the workload of image recognition. For example: if the video frame of T1 is detected as the first video frame, then detect whether the video frame of T1+t1 is the first video frame, if the video frame of T1+t1 is not the first video frame, return to detect T1+t1- Whether the video frame of t2 (t2 is less than t1) is the first video frame; If the video frame of T1+t1 is the first video frame, then detect whether the next video frame of the video frame of T1+t1 is the first video frame, if The video frame after the video frame of T1+t1 is not the first video frame, then the video frame of T1+t1 is determined to be the last video frame of the segment to be deleted, if the video frame after the video frame of T1+t1 is the first For the video frame, continue to detect whether the video frame of T1+2*t1 is the first video frame until the last video frame of the segment to be deleted is detected. Here, the same detection method as the first video frame of the segment to be deleted may be used for the first video frame of the segment to be deleted.

In some embodiments, S3023 matches the reference image with video frames in the sequence of video frames, and the implementation of obtaining the matching result includes: for each video frame in the sequence of video frames, determining the The reference area of the content object included in the video frame, and cut out the reference area from the video frame to obtain the image to be matched; determine the content similarity between the image to be matched and the reference image; The video frame to which the matching image whose similarity is greater than the set similarity threshold is determined as the first video frame; the content of the first video frame includes the object corresponding to the label to be deleted; the similarity is less than or equal to the similarity The video frame to which the image matching the threshold belongs is determined as the second video frame, and the content of the second video frame does not include the object corresponding to the tag to be deleted.

A target detection model is set in the server, and the server performs the following processing on the video frame that needs to be judged to be the first video frame:

The video frame is used as the input of the target detection model to obtain the position of the content object included in the video frame output by the target detection model, and based on the output position of the target detection model, the area of the position of the content object is cut out from the video frame, that is, the reference area , to obtain the image to be matched of the video frame; the server calculates the image similarity between the image to be matched and the reference image to obtain the image similarity between the video frame and the reference image.

In the embodiment of the present application, the algorithm adopted by the target detection model in the server may include Faster region-convolution neural network (Faster R-CNN), word multi-box detector (Single Shot MultiBox Detector, SSD ) and other target detection algorithms, in the embodiment of the present application, the target detection algorithm adopted by the target detection model is not limited in any way.

In the embodiment of the present application, the server can also set an image segmentation model, and determine the reference area from the video frame based on the image segmentation model. The image segmentation algorithm adopted by the image segmentation model may include image segmentation algorithms such as region growing, mean value iterative segmentation, and maximum entropy segmentation. In the embodiment of the present application, the image segmentation algorithm adopted by the image segmentation model is not limited in any way.

In the embodiment of the present application, when calculating the image similarity between the image to be matched and the reference image, the image to be matched or the reference image can be used as the target image to perform the following processing: the target image is reduced to a set size, and the reduced target image Perform grayscale processing and calculate the hash value of the target image after grayscale processing. At this time, calculate the similarity between the hash value of the image to be matched and the hash value of the reference image to obtain the hash value of the image to be matched value and the image similarity of the reference image. In an example, the set size is 9*8.

In some embodiments, the implementation of S3024 determining the clipping start point and the clipping end point corresponding to the clipping start point according to the matching result includes:

It is determined that the adjacent previous frame belongs to the second video frame but the video frame that itself belongs to the first video frame is determined as the cropping start frame; the content of the first video frame includes the corresponding Object; the content of the second video frame does not include the object corresponding to the label to be deleted;

Determine the video frame whose next adjacent frame belongs to the second video frame but itself belongs to the first video frame as the clipping end frame; the clipping start frame and the clipping end frame do not include A video frame of the second video frame.

In some embodiments, S303 cuts the segment to be deleted from the multimedia data to be processed to which the deleted segment belongs, and the implementation of obtaining the target multimedia data includes:

Based on the splicing of the previous video frame of the clipping start frame and the subsequent video frame of the clipping end frame, the segment before the segment to be deleted and the segment after the segment to be deleted are merged to obtain the Describe the target multimedia data.

After the server determines the segment to be deleted, the second video frame in the sequence of video frames is spliced together based on continuity, that is, the continuous second video is spliced into a segment to be retained, wherein the last segment of the segment to be retained before the segment to be deleted A video frame is the previous frame video of the cutting start frame of the segment to be deleted, and the first video frame of the segment to be retained after the segment to be deleted is the next frame video of the trimming end frame of the segment to be deleted. The previous video frame of the cutting start frame of the segment to be deleted is spliced with the next video frame of the trimming end frame, and the segment to be reserved before the segment to be deleted is merged with the segment to be retained after. Here, the video frame before the clipping start frame and the next video frame after the clipping end frame of all the clips to be deleted in the multimedia data to be processed are spliced to obtain the target video.

In the embodiment of the present application, the segment to be deleted may be a start position or an end position of the multimedia data to be processed, or may be located in a middle position of the multimedia data to be processed. Here, splicing may only be performed on the video frame preceding the cropping start frame and the subsequent video frame of the cropping end frame of the segment to be deleted located in the middle of the multimedia data to be processed.

In some embodiments, after cutting the segment to be deleted from the multimedia data to be processed to which the segment to be deleted belongs in S303 to obtain the target multimedia data, the following steps are further implemented:

Acquiring a storage path of the multimedia data to be processed corresponding to the target multimedia data;

storing the target multimedia data in the storage path to replace unprocessed multimedia data corresponding to the target multimedia data.

The server replaces the stored multimedia data to be processed with the target multimedia data obtained after processing the multimedia data to be processed.

In the embodiment of the present application, when the multimedia data to be processed is stored in a storage terminal other than the server terminal, the data volume of the target multimedia data can be judged, and when the data volume of the target multimedia data is greater than the set data volume, the target multimedia data can be The multimedia data is divided into multiple data blocks, and the multiple data blocks are uploaded to the storage terminal. At this time, the storage end splices multiple data blocks to obtain the target multimedia data, and replaces the original multimedia data to be processed with the target multimedia data.

In the following, taking multimedia data as video as an example, the multimedia data processing method provided in the embodiment of the present application will be further described.

As shown in FIG. 5 , the multimedia data processing system provided by the embodiment of the present application includes: a merchant terminal 501 , a server terminal 502 and an operator terminal 503 .

In the embodiment of the present application, as shown in FIG. 5 , the merchant terminal 501 is used to establish live file information, generate a video data stream of a live video file, and send the live file information and video data stream to the server 502 . Wherein, as shown in Figure 6, the merchant terminal 501 performs the following processing:

S6011. The merchant receives the live file information filled in by the merchant, and sends the live file information to the server.

Here, the live archive information is stored in the live archive database of the server 502 .

In the embodiment of this application, as shown in Figure 7, the merchant terminal 501 provides a live broadcast management background 701. The merchant can fill in the live file information 702 through the live broadcast management background 701, thereby entering the live file information 702. The live file information 702 may include: Title 7021 , live broadcast time 7022 , home page picture 7023 , shopping cart product list 7024 and other live broadcast detailed information, wherein the shopping cart product list 7024 includes information about products to be sold in the live broadcast. Among them, the shopping cart product list 7024 can be used as a basis for retrieving videos to be processed.

S6012. The merchant end broadcasts live.

Here, the video stream of the live broadcast process is sent to the server 502, and the server generates a live video file based on the received live video stream, and stores it in the live video library of the server 502.

When the merchant is live broadcasting, it can execute the process shown in Figure 8:

S801. The merchant terminal collects image data;

The merchant end can collect image data through the image acquisition device.

S802. The merchant terminal performs image processing on the collected image data;

Wherein, the image processing may include: beautification, filter and other processing.

S803. The merchant end compresses the image data that has undergone image processing.

The merchant end encodes and compresses the image data after image processing.

S804. The merchant end transmits the compressed image data to the server end in the form of a video stream.

Among them, the merchant side uploads the compressed image data to the server side through the Real Time Messaging Protocol (RTMP).

Wherein, an association relationship is established between the live archive information and the video storage information of the live video information in the live archive database (not shown) of the server 502 . Wherein, the video storage information includes: a video name, a video storage address, and the like.

The operation terminal 503 performs the following processing: the operation terminal receives the SKU of the commodity that needs to be blocked.

When the video data of a product needs to be blocked, the user enters the SKU of the product in the operating terminal 503 and clicks the one-key shielding function. At this time, the operating terminal 503 receives the SKU of the product input by the user and enables the automatic video blocking function . Wherein, the activation of the automatic video shielding function is triggered, and a shielding instruction is sent to the server 502, and the sent shielding instruction includes the received SKU of the product to be shielded. At this time, as shown in FIG. 5 , the operator 503 sends the SKU of the product to be masked to the server 502 .

After the server 502 receives the masking instruction, as shown in FIG. 9 , it performs the following processing:

S5021. The server retrieves the video to be processed based on the SKU of the commodity that needs to be blocked.

The server 502 searches the live broadcast archives based on the SKU of the commodity that needs to be masked, and retrieves the videos to be processed that need to be masked.

Here, the video list to be processed may be generated based on video information of the videos to be processed.

The server 502 uses the SKU to be blocked input by the operator 503 to perform an exact match in the shopping cart product list of the live broadcast archive. Those that do not contain SKUs are videos that do not need to be processed, and those that contain SKUs are videos that need to be processed. The server extracts the to-be-processed video containing the SKU to the to-be-processed video library, and uses the video in the to-be-processed video library as the input of video clipping to identify and detect the similarity with the main image of the SKU that needs to be masked, that is, the reference image.

S5022. The server determines, based on the SKUs of the commodities that need to be shielded, the main image of the SKUs of the commodities that need to be shielded.

When the operator 503 inputs the SKU to be masked to the server 502, the server 502 will go to the main website to query the main picture of the product according to the SKU, as a model picture for image similarity recognition on the video.

S5023. The server performs image similarity recognition on each frame of the video to be processed based on the main image of the SKU of the product that needs to be blocked, and identifies the video segment that needs to block the SKU.

S5024. The identified video segment that needs to be masked in editing the video to be processed by the server is the segment to be deleted.

The server sequentially edits the data frames of the identified video segments that need to be masked in each video to be processed in the video list to be processed. For each video to be processed, all the data frames in the video segment to be masked are clipped, and after the clipping is completed, the video is merged to form a complete video, that is, the target video.

S5025. The server saves the target video.

Upload the processed video back to the original video link in the live video library.

In S5023, the server uses an image similarity recognition algorithm to identify the similarity between the SKU main image of the product to be masked and each frame of the video to be processed.

The image similarity recognition algorithm in the embodiment of the present application can be completed by the target detection model faster-rcnn and the dHash algorithm.

Here, all product pictures in the video are identified by faster-rcnn, and the pictures are cut and saved as input pictures for the similarity recognition algorithm. The identification of product pictures includes the following steps:

S5231A, pre-scale the image of size P*Q to size M*N through faster-rcnn;

S5232A. Extract the feature map of the input image through the feature extraction layer of the faster-rcnn.

The scaled image is input to the feature extraction layer (Conv layers), which includes a convolution (conv) layer, an activation (relu) layer, and a pooling (pooling) layer. The feature extraction layer is used to extract feature maps of the input image.

S5233A. Determine candidate regions in the feature map of the input image through the region candidate network of faster-rcnn.

The region candidate network layer uses softmax to judge whether the anchors in the feature map belong to the foreground or the background, and then uses the bounding box regression algorithm (bounding box regression) to modify the anchor points to obtain accurate candidate regions (proposals).

S5233A. Through the region of interest (ROI) pooling layer of the faster-rcnn, based on the feature map of the candidate region and the input image, the feature of the candidate region is obtained.

Here, the ROI pooling layer uses proposals to extract candidate region features (proposal feature maps) from feature maps.

S5234A. Using the classification layer of faster-rcnn to classify each candidate region based on the feature of the candidate region.

Here, the classification layer may include a fully connected and softmax network, through which the extracted candidate region features are used to classify each candidate region, and the candidate regions classified as SKUs that need to be masked are identified.

S5235A. Determine the position of the proposal feature maps whose type is SKU through faster-rcnn.

The final precise position of the detection frame of type SKU is obtained through the bounding box regression algorithm again, that is, the image coordinates of each SKU.

The image coordinates obtained by faster-rcnn are cut to the video frame to obtain the image to be matched, and the dHash similarity is compared between each image to be matched and the SKU main image. Wherein, for an image to be matched, the similarity comparison includes the following steps:

S5231B, reducing the picture to 72 pixels;

Here, the image is reduced to 9*8 or 72 pixels.

S5232B. Perform grayscale processing on the reduced image;

S5233B. Comparing the left and right pixels of each row, and calculating the hash value of the picture;

Here, for each row, the difference value between two adjacent pixels is calculated to obtain 8 difference values, and for 8 rows, 64 difference values or hash values are obtained.

S5234B. Calculate the Hamming distance between the image to be matched and the main image of the SKU.

The Hamming distance between the image to be matched and the main image of the SKU is calculated by the hash value of the image to be matched and the hash value of the main image of the SKU. Here, the calculated Hamming distance is used as the distance between the image to be matched and the main image of the SKU Similarity, where the smaller the Hamming distance, the more similar the two pictures are, and the larger the Hamming distance is, the less similar the two pictures are.

At this time, through the above image similarity recognition algorithm for each frame in the video, a picture with a similarity of more than 90% is found, which is regarded as a data frame that needs to be masked.

In S5024, the data frames are recorded according to time intervals, for example [t1, t2], [t3, t4] and so on. Time t1 represents the first frame in which the product to be blocked appears, and t2 represents the last frame in which the product appears in this continuous time period. Similarly, [t3,t4] is the start and end time when the commodity appears in the next time interval.

In an example, as shown in FIG. 10 , the identified segment 1001 to be deleted whose content is the input SKU includes: segment 1, segment 2, segment 3, and segment 4, where t1 and t2 are the first frame and the last frame appear in the video to be processed, t3 and t4 are the time when the first frame and the last frame of segment 2 appear in the video to be processed, respectively, and t5 and t6 are the first frame and the last frame of segment 3, respectively The time when the last frame appears in the video to be processed, t7 and t8 are respectively the time when the first frame and the last frame of segment 4 appear in the video to be processed.

In S5024, cut and merge the video to be processed.

In the embodiment of the present application, the video to be processed is stored on the disk after data compression, and is stored in the form of a binary file.

In one example, as shown in Fig. 11, the duration of the video to be processed is t4, and the segment to be deleted with the product to be blocked includes: segment 1 with a time range of [t1, t2], segment 1 with a time range of [t3, t4] Fragment 2, the binary code of fragment 1 is: 10111, the code of fragment 2 is: 11011, here, based on the video cutting instruction, two videos are obtained based on the video frames in the time range [t0, t1), (t2, t3) respectively Files V1 and V2, and the duration of video file V1 is tv1, the encoding of video file V1 is: 110111, the duration of video file V2 is tv2, and the encoding of video file V2 is: 011, video file V1 and video The files V2 are merged to obtain a new video file V3, and the encoding of the video file V3 is: 110111011.

In the S5025, the compressed data can be cut into data packets, which are uploaded sequentially in segments, and after the segmented uploaded data packets are spliced, the complete video file is stored and uploaded to the storage path of the original video resource.

The multimedia data processing method provided by the embodiment of the present application has the following characteristics:

1. By performing image recognition on each frame of the live broadcast data, the product list of the live broadcast is identified, and the existing data information (SKU information filled in by the merchant) is used to automatically retrieve the videos that need to be blocked in the video library, and the videos that need to be processed Automatic cutting, merging, and uploading do not require manual search, and automation ensures the safety of live content. It is an automated video shielding solution that solves the problem that huge live historical video resources cannot be automatically blocked by one click according to the product.

2. Through the image recognition algorithm, the entire video resource will not be downloaded blindly, and the video clips that need to be offline can be accurately found, all video clips that need to be blocked can be offline, and the original video data will be replaced after cutting and merging the videos.

FIG. 12 is a schematic structural diagram of a multimedia data processing device according to an embodiment of the present application. As shown in FIG. 12 , the multimedia data processing device 1200 includes:

The determining unit 1201 is configured to determine the multimedia data to be processed based on the label to be deleted; the label of the multimedia data to be processed includes the label to be deleted;

The identification unit 1202 is configured to identify the content of the multimedia data to be processed, and determine the segment to be deleted in the multimedia data to be processed, the content of the segment to be deleted includes the object corresponding to the label to be deleted;

The clipping unit 1203 is configured to clip the segment to be deleted from the multimedia data to be processed to obtain target multimedia data.

In some embodiments, the determining unit 1201 is further configured to:

Obtaining tags of each of the at least two original multimedia data;

comparing the label of each original multimedia data with the label to be deleted;

Determining the original multimedia data whose tag includes the tag to be deleted as the multimedia data to be processed.

In some embodiments, the device 1200 also includes: a tag acquisition unit configured to:

In some embodiments, the identification unit 1202 is further configured to:

When the file type of the multimedia data to be processed includes video, extracting the video frame sequence of the multimedia data to be processed;

Acquiring a reference image corresponding to the label to be deleted;

matching the reference image with the video frames in the sequence of video frames to obtain a matching result;

According to the matching result, determine the trimming start frame and the trimming end frame corresponding to the trimming start frame; the video frames between the trimming start frame and the trimming end frame constitute the segment to be deleted.

In some embodiments, the identifying unit 1202 is further configured to:

For each video frame in the video frame sequence, determine a reference area of a content object included in the video frame, and cut out the reference area from the video frame to obtain an image to be matched;

determining the content similarity between the image to be matched and the reference image;

Determining the video frame to which the matching image whose content similarity is greater than the set similarity threshold belongs is the first video frame; the content of the first video frame includes the object corresponding to the label to be deleted;

The video frame to which the matching image whose similarity is less than or equal to the similarity threshold belongs is determined as a second video frame, and the content of the second video frame does not include the object corresponding to the tag to be deleted.

In some embodiments, the identifying unit 1202 is further configured to:

In some embodiments, the cropping unit 1203 is further configured to:

In some embodiments, the device 1200 further includes: a replacement unit configured to:

It should be noted that each logic unit included in the multimedia data processing device provided in the embodiment of the present application can be realized by a processor in an electronic device; of course, it can also be realized by a specific logic circuit; in the process of implementation, the processing The processor can be a central processing unit (CPU, Central Processing Unit), a microprocessor (MPU, Micro Processor Unit), a digital signal processor (DSP, Digital Signal Processor) or a field programmable gate array (FPGA, Field-Programmable Gate Array )Wait.

The description of the above system embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment. For technical details not disclosed in the system embodiments of the present application, please refer to the description of the method embodiments of the present application for understanding.

It should be noted that, in the embodiment of the present application, if the above multimedia data processing method is implemented in the form of software function modules and sold or used as an independent product, it can also be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solutions of the embodiments of the present application or the part that contributes to the related technologies can be embodied in the form of software products. The computer software products are stored in a storage medium and include several instructions to make A computer device (which may be a personal computer, a server, or a network device, etc.) runs all or part of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: various media that can store program codes such as U disk, mobile hard disk, read-only memory (Read Only Memory, ROM), magnetic disk or optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

An embodiment of the present application also provides an electronic device, including a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor implements the above-mentioned multimedia data processing method when running the computer program. step.

Correspondingly, the embodiments of the present application provide a storage medium, that is, a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the multimedia data processing method provided in the foregoing embodiments is implemented.

It should be pointed out here that: the description of the above storage medium embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment. For technical details not disclosed in the storage medium embodiments of the present application, please refer to the description of the method embodiments of the present application for understanding.

It should be noted that FIG. 13 is a schematic diagram of a hardware entity of an electronic device according to an embodiment of the present application. As shown in FIG. 13 , the electronic device 1300 includes: a processor 1301, at least one communication bus 1302, and at least one external communication interface 1304 and memory 1305. Wherein, the communication bus 1302 is configured to realize connection and communication between these components. In an example, the electronic device 1300 further includes: a user interface 1303, wherein the user interface 1303 may include a display screen, and the external communication interface 1304 may include a standard wired interface and a wireless interface.

The memory 1305 is configured to store instructions and applications executable by the processor 1301, and can also cache data to be processed or processed by the processor 1301 and various modules in the electronic device (for example, image data, audio data, voice communication data and video data) Communication data), which can be realized by flash memory (FLASH) or random access memory (Random Access Memory, RAM).

It should be understood that reference throughout the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic related to the embodiment is included in at least one embodiment of the present application. Thus, appearances of "in one embodiment" or "in some embodiments" throughout this specification are not necessarily referring to the same embodiments. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, and should not be used in the embodiments of the present application. The implementation process constitutes any limitation. The serial numbers of the above embodiments of the present application are for description only, and do not represent the advantages and disadvantages of the embodiments.

It should be noted that, in this document, the term "comprising", "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, It also includes other elements not expressly listed, or elements inherent in the process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus comprising that element.

In the several embodiments provided in this application, it should be understood that the disclosed devices and methods may be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods, such as: multiple units or components can be combined, or May be integrated into another system, or some features may be ignored, or not implemented. In addition, the coupling, or direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms of.

The units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units; they may be located in one place or distributed to multiple network units; Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application can be integrated into one processing unit, or each unit can be used as a single unit, or two or more units can be integrated into one unit; the above-mentioned integration The unit can be realized in the form of hardware or in the form of hardware plus software functional unit.

Those of ordinary skill in the art can understand that all or part of the steps to realize the above method embodiments can be completed by hardware related to program instructions, and the aforementioned programs can be stored in computer-readable storage media. When the program is executed, the execution includes The steps of the foregoing method embodiments; and the foregoing storage media include: removable storage devices, read-only memory (Read Only Memory, ROM), magnetic disks or optical disks and other media that can store program codes.

Alternatively, if the above-mentioned integrated units of the present application are realized in the form of software function modules and sold or used as independent products, they can also be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solutions of the embodiments of the present application or the part that contributes to the related technologies can be embodied in the form of software products. The computer software products are stored in a storage medium and include several instructions to make A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present application. The aforementioned storage medium includes various media capable of storing program codes such as removable storage devices, ROMs, magnetic disks or optical disks.

The above is only the embodiment of the present application, but the scope of protection of the present application is not limited thereto. Anyone familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present application, and should covered within the scope of protection of this application. Therefore, the protection scope of the present application should be determined by the protection scope of the claims.

Claims

A multimedia data processing method, the method comprising:

Determine the multimedia data to be processed based on the label to be deleted; the label of the multimedia data to be processed includes the label to be deleted;

Identifying the content of the multimedia data to be processed, and determining the segment to be deleted in the multimedia data to be processed, the content of the segment to be deleted includes the object corresponding to the label to be deleted;

Cutting out the segment to be deleted from the multimedia data to be processed to obtain target multimedia data.
The method according to claim 1, wherein said determining the multimedia data to be processed based on the label to be deleted comprises:

Obtaining tags of each of the at least two original multimedia data;

comparing the label of each original multimedia data with the label to be deleted;

Determining the original multimedia data whose tag includes the tag to be deleted as the multimedia data to be processed.
The method according to claim 2, wherein the method further comprises:

Obtain at least one commodity link under the account to which the original multimedia data belongs;

The commodity information targeted by each commodity link in the at least one commodity link is determined as a tag of the original video data.
The method according to any one of claims 1 to 3, wherein the file type of the multimedia data to be processed comprises: video, identifying the content of the multimedia data to be processed, determining the multimedia data to be processed Segments to be deleted in the data, including:

Extracting the video frame sequence of the multimedia data to be processed;

Acquiring a reference image corresponding to the label to be deleted;

matching the reference image with the video frames in the sequence of video frames to obtain a matching result;

According to the matching result, determine the trimming start frame and the trimming end frame corresponding to the trimming start frame; the video frames between the trimming start frame and the trimming end frame constitute the segment to be deleted.
The method according to claim 4, wherein said matching the reference image with the video frames in the sequence of video frames to obtain a matching result comprises:

For each video frame in the video frame sequence, determine a reference area of a content object included in the video frame, and cut out the reference area from the video frame to obtain an image to be matched;

determining the content similarity between the image to be matched and the reference image;

Determining the video frame to which the matching image whose content similarity is greater than the set similarity threshold belongs is the first video frame; the content of the first video frame includes the object corresponding to the label to be deleted;

The video frame to which the matching image whose similarity is less than or equal to the similarity threshold belongs is determined as a second video frame, and the content of the second video frame does not include the object corresponding to the tag to be deleted.
The method according to claim 4 or 5, wherein, according to the matching result, determining the clipping starting point and the clipping end point corresponding to the clipping starting point includes:

It is determined that the adjacent previous frame belongs to the second video frame but the video frame that itself belongs to the first video frame is determined as the cropping start frame; the content of the first video frame includes the corresponding Object; the content of the second video frame does not include the object corresponding to the label to be deleted;

Determine the video frame whose next adjacent frame belongs to the second video frame but itself belongs to the first video frame as the clipping end frame; the clipping start frame and the clipping end frame do not include A video frame of the second video frame.
The method according to claim 4, wherein said clipping the segment to be deleted from the multimedia data to be processed to which the deleted segment belongs to obtain the target multimedia data comprises:

Based on the splicing of the previous video frame of the clipping start frame and the subsequent video frame of the clipping end frame, the segment before the segment to be deleted and the segment after the segment to be deleted are merged to obtain the Describe the target multimedia data.
The method according to any one of claims 1 to 7, wherein the method further comprises:

Acquiring a storage path of the multimedia data to be processed corresponding to the target multimedia data;

storing the target multimedia data in the storage path to replace unprocessed multimedia data corresponding to the target multimedia data.
A multimedia data processing device, said device comprising:

The determining unit is configured to determine the multimedia data to be processed based on the label to be deleted; the label of the multimedia data to be processed includes the label to be deleted;

The identification unit is configured to identify the content of the multimedia data to be processed, and determine the segment to be deleted in the multimedia data to be processed, and the content of the segment to be deleted includes the object corresponding to the label to be deleted;

The clipping unit is configured to clip the segment to be deleted from the multimedia data to be processed to obtain target multimedia data.
An electronic device, comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, when the processor runs the computer program, the multimedia data processing according to any one of claims 1 to 8 is realized steps in the method.
A computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the multimedia data processing method described in any one of claims 1 to 8 is realized.