CN108307229B

CN108307229B - Video and audio data processing method and device

Info

Publication number: CN108307229B
Application number: CN201810107188.8A
Authority: CN
Inventors: 徐常亮; 李尉冉; 傅丕毅; 张云远
Original assignee: Xinhua Zhiyun Technology Co ltd
Current assignee: Xinhua Zhiyun Technology Co ltd
Priority date: 2018-02-02
Filing date: 2018-02-02
Publication date: 2023-12-22
Anticipated expiration: 2038-02-02
Also published as: CN108307229A

Abstract

The scheme is characterized in that firstly, an audio-video data object is divided into a plurality of sub-objects, then video characteristic information about video content in the sub-objects and audio characteristic information about audio content in the sub-objects are extracted, then content tags of each sub-object are determined according to the video characteristic information and the audio characteristic information, specific content contained in each sub-object in the audio-video data object can be determined through the content tags, and meanwhile association among the content tags can be used for representing association relations among all parts of content, so that audio-video content in the audio-video data object can be effectively applied, and unified scheduling and use of audio-video data are realized.

Description

Video and audio data processing method and device

Technical Field

The present disclosure relates to the field of information technologies, and in particular, to a method and an apparatus for processing audio and video data.

Background

With the development of intelligent devices and audio and video technologies, the speed of generating and spreading audio and video data objects containing audio content and video content, such as movies, television shows and the like, is greatly increased, but the audio and video data objects are generally independent, and lack a method and a channel for uniformly identifying and applying the content. The existing technology mainly uses video/audio fingerprints and corresponding audio/video libraries to identify video/audio, so that it is difficult to determine the association relationship between the content specifically contained in the video/audio data object and further the video/audio content in the video/audio data object cannot be effectively applied.

Content of the application

An object of the present application is to provide a method and apparatus for processing video and audio data, which are used for solving the problem that it is difficult to determine the association relationship between the content specifically included in the video and audio data object in the prior art.

In order to achieve the above objective, the present application provides a method for processing audio-visual data, which includes:

dividing the video and audio data object into a plurality of sub-objects;

extracting video characteristic information about video content in the sub-object and audio characteristic information about audio content in the sub-object;

and determining the content label of each sub-object according to the video characteristic information and the audio characteristic information.

Based on another aspect of the present application, there is also provided an apparatus for processing audio-visual data, the apparatus including:

the segmentation module is used for segmenting the video and audio data object into a plurality of sub-objects;

the feature extraction module is used for extracting video feature information about video content in the sub-object and audio feature information about audio content in the sub-object;

and the classifying and matching module is used for determining the content label of each sub-object according to the video characteristic information and the audio characteristic information.

In addition, the application also provides a processing device of video and audio data, wherein the device comprises:

a processor; and

one or more machine readable media having machine readable instructions stored thereon, which when executed by the processor, cause the device to perform the aforementioned method of processing audiovisual data.

In the processing scheme of the video and audio data, firstly, the video and audio data object is divided into a plurality of sub-objects, then video characteristic information about video content in the sub-objects and audio characteristic information about audio content in the sub-objects are extracted, then content labels of each sub-object are determined according to the video characteristic information and the audio characteristic information, specific content contained in each sub-object in the video and audio data object can be determined through the content labels, and meanwhile, association among the content labels can be used for representing association relations among all parts of content, and further, audio and video content in the video and audio data object can be effectively applied, and unified scheduling and use of video and audio data are achieved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings, in which:

fig. 1 shows a process flow chart of a processing method of audio-visual data provided in an embodiment of the present application;

fig. 2 is a schematic overall flow chart of processing an audio-visual data object by using the method provided in the embodiment of the present application;

fig. 3 is a schematic structural diagram of an audio-visual data processing device according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of another audio-visual data processing device according to an embodiment of the present application;

the same or similar reference numbers in the drawings refer to the same or similar parts.

Detailed Description

The present application is described in further detail below with reference to the accompanying drawings.

In a typical configuration of the present application, the terminals, the devices of the services network each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer-readable media include both permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computing device.

The embodiment of the application provides a processing method of audio-video data, which can effectively apply audio-video content in an audio-video data object to determine specific content contained in each sub-object in the audio-video data object and realize unified scheduling and use of audio-video data. The execution subject of the method may be a user equipment, a network device, or a device formed by integrating the user equipment and the network device through a network, or may be an application program running on the device. The user equipment comprises, but is not limited to, various terminal equipment such as computers, mobile phones, tablet computers and the like; the network device includes, but is not limited to, an implementation such as a network host, a single network server, a set of multiple network servers, or a set of computers based on cloud computing. Here, the Cloud is composed of a large number of hosts or web servers based on Cloud Computing (Cloud Computing), which is a kind of distributed Computing, one virtual computer composed of a group of loosely coupled computer sets.

Fig. 1 shows a processing method of audio-visual data provided in an embodiment of the present application, where the method includes the following steps:

in step S101, the video and audio data object is divided into a plurality of sub-objects. The video and audio data object in the embodiment of the present application refers to a file or a data stream containing video and audio data, and the specific content thereof may be a movie, a television show, or the like. The sub-object refers to a part of content of the video-audio data object, for example, for a movie with a duration of 120 minutes, the sub-object may be divided into a plurality of segments according to the duration, and each segment is a sub-object.

In some embodiments of the present application, when the audio-visual data object is segmented, the audio-visual data object may be clustered by means of space-time slicing (space-temporal slice), that is, according to the video content in the audio-visual data object, the audio-visual data object is clustered by means of space-time slicing, and based on the clustering result, a plurality of sub-objects are determined. The space-time slicing refers to an image formed by pixel strips at the same position in continuous frames of a video image sequence according to time sequence, and because images with similar contents have certain visual similarity, the video and audio data objects are segmented in a space-time slicing clustering mode, so that the video and audio data in each sub-object can be segmented to belong to similar contents.

For example, a picture in a video includes 3 parts of content, the first part is a picture of two people in an indoor scene, the second part is a picture of a landscape scene in an outdoor scene, and the third part is a picture of an explosion of the outdoor scene. Because the three images have great difference in vision, the video segment can be accurately divided into three parts by a space-time slicing clustering mode, the video frame contained in each part is a clustering result, and the video and the audio corresponding to the video frame are sub-objects.

In an actual scene, because the actual situation of each picture is more complex, errors may occur in the clustering result based on the space-time slicing, for example, the picture of a first part about two people conversations in an indoor scene may be greatly changed due to the movement of people, so that the picture content of a certain part is divided into two clustering results, or the pictures of a second part and a third part may be divided into one clustering result. Therefore, when a plurality of sub-objects are determined based on the clustering results, the clustering results can be dynamically adjusted according to the similarity among the clustering results, and the plurality of sub-objects are determined. For example, by setting a dynamic threshold, the similarity threshold during clustering can be dynamically adjusted, so that the preliminary clustering results are combined or continuously split, and the final clustering result is more accurate.

Step S102, extracting video characteristic information about video content in the sub-object and audio characteristic information about audio content in the sub-object.

In processing a portion concerning video, processing is performed based on video content in each sub-object, for example, for a movie, after division into a plurality of pieces, feature extraction is performed on the video content in each piece, and feature information thereof can be acquired. In some embodiments of the present application, a key frame may be extracted from the video content of the sub-object, and then the key frame is processed to obtain video feature information of the key frame, which is used as video feature information about the video content in the sub-object.

The key frame refers to a frame where a key action in image motion or change is located, and can reflect what is actually expressed by a video image sequence, for example, for a video content related to explosion, the key frame can be a frame representing the cause of explosion (such as when an impact occurs), a frame when an explosion flame is generated, a frame when the explosion flame is maximum, a frame when the explosion flame disappears, and the like. Since the key frames can better reflect the actual meaning of the video content, the processing operation amount can be reduced and the processing speed can be improved by taking the video characteristic information of the key frames as the video characteristic information about the video content in the sub-object.

The video feature information can be image features such as texture, color, shape or spatial relationship, and in an actual scene, one or more image features suitable for a current scene can be selected as the video feature information according to scene requirements so as to improve processing accuracy. The acquired video characteristic information may be recorded in the form of a multi-dimensional vector set.

While portions of the audio are processed, processing may be based on the audio content in each sub-object. For example, for a movie, after being divided into a plurality of segments, the audio content in each segment is subjected to feature extraction, so that the feature information of the movie can be obtained. For general audio-visual data objects, the audio content thereof includes various types, such as sound of a person, sound effects, environmental sounds, background music, and the like. Taking video content of two person conversations in an indoor scene as an example, corresponding audio content may include speaking sounds of two persons, footsteps of the two persons when walking, sounds of vehicles outside the room, background music, and the like, and the audio content may correspond to different waveforms of different wavebands. Thus, in some embodiments of the present application, waveform recognition may be performed at different wavebands when extracting audio features, and different types of audio sets may be extracted from the audio content of the sub-object, where the audio sets may be a human voice/sound effect set, an environmental sound set, or a background music set, etc. For these audio sets, the audio feature information therein may be extracted as the audio feature information about the audio content in the sub-object, respectively. The acquired audio feature information may be recorded in the form of a multi-dimensional set of vectors.

In an actual scenario, when audio content in a sub-object is processed, the audio content may be first separated from the sub-object. Meanwhile, in order to improve the accuracy of the audio feature extraction, the noise reduction processing can be performed on the audio content of the sub-object before the waveform identification is performed on different wavebands.

Step S103, determining the content label of each sub-object according to the video characteristic information and the audio characteristic information. The content tag is information for representing the video and audio content actually contained in the sub-object, and can describe the video and audio content from each schedule according to the requirement of the user, for example, the content, the scene or the corresponding emotion and the like.

In some embodiments of the present application, identification of content tags may be accomplished by deep learning, and before processing audio-visual data, a deep learning model may be constructed, and by marking audio content and video content of the content tags as training sets, the deep learning model is trained, so that the deep learning model may be used for identification of sub-object content tags. For example, if it is required to enable the scheme provided in the embodiment of the present application to identify whether a segment in a certain movie has content related to an explosion, various types of video and audio related to an explosion may be provided as a training set, where the training set includes video feature information related to the videos and audio feature information related to the audios, and content tags thereof have been labeled as an explosion. On the premise that the training samples are enough, the deep learning model can identify the input video characteristic information or audio characteristic information which is not marked with the content label, and determine whether the content label can be explosion or not, so as to determine the content corresponding to the movie fragment.

In another embodiment of the present application, after determining the content tag of the sub-object, the sub-object in the audio-visual data object may be categorized according to the content tag of the sub-object, so as to generate a categorized object set. For example, for a movie, all segments about explosions may be categorized as a collection of explosion segments, and all segments about character fights may be categorized as a collection alone.

In an actual scene, when classifying the sub-objects, the method can be based on external input or preset classifying conditions, for example, keywords input by a user can be obtained, and matched content labels are selected according to the keywords, so that a proper content set is obtained. Taking a movie as an example, if a trailer of the movie needs to be generated, the movie may be divided into a plurality of segments by using the scheme provided by the embodiment of the present application, and then a content tag corresponding to each segment is generated. The user can input corresponding keywords according to actual needs to select the segments needed by generating the trailer, for example, the user needs to generate the trailer with a style of comparison temperament, and then the segments corresponding to the content labels meeting the style can be selected as the materials used for generating the trailer to form a segment set. Similarly, if the user needs to generate a trailer with relatively more fighting content, a segment corresponding to the content tag may be selected.

For the audio content and the video content, the labels can be set independently, namely the labels can be divided into the video content label and the audio content label, which correspond to each other and are associated with the sub-objects of the video data object obtained by segmentation. Therefore, when classifying based on the content tag, the video content and/or the audio content of the sub-object in the video-audio data object can be classified according to the audio or video, or can be simultaneously combined with the audio and the video to generate a set required by a user, and the set can be classified according to the video content tag and/or the audio content tag of the sub-object to acquire a video content set and/or an audio content set.

Fig. 2 is a schematic overall flow chart of processing an audio-visual data object by using the method provided in the embodiment of the present application, where the overall flow chart includes the following processing steps:

s201, first, the video content is divided into a plurality of sub-objects.

S202, extracting video features of the segmented video content to obtain video feature information.

And S203, simultaneously separating the audio and video to obtain the audio content corresponding to the separated video.

S204, noise reduction is carried out on the audio content, and noise is eliminated.

S205, recognizing waveforms in different wave bands, and separating different types of audio, such as separating human voice/sound effects and the like.

S206, extracting the special audio characteristics of different types of audio to obtain the audio characteristic information.

S207, inputting the video characteristic information and the audio characteristic information into the deep learning model for processing.

S208, identifying content labels according to the processing result of deep learning, and classifying the content labels into a plurality of video content sets and audio content sets.

Based on the same inventive concept, the embodiment of the application also provides a processing device of video and audio data, the corresponding method of the device is the method in the previous embodiment, and the principle of solving the problem is similar to that of the method.

The embodiment of the application provides processing equipment of audio-video data, which can effectively apply audio-video content in an audio-video data object to determine specific content contained in each sub-object in the audio-video data object and realize unified scheduling and use of audio-video data. The specific implementation of the device may be a device formed by integrating the user device, the network device or the user device and the network device through a network, or may be an application program running on the device. The user equipment comprises, but is not limited to, various terminal equipment such as computers, mobile phones, tablet computers and the like; the network device includes, but is not limited to, an implementation such as a network host, a single network server, a set of multiple network servers, or a set of computers based on cloud computing. Here, the Cloud is composed of a large number of hosts or web servers based on Cloud Computing (Cloud Computing), which is a kind of distributed Computing, one virtual computer composed of a group of loosely coupled computer sets.

Fig. 3 shows an apparatus for processing audio-visual data according to an embodiment of the present application, where the apparatus includes a segmentation module 310, a feature extraction module 320, and a classification matching module 330. The partitioning module 310 is configured to partition the video and audio data object into a plurality of sub-objects. The video and audio data object in the embodiment of the present application refers to a file or a data stream containing video and audio data, and the specific content thereof may be a movie, a television show, or the like. The sub-object refers to a part of content of the video-audio data object, for example, for a movie with a duration of 120 minutes, the sub-object may be divided into a plurality of segments according to the duration, and each segment is a sub-object.

In some embodiments of the present application, when the segmentation module 310 segments the audio-visual data object, the audio-visual data object may be clustered by space-time slicing (space-temporal slice), that is, according to the video content in the audio-visual data object, the audio-visual data object is clustered by space-time slicing, and a plurality of sub-objects are determined based on the clustering result. The space-time slicing refers to an image formed by pixel strips at the same position in continuous frames of a video image sequence according to time sequence, and because images with similar contents have certain visual similarity, the video and audio data objects are segmented in a space-time slicing clustering mode, so that the video and audio data in each sub-object can be segmented to belong to similar contents.

The feature extraction module 320 is configured to extract video feature information about video content in the sub-object and audio feature information about audio content in the sub-object. The feature extraction module may include a video feature extraction sub-module and an audio feature extraction sub-module as it relates to the processing of video and audio.

In an actual scenario, the device provided in the embodiment of the present application may further include a noise reduction module, an audio/video separation module, and the like, where the noise reduction module is configured to, when processing the audio content in the sub-object, separate the audio content from the sub-object. Meanwhile, in order to improve accuracy in extracting the audio features, the audio-video separation module can firstly perform noise reduction processing on the audio content of the sub-object before waveform identification is performed on different wavebands.

The categorization match module 330 determines the content label for each sub-object based on the video feature information and the audio feature information. The content tag is information for representing the video and audio content actually contained in the sub-object, and can describe the video and audio content from each schedule according to the requirement of the user, for example, the content, the scene or the corresponding emotion and the like.

In some embodiments of the present application, the classification matching module 330 may complete identification of content tags by adopting a deep learning manner, and may construct a deep learning model before processing audio-video data, and train the deep learning model by marking audio content and video content of the content tags as training sets, so that the deep learning model may be used for identification of sub-object content tags. For example, if it is required to enable the scheme provided in the embodiment of the present application to identify whether a segment in a certain movie has content related to an explosion, various types of video and audio related to an explosion may be provided as a training set, where the training set includes video feature information related to the videos and audio feature information related to the audios, and content tags thereof have been labeled as an explosion. On the premise that the training samples are enough, the deep learning model can identify the input video characteristic information or audio characteristic information which is not marked with the content label, and determine whether the content label can be explosion or not, so as to determine the content corresponding to the movie fragment.

In another embodiment of the present application, after determining the content tag of the sub-object, the classification matching module 330 may classify the sub-object in the audio-visual data object according to the content tag of the sub-object, to generate a classification object set. For example, for a movie, all segments about explosions may be categorized as a collection of explosion segments, and all segments about character fights may be categorized as a collection alone.

In summary, in the processing scheme of the audio-video data provided by the application, firstly, the audio-video data object is divided into a plurality of sub-objects, then, video characteristic information about video content in the sub-objects and audio characteristic information about audio content in the sub-objects are extracted, then, according to the video characteristic information and the audio characteristic information, the content tag of each sub-object is determined, specific content contained in each sub-object in the audio-video data object can be determined through the content tag, and meanwhile, association between the content tags can be used for representing association relations between all parts of content, so that audio-video content in the audio-video data object can be effectively applied, and unified scheduling and use of audio-video data can be realized.

Furthermore, portions of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application by way of operation of the computer. Program instructions for invoking the methods of the present application may be stored in fixed or removable recording media and/or transmitted via a data stream in a broadcast or other signal bearing medium and/or stored within a working memory of a computer device operating according to the program instructions. Here, one embodiment according to the present application includes an apparatus as shown in fig. 4, which includes one or more machine-readable media 410 storing machine-readable instructions and a processor 420 for executing the machine-readable instructions, wherein the machine-readable instructions, when executed by the processor, cause the apparatus to perform methods and/or aspects based on the foregoing embodiments according to the present application.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, using Application Specific Integrated Circuits (ASIC), a general purpose computer or any other similar hardware device. In one embodiment, the software program of the present application may be executed by a processor to implement the above steps or functions. Likewise, the software programs of the present application (including associated data structures) may be stored on a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. In addition, some steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the apparatus claims can also be implemented by means of one unit or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.

Claims

1. A processing method of video and audio data, wherein, the method includes:

performing space-time slice clustering on the video and audio data objects according to video content in the video and audio data objects;

determining a plurality of sub-objects based on the clustering result;

extracting a key frame from the video content of the sub-object, wherein the key frame is a frame in which a key action in image motion or change is located;

acquiring video characteristic information of the key frames as video characteristic information about video content in the sub-objects;

waveform identification is carried out in different wave bands, and different types of audio sets are extracted from the audio content of the sub-object;

respectively extracting the audio characteristic information in the audio set as the audio characteristic information about the audio content in the sub-object;

determining the content label of each sub-object according to the video characteristic information and the audio characteristic information;

and determining matched content labels according to the input keywords, and generating trailers of the video and audio data objects based on the sub-objects corresponding to the content labels.

2. The method of claim 1, wherein determining a plurality of sub-objects based on the clustering result comprises:

and dynamically adjusting the clustering results according to the similarity among the clustering results to determine a plurality of sub-objects.

3. The method of claim 1, wherein waveform recognition is performed at different wavebands, and further comprising, prior to extracting different types of audio sets from the audio content of the sub-object:

and carrying out noise reduction processing on the audio content of the sub-object.

4. The method of claim 1, wherein prior to extracting the audio feature information about the audio content in the sub-object, further comprising:

the audio content is separated from the sub-objects.

5. The method of claim 1, wherein determining the content tag for each sub-object based on the video feature information and audio feature information comprises:

and inputting the video characteristic information and the audio characteristic information into a deep learning model to obtain the content label of each sub-object, wherein the deep learning model is obtained by training based on the audio content and the video content marked with the content label.

6. The method of claim 1, wherein the method further comprises:

and classifying the sub-objects in the video and audio data object according to the content labels of the sub-objects to generate a classified object set.

7. The method of claim 6, wherein the content tags include a video content tag and an audio content tag;

classifying the sub-objects in the video and audio data object according to the content labels of the sub-objects to obtain a classified object set, wherein the classifying comprises the following steps:

and classifying the video content and/or the audio content of the sub-object in the video and audio data object according to the video content tag and/or the audio content tag of the sub-object to obtain a video content set and/or a video content set.

8. An apparatus for processing audio-visual data, wherein the apparatus comprises:

the segmentation module is used for carrying out space-time slice clustering on the video data objects according to the video content in the video data objects; determining a plurality of sub-objects based on the clustering result;

the feature extraction module is used for extracting key frames from the video content of the sub-object, wherein the key frames are frames in which key actions in image motion or change are located; acquiring video characteristic information of the key frames as video characteristic information about video content in the sub-objects; waveform recognition is carried out in different wave bands, and different types of audio sets are extracted from the audio content of the sub-object; respectively extracting the audio characteristic information in the audio set as the audio characteristic information about the audio content in the sub-object;

and the classifying and matching module is used for determining the content label of each sub-object according to the video characteristic information and the audio characteristic information, determining the matched content label according to the input keywords, and generating the trailer of the video and audio data object based on the sub-object corresponding to the content label.

9. The apparatus of claim 8, wherein the partitioning module is configured to dynamically adjust the clustering results according to a similarity between the clustering results, and determine a plurality of sub-objects.

10. The apparatus of claim 8, wherein the apparatus further comprises:

the noise reduction module is used for carrying out waveform recognition on different wave bands and carrying out noise reduction processing on the audio contents of the sub-objects before extracting different types of audio sets from the audio contents of the sub-objects.

11. The apparatus of claim 8, wherein the apparatus further comprises:

and the audio and video separation module is used for separating the audio content from the sub-objects.

12. The apparatus of claim 8, wherein determining the content tag for each sub-object from the video feature information and audio feature information comprises:

13. The apparatus of claim 8, wherein the categorization match module is further configured to categorize the sub-objects in the audio-visual data object according to content tags of the sub-objects, generating a categorized object set.

14. The device of claim 13, wherein the content tags include a video content tag and an audio content tag;

the classifying and matching module is used for classifying the video content and/or the audio content of the sub-object in the video and audio data object according to the video content tag and/or the audio content tag of the sub-object to obtain a video content set and/or a video content set.

15. An apparatus for processing audio-visual data, wherein the apparatus comprises:

a processor; and

one or more machine-readable media having machine-readable instructions stored thereon, which when executed by the processor, cause the apparatus to perform the method of any of claims 1-7.