CN108307229B - Video and audio data processing method and device - Google Patents

Video and audio data processing method and device Download PDF

Info

Publication number
CN108307229B
CN108307229B CN201810107188.8A CN201810107188A CN108307229B CN 108307229 B CN108307229 B CN 108307229B CN 201810107188 A CN201810107188 A CN 201810107188A CN 108307229 B CN108307229 B CN 108307229B
Authority
CN
China
Prior art keywords
audio
content
video
sub
objects
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810107188.8A
Other languages
Chinese (zh)
Other versions
CN108307229A (en
Inventor
徐常亮
李尉冉
傅丕毅
张云远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinhua Zhiyun Technology Co ltd
Original Assignee
Xinhua Zhiyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xinhua Zhiyun Technology Co ltd filed Critical Xinhua Zhiyun Technology Co ltd
Priority to CN201810107188.8A priority Critical patent/CN108307229B/en
Publication of CN108307229A publication Critical patent/CN108307229A/en
Application granted granted Critical
Publication of CN108307229B publication Critical patent/CN108307229B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs

Abstract

The scheme is characterized in that firstly, an audio-video data object is divided into a plurality of sub-objects, then video characteristic information about video content in the sub-objects and audio characteristic information about audio content in the sub-objects are extracted, then content tags of each sub-object are determined according to the video characteristic information and the audio characteristic information, specific content contained in each sub-object in the audio-video data object can be determined through the content tags, and meanwhile association among the content tags can be used for representing association relations among all parts of content, so that audio-video content in the audio-video data object can be effectively applied, and unified scheduling and use of audio-video data are realized.

Description

Video and audio data processing method and device
Technical Field
The present disclosure relates to the field of information technologies, and in particular, to a method and an apparatus for processing audio and video data.
Background
With the development of intelligent devices and audio and video technologies, the speed of generating and spreading audio and video data objects containing audio content and video content, such as movies, television shows and the like, is greatly increased, but the audio and video data objects are generally independent, and lack a method and a channel for uniformly identifying and applying the content. The existing technology mainly uses video/audio fingerprints and corresponding audio/video libraries to identify video/audio, so that it is difficult to determine the association relationship between the content specifically contained in the video/audio data object and further the video/audio content in the video/audio data object cannot be effectively applied.
Content of the application
An object of the present application is to provide a method and apparatus for processing video and audio data, which are used for solving the problem that it is difficult to determine the association relationship between the content specifically included in the video and audio data object in the prior art.
In order to achieve the above objective, the present application provides a method for processing audio-visual data, which includes:
dividing the video and audio data object into a plurality of sub-objects;
extracting video characteristic information about video content in the sub-object and audio characteristic information about audio content in the sub-object;
and determining the content label of each sub-object according to the video characteristic information and the audio characteristic information.
Based on another aspect of the present application, there is also provided an apparatus for processing audio-visual data, the apparatus including:
the segmentation module is used for segmenting the video and audio data object into a plurality of sub-objects;
the feature extraction module is used for extracting video feature information about video content in the sub-object and audio feature information about audio content in the sub-object;
and the classifying and matching module is used for determining the content label of each sub-object according to the video characteristic information and the audio characteristic information.
In addition, the application also provides a processing device of video and audio data, wherein the device comprises:
a processor; and
one or more machine readable media having machine readable instructions stored thereon, which when executed by the processor, cause the device to perform the aforementioned method of processing audiovisual data.
In the processing scheme of the video and audio data, firstly, the video and audio data object is divided into a plurality of sub-objects, then video characteristic information about video content in the sub-objects and audio characteristic information about audio content in the sub-objects are extracted, then content labels of each sub-object are determined according to the video characteristic information and the audio characteristic information, specific content contained in each sub-object in the video and audio data object can be determined through the content labels, and meanwhile, association among the content labels can be used for representing association relations among all parts of content, and further, audio and video content in the video and audio data object can be effectively applied, and unified scheduling and use of video and audio data are achieved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings, in which:
fig. 1 shows a process flow chart of a processing method of audio-visual data provided in an embodiment of the present application;
fig. 2 is a schematic overall flow chart of processing an audio-visual data object by using the method provided in the embodiment of the present application;
fig. 3 is a schematic structural diagram of an audio-visual data processing device according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of another audio-visual data processing device according to an embodiment of the present application;
the same or similar reference numbers in the drawings refer to the same or similar parts.
Detailed Description
The present application is described in further detail below with reference to the accompanying drawings.
In a typical configuration of the present application, the terminals, the devices of the services network each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer-readable media include both permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computing device.
The embodiment of the application provides a processing method of audio-video data, which can effectively apply audio-video content in an audio-video data object to determine specific content contained in each sub-object in the audio-video data object and realize unified scheduling and use of audio-video data. The execution subject of the method may be a user equipment, a network device, or a device formed by integrating the user equipment and the network device through a network, or may be an application program running on the device. The user equipment comprises, but is not limited to, various terminal equipment such as computers, mobile phones, tablet computers and the like; the network device includes, but is not limited to, an implementation such as a network host, a single network server, a set of multiple network servers, or a set of computers based on cloud computing. Here, the Cloud is composed of a large number of hosts or web servers based on Cloud Computing (Cloud Computing), which is a kind of distributed Computing, one virtual computer composed of a group of loosely coupled computer sets.
Fig. 1 shows a processing method of audio-visual data provided in an embodiment of the present application, where the method includes the following steps:
in step S101, the video and audio data object is divided into a plurality of sub-objects. The video and audio data object in the embodiment of the present application refers to a file or a data stream containing video and audio data, and the specific content thereof may be a movie, a television show, or the like. The sub-object refers to a part of content of the video-audio data object, for example, for a movie with a duration of 120 minutes, the sub-object may be divided into a plurality of segments according to the duration, and each segment is a sub-object.
In some embodiments of the present application, when the audio-visual data object is segmented, the audio-visual data object may be clustered by means of space-time slicing (space-temporal slice), that is, according to the video content in the audio-visual data object, the audio-visual data object is clustered by means of space-time slicing, and based on the clustering result, a plurality of sub-objects are determined. The space-time slicing refers to an image formed by pixel strips at the same position in continuous frames of a video image sequence according to time sequence, and because images with similar contents have certain visual similarity, the video and audio data objects are segmented in a space-time slicing clustering mode, so that the video and audio data in each sub-object can be segmented to belong to similar contents.
For example, a picture in a video includes 3 parts of content, the first part is a picture of two people in an indoor scene, the second part is a picture of a landscape scene in an outdoor scene, and the third part is a picture of an explosion of the outdoor scene. Because the three images have great difference in vision, the video segment can be accurately divided into three parts by a space-time slicing clustering mode, the video frame contained in each part is a clustering result, and the video and the audio corresponding to the video frame are sub-objects.
In an actual scene, because the actual situation of each picture is more complex, errors may occur in the clustering result based on the space-time slicing, for example, the picture of a first part about two people conversations in an indoor scene may be greatly changed due to the movement of people, so that the picture content of a certain part is divided into two clustering results, or the pictures of a second part and a third part may be divided into one clustering result. Therefore, when a plurality of sub-objects are determined based on the clustering results, the clustering results can be dynamically adjusted according to the similarity among the clustering results, and the plurality of sub-objects are determined. For example, by setting a dynamic threshold, the similarity threshold during clustering can be dynamically adjusted, so that the preliminary clustering results are combined or continuously split, and the final clustering result is more accurate.
Step S102, extracting video characteristic information about video content in the sub-object and audio characteristic information about audio content in the sub-object.
In processing a portion concerning video, processing is performed based on video content in each sub-object, for example, for a movie, after division into a plurality of pieces, feature extraction is performed on the video content in each piece, and feature information thereof can be acquired. In some embodiments of the present application, a key frame may be extracted from the video content of the sub-object, and then the key frame is processed to obtain video feature information of the key frame, which is used as video feature information about the video content in the sub-object.
The key frame refers to a frame where a key action in image motion or change is located, and can reflect what is actually expressed by a video image sequence, for example, for a video content related to explosion, the key frame can be a frame representing the cause of explosion (such as when an impact occurs), a frame when an explosion flame is generated, a frame when the explosion flame is maximum, a frame when the explosion flame disappears, and the like. Since the key frames can better reflect the actual meaning of the video content, the processing operation amount can be reduced and the processing speed can be improved by taking the video characteristic information of the key frames as the video characteristic information about the video content in the sub-object.
The video feature information can be image features such as texture, color, shape or spatial relationship, and in an actual scene, one or more image features suitable for a current scene can be selected as the video feature information according to scene requirements so as to improve processing accuracy. The acquired video characteristic information may be recorded in the form of a multi-dimensional vector set.
While portions of the audio are processed, processing may be based on the audio content in each sub-object. For example, for a movie, after being divided into a plurality of segments, the audio content in each segment is subjected to feature extraction, so that the feature information of the movie can be obtained. For general audio-visual data objects, the audio content thereof includes various types, such as sound of a person, sound effects, environmental sounds, background music, and the like. Taking video content of two person conversations in an indoor scene as an example, corresponding audio content may include speaking sounds of two persons, footsteps of the two persons when walking, sounds of vehicles outside the room, background music, and the like, and the audio content may correspond to different waveforms of different wavebands. Thus, in some embodiments of the present application, waveform recognition may be performed at different wavebands when extracting audio features, and different types of audio sets may be extracted from the audio content of the sub-object, where the audio sets may be a human voice/sound effect set, an environmental sound set, or a background music set, etc. For these audio sets, the audio feature information therein may be extracted as the audio feature information about the audio content in the sub-object, respectively. The acquired audio feature information may be recorded in the form of a multi-dimensional set of vectors.
In an actual scenario, when audio content in a sub-object is processed, the audio content may be first separated from the sub-object. Meanwhile, in order to improve the accuracy of the audio feature extraction, the noise reduction processing can be performed on the audio content of the sub-object before the waveform identification is performed on different wavebands.
Step S103, determining the content label of each sub-object according to the video characteristic information and the audio characteristic information. The content tag is information for representing the video and audio content actually contained in the sub-object, and can describe the video and audio content from each schedule according to the requirement of the user, for example, the content, the scene or the corresponding emotion and the like.
In some embodiments of the present application, identification of content tags may be accomplished by deep learning, and before processing audio-visual data, a deep learning model may be constructed, and by marking audio content and video content of the content tags as training sets, the deep learning model is trained, so that the deep learning model may be used for identification of sub-object content tags. For example, if it is required to enable the scheme provided in the embodiment of the present application to identify whether a segment in a certain movie has content related to an explosion, various types of video and audio related to an explosion may be provided as a training set, where the training set includes video feature information related to the videos and audio feature information related to the audios, and content tags thereof have been labeled as an explosion. On the premise that the training samples are enough, the deep learning model can identify the input video characteristic information or audio characteristic information which is not marked with the content label, and determine whether the content label can be explosion or not, so as to determine the content corresponding to the movie fragment.
In another embodiment of the present application, after determining the content tag of the sub-object, the sub-object in the audio-visual data object may be categorized according to the content tag of the sub-object, so as to generate a categorized object set. For example, for a movie, all segments about explosions may be categorized as a collection of explosion segments, and all segments about character fights may be categorized as a collection alone.
In an actual scene, when classifying the sub-objects, the method can be based on external input or preset classifying conditions, for example, keywords input by a user can be obtained, and matched content labels are selected according to the keywords, so that a proper content set is obtained. Taking a movie as an example, if a trailer of the movie needs to be generated, the movie may be divided into a plurality of segments by using the scheme provided by the embodiment of the present application, and then a content tag corresponding to each segment is generated. The user can input corresponding keywords according to actual needs to select the segments needed by generating the trailer, for example, the user needs to generate the trailer with a style of comparison temperament, and then the segments corresponding to the content labels meeting the style can be selected as the materials used for generating the trailer to form a segment set. Similarly, if the user needs to generate a trailer with relatively more fighting content, a segment corresponding to the content tag may be selected.
For the audio content and the video content, the labels can be set independently, namely the labels can be divided into the video content label and the audio content label, which correspond to each other and are associated with the sub-objects of the video data object obtained by segmentation. Therefore, when classifying based on the content tag, the video content and/or the audio content of the sub-object in the video-audio data object can be classified according to the audio or video, or can be simultaneously combined with the audio and the video to generate a set required by a user, and the set can be classified according to the video content tag and/or the audio content tag of the sub-object to acquire a video content set and/or an audio content set.
Fig. 2 is a schematic overall flow chart of processing an audio-visual data object by using the method provided in the embodiment of the present application, where the overall flow chart includes the following processing steps:
s201, first, the video content is divided into a plurality of sub-objects.
S202, extracting video features of the segmented video content to obtain video feature information.
And S203, simultaneously separating the audio and video to obtain the audio content corresponding to the separated video.
S204, noise reduction is carried out on the audio content, and noise is eliminated.
S205, recognizing waveforms in different wave bands, and separating different types of audio, such as separating human voice/sound effects and the like.
S206, extracting the special audio characteristics of different types of audio to obtain the audio characteristic information.
S207, inputting the video characteristic information and the audio characteristic information into the deep learning model for processing.
S208, identifying content labels according to the processing result of deep learning, and classifying the content labels into a plurality of video content sets and audio content sets.
Based on the same inventive concept, the embodiment of the application also provides a processing device of video and audio data, the corresponding method of the device is the method in the previous embodiment, and the principle of solving the problem is similar to that of the method.
The embodiment of the application provides processing equipment of audio-video data, which can effectively apply audio-video content in an audio-video data object to determine specific content contained in each sub-object in the audio-video data object and realize unified scheduling and use of audio-video data. The specific implementation of the device may be a device formed by integrating the user device, the network device or the user device and the network device through a network, or may be an application program running on the device. The user equipment comprises, but is not limited to, various terminal equipment such as computers, mobile phones, tablet computers and the like; the network device includes, but is not limited to, an implementation such as a network host, a single network server, a set of multiple network servers, or a set of computers based on cloud computing. Here, the Cloud is composed of a large number of hosts or web servers based on Cloud Computing (Cloud Computing), which is a kind of distributed Computing, one virtual computer composed of a group of loosely coupled computer sets.
Fig. 3 shows an apparatus for processing audio-visual data according to an embodiment of the present application, where the apparatus includes a segmentation module 310, a feature extraction module 320, and a classification matching module 330. The partitioning module 310 is configured to partition the video and audio data object into a plurality of sub-objects. The video and audio data object in the embodiment of the present application refers to a file or a data stream containing video and audio data, and the specific content thereof may be a movie, a television show, or the like. The sub-object refers to a part of content of the video-audio data object, for example, for a movie with a duration of 120 minutes, the sub-object may be divided into a plurality of segments according to the duration, and each segment is a sub-object.
In some embodiments of the present application, when the segmentation module 310 segments the audio-visual data object, the audio-visual data object may be clustered by space-time slicing (space-temporal slice), that is, according to the video content in the audio-visual data object, the audio-visual data object is clustered by space-time slicing, and a plurality of sub-objects are determined based on the clustering result. The space-time slicing refers to an image formed by pixel strips at the same position in continuous frames of a video image sequence according to time sequence, and because images with similar contents have certain visual similarity, the video and audio data objects are segmented in a space-time slicing clustering mode, so that the video and audio data in each sub-object can be segmented to belong to similar contents.
For example, a picture in a video includes 3 parts of content, the first part is a picture of two people in an indoor scene, the second part is a picture of a landscape scene in an outdoor scene, and the third part is a picture of an explosion of the outdoor scene. Because the three images have great difference in vision, the video segment can be accurately divided into three parts by a space-time slicing clustering mode, the video frame contained in each part is a clustering result, and the video and the audio corresponding to the video frame are sub-objects.
In an actual scene, because the actual situation of each picture is more complex, errors may occur in the clustering result based on the space-time slicing, for example, the picture of a first part about two people conversations in an indoor scene may be greatly changed due to the movement of people, so that the picture content of a certain part is divided into two clustering results, or the pictures of a second part and a third part may be divided into one clustering result. Therefore, when a plurality of sub-objects are determined based on the clustering results, the clustering results can be dynamically adjusted according to the similarity among the clustering results, and the plurality of sub-objects are determined. For example, by setting a dynamic threshold, the similarity threshold during clustering can be dynamically adjusted, so that the preliminary clustering results are combined or continuously split, and the final clustering result is more accurate.
The feature extraction module 320 is configured to extract video feature information about video content in the sub-object and audio feature information about audio content in the sub-object. The feature extraction module may include a video feature extraction sub-module and an audio feature extraction sub-module as it relates to the processing of video and audio.
In processing a portion concerning video, processing is performed based on video content in each sub-object, for example, for a movie, after division into a plurality of pieces, feature extraction is performed on the video content in each piece, and feature information thereof can be acquired. In some embodiments of the present application, a key frame may be extracted from the video content of the sub-object, and then the key frame is processed to obtain video feature information of the key frame, which is used as video feature information about the video content in the sub-object.
The key frame refers to a frame where a key action in image motion or change is located, and can reflect what is actually expressed by a video image sequence, for example, for a video content related to explosion, the key frame can be a frame representing the cause of explosion (such as when an impact occurs), a frame when an explosion flame is generated, a frame when the explosion flame is maximum, a frame when the explosion flame disappears, and the like. Since the key frames can better reflect the actual meaning of the video content, the processing operation amount can be reduced and the processing speed can be improved by taking the video characteristic information of the key frames as the video characteristic information about the video content in the sub-object.
The video feature information can be image features such as texture, color, shape or spatial relationship, and in an actual scene, one or more image features suitable for a current scene can be selected as the video feature information according to scene requirements so as to improve processing accuracy. The acquired video characteristic information may be recorded in the form of a multi-dimensional vector set.
While portions of the audio are processed, processing may be based on the audio content in each sub-object. For example, for a movie, after being divided into a plurality of segments, the audio content in each segment is subjected to feature extraction, so that the feature information of the movie can be obtained. For general audio-visual data objects, the audio content thereof includes various types, such as sound of a person, sound effects, environmental sounds, background music, and the like. Taking video content of two person conversations in an indoor scene as an example, corresponding audio content may include speaking sounds of two persons, footsteps of the two persons when walking, sounds of vehicles outside the room, background music, and the like, and the audio content may correspond to different waveforms of different wavebands. Thus, in some embodiments of the present application, waveform recognition may be performed at different wavebands when extracting audio features, and different types of audio sets may be extracted from the audio content of the sub-object, where the audio sets may be a human voice/sound effect set, an environmental sound set, or a background music set, etc. For these audio sets, the audio feature information therein may be extracted as the audio feature information about the audio content in the sub-object, respectively. The acquired audio feature information may be recorded in the form of a multi-dimensional set of vectors.
In an actual scenario, the device provided in the embodiment of the present application may further include a noise reduction module, an audio/video separation module, and the like, where the noise reduction module is configured to, when processing the audio content in the sub-object, separate the audio content from the sub-object. Meanwhile, in order to improve accuracy in extracting the audio features, the audio-video separation module can firstly perform noise reduction processing on the audio content of the sub-object before waveform identification is performed on different wavebands.
The categorization match module 330 determines the content label for each sub-object based on the video feature information and the audio feature information. The content tag is information for representing the video and audio content actually contained in the sub-object, and can describe the video and audio content from each schedule according to the requirement of the user, for example, the content, the scene or the corresponding emotion and the like.
In some embodiments of the present application, the classification matching module 330 may complete identification of content tags by adopting a deep learning manner, and may construct a deep learning model before processing audio-video data, and train the deep learning model by marking audio content and video content of the content tags as training sets, so that the deep learning model may be used for identification of sub-object content tags. For example, if it is required to enable the scheme provided in the embodiment of the present application to identify whether a segment in a certain movie has content related to an explosion, various types of video and audio related to an explosion may be provided as a training set, where the training set includes video feature information related to the videos and audio feature information related to the audios, and content tags thereof have been labeled as an explosion. On the premise that the training samples are enough, the deep learning model can identify the input video characteristic information or audio characteristic information which is not marked with the content label, and determine whether the content label can be explosion or not, so as to determine the content corresponding to the movie fragment.
In another embodiment of the present application, after determining the content tag of the sub-object, the classification matching module 330 may classify the sub-object in the audio-visual data object according to the content tag of the sub-object, to generate a classification object set. For example, for a movie, all segments about explosions may be categorized as a collection of explosion segments, and all segments about character fights may be categorized as a collection alone.
In an actual scene, when classifying the sub-objects, the method can be based on external input or preset classifying conditions, for example, keywords input by a user can be obtained, and matched content labels are selected according to the keywords, so that a proper content set is obtained. Taking a movie as an example, if a trailer of the movie needs to be generated, the movie may be divided into a plurality of segments by using the scheme provided by the embodiment of the present application, and then a content tag corresponding to each segment is generated. The user can input corresponding keywords according to actual needs to select the segments needed by generating the trailer, for example, the user needs to generate the trailer with a style of comparison temperament, and then the segments corresponding to the content labels meeting the style can be selected as the materials used for generating the trailer to form a segment set. Similarly, if the user needs to generate a trailer with relatively more fighting content, a segment corresponding to the content tag may be selected.
For the audio content and the video content, the labels can be set independently, namely the labels can be divided into the video content label and the audio content label, which correspond to each other and are associated with the sub-objects of the video data object obtained by segmentation. Therefore, when classifying based on the content tag, the video content and/or the audio content of the sub-object in the video-audio data object can be classified according to the audio or video, or can be simultaneously combined with the audio and the video to generate a set required by a user, and the set can be classified according to the video content tag and/or the audio content tag of the sub-object to acquire a video content set and/or an audio content set.
In summary, in the processing scheme of the audio-video data provided by the application, firstly, the audio-video data object is divided into a plurality of sub-objects, then, video characteristic information about video content in the sub-objects and audio characteristic information about audio content in the sub-objects are extracted, then, according to the video characteristic information and the audio characteristic information, the content tag of each sub-object is determined, specific content contained in each sub-object in the audio-video data object can be determined through the content tag, and meanwhile, association between the content tags can be used for representing association relations between all parts of content, so that audio-video content in the audio-video data object can be effectively applied, and unified scheduling and use of audio-video data can be realized.
Furthermore, portions of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application by way of operation of the computer. Program instructions for invoking the methods of the present application may be stored in fixed or removable recording media and/or transmitted via a data stream in a broadcast or other signal bearing medium and/or stored within a working memory of a computer device operating according to the program instructions. Here, one embodiment according to the present application includes an apparatus as shown in fig. 4, which includes one or more machine-readable media 410 storing machine-readable instructions and a processor 420 for executing the machine-readable instructions, wherein the machine-readable instructions, when executed by the processor, cause the apparatus to perform methods and/or aspects based on the foregoing embodiments according to the present application.
It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, using Application Specific Integrated Circuits (ASIC), a general purpose computer or any other similar hardware device. In one embodiment, the software program of the present application may be executed by a processor to implement the above steps or functions. Likewise, the software programs of the present application (including associated data structures) may be stored on a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. In addition, some steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the apparatus claims can also be implemented by means of one unit or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.

Claims (15)

1. A processing method of video and audio data, wherein, the method includes:
performing space-time slice clustering on the video and audio data objects according to video content in the video and audio data objects;
determining a plurality of sub-objects based on the clustering result;
extracting a key frame from the video content of the sub-object, wherein the key frame is a frame in which a key action in image motion or change is located;
acquiring video characteristic information of the key frames as video characteristic information about video content in the sub-objects;
waveform identification is carried out in different wave bands, and different types of audio sets are extracted from the audio content of the sub-object;
respectively extracting the audio characteristic information in the audio set as the audio characteristic information about the audio content in the sub-object;
determining the content label of each sub-object according to the video characteristic information and the audio characteristic information;
and determining matched content labels according to the input keywords, and generating trailers of the video and audio data objects based on the sub-objects corresponding to the content labels.
2. The method of claim 1, wherein determining a plurality of sub-objects based on the clustering result comprises:
and dynamically adjusting the clustering results according to the similarity among the clustering results to determine a plurality of sub-objects.
3. The method of claim 1, wherein waveform recognition is performed at different wavebands, and further comprising, prior to extracting different types of audio sets from the audio content of the sub-object:
and carrying out noise reduction processing on the audio content of the sub-object.
4. The method of claim 1, wherein prior to extracting the audio feature information about the audio content in the sub-object, further comprising:
the audio content is separated from the sub-objects.
5. The method of claim 1, wherein determining the content tag for each sub-object based on the video feature information and audio feature information comprises:
and inputting the video characteristic information and the audio characteristic information into a deep learning model to obtain the content label of each sub-object, wherein the deep learning model is obtained by training based on the audio content and the video content marked with the content label.
6. The method of claim 1, wherein the method further comprises:
and classifying the sub-objects in the video and audio data object according to the content labels of the sub-objects to generate a classified object set.
7. The method of claim 6, wherein the content tags include a video content tag and an audio content tag;
classifying the sub-objects in the video and audio data object according to the content labels of the sub-objects to obtain a classified object set, wherein the classifying comprises the following steps:
and classifying the video content and/or the audio content of the sub-object in the video and audio data object according to the video content tag and/or the audio content tag of the sub-object to obtain a video content set and/or a video content set.
8. An apparatus for processing audio-visual data, wherein the apparatus comprises:
the segmentation module is used for carrying out space-time slice clustering on the video data objects according to the video content in the video data objects; determining a plurality of sub-objects based on the clustering result;
the feature extraction module is used for extracting key frames from the video content of the sub-object, wherein the key frames are frames in which key actions in image motion or change are located; acquiring video characteristic information of the key frames as video characteristic information about video content in the sub-objects; waveform recognition is carried out in different wave bands, and different types of audio sets are extracted from the audio content of the sub-object; respectively extracting the audio characteristic information in the audio set as the audio characteristic information about the audio content in the sub-object;
and the classifying and matching module is used for determining the content label of each sub-object according to the video characteristic information and the audio characteristic information, determining the matched content label according to the input keywords, and generating the trailer of the video and audio data object based on the sub-object corresponding to the content label.
9. The apparatus of claim 8, wherein the partitioning module is configured to dynamically adjust the clustering results according to a similarity between the clustering results, and determine a plurality of sub-objects.
10. The apparatus of claim 8, wherein the apparatus further comprises:
the noise reduction module is used for carrying out waveform recognition on different wave bands and carrying out noise reduction processing on the audio contents of the sub-objects before extracting different types of audio sets from the audio contents of the sub-objects.
11. The apparatus of claim 8, wherein the apparatus further comprises:
and the audio and video separation module is used for separating the audio content from the sub-objects.
12. The apparatus of claim 8, wherein determining the content tag for each sub-object from the video feature information and audio feature information comprises:
and inputting the video characteristic information and the audio characteristic information into a deep learning model to obtain the content label of each sub-object, wherein the deep learning model is obtained by training based on the audio content and the video content marked with the content label.
13. The apparatus of claim 8, wherein the categorization match module is further configured to categorize the sub-objects in the audio-visual data object according to content tags of the sub-objects, generating a categorized object set.
14. The device of claim 13, wherein the content tags include a video content tag and an audio content tag;
the classifying and matching module is used for classifying the video content and/or the audio content of the sub-object in the video and audio data object according to the video content tag and/or the audio content tag of the sub-object to obtain a video content set and/or a video content set.
15. An apparatus for processing audio-visual data, wherein the apparatus comprises:
a processor; and
one or more machine-readable media having machine-readable instructions stored thereon, which when executed by the processor, cause the apparatus to perform the method of any of claims 1-7.
CN201810107188.8A 2018-02-02 2018-02-02 Video and audio data processing method and device Active CN108307229B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810107188.8A CN108307229B (en) 2018-02-02 2018-02-02 Video and audio data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810107188.8A CN108307229B (en) 2018-02-02 2018-02-02 Video and audio data processing method and device

Publications (2)

Publication Number Publication Date
CN108307229A CN108307229A (en) 2018-07-20
CN108307229B true CN108307229B (en) 2023-12-22

Family

ID=62850942

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810107188.8A Active CN108307229B (en) 2018-02-02 2018-02-02 Video and audio data processing method and device

Country Status (1)

Country Link
CN (1) CN108307229B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101920B (en) * 2018-08-07 2021-06-25 石家庄铁道大学 Video time domain unit segmentation method
CN109120996B (en) * 2018-08-31 2021-08-13 深圳市万普拉斯科技有限公司 Video information identification method, storage medium and computer equipment
CN109257622A (en) * 2018-11-01 2019-01-22 广州市百果园信息技术有限公司 A kind of audio/video processing method, device, equipment and medium
CN109587568A (en) * 2018-11-01 2019-04-05 北京奇艺世纪科技有限公司 Video broadcasting method, device, computer readable storage medium
CN110234038B (en) * 2019-05-13 2020-02-14 特斯联(北京)科技有限公司 User management method based on distributed storage
CN110324726B (en) * 2019-05-29 2022-02-18 北京奇艺世纪科技有限公司 Model generation method, video processing method, model generation device, video processing device, electronic equipment and storage medium
CN110213670B (en) * 2019-05-31 2022-01-07 北京奇艺世纪科技有限公司 Video processing method and device, electronic equipment and storage medium
CN110677716B (en) * 2019-08-20 2022-02-01 咪咕音乐有限公司 Audio processing method, electronic device, and storage medium
CN110930997B (en) * 2019-12-10 2022-08-16 四川长虹电器股份有限公司 Method for labeling audio by using deep learning model
CN111008287B (en) * 2019-12-19 2023-08-04 Oppo(重庆)智能科技有限公司 Audio and video processing method and device, server and storage medium
CN113163272B (en) * 2020-01-07 2022-11-25 海信集团有限公司 Video editing method, computer device and storage medium
CN111770375B (en) 2020-06-05 2022-08-23 百度在线网络技术(北京)有限公司 Video processing method and device, electronic equipment and storage medium
CN113095231B (en) * 2021-04-14 2023-04-18 上海西井信息科技有限公司 Video identification method, system, device and storage medium based on classified object

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20040041127A (en) * 2004-04-23 2004-05-14 학교법인 한국정보통신학원 An intelligent agent system for providing viewer-customized video skims in digital TV broadcasting
US6829781B1 (en) * 2000-05-24 2004-12-07 At&T Corp. Network-based service to provide on-demand video summaries of television programs
CN1938714A (en) * 2004-03-23 2007-03-28 英国电讯有限公司 Method and system for semantically segmenting scenes of a video sequence
CN100538698C (en) * 2004-01-14 2009-09-09 三菱电机株式会社 Summary transcriber and summary reproducting method
JP2010039877A (en) * 2008-08-07 2010-02-18 Nippon Telegr & Teleph Corp <Ntt> Apparatus and program for generating digest content
CN103299324A (en) * 2010-11-11 2013-09-11 谷歌公司 Learning tags for video annotation using latent subtags
CN103854014A (en) * 2014-02-25 2014-06-11 中国科学院自动化研究所 Terror video identification method and device based on sparse representation of context
US9002175B1 (en) * 2013-03-13 2015-04-07 Google Inc. Automated video trailer creation
CN105279495A (en) * 2015-10-23 2016-01-27 天津大学 Video description method based on deep learning and text summarization
CN105611413A (en) * 2015-12-24 2016-05-25 小米科技有限责任公司 Method and device for adding video clip class markers
US9635337B1 (en) * 2015-03-27 2017-04-25 Amazon Technologies, Inc. Dynamically generated media trailers
CN106649713A (en) * 2016-12-21 2017-05-10 中山大学 Movie visualization processing method and system based on content
CN106779073A (en) * 2016-12-27 2017-05-31 西安石油大学 Media information sorting technique and device based on deep neural network
CN106878632A (en) * 2017-02-28 2017-06-20 北京知慧教育科技有限公司 A kind for the treatment of method and apparatus of video data
CN107077595A (en) * 2014-09-08 2017-08-18 谷歌公司 Selection and presentation representative frame are for video preview
CN107436921A (en) * 2017-07-03 2017-12-05 李洪海 Video data handling procedure, device, equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8195038B2 (en) * 2008-10-24 2012-06-05 At&T Intellectual Property I, L.P. Brief and high-interest video summary generation
US8769584B2 (en) * 2009-05-29 2014-07-01 TVI Interactive Systems, Inc. Methods for displaying contextually targeted content on a connected television
US9313535B2 (en) * 2011-02-03 2016-04-12 Ericsson Ab Generating montages of video segments responsive to viewing preferences associated with a video terminal
US10134440B2 (en) * 2011-05-03 2018-11-20 Kodak Alaris Inc. Video summarization using audio and visual cues
US9667937B2 (en) * 2013-03-14 2017-05-30 Centurylink Intellectual Property Llc Auto-summarizing video content system and method
US11055340B2 (en) * 2013-10-03 2021-07-06 Minute Spoteam Ltd. System and method for creating synopsis for multimedia content

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6829781B1 (en) * 2000-05-24 2004-12-07 At&T Corp. Network-based service to provide on-demand video summaries of television programs
CN100538698C (en) * 2004-01-14 2009-09-09 三菱电机株式会社 Summary transcriber and summary reproducting method
CN1938714A (en) * 2004-03-23 2007-03-28 英国电讯有限公司 Method and system for semantically segmenting scenes of a video sequence
KR20040041127A (en) * 2004-04-23 2004-05-14 학교법인 한국정보통신학원 An intelligent agent system for providing viewer-customized video skims in digital TV broadcasting
JP2010039877A (en) * 2008-08-07 2010-02-18 Nippon Telegr & Teleph Corp <Ntt> Apparatus and program for generating digest content
CN103299324A (en) * 2010-11-11 2013-09-11 谷歌公司 Learning tags for video annotation using latent subtags
US9002175B1 (en) * 2013-03-13 2015-04-07 Google Inc. Automated video trailer creation
CN103854014A (en) * 2014-02-25 2014-06-11 中国科学院自动化研究所 Terror video identification method and device based on sparse representation of context
CN107077595A (en) * 2014-09-08 2017-08-18 谷歌公司 Selection and presentation representative frame are for video preview
US9635337B1 (en) * 2015-03-27 2017-04-25 Amazon Technologies, Inc. Dynamically generated media trailers
CN105279495A (en) * 2015-10-23 2016-01-27 天津大学 Video description method based on deep learning and text summarization
CN105611413A (en) * 2015-12-24 2016-05-25 小米科技有限责任公司 Method and device for adding video clip class markers
CN106649713A (en) * 2016-12-21 2017-05-10 中山大学 Movie visualization processing method and system based on content
CN106779073A (en) * 2016-12-27 2017-05-31 西安石油大学 Media information sorting technique and device based on deep neural network
CN106878632A (en) * 2017-02-28 2017-06-20 北京知慧教育科技有限公司 A kind for the treatment of method and apparatus of video data
CN107436921A (en) * 2017-07-03 2017-12-05 李洪海 Video data handling procedure, device, equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
NVPS:一个多模态的新闻视频处理系统;谢毓湘, 栾悉道, 吴玲达, 老松杨;情报学报(04);全文 *
Personalized video summary using visual semantic annotations and automatic speech transcriptions;B.L. Tseng 等;《IEEE》;全文 *
基于情感的视频摘要研究;兰怡洁;《中国优秀硕士学位论文电子期刊》;全文 *

Also Published As

Publication number Publication date
CN108307229A (en) 2018-07-20

Similar Documents

Publication Publication Date Title
CN108307229B (en) Video and audio data processing method and device
Afouras et al. Self-supervised learning of audio-visual objects from video
Chung et al. Out of time: automated lip sync in the wild
EP2641401B1 (en) Method and system for video summarization
Ejaz et al. Efficient visual attention based framework for extracting key frames from videos
US7555149B2 (en) Method and system for segmenting videos using face detection
Hong et al. Video accessibility enhancement for hearing-impaired users
EP3813376A1 (en) System and method for generating localized contextual video annotation
CN108833973A (en) Extracting method, device and the computer equipment of video features
US20140181668A1 (en) Visual summarization of video for quick understanding
US20110304774A1 (en) Contextual tagging of recorded data
US11057457B2 (en) Television key phrase detection
CN110914872A (en) Navigating video scenes with cognitive insights
US20120033949A1 (en) Video Skimming Methods and Systems
El Khoury et al. Audiovisual diarization of people in video content
WO2023197979A1 (en) Data processing method and apparatus, and computer device and storage medium
CN109408672B (en) Article generation method, article generation device, server and storage medium
CN108615532B (en) Classification method and device applied to sound scene
CN112565885B (en) Video segmentation method, system, device and storage medium
Coutrot et al. An audiovisual attention model for natural conversation scenes
US10904476B1 (en) Techniques for up-sampling digital media content
CN111836118B (en) Video processing method, device, server and storage medium
CN113343831A (en) Method and device for classifying speakers in video, electronic equipment and storage medium
CN113923504B (en) Video preview moving picture generation method and device
Li et al. What's making that sound?

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant