CN111767805A

CN111767805A - Multi-mode data automatic cleaning and labeling method and system

Info

Publication number: CN111767805A
Application number: CN202010525080.8A
Authority: CN
Inventors: 刘青松; 胡炳然
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2020-06-10
Filing date: 2020-06-10
Publication date: 2020-10-13

Abstract

The invention provides a method and a system for automatically cleaning and labeling multi-modal data, which are used for combining two complementary information, namely images and audios, in videos to mutually cooperate to finish the automatic cleaning and labeling work of the video multi-modal data.

Description

Multi-mode data automatic cleaning and labeling method and system

Technical Field

The invention relates to the technical field of multi-modal data processing, in particular to an automatic multi-modal data cleaning and labeling method and system.

Background

The video data belongs to multi-modal data, and the video data simultaneously comprises data information of two single modalities, namely image information and audio information. At present, in the process of processing video data, image information and audio information in the video data need to be processed independently, but because the image information and the audio information have relevance in terms of time and scene content, the independent processing mode cannot effectively solve the problem of performing mutual cooperation labeling on image pictures and audio information in a multi-modal scene such as a video, and cannot provide diversified labeling information and improve the labeling performance of a single modal dimension by using the multi-modal information.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a multi-modal data automatic cleaning and labeling method and a system, wherein the multi-modal data automatic cleaning and labeling method and system obtains effective human face image frames meeting preset quality conditions and corresponding timestamp information through human face information analysis, obtains audio characteristic information and speaker identity determination information through audio information analysis, and labels speaker speaking state and/or speaking starting point information in a video through synthesis and decision processing; therefore, the multi-mode data automatic cleaning and labeling method and system are combined with two complementary information, namely images and audios, in videos to cooperate with each other to finish the automatic cleaning and labeling work of the multi-mode data of the videos, and particularly can clean and label a picture segment with front face information of people from massive un-labeled multi-mode data of the videos, and label the speaking state and the speaking starting point information of the people in the picture segment at the same time, so that the diversified labeling of the multi-mode information is realized, and the labeling performance of the multi-mode information is improved.

The invention provides a multi-modal data automatic cleaning and labeling method, which is characterized by comprising the following steps:

step S1, a human face information analysis step, which is used for carrying out human face recognition on picture components in the video so as to obtain an effective human face image frame meeting the preset quality condition and corresponding timestamp information;

step S2, an audio information analysis step, which is used for analyzing the audio component in the video to obtain the audio characteristic information and the speaker identity determination information;

step S3, a synthesis and decision processing step, which is used for labeling the speaking state and/or the speaking starting point information of the speaker in the video according to the human face image frame, the timestamp information, the audio characteristic information and the speaker identity determination information;

further, in step S1, the human face information analyzing step specifically includes,

step S101, separating the picture components from the video and acquiring face feature information to be recognized, wherein the face feature information comprises face five sense organs information and/or face contour information;

step S102, decomposing the picture components into a plurality of image frames, and identifying and selecting the human face image frames containing the human face characteristic information from the plurality of image frames;

step S103, judging whether the human face image frame meets a preset image resolution condition and/or an image tone condition, if so, determining the corresponding human face image frame as an effective human face image frame;

step S104, acquiring time axis information corresponding to the picture components decomposed into the plurality of image frames, and extracting timestamp information corresponding to each effective human face image frame from the time axis information;

further, in the step S2, the audio information analyzing step specifically includes,

step S201, separating the audio component from the video;

step S202, performing VAD voice activation detection on the audio component to obtain audio characteristic information about the audio component, wherein the audio characteristic information comprises character voice information and environmental sound information;

step S203, carrying out VPR voiceprint recognition processing on the audio component so as to obtain all voiceprint recognition information of the audio component;

step S204, obtaining speaker identity determination information corresponding to the character voice information according to all the voiceprint identification information;

further, in the step S3, the integrating and deciding process step includes,

step S301, judging whether the effective human face image frame obtained in the human face information analysis step meets the requirement of preset human face information quantity, and giving candidate labels of the effective human face image frame according to the judgment result;

step S302, according to the timestamp information and the audio characteristic information, determining matched audio characteristic information in a time period in which the effective human face image frame is positioned, and according to the matched audio characteristic information, marking the speaking state of the speaker and/or the speaking starting point information;

further, in the step S301, it is determined whether the effective face image frame obtained in the face information analysis step meets the requirement of a preset face information amount, and the candidate labeling of the effective face image frame according to the determination result specifically includes,

step S3011, calculating an actual face feature information amount corresponding to the effective face image frame, comparing the actual face feature information amount with a preset face feature information threshold amount, if the actual face feature information amount is greater than or equal to the preset face feature information threshold amount, determining that the effective face image frame meets the preset face information amount requirement, otherwise, determining that the effective face image frame does not meet the preset face information amount requirement;

step S3012, performing qualified candidate labeling on the effective face image frames corresponding to the requirement of the preset face information amount, and performing unqualified candidate labeling on the effective face image frames corresponding to the requirement of the preset face information amount;

alternatively, the first and second electrodes may be,

in the step S302, according to the timestamp information and the audio feature information, matching audio feature information in a time period in which each of the effective face image features is located is determined, and according to the matching audio feature information, the speaking state of the speaker and/or the speaking starting point information are/is labeled,

step S3021, determining a coexistence time period of the effective human face image frame and the audio feature information according to the timestamp information;

step S3022, determining audio feature information corresponding to the effective face image frame in the time period according to the coexistence time period, and using the audio feature information as the matching audio feature information;

step S3023, determining whether the matching audio information belongs to the character voice information, and whether a speaker corresponding to the character voice information is consistent with the character in the effective human face image frame, and if both of the two are true, labeling the speaking state of the speaker and/or the speaking starting point information.

the multi-modal data automatic cleaning and labeling system comprises a face information analysis module, an audio information analysis module and a synthesis and decision processing module; wherein the content of the first and second substances,

the face information analysis module is used for carrying out face recognition on picture components in the video so as to obtain effective face image frames meeting preset quality conditions and corresponding timestamp information;

the audio information analysis module is used for analyzing audio components in the video so as to obtain audio characteristic information and speaker identity determination information;

the synthesis and decision processing module is used for labeling the speaking state and/or the speaking starting point information of the speaker in the video according to the human face image frame, the timestamp information, the audio characteristic information and the speaker identity determination information;

further, the face information analysis module comprises a face image frame picking submodule, an effective face image frame determining submodule and a timestamp information extraction submodule; wherein the content of the first and second substances,

the human face image frame picking submodule is used for identifying and picking human face image frames containing the human face characteristic information from a plurality of image frames corresponding to the image components;

the effective human face image frame determining submodule is used for determining the human face image frame meeting the preset image resolution condition and/or the image tone condition as an effective human face image frame;

the effective human face image frame determining submodule is used for extracting timestamp information corresponding to each effective human face image frame from time axis information corresponding to the image components which are decomposed into the plurality of image frames;

further, the audio information analysis module comprises a VAD voice activation detection sub-module, a VPR voiceprint recognition processing sub-module and a speaker identity determination information generation sub-module; wherein the content of the first and second substances,

the VAD voice activation detection submodule is used for carrying out VAD voice activation detection on the audio component so as to obtain audio characteristic information about the audio component, wherein the audio characteristic information comprises character voice information and environmental sound information;

the VPR voiceprint recognition processing submodule is used for carrying out VPR voiceprint recognition processing on the audio component so as to obtain all voiceprint recognition information about the audio component;

the speaker identity confirming information generating submodule is used for generating speaker identity confirming information corresponding to the character voice information according to all the voiceprint recognition information;

furthermore, the synthesis and decision processing module comprises an image frame candidate marking sub-module and a speaking state/speaking starting point marking sub-module; wherein the content of the first and second substances,

the image frame candidate labeling submodule is used for giving out candidate labels of the effective face image frame according to the judgment result of whether the effective face image frame meets the requirement of the preset face information amount;

the speaking state/speaking starting point labeling submodule is used for labeling the speaking state and/or the speaking starting point information of the speaker according to the timestamp information and the audio characteristic information;

further, the image frame candidate labeling submodule comprises an actual face feature information quantity calculating unit, a face feature information quantity comparing unit and a candidate labeling executing unit; wherein the content of the first and second substances,

the actual face feature information amount calculating unit is used for calculating actual face feature information amount corresponding to the effective face image frame;

the face characteristic information quantity comparison unit is used for comparing the actual face characteristic information quantity with a preset face characteristic information threshold quantity, and when the actual face characteristic information quantity is greater than or equal to the preset face characteristic information threshold quantity, determining that the effective face image frame meets the preset face information quantity requirement, otherwise, determining that the effective face image frame does not meet the preset face information quantity requirement;

the candidate labeling execution unit is used for performing qualified candidate labeling on the effective face image frames corresponding to the preset face information quantity requirement and performing unqualified candidate labeling on the effective face image frames corresponding to the non-conformity with the preset face information quantity requirement;

alternatively, the first and second electrodes may be,

the speaking state/speaking starting point labeling submodule comprises a coexistence time period determining unit, a matching audio characteristic information determining unit and a labeling operation unit; wherein the content of the first and second substances,

the coexistence time period determining unit is used for determining the coexistence time period of the effective human face image frame and the audio feature information according to the timestamp information;

the matching audio characteristic information determining unit is used for determining corresponding audio characteristic information in a time period in which the effective human face image frame is positioned according to the coexistence time period, and the audio characteristic information is used as the matching audio characteristic information;

and the marking operation unit is used for marking the speaking state of the speaker and/or the speaking starting point information when the matched audio information belongs to the person voice information and the speaker corresponding to the person voice information is consistent with the person in the effective human face image frame.

Compared with the prior art, the multi-modal data automatic cleaning and labeling method and system can obtain the effective human face image frame meeting the preset quality condition and the corresponding timestamp information thereof through human face information analysis, obtain the audio characteristic information and the speaker identity determination information through audio information analysis, and label the speaker speaking state and/or the speaker starting point information in the video through comprehensive and decision processing; therefore, the multi-mode data automatic cleaning and labeling method and system are combined with two complementary information, namely images and audios, in videos to cooperate with each other to finish the automatic cleaning and labeling work of the multi-mode data of the videos, and particularly can clean and label a picture segment with front face information of people from massive un-labeled multi-mode data of the videos, and label the speaking state and the speaking starting point information of the people in the picture segment at the same time, so that the diversified labeling of the multi-mode information is realized, and the labeling performance of the multi-mode information is improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of an automatic multi-modal data cleaning and labeling method provided by the present invention.

FIG. 2 is a schematic structural diagram of the multi-modal data automatic cleaning and labeling system provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flow chart of an automatic multi-modal data cleaning and labeling method according to an embodiment of the present invention. The multi-modal data automatic cleaning and labeling method comprises the following steps:

step S3, a step of synthesis and decision processing, which is used to label the speaking status and/or the speaking starting point information of the speaker in the video according to the human face image frame, the timestamp information, the audio characteristic information and the speaker identity determination information.

The method for automatically cleaning and labeling the multi-modal data is different from the mode of separately processing the image picture information and the audio information of the multi-modal video data in the prior art, and the image picture information and the audio information are cooperatively labeled by taking the timestamp information as an associated element, so that the complementary cooperative cleaning and labeling of the multi-modal video data are realized, and the multi-modal scene processing quality of the multi-modal video data is improved.

Preferably, in the step S1, the face information analyzing step specifically includes,

step S101, separating the picture component from the video and obtaining face feature information to be recognized, wherein the face feature information comprises face five sense organs information and/or face contour information;

step S102, decomposing the picture component into a plurality of image frames, and identifying and selecting the human face image frame containing the human face characteristic information from the plurality of image frames;

and step S104, acquiring time axis information corresponding to the picture component decomposed into the plurality of image frames, and extracting time stamp information corresponding to each effective human face image frame from the time axis information.

By using the facial feature information and/or the facial contour information as the facial feature information, the accuracy and the efficiency of dividing the picture component into a plurality of image frames can be improved, so that the situation of wrong image frame division is effectively avoided.

Preferably, in the step S2, the audio information analyzing step specifically includes,

step S201, separating the audio component from the video;

step S204, obtaining the speaker identity confirming information corresponding to the character voice information according to all the voiceprint recognition information.

The audio component is analyzed through VAD voice activation detection and VPR voiceprint recognition processing respectively, and the accuracy of obtaining the audio characteristic information and the voiceprint recognition information can be guaranteed.

Preferably, in the step S3, the integrating and deciding process step includes,

step S301, judging whether the effective human face image frame obtained in the human face information analysis step meets the requirement of the preset human face information amount, and giving candidate labels of the effective human face image frame according to the judgment result;

step S302, according to the timestamp information and the audio characteristic information, determining the matching audio characteristic information in the time period of the effective human face image frame, and according to the matching audio characteristic information, marking the speaking state of the speaker and/or the speaking starting point information.

The process can realize diversified and dynamic candidate labeling of the effective face image, thereby improving the labeling applicability and the labeling effectiveness to different multi-modal scenes; in addition, the above process can also ensure the matching audio characteristic information to be precisely matched with the effective human face image frame, thereby ensuring the accuracy of the speaking state of the speaker and/or the labeling time of the speaking starting point information.

Preferably, in the step S301, it is determined whether the effective face image frame obtained in the face information analysis step meets the requirement of the preset face information amount, and the candidate label of the effective face image frame according to the determination result specifically includes,

step S3011, calculating an actual face feature information amount corresponding to the effective image frame of the face, comparing the actual face feature information amount with a preset face feature information threshold amount, if the actual face feature information amount is greater than or equal to the preset face feature information threshold amount, determining that the effective image frame of the face meets the preset face information amount requirement, otherwise, determining that the effective image frame of the face does not meet the preset face information amount requirement;

step S3012, performing qualified candidate labeling on the effective face image frames corresponding to the preset face information amount requirement, and performing unqualified candidate labeling on the effective face image frames not corresponding to the preset face information amount requirement.

Preferably, in the step S302, according to the timestamp information and the audio characteristic information, matching audio characteristic information in a time period in which each of the effective facial image characteristics is located is determined, and according to the matching audio characteristic information, the speaking state of the speaker and/or the speaking starting point information is labeled to specifically include,

step S3021, determining a coexistence time period of the valid face image frame and the audio feature information according to the timestamp information;

step S3022, determining audio feature information corresponding to the valid face image frame in the time period according to the coexistence time period, and using the determined audio feature information as the matching audio feature information;

step S3023, determining whether the matching audio information belongs to the character voice information, and whether the speaker corresponding to the character voice information is consistent with the character in the valid human face image frame, and if both of the two are true, labeling the speaking state of the speaker and/or the speaking starting point information.

Fig. 2 is a schematic structural diagram of an automatic multi-modal data cleaning and labeling system according to an embodiment of the present invention. The multi-modal data automatic cleaning and labeling system comprises a face information analysis module, an audio information analysis module and a synthesis and decision processing module; wherein the content of the first and second substances,

the synthesis and decision processing module is used for labeling the speaking state and/or the speaking starting point information of the speaker in the video according to the human face image frame, the timestamp information, the audio characteristic information and the speaker identity determination information.

The system for automatically cleaning and labeling the multi-modal data is different from a mode of separately processing image picture information and audio information of the multi-modal video data in the prior art, and the system carries out cooperative labeling on the image picture information and the audio information by taking timestamp information as an associated element, so that complementary cooperative cleaning and labeling of the multi-modal video data are realized, and the multi-modal scene processing quality of the multi-modal video data is improved.

Preferably, the face information analysis module comprises a face image frame picking submodule, an effective face image frame determining submodule and a timestamp information extraction submodule; wherein the content of the first and second substances,

the human face image frame picking submodule is used for identifying and picking the human face image frame containing the human face characteristic information from a plurality of image frames corresponding to the image components;

the effective human face image frame determining submodule is used for extracting time stamp information corresponding to each effective human face image frame from time axis information corresponding to a plurality of image frames decomposed from the image components.

Preferably, the audio information analysis module comprises a VAD voice activation detection sub-module, a VPR voiceprint recognition processing sub-module and a speaker identity determination information generation sub-module; wherein the content of the first and second substances,

the VAD voice activation detection submodule is used for carrying out VAD voice activation detection on the audio component so as to obtain audio characteristic information related to the audio component, wherein the audio characteristic information comprises character voice information and environmental sound information;

the speaker identity confirming information generating submodule is used for generating speaker identity confirming information corresponding to the character voice information according to all the voiceprint recognition information.

Preferably, the synthesis and decision processing module comprises an image frame candidate labeling sub-module and a speaking state/speaking starting point labeling sub-module; wherein the content of the first and second substances,

the speaking state/speaking starting point labeling submodule is used for labeling the speaking state and/or the speaking starting point information of the speaker according to the timestamp information and the audio characteristic information.

The image frame candidate labeling submodule can realize diversified and dynamic candidate labeling on the effective face image, so that the labeling applicability and the labeling effectiveness on different multi-modal scenes are improved; in addition, the speaking state/speaking starting point labeling submodule can also ensure that the matched audio characteristic information is accurately matched with the effective human face image frame, thereby ensuring the accuracy of the labeling time of the speaking state and/or the speaking starting point information of the speaker.

Preferably, the image frame candidate labeling submodule comprises an actual face feature information amount calculating unit, a face feature information amount comparing unit and a candidate labeling executing unit; wherein the content of the first and second substances,

the actual face characteristic information quantity calculating unit is used for calculating actual face characteristic information quantity corresponding to the effective image frame of the face;

the face characteristic information quantity comparison unit is used for comparing the actual face characteristic information quantity with a preset face characteristic information threshold quantity, and determining that the effective face image frame meets the preset face information quantity requirement when the actual face characteristic information quantity is greater than or equal to the preset face characteristic information threshold quantity, or else, determining that the effective face image frame does not meet the preset face information quantity requirement;

the candidate labeling execution unit is used for performing qualified candidate labeling on the effective face image frame corresponding to the preset face information amount requirement and performing unqualified candidate labeling on the effective face image frame not corresponding to the preset face information amount requirement.

Preferably, the speaking state/speaking starting point labeling submodule comprises a coexistence time period determining unit, a matching audio characteristic information determining unit and a labeling operation unit; wherein the content of the first and second substances,

the coexistence time period determination unit is used for determining the coexistence time period of the effective human face image frame and the audio characteristic information according to the timestamp information;

the labeling operation unit is used for labeling the speaking state of the speaker and/or the speaking starting point information when the matching audio information belongs to the person voice information and the speaker corresponding to the person voice information is consistent with the person in the effective human face image frame.

It can be known from the content of the above embodiment that the method and system for automatically cleaning and labeling multi-modal data combine two complementary information, namely, image and audio, in a video to mutually cooperate to complete the work of automatically cleaning and labeling the multi-modal data of the video, and specifically can clean and label a picture segment with front face information of a person from a large amount of un-labeled multi-modal data of the video, and label the speaking state and the speaking starting point information of the person in the picture segment at the same time, thereby realizing diversified labeling of the multi-modal information and improving the labeling performance of the multi-modal information.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. The method for automatically cleaning and labeling the multi-modal data is characterized by comprising the following steps of:

and step S3, a step of synthesis and decision processing, which is used for labeling the speaking state and/or the speaking starting point information of the speaker in the video according to the human face image frame, the timestamp information, the audio characteristic information and the speaker identity determination information.

2. The method for automatically cleansing and labeling multimodal data as recited in claim 1, wherein: in step S1, the human face information analyzing step specifically includes,

and step S104, acquiring time axis information corresponding to the picture components decomposed into the plurality of image frames, and extracting time stamp information corresponding to each effective human face image frame from the time axis information.

3. The method for automatically cleansing and labeling multimodal data as recited in claim 1, wherein: in step S2, the audio information analyzing step specifically includes,

step S201, separating the audio component from the video;

and step S204, obtaining speaker identity determination information corresponding to the character voice information according to all the voiceprint identification information.

4. The method for automatically cleansing and labeling multimodal data as recited in claim 1, wherein: in the step S3, the integrating and deciding process step includes,

step S302, according to the timestamp information and the audio characteristic information, determining matched audio characteristic information in a time period of the effective human face image frame, and according to the matched audio characteristic information, marking the speaking state of the speaker and/or the speaking starting point information.

5. The method for automatically cleansing and labeling multimodal data as recited in claim 4, wherein: in step S301, it is determined whether the effective face image frame obtained in the face information analysis step meets the requirement of a preset face information amount, and the candidate labeling of the effective face image frame according to the determination result specifically includes,

alternatively, the first and second electrodes may be,

6. The multi-modal data automatic cleaning and labeling system is characterized in that:

7. The multi-modal data auto-cleansing and labeling system of claim 6, wherein: the face information analysis module comprises a face image frame picking submodule, an effective face image frame determining submodule and a timestamp information extraction submodule; wherein the content of the first and second substances,

the effective human face image frame determining submodule is used for extracting time stamp information corresponding to each effective human face image frame from time axis information corresponding to the image components which are decomposed into the plurality of image frames.

8. The multi-modal data auto-cleansing and labeling system of claim 6, wherein: the audio information analysis module comprises a VAD voice activation detection submodule, a VPR voiceprint recognition processing submodule and a speaker identity determination information generation submodule; wherein the content of the first and second substances,

and the speaker identity confirming information generating submodule is used for generating speaker identity confirming information corresponding to the character voice information according to all the voiceprint recognition information.

9. The multi-modal data auto-cleansing and labeling system of claim 6, wherein: the synthesis and decision processing module comprises an image frame candidate marking sub-module and a speaking state/speaking starting point marking sub-module; wherein the content of the first and second substances,

and the speaking state/speaking starting point labeling submodule is used for labeling the speaking state and/or the speaking starting point information of the speaker according to the timestamp information and the audio characteristic information.

10. The multi-modal data auto-cleansing and labeling system of claim 6, wherein: the image frame candidate labeling submodule comprises an actual human face characteristic information quantity calculating unit, a human face characteristic information quantity comparing unit and a candidate labeling execution unit; wherein the content of the first and second substances,

alternatively, the first and second electrodes may be,