CN111767805A - Multi-mode data automatic cleaning and labeling method and system - Google Patents
Multi-mode data automatic cleaning and labeling method and system Download PDFInfo
- Publication number
- CN111767805A CN111767805A CN202010525080.8A CN202010525080A CN111767805A CN 111767805 A CN111767805 A CN 111767805A CN 202010525080 A CN202010525080 A CN 202010525080A CN 111767805 A CN111767805 A CN 111767805A
- Authority
- CN
- China
- Prior art keywords
- information
- image frame
- audio
- face image
- effective
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000002372 labelling Methods 0.000 title claims abstract description 108
- 238000004140 cleaning Methods 0.000 title claims abstract description 30
- 238000000034 method Methods 0.000 claims abstract description 14
- 239000000126 substance Substances 0.000 claims description 18
- 230000004913 activation Effects 0.000 claims description 14
- 238000001514 detection method Methods 0.000 claims description 14
- 230000015572 biosynthetic process Effects 0.000 claims description 13
- 238000003786 synthesis reaction Methods 0.000 claims description 13
- 230000007613 environmental effect Effects 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 3
- 210000000697 sensory organ Anatomy 0.000 claims description 3
- 230000000295 complement effect Effects 0.000 abstract description 6
- 230000001815 facial effect Effects 0.000 description 7
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
- G06F16/735—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7834—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7847—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/172—Classification, e.g. identification
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Library & Information Science (AREA)
- Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Human Computer Interaction (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention provides a method and a system for automatically cleaning and labeling multi-modal data, which are used for combining two complementary information, namely images and audios, in videos to mutually cooperate to finish the automatic cleaning and labeling work of the video multi-modal data.
Description
Technical Field
The invention relates to the technical field of multi-modal data processing, in particular to an automatic multi-modal data cleaning and labeling method and system.
Background
The video data belongs to multi-modal data, and the video data simultaneously comprises data information of two single modalities, namely image information and audio information. At present, in the process of processing video data, image information and audio information in the video data need to be processed independently, but because the image information and the audio information have relevance in terms of time and scene content, the independent processing mode cannot effectively solve the problem of performing mutual cooperation labeling on image pictures and audio information in a multi-modal scene such as a video, and cannot provide diversified labeling information and improve the labeling performance of a single modal dimension by using the multi-modal information.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a multi-modal data automatic cleaning and labeling method and a system, wherein the multi-modal data automatic cleaning and labeling method and system obtains effective human face image frames meeting preset quality conditions and corresponding timestamp information through human face information analysis, obtains audio characteristic information and speaker identity determination information through audio information analysis, and labels speaker speaking state and/or speaking starting point information in a video through synthesis and decision processing; therefore, the multi-mode data automatic cleaning and labeling method and system are combined with two complementary information, namely images and audios, in videos to cooperate with each other to finish the automatic cleaning and labeling work of the multi-mode data of the videos, and particularly can clean and label a picture segment with front face information of people from massive un-labeled multi-mode data of the videos, and label the speaking state and the speaking starting point information of the people in the picture segment at the same time, so that the diversified labeling of the multi-mode information is realized, and the labeling performance of the multi-mode information is improved.
The invention provides a multi-modal data automatic cleaning and labeling method, which is characterized by comprising the following steps:
step S1, a human face information analysis step, which is used for carrying out human face recognition on picture components in the video so as to obtain an effective human face image frame meeting the preset quality condition and corresponding timestamp information;
step S2, an audio information analysis step, which is used for analyzing the audio component in the video to obtain the audio characteristic information and the speaker identity determination information;
step S3, a synthesis and decision processing step, which is used for labeling the speaking state and/or the speaking starting point information of the speaker in the video according to the human face image frame, the timestamp information, the audio characteristic information and the speaker identity determination information;
further, in step S1, the human face information analyzing step specifically includes,
step S101, separating the picture components from the video and acquiring face feature information to be recognized, wherein the face feature information comprises face five sense organs information and/or face contour information;
step S102, decomposing the picture components into a plurality of image frames, and identifying and selecting the human face image frames containing the human face characteristic information from the plurality of image frames;
step S103, judging whether the human face image frame meets a preset image resolution condition and/or an image tone condition, if so, determining the corresponding human face image frame as an effective human face image frame;
step S104, acquiring time axis information corresponding to the picture components decomposed into the plurality of image frames, and extracting timestamp information corresponding to each effective human face image frame from the time axis information;
further, in the step S2, the audio information analyzing step specifically includes,
step S201, separating the audio component from the video;
step S202, performing VAD voice activation detection on the audio component to obtain audio characteristic information about the audio component, wherein the audio characteristic information comprises character voice information and environmental sound information;
step S203, carrying out VPR voiceprint recognition processing on the audio component so as to obtain all voiceprint recognition information of the audio component;
step S204, obtaining speaker identity determination information corresponding to the character voice information according to all the voiceprint identification information;
further, in the step S3, the integrating and deciding process step includes,
step S301, judging whether the effective human face image frame obtained in the human face information analysis step meets the requirement of preset human face information quantity, and giving candidate labels of the effective human face image frame according to the judgment result;
step S302, according to the timestamp information and the audio characteristic information, determining matched audio characteristic information in a time period in which the effective human face image frame is positioned, and according to the matched audio characteristic information, marking the speaking state of the speaker and/or the speaking starting point information;
further, in the step S301, it is determined whether the effective face image frame obtained in the face information analysis step meets the requirement of a preset face information amount, and the candidate labeling of the effective face image frame according to the determination result specifically includes,
step S3011, calculating an actual face feature information amount corresponding to the effective face image frame, comparing the actual face feature information amount with a preset face feature information threshold amount, if the actual face feature information amount is greater than or equal to the preset face feature information threshold amount, determining that the effective face image frame meets the preset face information amount requirement, otherwise, determining that the effective face image frame does not meet the preset face information amount requirement;
step S3012, performing qualified candidate labeling on the effective face image frames corresponding to the requirement of the preset face information amount, and performing unqualified candidate labeling on the effective face image frames corresponding to the requirement of the preset face information amount;
alternatively, the first and second electrodes may be,
in the step S302, according to the timestamp information and the audio feature information, matching audio feature information in a time period in which each of the effective face image features is located is determined, and according to the matching audio feature information, the speaking state of the speaker and/or the speaking starting point information are/is labeled,
step S3021, determining a coexistence time period of the effective human face image frame and the audio feature information according to the timestamp information;
step S3022, determining audio feature information corresponding to the effective face image frame in the time period according to the coexistence time period, and using the audio feature information as the matching audio feature information;
step S3023, determining whether the matching audio information belongs to the character voice information, and whether a speaker corresponding to the character voice information is consistent with the character in the effective human face image frame, and if both of the two are true, labeling the speaking state of the speaker and/or the speaking starting point information.
The invention provides a multi-modal data automatic cleaning and labeling method, which is characterized by comprising the following steps:
the multi-modal data automatic cleaning and labeling system comprises a face information analysis module, an audio information analysis module and a synthesis and decision processing module; wherein the content of the first and second substances,
the face information analysis module is used for carrying out face recognition on picture components in the video so as to obtain effective face image frames meeting preset quality conditions and corresponding timestamp information;
the audio information analysis module is used for analyzing audio components in the video so as to obtain audio characteristic information and speaker identity determination information;
the synthesis and decision processing module is used for labeling the speaking state and/or the speaking starting point information of the speaker in the video according to the human face image frame, the timestamp information, the audio characteristic information and the speaker identity determination information;
further, the face information analysis module comprises a face image frame picking submodule, an effective face image frame determining submodule and a timestamp information extraction submodule; wherein the content of the first and second substances,
the human face image frame picking submodule is used for identifying and picking human face image frames containing the human face characteristic information from a plurality of image frames corresponding to the image components;
the effective human face image frame determining submodule is used for determining the human face image frame meeting the preset image resolution condition and/or the image tone condition as an effective human face image frame;
the effective human face image frame determining submodule is used for extracting timestamp information corresponding to each effective human face image frame from time axis information corresponding to the image components which are decomposed into the plurality of image frames;
further, the audio information analysis module comprises a VAD voice activation detection sub-module, a VPR voiceprint recognition processing sub-module and a speaker identity determination information generation sub-module; wherein the content of the first and second substances,
the VAD voice activation detection submodule is used for carrying out VAD voice activation detection on the audio component so as to obtain audio characteristic information about the audio component, wherein the audio characteristic information comprises character voice information and environmental sound information;
the VPR voiceprint recognition processing submodule is used for carrying out VPR voiceprint recognition processing on the audio component so as to obtain all voiceprint recognition information about the audio component;
the speaker identity confirming information generating submodule is used for generating speaker identity confirming information corresponding to the character voice information according to all the voiceprint recognition information;
furthermore, the synthesis and decision processing module comprises an image frame candidate marking sub-module and a speaking state/speaking starting point marking sub-module; wherein the content of the first and second substances,
the image frame candidate labeling submodule is used for giving out candidate labels of the effective face image frame according to the judgment result of whether the effective face image frame meets the requirement of the preset face information amount;
the speaking state/speaking starting point labeling submodule is used for labeling the speaking state and/or the speaking starting point information of the speaker according to the timestamp information and the audio characteristic information;
further, the image frame candidate labeling submodule comprises an actual face feature information quantity calculating unit, a face feature information quantity comparing unit and a candidate labeling executing unit; wherein the content of the first and second substances,
the actual face feature information amount calculating unit is used for calculating actual face feature information amount corresponding to the effective face image frame;
the face characteristic information quantity comparison unit is used for comparing the actual face characteristic information quantity with a preset face characteristic information threshold quantity, and when the actual face characteristic information quantity is greater than or equal to the preset face characteristic information threshold quantity, determining that the effective face image frame meets the preset face information quantity requirement, otherwise, determining that the effective face image frame does not meet the preset face information quantity requirement;
the candidate labeling execution unit is used for performing qualified candidate labeling on the effective face image frames corresponding to the preset face information quantity requirement and performing unqualified candidate labeling on the effective face image frames corresponding to the non-conformity with the preset face information quantity requirement;
alternatively, the first and second electrodes may be,
the speaking state/speaking starting point labeling submodule comprises a coexistence time period determining unit, a matching audio characteristic information determining unit and a labeling operation unit; wherein the content of the first and second substances,
the coexistence time period determining unit is used for determining the coexistence time period of the effective human face image frame and the audio feature information according to the timestamp information;
the matching audio characteristic information determining unit is used for determining corresponding audio characteristic information in a time period in which the effective human face image frame is positioned according to the coexistence time period, and the audio characteristic information is used as the matching audio characteristic information;
and the marking operation unit is used for marking the speaking state of the speaker and/or the speaking starting point information when the matched audio information belongs to the person voice information and the speaker corresponding to the person voice information is consistent with the person in the effective human face image frame.
Compared with the prior art, the multi-modal data automatic cleaning and labeling method and system can obtain the effective human face image frame meeting the preset quality condition and the corresponding timestamp information thereof through human face information analysis, obtain the audio characteristic information and the speaker identity determination information through audio information analysis, and label the speaker speaking state and/or the speaker starting point information in the video through comprehensive and decision processing; therefore, the multi-mode data automatic cleaning and labeling method and system are combined with two complementary information, namely images and audios, in videos to cooperate with each other to finish the automatic cleaning and labeling work of the multi-mode data of the videos, and particularly can clean and label a picture segment with front face information of people from massive un-labeled multi-mode data of the videos, and label the speaking state and the speaking starting point information of the people in the picture segment at the same time, so that the diversified labeling of the multi-mode information is realized, and the labeling performance of the multi-mode information is improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of an automatic multi-modal data cleaning and labeling method provided by the present invention.
FIG. 2 is a schematic structural diagram of the multi-modal data automatic cleaning and labeling system provided by the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of an automatic multi-modal data cleaning and labeling method according to an embodiment of the present invention. The multi-modal data automatic cleaning and labeling method comprises the following steps:
step S1, a human face information analysis step, which is used for carrying out human face recognition on picture components in the video so as to obtain an effective human face image frame meeting the preset quality condition and corresponding timestamp information;
step S2, an audio information analysis step, which is used for analyzing the audio component in the video to obtain the audio characteristic information and the speaker identity determination information;
step S3, a step of synthesis and decision processing, which is used to label the speaking status and/or the speaking starting point information of the speaker in the video according to the human face image frame, the timestamp information, the audio characteristic information and the speaker identity determination information.
The method for automatically cleaning and labeling the multi-modal data is different from the mode of separately processing the image picture information and the audio information of the multi-modal video data in the prior art, and the image picture information and the audio information are cooperatively labeled by taking the timestamp information as an associated element, so that the complementary cooperative cleaning and labeling of the multi-modal video data are realized, and the multi-modal scene processing quality of the multi-modal video data is improved.
Preferably, in the step S1, the face information analyzing step specifically includes,
step S101, separating the picture component from the video and obtaining face feature information to be recognized, wherein the face feature information comprises face five sense organs information and/or face contour information;
step S102, decomposing the picture component into a plurality of image frames, and identifying and selecting the human face image frame containing the human face characteristic information from the plurality of image frames;
step S103, judging whether the human face image frame meets a preset image resolution condition and/or an image tone condition, if so, determining the corresponding human face image frame as an effective human face image frame;
and step S104, acquiring time axis information corresponding to the picture component decomposed into the plurality of image frames, and extracting time stamp information corresponding to each effective human face image frame from the time axis information.
By using the facial feature information and/or the facial contour information as the facial feature information, the accuracy and the efficiency of dividing the picture component into a plurality of image frames can be improved, so that the situation of wrong image frame division is effectively avoided.
Preferably, in the step S2, the audio information analyzing step specifically includes,
step S201, separating the audio component from the video;
step S202, performing VAD voice activation detection on the audio component to obtain audio characteristic information about the audio component, wherein the audio characteristic information comprises character voice information and environmental sound information;
step S203, carrying out VPR voiceprint recognition processing on the audio component so as to obtain all voiceprint recognition information of the audio component;
step S204, obtaining the speaker identity confirming information corresponding to the character voice information according to all the voiceprint recognition information.
The audio component is analyzed through VAD voice activation detection and VPR voiceprint recognition processing respectively, and the accuracy of obtaining the audio characteristic information and the voiceprint recognition information can be guaranteed.
Preferably, in the step S3, the integrating and deciding process step includes,
step S301, judging whether the effective human face image frame obtained in the human face information analysis step meets the requirement of the preset human face information amount, and giving candidate labels of the effective human face image frame according to the judgment result;
step S302, according to the timestamp information and the audio characteristic information, determining the matching audio characteristic information in the time period of the effective human face image frame, and according to the matching audio characteristic information, marking the speaking state of the speaker and/or the speaking starting point information.
The process can realize diversified and dynamic candidate labeling of the effective face image, thereby improving the labeling applicability and the labeling effectiveness to different multi-modal scenes; in addition, the above process can also ensure the matching audio characteristic information to be precisely matched with the effective human face image frame, thereby ensuring the accuracy of the speaking state of the speaker and/or the labeling time of the speaking starting point information.
Preferably, in the step S301, it is determined whether the effective face image frame obtained in the face information analysis step meets the requirement of the preset face information amount, and the candidate label of the effective face image frame according to the determination result specifically includes,
step S3011, calculating an actual face feature information amount corresponding to the effective image frame of the face, comparing the actual face feature information amount with a preset face feature information threshold amount, if the actual face feature information amount is greater than or equal to the preset face feature information threshold amount, determining that the effective image frame of the face meets the preset face information amount requirement, otherwise, determining that the effective image frame of the face does not meet the preset face information amount requirement;
step S3012, performing qualified candidate labeling on the effective face image frames corresponding to the preset face information amount requirement, and performing unqualified candidate labeling on the effective face image frames not corresponding to the preset face information amount requirement.
Preferably, in the step S302, according to the timestamp information and the audio characteristic information, matching audio characteristic information in a time period in which each of the effective facial image characteristics is located is determined, and according to the matching audio characteristic information, the speaking state of the speaker and/or the speaking starting point information is labeled to specifically include,
step S3021, determining a coexistence time period of the valid face image frame and the audio feature information according to the timestamp information;
step S3022, determining audio feature information corresponding to the valid face image frame in the time period according to the coexistence time period, and using the determined audio feature information as the matching audio feature information;
step S3023, determining whether the matching audio information belongs to the character voice information, and whether the speaker corresponding to the character voice information is consistent with the character in the valid human face image frame, and if both of the two are true, labeling the speaking state of the speaker and/or the speaking starting point information.
Fig. 2 is a schematic structural diagram of an automatic multi-modal data cleaning and labeling system according to an embodiment of the present invention. The multi-modal data automatic cleaning and labeling system comprises a face information analysis module, an audio information analysis module and a synthesis and decision processing module; wherein the content of the first and second substances,
the face information analysis module is used for carrying out face recognition on picture components in the video so as to obtain effective face image frames meeting preset quality conditions and corresponding timestamp information;
the audio information analysis module is used for analyzing audio components in the video so as to obtain audio characteristic information and speaker identity determination information;
the synthesis and decision processing module is used for labeling the speaking state and/or the speaking starting point information of the speaker in the video according to the human face image frame, the timestamp information, the audio characteristic information and the speaker identity determination information.
The system for automatically cleaning and labeling the multi-modal data is different from a mode of separately processing image picture information and audio information of the multi-modal video data in the prior art, and the system carries out cooperative labeling on the image picture information and the audio information by taking timestamp information as an associated element, so that complementary cooperative cleaning and labeling of the multi-modal video data are realized, and the multi-modal scene processing quality of the multi-modal video data is improved.
Preferably, the face information analysis module comprises a face image frame picking submodule, an effective face image frame determining submodule and a timestamp information extraction submodule; wherein the content of the first and second substances,
the human face image frame picking submodule is used for identifying and picking the human face image frame containing the human face characteristic information from a plurality of image frames corresponding to the image components;
the effective human face image frame determining submodule is used for determining the human face image frame meeting the preset image resolution condition and/or the image tone condition as an effective human face image frame;
the effective human face image frame determining submodule is used for extracting time stamp information corresponding to each effective human face image frame from time axis information corresponding to a plurality of image frames decomposed from the image components.
By using the facial feature information and/or the facial contour information as the facial feature information, the accuracy and the efficiency of dividing the picture component into a plurality of image frames can be improved, so that the situation of wrong image frame division is effectively avoided.
Preferably, the audio information analysis module comprises a VAD voice activation detection sub-module, a VPR voiceprint recognition processing sub-module and a speaker identity determination information generation sub-module; wherein the content of the first and second substances,
the VAD voice activation detection submodule is used for carrying out VAD voice activation detection on the audio component so as to obtain audio characteristic information related to the audio component, wherein the audio characteristic information comprises character voice information and environmental sound information;
the VPR voiceprint recognition processing submodule is used for carrying out VPR voiceprint recognition processing on the audio component so as to obtain all voiceprint recognition information about the audio component;
the speaker identity confirming information generating submodule is used for generating speaker identity confirming information corresponding to the character voice information according to all the voiceprint recognition information.
The audio component is analyzed through VAD voice activation detection and VPR voiceprint recognition processing respectively, and the accuracy of obtaining the audio characteristic information and the voiceprint recognition information can be guaranteed.
Preferably, the synthesis and decision processing module comprises an image frame candidate labeling sub-module and a speaking state/speaking starting point labeling sub-module; wherein the content of the first and second substances,
the image frame candidate labeling submodule is used for giving out candidate labels of the effective face image frame according to the judgment result of whether the effective face image frame meets the requirement of the preset face information amount;
the speaking state/speaking starting point labeling submodule is used for labeling the speaking state and/or the speaking starting point information of the speaker according to the timestamp information and the audio characteristic information.
The image frame candidate labeling submodule can realize diversified and dynamic candidate labeling on the effective face image, so that the labeling applicability and the labeling effectiveness on different multi-modal scenes are improved; in addition, the speaking state/speaking starting point labeling submodule can also ensure that the matched audio characteristic information is accurately matched with the effective human face image frame, thereby ensuring the accuracy of the labeling time of the speaking state and/or the speaking starting point information of the speaker.
Preferably, the image frame candidate labeling submodule comprises an actual face feature information amount calculating unit, a face feature information amount comparing unit and a candidate labeling executing unit; wherein the content of the first and second substances,
the actual face characteristic information quantity calculating unit is used for calculating actual face characteristic information quantity corresponding to the effective image frame of the face;
the face characteristic information quantity comparison unit is used for comparing the actual face characteristic information quantity with a preset face characteristic information threshold quantity, and determining that the effective face image frame meets the preset face information quantity requirement when the actual face characteristic information quantity is greater than or equal to the preset face characteristic information threshold quantity, or else, determining that the effective face image frame does not meet the preset face information quantity requirement;
the candidate labeling execution unit is used for performing qualified candidate labeling on the effective face image frame corresponding to the preset face information amount requirement and performing unqualified candidate labeling on the effective face image frame not corresponding to the preset face information amount requirement.
Preferably, the speaking state/speaking starting point labeling submodule comprises a coexistence time period determining unit, a matching audio characteristic information determining unit and a labeling operation unit; wherein the content of the first and second substances,
the coexistence time period determination unit is used for determining the coexistence time period of the effective human face image frame and the audio characteristic information according to the timestamp information;
the matching audio characteristic information determining unit is used for determining corresponding audio characteristic information in a time period in which the effective human face image frame is positioned according to the coexistence time period, and the audio characteristic information is used as the matching audio characteristic information;
the labeling operation unit is used for labeling the speaking state of the speaker and/or the speaking starting point information when the matching audio information belongs to the person voice information and the speaker corresponding to the person voice information is consistent with the person in the effective human face image frame.
It can be known from the content of the above embodiment that the method and system for automatically cleaning and labeling multi-modal data combine two complementary information, namely, image and audio, in a video to mutually cooperate to complete the work of automatically cleaning and labeling the multi-modal data of the video, and specifically can clean and label a picture segment with front face information of a person from a large amount of un-labeled multi-modal data of the video, and label the speaking state and the speaking starting point information of the person in the picture segment at the same time, thereby realizing diversified labeling of the multi-modal information and improving the labeling performance of the multi-modal information.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.
Claims (10)
1. The method for automatically cleaning and labeling the multi-modal data is characterized by comprising the following steps of:
step S1, a human face information analysis step, which is used for carrying out human face recognition on picture components in the video so as to obtain an effective human face image frame meeting the preset quality condition and corresponding timestamp information;
step S2, an audio information analysis step, which is used for analyzing the audio component in the video to obtain the audio characteristic information and the speaker identity determination information;
and step S3, a step of synthesis and decision processing, which is used for labeling the speaking state and/or the speaking starting point information of the speaker in the video according to the human face image frame, the timestamp information, the audio characteristic information and the speaker identity determination information.
2. The method for automatically cleansing and labeling multimodal data as recited in claim 1, wherein: in step S1, the human face information analyzing step specifically includes,
step S101, separating the picture components from the video and acquiring face feature information to be recognized, wherein the face feature information comprises face five sense organs information and/or face contour information;
step S102, decomposing the picture components into a plurality of image frames, and identifying and selecting the human face image frames containing the human face characteristic information from the plurality of image frames;
step S103, judging whether the human face image frame meets a preset image resolution condition and/or an image tone condition, if so, determining the corresponding human face image frame as an effective human face image frame;
and step S104, acquiring time axis information corresponding to the picture components decomposed into the plurality of image frames, and extracting time stamp information corresponding to each effective human face image frame from the time axis information.
3. The method for automatically cleansing and labeling multimodal data as recited in claim 1, wherein: in step S2, the audio information analyzing step specifically includes,
step S201, separating the audio component from the video;
step S202, performing VAD voice activation detection on the audio component to obtain audio characteristic information about the audio component, wherein the audio characteristic information comprises character voice information and environmental sound information;
step S203, carrying out VPR voiceprint recognition processing on the audio component so as to obtain all voiceprint recognition information of the audio component;
and step S204, obtaining speaker identity determination information corresponding to the character voice information according to all the voiceprint identification information.
4. The method for automatically cleansing and labeling multimodal data as recited in claim 1, wherein: in the step S3, the integrating and deciding process step includes,
step S301, judging whether the effective human face image frame obtained in the human face information analysis step meets the requirement of preset human face information quantity, and giving candidate labels of the effective human face image frame according to the judgment result;
step S302, according to the timestamp information and the audio characteristic information, determining matched audio characteristic information in a time period of the effective human face image frame, and according to the matched audio characteristic information, marking the speaking state of the speaker and/or the speaking starting point information.
5. The method for automatically cleansing and labeling multimodal data as recited in claim 4, wherein: in step S301, it is determined whether the effective face image frame obtained in the face information analysis step meets the requirement of a preset face information amount, and the candidate labeling of the effective face image frame according to the determination result specifically includes,
step S3011, calculating an actual face feature information amount corresponding to the effective face image frame, comparing the actual face feature information amount with a preset face feature information threshold amount, if the actual face feature information amount is greater than or equal to the preset face feature information threshold amount, determining that the effective face image frame meets the preset face information amount requirement, otherwise, determining that the effective face image frame does not meet the preset face information amount requirement;
step S3012, performing qualified candidate labeling on the effective face image frames corresponding to the requirement of the preset face information amount, and performing unqualified candidate labeling on the effective face image frames corresponding to the requirement of the preset face information amount;
alternatively, the first and second electrodes may be,
in the step S302, according to the timestamp information and the audio feature information, matching audio feature information in a time period in which each of the effective face image features is located is determined, and according to the matching audio feature information, the speaking state of the speaker and/or the speaking starting point information are/is labeled,
step S3021, determining a coexistence time period of the effective human face image frame and the audio feature information according to the timestamp information;
step S3022, determining audio feature information corresponding to the effective face image frame in the time period according to the coexistence time period, and using the audio feature information as the matching audio feature information;
step S3023, determining whether the matching audio information belongs to the character voice information, and whether a speaker corresponding to the character voice information is consistent with the character in the effective human face image frame, and if both of the two are true, labeling the speaking state of the speaker and/or the speaking starting point information.
6. The multi-modal data automatic cleaning and labeling system is characterized in that:
the multi-modal data automatic cleaning and labeling system comprises a face information analysis module, an audio information analysis module and a synthesis and decision processing module; wherein the content of the first and second substances,
the face information analysis module is used for carrying out face recognition on picture components in the video so as to obtain effective face image frames meeting preset quality conditions and corresponding timestamp information;
the audio information analysis module is used for analyzing audio components in the video so as to obtain audio characteristic information and speaker identity determination information;
the synthesis and decision processing module is used for labeling the speaking state and/or the speaking starting point information of the speaker in the video according to the human face image frame, the timestamp information, the audio characteristic information and the speaker identity determination information.
7. The multi-modal data auto-cleansing and labeling system of claim 6, wherein: the face information analysis module comprises a face image frame picking submodule, an effective face image frame determining submodule and a timestamp information extraction submodule; wherein the content of the first and second substances,
the human face image frame picking submodule is used for identifying and picking human face image frames containing the human face characteristic information from a plurality of image frames corresponding to the image components;
the effective human face image frame determining submodule is used for determining the human face image frame meeting the preset image resolution condition and/or the image tone condition as an effective human face image frame;
the effective human face image frame determining submodule is used for extracting time stamp information corresponding to each effective human face image frame from time axis information corresponding to the image components which are decomposed into the plurality of image frames.
8. The multi-modal data auto-cleansing and labeling system of claim 6, wherein: the audio information analysis module comprises a VAD voice activation detection submodule, a VPR voiceprint recognition processing submodule and a speaker identity determination information generation submodule; wherein the content of the first and second substances,
the VAD voice activation detection submodule is used for carrying out VAD voice activation detection on the audio component so as to obtain audio characteristic information about the audio component, wherein the audio characteristic information comprises character voice information and environmental sound information;
the VPR voiceprint recognition processing submodule is used for carrying out VPR voiceprint recognition processing on the audio component so as to obtain all voiceprint recognition information about the audio component;
and the speaker identity confirming information generating submodule is used for generating speaker identity confirming information corresponding to the character voice information according to all the voiceprint recognition information.
9. The multi-modal data auto-cleansing and labeling system of claim 6, wherein: the synthesis and decision processing module comprises an image frame candidate marking sub-module and a speaking state/speaking starting point marking sub-module; wherein the content of the first and second substances,
the image frame candidate labeling submodule is used for giving out candidate labels of the effective face image frame according to the judgment result of whether the effective face image frame meets the requirement of the preset face information amount;
and the speaking state/speaking starting point labeling submodule is used for labeling the speaking state and/or the speaking starting point information of the speaker according to the timestamp information and the audio characteristic information.
10. The multi-modal data auto-cleansing and labeling system of claim 6, wherein: the image frame candidate labeling submodule comprises an actual human face characteristic information quantity calculating unit, a human face characteristic information quantity comparing unit and a candidate labeling execution unit; wherein the content of the first and second substances,
the actual face feature information amount calculating unit is used for calculating actual face feature information amount corresponding to the effective face image frame;
the face characteristic information quantity comparison unit is used for comparing the actual face characteristic information quantity with a preset face characteristic information threshold quantity, and when the actual face characteristic information quantity is greater than or equal to the preset face characteristic information threshold quantity, determining that the effective face image frame meets the preset face information quantity requirement, otherwise, determining that the effective face image frame does not meet the preset face information quantity requirement;
the candidate labeling execution unit is used for performing qualified candidate labeling on the effective face image frames corresponding to the preset face information quantity requirement and performing unqualified candidate labeling on the effective face image frames corresponding to the non-conformity with the preset face information quantity requirement;
alternatively, the first and second electrodes may be,
the speaking state/speaking starting point labeling submodule comprises a coexistence time period determining unit, a matching audio characteristic information determining unit and a labeling operation unit; wherein the content of the first and second substances,
the coexistence time period determining unit is used for determining the coexistence time period of the effective human face image frame and the audio feature information according to the timestamp information;
the matching audio characteristic information determining unit is used for determining corresponding audio characteristic information in a time period in which the effective human face image frame is positioned according to the coexistence time period, and the audio characteristic information is used as the matching audio characteristic information;
and the marking operation unit is used for marking the speaking state of the speaker and/or the speaking starting point information when the matched audio information belongs to the person voice information and the speaker corresponding to the person voice information is consistent with the person in the effective human face image frame.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010525080.8A CN111767805A (en) | 2020-06-10 | 2020-06-10 | Multi-mode data automatic cleaning and labeling method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010525080.8A CN111767805A (en) | 2020-06-10 | 2020-06-10 | Multi-mode data automatic cleaning and labeling method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111767805A true CN111767805A (en) | 2020-10-13 |
Family
ID=72720474
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010525080.8A Pending CN111767805A (en) | 2020-06-10 | 2020-06-10 | Multi-mode data automatic cleaning and labeling method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111767805A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117174092A (en) * | 2023-11-02 | 2023-12-05 | 北京语言大学 | Mobile corpus transcription method and device based on voiceprint recognition and multi-modal analysis |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018113526A1 (en) * | 2016-12-20 | 2018-06-28 | 四川长虹电器股份有限公司 | Face recognition and voiceprint recognition-based interactive authentication system and method |
CN109660744A (en) * | 2018-10-19 | 2019-04-19 | 深圳壹账通智能科技有限公司 | The double recording methods of intelligence, equipment, storage medium and device based on big data |
CN109671438A (en) * | 2019-01-28 | 2019-04-23 | 武汉恩特拉信息技术有限公司 | It is a kind of to provide the device and method of ancillary service using voice |
CN110881115A (en) * | 2019-12-24 | 2020-03-13 | 新华智云科技有限公司 | Strip splitting method and system for conference video |
CN110909613A (en) * | 2019-10-28 | 2020-03-24 | Oppo广东移动通信有限公司 | Video character recognition method and device, storage medium and electronic equipment |
-
2020
- 2020-06-10 CN CN202010525080.8A patent/CN111767805A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018113526A1 (en) * | 2016-12-20 | 2018-06-28 | 四川长虹电器股份有限公司 | Face recognition and voiceprint recognition-based interactive authentication system and method |
CN109660744A (en) * | 2018-10-19 | 2019-04-19 | 深圳壹账通智能科技有限公司 | The double recording methods of intelligence, equipment, storage medium and device based on big data |
CN109671438A (en) * | 2019-01-28 | 2019-04-23 | 武汉恩特拉信息技术有限公司 | It is a kind of to provide the device and method of ancillary service using voice |
CN110909613A (en) * | 2019-10-28 | 2020-03-24 | Oppo广东移动通信有限公司 | Video character recognition method and device, storage medium and electronic equipment |
CN110881115A (en) * | 2019-12-24 | 2020-03-13 | 新华智云科技有限公司 | Strip splitting method and system for conference video |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117174092A (en) * | 2023-11-02 | 2023-12-05 | 北京语言大学 | Mobile corpus transcription method and device based on voiceprint recognition and multi-modal analysis |
CN117174092B (en) * | 2023-11-02 | 2024-01-26 | 北京语言大学 | Mobile corpus transcription method and device based on voiceprint recognition and multi-modal analysis |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11315366B2 (en) | Conference recording method and data processing device employing the same | |
WO2021082941A1 (en) | Video figure recognition method and apparatus, and storage medium and electronic device | |
US7920761B2 (en) | Multimodal identification and tracking of speakers in video | |
US20190259388A1 (en) | Speech-to-text generation using video-speech matching from a primary speaker | |
CN111050201B (en) | Data processing method and device, electronic equipment and storage medium | |
CN110598008B (en) | Method and device for detecting quality of recorded data and storage medium | |
US11954912B2 (en) | Method for cutting video based on text of the video and computing device applying method | |
CN111901627B (en) | Video processing method and device, storage medium and electronic equipment | |
CN109714608A (en) | Video data handling procedure, device, computer equipment and storage medium | |
CN113242361A (en) | Video processing method and device and computer readable storage medium | |
CN114598933B (en) | Video content processing method, system, terminal and storage medium | |
CN111767805A (en) | Multi-mode data automatic cleaning and labeling method and system | |
CN111626061A (en) | Conference record generation method, device, equipment and readable storage medium | |
CN114239610A (en) | Multi-language speech recognition and translation method and related system | |
CN115497017A (en) | Broadcast television news stripping method and device based on artificial intelligence | |
EP4068282A1 (en) | Method for processing conference data and related device | |
CN111221987A (en) | Hybrid audio tagging method and apparatus | |
CN111161710A (en) | Simultaneous interpretation method and device, electronic equipment and storage medium | |
CN115831124A (en) | Conference record role separation system and method based on voiceprint recognition | |
CN113411517B (en) | Video template generation method and device, electronic equipment and storage medium | |
CN115499677A (en) | Audio and video synchronization detection method and device based on live broadcast | |
CN114155845A (en) | Service determination method and device, electronic equipment and storage medium | |
CN113194333A (en) | Video clipping method, device, equipment and computer readable storage medium | |
CN114125365A (en) | Video conference method, device and readable storage medium | |
CN111985400A (en) | Face living body identification method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |