CN111767805A - Multi-mode data automatic cleaning and labeling method and system - Google Patents

Multi-mode data automatic cleaning and labeling method and system Download PDF

Info

Publication number
CN111767805A
CN111767805A CN202010525080.8A CN202010525080A CN111767805A CN 111767805 A CN111767805 A CN 111767805A CN 202010525080 A CN202010525080 A CN 202010525080A CN 111767805 A CN111767805 A CN 111767805A
Authority
CN
China
Prior art keywords
information
image frame
audio
face image
effective
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010525080.8A
Other languages
Chinese (zh)
Inventor
刘青松
胡炳然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd, Xiamen Yunzhixin Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN202010525080.8A priority Critical patent/CN111767805A/en
Publication of CN111767805A publication Critical patent/CN111767805A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention provides a method and a system for automatically cleaning and labeling multi-modal data, which are used for combining two complementary information, namely images and audios, in videos to mutually cooperate to finish the automatic cleaning and labeling work of the video multi-modal data.

Description

Multi-mode data automatic cleaning and labeling method and system
Technical Field
The invention relates to the technical field of multi-modal data processing, in particular to an automatic multi-modal data cleaning and labeling method and system.
Background
The video data belongs to multi-modal data, and the video data simultaneously comprises data information of two single modalities, namely image information and audio information. At present, in the process of processing video data, image information and audio information in the video data need to be processed independently, but because the image information and the audio information have relevance in terms of time and scene content, the independent processing mode cannot effectively solve the problem of performing mutual cooperation labeling on image pictures and audio information in a multi-modal scene such as a video, and cannot provide diversified labeling information and improve the labeling performance of a single modal dimension by using the multi-modal information.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a multi-modal data automatic cleaning and labeling method and a system, wherein the multi-modal data automatic cleaning and labeling method and system obtains effective human face image frames meeting preset quality conditions and corresponding timestamp information through human face information analysis, obtains audio characteristic information and speaker identity determination information through audio information analysis, and labels speaker speaking state and/or speaking starting point information in a video through synthesis and decision processing; therefore, the multi-mode data automatic cleaning and labeling method and system are combined with two complementary information, namely images and audios, in videos to cooperate with each other to finish the automatic cleaning and labeling work of the multi-mode data of the videos, and particularly can clean and label a picture segment with front face information of people from massive un-labeled multi-mode data of the videos, and label the speaking state and the speaking starting point information of the people in the picture segment at the same time, so that the diversified labeling of the multi-mode information is realized, and the labeling performance of the multi-mode information is improved.
The invention provides a multi-modal data automatic cleaning and labeling method, which is characterized by comprising the following steps:
step S1, a human face information analysis step, which is used for carrying out human face recognition on picture components in the video so as to obtain an effective human face image frame meeting the preset quality condition and corresponding timestamp information;
step S2, an audio information analysis step, which is used for analyzing the audio component in the video to obtain the audio characteristic information and the speaker identity determination information;
step S3, a synthesis and decision processing step, which is used for labeling the speaking state and/or the speaking starting point information of the speaker in the video according to the human face image frame, the timestamp information, the audio characteristic information and the speaker identity determination information;
further, in step S1, the human face information analyzing step specifically includes,
step S101, separating the picture components from the video and acquiring face feature information to be recognized, wherein the face feature information comprises face five sense organs information and/or face contour information;
step S102, decomposing the picture components into a plurality of image frames, and identifying and selecting the human face image frames containing the human face characteristic information from the plurality of image frames;
step S103, judging whether the human face image frame meets a preset image resolution condition and/or an image tone condition, if so, determining the corresponding human face image frame as an effective human face image frame;
step S104, acquiring time axis information corresponding to the picture components decomposed into the plurality of image frames, and extracting timestamp information corresponding to each effective human face image frame from the time axis information;
further, in the step S2, the audio information analyzing step specifically includes,
step S201, separating the audio component from the video;
step S202, performing VAD voice activation detection on the audio component to obtain audio characteristic information about the audio component, wherein the audio characteristic information comprises character voice information and environmental sound information;
step S203, carrying out VPR voiceprint recognition processing on the audio component so as to obtain all voiceprint recognition information of the audio component;
step S204, obtaining speaker identity determination information corresponding to the character voice information according to all the voiceprint identification information;
further, in the step S3, the integrating and deciding process step includes,
step S301, judging whether the effective human face image frame obtained in the human face information analysis step meets the requirement of preset human face information quantity, and giving candidate labels of the effective human face image frame according to the judgment result;
step S302, according to the timestamp information and the audio characteristic information, determining matched audio characteristic information in a time period in which the effective human face image frame is positioned, and according to the matched audio characteristic information, marking the speaking state of the speaker and/or the speaking starting point information;
further, in the step S301, it is determined whether the effective face image frame obtained in the face information analysis step meets the requirement of a preset face information amount, and the candidate labeling of the effective face image frame according to the determination result specifically includes,
step S3011, calculating an actual face feature information amount corresponding to the effective face image frame, comparing the actual face feature information amount with a preset face feature information threshold amount, if the actual face feature information amount is greater than or equal to the preset face feature information threshold amount, determining that the effective face image frame meets the preset face information amount requirement, otherwise, determining that the effective face image frame does not meet the preset face information amount requirement;
step S3012, performing qualified candidate labeling on the effective face image frames corresponding to the requirement of the preset face information amount, and performing unqualified candidate labeling on the effective face image frames corresponding to the requirement of the preset face information amount;
alternatively, the first and second electrodes may be,
in the step S302, according to the timestamp information and the audio feature information, matching audio feature information in a time period in which each of the effective face image features is located is determined, and according to the matching audio feature information, the speaking state of the speaker and/or the speaking starting point information are/is labeled,
step S3021, determining a coexistence time period of the effective human face image frame and the audio feature information according to the timestamp information;
step S3022, determining audio feature information corresponding to the effective face image frame in the time period according to the coexistence time period, and using the audio feature information as the matching audio feature information;
step S3023, determining whether the matching audio information belongs to the character voice information, and whether a speaker corresponding to the character voice information is consistent with the character in the effective human face image frame, and if both of the two are true, labeling the speaking state of the speaker and/or the speaking starting point information.
The invention provides a multi-modal data automatic cleaning and labeling method, which is characterized by comprising the following steps:
the multi-modal data automatic cleaning and labeling system comprises a face information analysis module, an audio information analysis module and a synthesis and decision processing module; wherein the content of the first and second substances,
the face information analysis module is used for carrying out face recognition on picture components in the video so as to obtain effective face image frames meeting preset quality conditions and corresponding timestamp information;
the audio information analysis module is used for analyzing audio components in the video so as to obtain audio characteristic information and speaker identity determination information;
the synthesis and decision processing module is used for labeling the speaking state and/or the speaking starting point information of the speaker in the video according to the human face image frame, the timestamp information, the audio characteristic information and the speaker identity determination information;
further, the face information analysis module comprises a face image frame picking submodule, an effective face image frame determining submodule and a timestamp information extraction submodule; wherein the content of the first and second substances,
the human face image frame picking submodule is used for identifying and picking human face image frames containing the human face characteristic information from a plurality of image frames corresponding to the image components;
the effective human face image frame determining submodule is used for determining the human face image frame meeting the preset image resolution condition and/or the image tone condition as an effective human face image frame;
the effective human face image frame determining submodule is used for extracting timestamp information corresponding to each effective human face image frame from time axis information corresponding to the image components which are decomposed into the plurality of image frames;
further, the audio information analysis module comprises a VAD voice activation detection sub-module, a VPR voiceprint recognition processing sub-module and a speaker identity determination information generation sub-module; wherein the content of the first and second substances,
the VAD voice activation detection submodule is used for carrying out VAD voice activation detection on the audio component so as to obtain audio characteristic information about the audio component, wherein the audio characteristic information comprises character voice information and environmental sound information;
the VPR voiceprint recognition processing submodule is used for carrying out VPR voiceprint recognition processing on the audio component so as to obtain all voiceprint recognition information about the audio component;
the speaker identity confirming information generating submodule is used for generating speaker identity confirming information corresponding to the character voice information according to all the voiceprint recognition information;
furthermore, the synthesis and decision processing module comprises an image frame candidate marking sub-module and a speaking state/speaking starting point marking sub-module; wherein the content of the first and second substances,
the image frame candidate labeling submodule is used for giving out candidate labels of the effective face image frame according to the judgment result of whether the effective face image frame meets the requirement of the preset face information amount;
the speaking state/speaking starting point labeling submodule is used for labeling the speaking state and/or the speaking starting point information of the speaker according to the timestamp information and the audio characteristic information;
further, the image frame candidate labeling submodule comprises an actual face feature information quantity calculating unit, a face feature information quantity comparing unit and a candidate labeling executing unit; wherein the content of the first and second substances,
the actual face feature information amount calculating unit is used for calculating actual face feature information amount corresponding to the effective face image frame;
the face characteristic information quantity comparison unit is used for comparing the actual face characteristic information quantity with a preset face characteristic information threshold quantity, and when the actual face characteristic information quantity is greater than or equal to the preset face characteristic information threshold quantity, determining that the effective face image frame meets the preset face information quantity requirement, otherwise, determining that the effective face image frame does not meet the preset face information quantity requirement;
the candidate labeling execution unit is used for performing qualified candidate labeling on the effective face image frames corresponding to the preset face information quantity requirement and performing unqualified candidate labeling on the effective face image frames corresponding to the non-conformity with the preset face information quantity requirement;
alternatively, the first and second electrodes may be,
the speaking state/speaking starting point labeling submodule comprises a coexistence time period determining unit, a matching audio characteristic information determining unit and a labeling operation unit; wherein the content of the first and second substances,
the coexistence time period determining unit is used for determining the coexistence time period of the effective human face image frame and the audio feature information according to the timestamp information;
the matching audio characteristic information determining unit is used for determining corresponding audio characteristic information in a time period in which the effective human face image frame is positioned according to the coexistence time period, and the audio characteristic information is used as the matching audio characteristic information;
and the marking operation unit is used for marking the speaking state of the speaker and/or the speaking starting point information when the matched audio information belongs to the person voice information and the speaker corresponding to the person voice information is consistent with the person in the effective human face image frame.
Compared with the prior art, the multi-modal data automatic cleaning and labeling method and system can obtain the effective human face image frame meeting the preset quality condition and the corresponding timestamp information thereof through human face information analysis, obtain the audio characteristic information and the speaker identity determination information through audio information analysis, and label the speaker speaking state and/or the speaker starting point information in the video through comprehensive and decision processing; therefore, the multi-mode data automatic cleaning and labeling method and system are combined with two complementary information, namely images and audios, in videos to cooperate with each other to finish the automatic cleaning and labeling work of the multi-mode data of the videos, and particularly can clean and label a picture segment with front face information of people from massive un-labeled multi-mode data of the videos, and label the speaking state and the speaking starting point information of the people in the picture segment at the same time, so that the diversified labeling of the multi-mode information is realized, and the labeling performance of the multi-mode information is improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of an automatic multi-modal data cleaning and labeling method provided by the present invention.
FIG. 2 is a schematic structural diagram of the multi-modal data automatic cleaning and labeling system provided by the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of an automatic multi-modal data cleaning and labeling method according to an embodiment of the present invention. The multi-modal data automatic cleaning and labeling method comprises the following steps:
step S1, a human face information analysis step, which is used for carrying out human face recognition on picture components in the video so as to obtain an effective human face image frame meeting the preset quality condition and corresponding timestamp information;
step S2, an audio information analysis step, which is used for analyzing the audio component in the video to obtain the audio characteristic information and the speaker identity determination information;
step S3, a step of synthesis and decision processing, which is used to label the speaking status and/or the speaking starting point information of the speaker in the video according to the human face image frame, the timestamp information, the audio characteristic information and the speaker identity determination information.
The method for automatically cleaning and labeling the multi-modal data is different from the mode of separately processing the image picture information and the audio information of the multi-modal video data in the prior art, and the image picture information and the audio information are cooperatively labeled by taking the timestamp information as an associated element, so that the complementary cooperative cleaning and labeling of the multi-modal video data are realized, and the multi-modal scene processing quality of the multi-modal video data is improved.
Preferably, in the step S1, the face information analyzing step specifically includes,
step S101, separating the picture component from the video and obtaining face feature information to be recognized, wherein the face feature information comprises face five sense organs information and/or face contour information;
step S102, decomposing the picture component into a plurality of image frames, and identifying and selecting the human face image frame containing the human face characteristic information from the plurality of image frames;
step S103, judging whether the human face image frame meets a preset image resolution condition and/or an image tone condition, if so, determining the corresponding human face image frame as an effective human face image frame;
and step S104, acquiring time axis information corresponding to the picture component decomposed into the plurality of image frames, and extracting time stamp information corresponding to each effective human face image frame from the time axis information.
By using the facial feature information and/or the facial contour information as the facial feature information, the accuracy and the efficiency of dividing the picture component into a plurality of image frames can be improved, so that the situation of wrong image frame division is effectively avoided.
Preferably, in the step S2, the audio information analyzing step specifically includes,
step S201, separating the audio component from the video;
step S202, performing VAD voice activation detection on the audio component to obtain audio characteristic information about the audio component, wherein the audio characteristic information comprises character voice information and environmental sound information;
step S203, carrying out VPR voiceprint recognition processing on the audio component so as to obtain all voiceprint recognition information of the audio component;
step S204, obtaining the speaker identity confirming information corresponding to the character voice information according to all the voiceprint recognition information.
The audio component is analyzed through VAD voice activation detection and VPR voiceprint recognition processing respectively, and the accuracy of obtaining the audio characteristic information and the voiceprint recognition information can be guaranteed.
Preferably, in the step S3, the integrating and deciding process step includes,
step S301, judging whether the effective human face image frame obtained in the human face information analysis step meets the requirement of the preset human face information amount, and giving candidate labels of the effective human face image frame according to the judgment result;
step S302, according to the timestamp information and the audio characteristic information, determining the matching audio characteristic information in the time period of the effective human face image frame, and according to the matching audio characteristic information, marking the speaking state of the speaker and/or the speaking starting point information.
The process can realize diversified and dynamic candidate labeling of the effective face image, thereby improving the labeling applicability and the labeling effectiveness to different multi-modal scenes; in addition, the above process can also ensure the matching audio characteristic information to be precisely matched with the effective human face image frame, thereby ensuring the accuracy of the speaking state of the speaker and/or the labeling time of the speaking starting point information.
Preferably, in the step S301, it is determined whether the effective face image frame obtained in the face information analysis step meets the requirement of the preset face information amount, and the candidate label of the effective face image frame according to the determination result specifically includes,
step S3011, calculating an actual face feature information amount corresponding to the effective image frame of the face, comparing the actual face feature information amount with a preset face feature information threshold amount, if the actual face feature information amount is greater than or equal to the preset face feature information threshold amount, determining that the effective image frame of the face meets the preset face information amount requirement, otherwise, determining that the effective image frame of the face does not meet the preset face information amount requirement;
step S3012, performing qualified candidate labeling on the effective face image frames corresponding to the preset face information amount requirement, and performing unqualified candidate labeling on the effective face image frames not corresponding to the preset face information amount requirement.
Preferably, in the step S302, according to the timestamp information and the audio characteristic information, matching audio characteristic information in a time period in which each of the effective facial image characteristics is located is determined, and according to the matching audio characteristic information, the speaking state of the speaker and/or the speaking starting point information is labeled to specifically include,
step S3021, determining a coexistence time period of the valid face image frame and the audio feature information according to the timestamp information;
step S3022, determining audio feature information corresponding to the valid face image frame in the time period according to the coexistence time period, and using the determined audio feature information as the matching audio feature information;
step S3023, determining whether the matching audio information belongs to the character voice information, and whether the speaker corresponding to the character voice information is consistent with the character in the valid human face image frame, and if both of the two are true, labeling the speaking state of the speaker and/or the speaking starting point information.
Fig. 2 is a schematic structural diagram of an automatic multi-modal data cleaning and labeling system according to an embodiment of the present invention. The multi-modal data automatic cleaning and labeling system comprises a face information analysis module, an audio information analysis module and a synthesis and decision processing module; wherein the content of the first and second substances,
the face information analysis module is used for carrying out face recognition on picture components in the video so as to obtain effective face image frames meeting preset quality conditions and corresponding timestamp information;
the audio information analysis module is used for analyzing audio components in the video so as to obtain audio characteristic information and speaker identity determination information;
the synthesis and decision processing module is used for labeling the speaking state and/or the speaking starting point information of the speaker in the video according to the human face image frame, the timestamp information, the audio characteristic information and the speaker identity determination information.
The system for automatically cleaning and labeling the multi-modal data is different from a mode of separately processing image picture information and audio information of the multi-modal video data in the prior art, and the system carries out cooperative labeling on the image picture information and the audio information by taking timestamp information as an associated element, so that complementary cooperative cleaning and labeling of the multi-modal video data are realized, and the multi-modal scene processing quality of the multi-modal video data is improved.
Preferably, the face information analysis module comprises a face image frame picking submodule, an effective face image frame determining submodule and a timestamp information extraction submodule; wherein the content of the first and second substances,
the human face image frame picking submodule is used for identifying and picking the human face image frame containing the human face characteristic information from a plurality of image frames corresponding to the image components;
the effective human face image frame determining submodule is used for determining the human face image frame meeting the preset image resolution condition and/or the image tone condition as an effective human face image frame;
the effective human face image frame determining submodule is used for extracting time stamp information corresponding to each effective human face image frame from time axis information corresponding to a plurality of image frames decomposed from the image components.
By using the facial feature information and/or the facial contour information as the facial feature information, the accuracy and the efficiency of dividing the picture component into a plurality of image frames can be improved, so that the situation of wrong image frame division is effectively avoided.
Preferably, the audio information analysis module comprises a VAD voice activation detection sub-module, a VPR voiceprint recognition processing sub-module and a speaker identity determination information generation sub-module; wherein the content of the first and second substances,
the VAD voice activation detection submodule is used for carrying out VAD voice activation detection on the audio component so as to obtain audio characteristic information related to the audio component, wherein the audio characteristic information comprises character voice information and environmental sound information;
the VPR voiceprint recognition processing submodule is used for carrying out VPR voiceprint recognition processing on the audio component so as to obtain all voiceprint recognition information about the audio component;
the speaker identity confirming information generating submodule is used for generating speaker identity confirming information corresponding to the character voice information according to all the voiceprint recognition information.
The audio component is analyzed through VAD voice activation detection and VPR voiceprint recognition processing respectively, and the accuracy of obtaining the audio characteristic information and the voiceprint recognition information can be guaranteed.
Preferably, the synthesis and decision processing module comprises an image frame candidate labeling sub-module and a speaking state/speaking starting point labeling sub-module; wherein the content of the first and second substances,
the image frame candidate labeling submodule is used for giving out candidate labels of the effective face image frame according to the judgment result of whether the effective face image frame meets the requirement of the preset face information amount;
the speaking state/speaking starting point labeling submodule is used for labeling the speaking state and/or the speaking starting point information of the speaker according to the timestamp information and the audio characteristic information.
The image frame candidate labeling submodule can realize diversified and dynamic candidate labeling on the effective face image, so that the labeling applicability and the labeling effectiveness on different multi-modal scenes are improved; in addition, the speaking state/speaking starting point labeling submodule can also ensure that the matched audio characteristic information is accurately matched with the effective human face image frame, thereby ensuring the accuracy of the labeling time of the speaking state and/or the speaking starting point information of the speaker.
Preferably, the image frame candidate labeling submodule comprises an actual face feature information amount calculating unit, a face feature information amount comparing unit and a candidate labeling executing unit; wherein the content of the first and second substances,
the actual face characteristic information quantity calculating unit is used for calculating actual face characteristic information quantity corresponding to the effective image frame of the face;
the face characteristic information quantity comparison unit is used for comparing the actual face characteristic information quantity with a preset face characteristic information threshold quantity, and determining that the effective face image frame meets the preset face information quantity requirement when the actual face characteristic information quantity is greater than or equal to the preset face characteristic information threshold quantity, or else, determining that the effective face image frame does not meet the preset face information quantity requirement;
the candidate labeling execution unit is used for performing qualified candidate labeling on the effective face image frame corresponding to the preset face information amount requirement and performing unqualified candidate labeling on the effective face image frame not corresponding to the preset face information amount requirement.
Preferably, the speaking state/speaking starting point labeling submodule comprises a coexistence time period determining unit, a matching audio characteristic information determining unit and a labeling operation unit; wherein the content of the first and second substances,
the coexistence time period determination unit is used for determining the coexistence time period of the effective human face image frame and the audio characteristic information according to the timestamp information;
the matching audio characteristic information determining unit is used for determining corresponding audio characteristic information in a time period in which the effective human face image frame is positioned according to the coexistence time period, and the audio characteristic information is used as the matching audio characteristic information;
the labeling operation unit is used for labeling the speaking state of the speaker and/or the speaking starting point information when the matching audio information belongs to the person voice information and the speaker corresponding to the person voice information is consistent with the person in the effective human face image frame.
It can be known from the content of the above embodiment that the method and system for automatically cleaning and labeling multi-modal data combine two complementary information, namely, image and audio, in a video to mutually cooperate to complete the work of automatically cleaning and labeling the multi-modal data of the video, and specifically can clean and label a picture segment with front face information of a person from a large amount of un-labeled multi-modal data of the video, and label the speaking state and the speaking starting point information of the person in the picture segment at the same time, thereby realizing diversified labeling of the multi-modal information and improving the labeling performance of the multi-modal information.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. The method for automatically cleaning and labeling the multi-modal data is characterized by comprising the following steps of:
step S1, a human face information analysis step, which is used for carrying out human face recognition on picture components in the video so as to obtain an effective human face image frame meeting the preset quality condition and corresponding timestamp information;
step S2, an audio information analysis step, which is used for analyzing the audio component in the video to obtain the audio characteristic information and the speaker identity determination information;
and step S3, a step of synthesis and decision processing, which is used for labeling the speaking state and/or the speaking starting point information of the speaker in the video according to the human face image frame, the timestamp information, the audio characteristic information and the speaker identity determination information.
2. The method for automatically cleansing and labeling multimodal data as recited in claim 1, wherein: in step S1, the human face information analyzing step specifically includes,
step S101, separating the picture components from the video and acquiring face feature information to be recognized, wherein the face feature information comprises face five sense organs information and/or face contour information;
step S102, decomposing the picture components into a plurality of image frames, and identifying and selecting the human face image frames containing the human face characteristic information from the plurality of image frames;
step S103, judging whether the human face image frame meets a preset image resolution condition and/or an image tone condition, if so, determining the corresponding human face image frame as an effective human face image frame;
and step S104, acquiring time axis information corresponding to the picture components decomposed into the plurality of image frames, and extracting time stamp information corresponding to each effective human face image frame from the time axis information.
3. The method for automatically cleansing and labeling multimodal data as recited in claim 1, wherein: in step S2, the audio information analyzing step specifically includes,
step S201, separating the audio component from the video;
step S202, performing VAD voice activation detection on the audio component to obtain audio characteristic information about the audio component, wherein the audio characteristic information comprises character voice information and environmental sound information;
step S203, carrying out VPR voiceprint recognition processing on the audio component so as to obtain all voiceprint recognition information of the audio component;
and step S204, obtaining speaker identity determination information corresponding to the character voice information according to all the voiceprint identification information.
4. The method for automatically cleansing and labeling multimodal data as recited in claim 1, wherein: in the step S3, the integrating and deciding process step includes,
step S301, judging whether the effective human face image frame obtained in the human face information analysis step meets the requirement of preset human face information quantity, and giving candidate labels of the effective human face image frame according to the judgment result;
step S302, according to the timestamp information and the audio characteristic information, determining matched audio characteristic information in a time period of the effective human face image frame, and according to the matched audio characteristic information, marking the speaking state of the speaker and/or the speaking starting point information.
5. The method for automatically cleansing and labeling multimodal data as recited in claim 4, wherein: in step S301, it is determined whether the effective face image frame obtained in the face information analysis step meets the requirement of a preset face information amount, and the candidate labeling of the effective face image frame according to the determination result specifically includes,
step S3011, calculating an actual face feature information amount corresponding to the effective face image frame, comparing the actual face feature information amount with a preset face feature information threshold amount, if the actual face feature information amount is greater than or equal to the preset face feature information threshold amount, determining that the effective face image frame meets the preset face information amount requirement, otherwise, determining that the effective face image frame does not meet the preset face information amount requirement;
step S3012, performing qualified candidate labeling on the effective face image frames corresponding to the requirement of the preset face information amount, and performing unqualified candidate labeling on the effective face image frames corresponding to the requirement of the preset face information amount;
alternatively, the first and second electrodes may be,
in the step S302, according to the timestamp information and the audio feature information, matching audio feature information in a time period in which each of the effective face image features is located is determined, and according to the matching audio feature information, the speaking state of the speaker and/or the speaking starting point information are/is labeled,
step S3021, determining a coexistence time period of the effective human face image frame and the audio feature information according to the timestamp information;
step S3022, determining audio feature information corresponding to the effective face image frame in the time period according to the coexistence time period, and using the audio feature information as the matching audio feature information;
step S3023, determining whether the matching audio information belongs to the character voice information, and whether a speaker corresponding to the character voice information is consistent with the character in the effective human face image frame, and if both of the two are true, labeling the speaking state of the speaker and/or the speaking starting point information.
6. The multi-modal data automatic cleaning and labeling system is characterized in that:
the multi-modal data automatic cleaning and labeling system comprises a face information analysis module, an audio information analysis module and a synthesis and decision processing module; wherein the content of the first and second substances,
the face information analysis module is used for carrying out face recognition on picture components in the video so as to obtain effective face image frames meeting preset quality conditions and corresponding timestamp information;
the audio information analysis module is used for analyzing audio components in the video so as to obtain audio characteristic information and speaker identity determination information;
the synthesis and decision processing module is used for labeling the speaking state and/or the speaking starting point information of the speaker in the video according to the human face image frame, the timestamp information, the audio characteristic information and the speaker identity determination information.
7. The multi-modal data auto-cleansing and labeling system of claim 6, wherein: the face information analysis module comprises a face image frame picking submodule, an effective face image frame determining submodule and a timestamp information extraction submodule; wherein the content of the first and second substances,
the human face image frame picking submodule is used for identifying and picking human face image frames containing the human face characteristic information from a plurality of image frames corresponding to the image components;
the effective human face image frame determining submodule is used for determining the human face image frame meeting the preset image resolution condition and/or the image tone condition as an effective human face image frame;
the effective human face image frame determining submodule is used for extracting time stamp information corresponding to each effective human face image frame from time axis information corresponding to the image components which are decomposed into the plurality of image frames.
8. The multi-modal data auto-cleansing and labeling system of claim 6, wherein: the audio information analysis module comprises a VAD voice activation detection submodule, a VPR voiceprint recognition processing submodule and a speaker identity determination information generation submodule; wherein the content of the first and second substances,
the VAD voice activation detection submodule is used for carrying out VAD voice activation detection on the audio component so as to obtain audio characteristic information about the audio component, wherein the audio characteristic information comprises character voice information and environmental sound information;
the VPR voiceprint recognition processing submodule is used for carrying out VPR voiceprint recognition processing on the audio component so as to obtain all voiceprint recognition information about the audio component;
and the speaker identity confirming information generating submodule is used for generating speaker identity confirming information corresponding to the character voice information according to all the voiceprint recognition information.
9. The multi-modal data auto-cleansing and labeling system of claim 6, wherein: the synthesis and decision processing module comprises an image frame candidate marking sub-module and a speaking state/speaking starting point marking sub-module; wherein the content of the first and second substances,
the image frame candidate labeling submodule is used for giving out candidate labels of the effective face image frame according to the judgment result of whether the effective face image frame meets the requirement of the preset face information amount;
and the speaking state/speaking starting point labeling submodule is used for labeling the speaking state and/or the speaking starting point information of the speaker according to the timestamp information and the audio characteristic information.
10. The multi-modal data auto-cleansing and labeling system of claim 6, wherein: the image frame candidate labeling submodule comprises an actual human face characteristic information quantity calculating unit, a human face characteristic information quantity comparing unit and a candidate labeling execution unit; wherein the content of the first and second substances,
the actual face feature information amount calculating unit is used for calculating actual face feature information amount corresponding to the effective face image frame;
the face characteristic information quantity comparison unit is used for comparing the actual face characteristic information quantity with a preset face characteristic information threshold quantity, and when the actual face characteristic information quantity is greater than or equal to the preset face characteristic information threshold quantity, determining that the effective face image frame meets the preset face information quantity requirement, otherwise, determining that the effective face image frame does not meet the preset face information quantity requirement;
the candidate labeling execution unit is used for performing qualified candidate labeling on the effective face image frames corresponding to the preset face information quantity requirement and performing unqualified candidate labeling on the effective face image frames corresponding to the non-conformity with the preset face information quantity requirement;
alternatively, the first and second electrodes may be,
the speaking state/speaking starting point labeling submodule comprises a coexistence time period determining unit, a matching audio characteristic information determining unit and a labeling operation unit; wherein the content of the first and second substances,
the coexistence time period determining unit is used for determining the coexistence time period of the effective human face image frame and the audio feature information according to the timestamp information;
the matching audio characteristic information determining unit is used for determining corresponding audio characteristic information in a time period in which the effective human face image frame is positioned according to the coexistence time period, and the audio characteristic information is used as the matching audio characteristic information;
and the marking operation unit is used for marking the speaking state of the speaker and/or the speaking starting point information when the matched audio information belongs to the person voice information and the speaker corresponding to the person voice information is consistent with the person in the effective human face image frame.
CN202010525080.8A 2020-06-10 2020-06-10 Multi-mode data automatic cleaning and labeling method and system Pending CN111767805A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010525080.8A CN111767805A (en) 2020-06-10 2020-06-10 Multi-mode data automatic cleaning and labeling method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010525080.8A CN111767805A (en) 2020-06-10 2020-06-10 Multi-mode data automatic cleaning and labeling method and system

Publications (1)

Publication Number Publication Date
CN111767805A true CN111767805A (en) 2020-10-13

Family

ID=72720474

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010525080.8A Pending CN111767805A (en) 2020-06-10 2020-06-10 Multi-mode data automatic cleaning and labeling method and system

Country Status (1)

Country Link
CN (1) CN111767805A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117174092A (en) * 2023-11-02 2023-12-05 北京语言大学 Mobile corpus transcription method and device based on voiceprint recognition and multi-modal analysis

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018113526A1 (en) * 2016-12-20 2018-06-28 四川长虹电器股份有限公司 Face recognition and voiceprint recognition-based interactive authentication system and method
CN109660744A (en) * 2018-10-19 2019-04-19 深圳壹账通智能科技有限公司 The double recording methods of intelligence, equipment, storage medium and device based on big data
CN109671438A (en) * 2019-01-28 2019-04-23 武汉恩特拉信息技术有限公司 It is a kind of to provide the device and method of ancillary service using voice
CN110881115A (en) * 2019-12-24 2020-03-13 新华智云科技有限公司 Strip splitting method and system for conference video
CN110909613A (en) * 2019-10-28 2020-03-24 Oppo广东移动通信有限公司 Video character recognition method and device, storage medium and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018113526A1 (en) * 2016-12-20 2018-06-28 四川长虹电器股份有限公司 Face recognition and voiceprint recognition-based interactive authentication system and method
CN109660744A (en) * 2018-10-19 2019-04-19 深圳壹账通智能科技有限公司 The double recording methods of intelligence, equipment, storage medium and device based on big data
CN109671438A (en) * 2019-01-28 2019-04-23 武汉恩特拉信息技术有限公司 It is a kind of to provide the device and method of ancillary service using voice
CN110909613A (en) * 2019-10-28 2020-03-24 Oppo广东移动通信有限公司 Video character recognition method and device, storage medium and electronic equipment
CN110881115A (en) * 2019-12-24 2020-03-13 新华智云科技有限公司 Strip splitting method and system for conference video

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117174092A (en) * 2023-11-02 2023-12-05 北京语言大学 Mobile corpus transcription method and device based on voiceprint recognition and multi-modal analysis
CN117174092B (en) * 2023-11-02 2024-01-26 北京语言大学 Mobile corpus transcription method and device based on voiceprint recognition and multi-modal analysis

Similar Documents

Publication Publication Date Title
US11315366B2 (en) Conference recording method and data processing device employing the same
WO2021082941A1 (en) Video figure recognition method and apparatus, and storage medium and electronic device
US7920761B2 (en) Multimodal identification and tracking of speakers in video
US20190259388A1 (en) Speech-to-text generation using video-speech matching from a primary speaker
CN111050201B (en) Data processing method and device, electronic equipment and storage medium
CN110598008B (en) Method and device for detecting quality of recorded data and storage medium
US11954912B2 (en) Method for cutting video based on text of the video and computing device applying method
CN111901627B (en) Video processing method and device, storage medium and electronic equipment
CN109714608A (en) Video data handling procedure, device, computer equipment and storage medium
CN113242361A (en) Video processing method and device and computer readable storage medium
CN114598933B (en) Video content processing method, system, terminal and storage medium
CN111767805A (en) Multi-mode data automatic cleaning and labeling method and system
CN111626061A (en) Conference record generation method, device, equipment and readable storage medium
CN114239610A (en) Multi-language speech recognition and translation method and related system
CN115497017A (en) Broadcast television news stripping method and device based on artificial intelligence
EP4068282A1 (en) Method for processing conference data and related device
CN111221987A (en) Hybrid audio tagging method and apparatus
CN111161710A (en) Simultaneous interpretation method and device, electronic equipment and storage medium
CN115831124A (en) Conference record role separation system and method based on voiceprint recognition
CN113411517B (en) Video template generation method and device, electronic equipment and storage medium
CN115499677A (en) Audio and video synchronization detection method and device based on live broadcast
CN114155845A (en) Service determination method and device, electronic equipment and storage medium
CN113194333A (en) Video clipping method, device, equipment and computer readable storage medium
CN114125365A (en) Video conference method, device and readable storage medium
CN111985400A (en) Face living body identification method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination