CN111767793A - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN111767793A
CN111767793A CN202010450209.3A CN202010450209A CN111767793A CN 111767793 A CN111767793 A CN 111767793A CN 202010450209 A CN202010450209 A CN 202010450209A CN 111767793 A CN111767793 A CN 111767793A
Authority
CN
China
Prior art keywords
initial
voice
current
data
image data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010450209.3A
Other languages
Chinese (zh)
Inventor
郭莉莉
杨琳
王旭阳
徐培来
柳杨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN202010450209.3A priority Critical patent/CN111767793A/en
Publication of CN111767793A publication Critical patent/CN111767793A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/166Detection; Localisation; Normalisation using acquisition arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features

Abstract

The invention discloses a data processing method and a device, wherein the method comprises the following steps: respectively carrying out feature recognition on the acquired original image data and original voice data to obtain initial feature image data and initial feature voice data corresponding to an initial object; performing face tracking on the current image data, and judging whether a face tracking result of the current object is matched with the initial characteristic image data of the initial object; if the face tracking result of the current object is not matched with the initial feature image data of the initial object, performing voice tracking on the current object, and judging whether the voice tracking result of the current object is matched with the initial feature voice data of the initial object or not; and if the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object, performing voice recognition on the current voice data of the current object. The embodiment of the invention can improve the fluency and accuracy of voice recognition.

Description

Data processing method and device
Technical Field
The present invention relates to the field of information processing technologies, and in particular, to a data processing method and apparatus.
Background
With the development of technology, various industries have a wide range of applications for speech recognition and image recognition. However, the accuracy of image recognition is not high due to the influence of illumination, face angles, occlusion and the like, and the accuracy of voice recognition is not high due to the influence of various noises, multiple persons and the like. Therefore, in practical application, the voice and the image are combined, so that the recognition accuracy of the voice and the image can be respectively improved, and the real-time application requirement can be met. However, in the prior art, because the voice recognition depends heavily on the lip movement detection in the image recognition, if the face detection is not available, the voice recognition is stopped, which causes the discontinuity of the voice recognition. Therefore, how to more accurately perform speech recognition by using audio/video information is a major research point in the current image/speech recognition technology system.
Disclosure of Invention
In order to effectively overcome the above-mentioned defects in the prior art, embodiments of the present invention creatively provide a data processing method, including: acquiring original image data and original voice data; respectively carrying out feature recognition on the original image data and the original voice data to obtain initial feature image data and initial feature voice data corresponding to an initial object; performing face tracking on the current image data, and judging whether a face tracking result of the current object is matched with the initial characteristic image data of the initial object; if the face tracking result of the current object is not matched with the initial feature image data of the initial object, performing voice tracking on the current object, and judging whether the voice tracking result of the current object is matched with the initial feature voice data of the initial object or not; and if the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object, performing voice recognition on the current voice data of the current object.
In an implementation manner, if the face tracking result of the current object matches the initial feature image data of the initial object, the current object is subjected to voice tracking in a single-person mode or a multi-person mode according to the face tracking result of the current object.
In an embodiment, the performing voice tracking on the current object in a single-person mode or a multi-person mode according to the face tracking result of the current object includes: and if only one object data matched with the initial characteristic image data of the initial object exists in the face tracking result of the current object, performing voice recognition on the single voice data of the current object and storing the single voice data.
In an embodiment, the performing voice tracking on the current object in single-person mode or multi-person mode according to the face tracking result of the current object further includes: and if a plurality of object data matched with the initial characteristic image data of the initial object exist in the face tracking result of the current object, performing voice separation on the current voice data of the current object to obtain corresponding separated voice data of each current object, and performing voice recognition on the separated voice data.
In an embodiment, the voice separating the current voice data of the current object includes: and carrying out voice separation on the current voice data of the current object according to the initial characteristic voice data and the historical single voice data.
In one possible embodiment, the current speech data of the current object is speech separated from the initial characteristic speech data and historical single-person speech data by beamforming.
In an embodiment, before the determining whether the face tracking result of the current subject matches the initial feature image data of the initial subject, the method further comprises: establishing an object classification model according to the initial characteristic image data and/or the initial characteristic voice data; the determining whether the face tracking result of the current subject matches the initial feature image data of the initial subject includes: judging whether the face tracking result of the current object is matched with the initial characteristic image data of the initial object or not according to the object classification model; the judging whether the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object comprises: and judging whether the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object or not according to the object classification model.
In one embodiment, the initial feature image data includes at least initial lip movement feature data corresponding to an initial object; whether the face tracking result of the current subject matches the initial feature image data of the initial subject includes: and judging whether the face tracking result of the current object is matched with the initial lip movement characteristic data of the initial object or not.
In an embodiment, the method further comprises: and if the voice tracking result of the current object is not matched with the initial characteristic voice data of the initial object, ending the voice recognition of the current voice data.
Another aspect of an embodiment of the present invention provides a data processing apparatus, including: the acquisition module is used for acquiring original image data and original voice data; the feature recognition module is used for respectively carrying out feature recognition on the original image data and the original voice data to obtain initial feature image data and initial feature voice data corresponding to an initial object; the face tracking module is used for carrying out face tracking on the current image data and judging whether the face tracking result of the current object is matched with the initial characteristic image data of the initial object or not; the first voice tracking module is used for carrying out voice tracking on the current object and judging whether the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object or not if the face tracking result of the current object is not matched with the initial characteristic image data of the initial object; and the voice recognition module is used for carrying out voice recognition on the current voice data of the current object if the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object.
According to the data processing method and device provided by the embodiment of the invention, when the image detection result is abnormal in the voice recognition process, namely when the image detection result shows that the current object is not matched with the initial object, whether the voice recognition is continuously carried out is judged by comparing and judging whether the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object, so that the problem that the voice recognition is discontinuous due to the fact that the voice recognition is seriously dependent on lip movement detection in the image recognition if the face detection is not available is solved, the voice recognition can be stopped, the voice recognition can be carried out more accurately by using audio and video information, the fluency, the integrity and the accuracy of the voice recognition are improved, and the user experience is effectively improved.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Fig. 1 is a schematic flow chart illustrating an implementation of a data processing method according to an embodiment of the present invention;
FIG. 2 is a block diagram of a data processing apparatus according to an embodiment of the present invention;
FIG. 3 is a block diagram of another embodiment of a data processing apparatus;
fig. 4 is a block diagram of another embodiment of a data processing apparatus according to an embodiment of the present invention;
fig. 5 is a block diagram of another embodiment of a data processing apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification. Rather, they are merely examples of methods, apparatus or devices consistent with certain aspects of the specification, as detailed in the claims that follow.
Referring to fig. 1, an embodiment of the present invention provides a data processing method, including:
step 101, acquiring original image data and original voice data;
102, respectively carrying out feature recognition on original image data and original voice data to obtain initial feature image data and initial feature voice data corresponding to an initial object;
step 103, performing face tracking on the current image data, and judging whether the face tracking result of the current object is matched with the initial characteristic image data of the initial object;
step 104, if the face tracking result of the current object is not matched with the initial feature image data of the initial object, performing voice tracking on the current object, and judging whether the voice tracking result of the current object is matched with the initial feature voice data of the initial object or not;
and 105, if the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object, performing voice recognition on the current voice data of the current object.
In step 101, the original image data and the original voice data are acquired by a camera or other equipment and include an image and a voice of an object to be recognized, and in step 102, the acquired original image data and the acquired original voice data are respectively subjected to feature recognition, specifically, feature recognition methods such as face and lip motion detection recognition on the original image data and voiceprint recognition on the original voice data are performed, so that the initial feature image data and the initial feature voice data obtained by recognition respectively include image features and voice features corresponding to the initial object, and the identity feature of the initial object can be confirmed according to the initial feature image data and the initial feature voice data. The initial feature image data may be face feature data and/or lip movement feature data corresponding to the initial object, and of course, other feature recognition methods may also be specifically used in step 102 in the embodiment of the present invention, and the specific method of feature recognition and the specific content of the initial feature image data are not limited herein as long as the initial feature image data and the initial feature voice data obtained by recognition can be used to confirm the identity of the initial object. Then step 103 performs face tracking detection on the current image data, and determines whether the face tracking result of the current object matches with the initial feature image data of the initial object; the current image data can be obtained by transmission of the camera and other equipment, and the initial characteristic image data can be used for confirming the identity of the initial object at an image angle, so that whether the current object is matched with the initial object at the image angle can be known only by judging whether the face tracking result of the current object is matched with the initial image data, wherein the current object comprises an object corresponding to the current image data and an object corresponding to the current voice data. If the face tracking result of the current object does not match the initial feature image data of the initial object, that is, the current object and the initial object are not matched in image angle, which means that the object corresponding to the current image data is not the initial object, then the voice tracking of the current object may be performed through step 104, and it is determined whether the voice tracking result of the current object matches the initial feature voice data of the initial object, so that it is known whether the current object and the initial object are matched in voice angle, that is, whether the object corresponding to the current voice data is the initial object. If the current voice data is matched with the initial characteristic voice data of the initial object, the current voice data of the current object is subjected to voice recognition through step 105.
In the embodiment of the invention, when the image detection result is abnormal in the voice recognition process, namely when the image detection result shows that the current object is not matched with the initial object, whether the voice recognition is continuously carried out is judged by comparing and judging whether the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object, so that the problem of discontinuity of the voice recognition caused by stopping the voice recognition if the face detection is not available because the voice recognition is seriously dependent on the lip movement detection in the image recognition in the prior art is solved, the voice recognition can be more accurately carried out by utilizing the audio and video information, the fluency, the integrity and the accuracy of the voice recognition are improved, and the user experience is effectively improved.
In an embodiment, the method further comprises: and if the face tracking result of the current object is matched with the initial characteristic image data of the initial object, performing voice tracking on the current object in a single-person mode or a multi-person mode according to the face tracking result of the current object.
In the embodiment of the invention, when the face tracking result of the current object is matched with the initial characteristic image data of the initial object, because the number of the current object possibly comprises one person or more persons, whether the voice tracking of a multi-person mode is started or not can be determined according to the face tracking result of the current object, specifically, whether the lip movement number of the current object is more than one, and the voice tracking recognition under the multi-person mode is carried out according to the reference data such as the lip movement position and direction of each object in the current object, so that the voice recognition accuracy under the multi-speaker mode is improved.
In one embodiment, performing voice tracking in single-person mode or multi-person mode on the current object according to the face tracking result of the current object comprises: and if only one object data matched with the initial characteristic image data of the initial object exists in the face tracking result of the current object, performing voice recognition on the single voice data of the current object and storing the single voice data.
In the embodiment of the invention, when only one object data matched with the initial characteristic image data of the initial object exists in the face tracking result of the current object, the current object is only a speaker, the current single voice data can be confirmed as the voice data of the current speaker, the voice recognition can be directly carried out on the single voice data of the current object and stored, and the stored single voice data can also be used as historical single voice data for carrying out voice separation so as to improve the accuracy of the voice recognition.
In an embodiment, performing voice tracking in single-person mode or multi-person mode on the current object according to the face tracking result of the current object further comprises:
and if a plurality of object data matched with the initial characteristic image data of the initial object exist in the face tracking result of the current object, performing voice separation on the current voice data of the current object to obtain corresponding separated voice data of each current object, and performing voice recognition on the separated voice data.
In the embodiment of the invention, if a plurality of object data matched with the initial characteristic image data of the initial object exist in the face tracking result of the current object, which indicates that the current object is a plurality of speakers, voice separation needs to be performed on the current object, namely the current voice data of the plurality of speakers so as to obtain corresponding separated voice data corresponding to each current object, and then voice recognition is performed on the separated voice data conveniently so as to improve the voice recognition accuracy under a plurality of speaker modes.
In one embodiment, the voice separating the current voice data of the current object includes: and carrying out voice separation on the current voice data of the current object according to the initial characteristic voice data and the historical single voice data. In the embodiment of the invention, the voice separation can be realized by respectively carrying out characteristic comparison on the current voice data and the initial characteristic voice data and the historical single voice data, wherein the historical single voice data is the stored single voice data corresponding to a single speaker, for example, the stored single voice data in a single mode, and the voice separation accuracy in a plurality of speaker modes can be effectively improved by simultaneously using the initial characteristic voice data and the historical single voice data as reference data.
In one possible embodiment, the current speech data of the current object is speech separated from the initial characteristic speech data and the historical single-person speech data by beamforming. In the embodiment of the invention, when a plurality of speaker modes are adopted, the historical single voice data stored in the single speaker mode and other conditions are used as reference signals to carry out voice separation through beam forming, so that the accuracy of the beam forming for voice separation can be greatly improved.
In one embodiment, before determining whether the face tracking result of the current subject matches the initial feature image data of the initial subject, the method further comprises:
establishing an object classification model according to the initial characteristic image data and/or the initial characteristic voice data;
determining whether the face tracking result of the current subject matches the initial feature image data of the initial subject includes:
judging whether the face tracking result of the current object is matched with the initial characteristic image data of the initial object or not according to the object classification model;
judging whether the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object comprises the following steps:
and judging whether the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object or not according to the object classification model.
In the embodiment of the invention, the object classification model is established according to the initial characteristic image data and/or the initial characteristic voice data, and whether the face tracking result and the voice tracking result of the current object are matched with the corresponding data of the initial object or not is respectively judged according to the object classification model, so that the characteristic data in the initial characteristic image data and the initial characteristic voice data can be more accurately extracted, more accurate judgment results can be obtained, and the speed of voice recognition can be improved. The object classification model may be obtained by training sample image data or sample voice data separately through a neural network, or by training sample image data or sample voice data as a classification basis at the same time.
In one embodiment, the initial feature image data includes at least initial lip movement feature data corresponding to the initial object;
whether the face tracking result of the current subject matches the initial feature image data of the initial subject includes:
it is determined whether the face tracking result of the current subject matches the initial lip movement feature data of the initial subject.
In the embodiment of the invention, when face tracking is carried out in the voice recognition process, because the multiple differences of the image display ranges are large and the background picture noise of the image display content is more, the lip action of a speaker is required to be taken as the image detection key point, the initial characteristic image data at least comprises the initial lip action characteristic data corresponding to the initial object, the face tracking result of the current object is matched with the initial characteristic image data of the initial object, namely the face tracking result of the current object is at least matched with the initial lip action characteristic image data of the initial object, and then voice tracking in a single-person mode or a multi-person mode can be carried out according to the face tracking result of the current object.
In an embodiment, the method further comprises: and if the voice tracking result of the current object is not matched with the initial characteristic voice data of the initial object, ending the voice recognition of the current voice data. In the embodiment of the present invention, if the face tracking result of the current object does not match the initial feature image data of the initial object, and the voice tracking result of the current object also does not match the initial feature voice data of the initial object, it indicates that both the object corresponding to the current image data included in the current object and the object corresponding to the current voice data do not match the initial object, that is, the face in the current image is not the initial object, and the current speaker is not the initial object and is a different speaker, then the current voice data should be regarded as noise, and the voice recognition of the current voice data is stopped.
Referring to fig. 2, another embodiment of the present invention provides a data processing apparatus, including:
an obtaining module 201, configured to obtain original image data and original voice data;
a feature recognition module 202, configured to perform feature recognition on the original image data and the original voice data, respectively, to obtain initial feature image data and initial feature voice data corresponding to an initial object;
a face tracking module 203, configured to perform face tracking on current image data and determine whether a face tracking result of the current object matches initial feature image data of an initial object;
the first voice tracking module 204 is configured to, if the face tracking result of the current object does not match the initial feature image data of the initial object, perform voice tracking on the current object, and determine whether the voice tracking result of the current object matches the initial feature voice data of the initial object;
and the voice recognition module 205 is configured to perform voice recognition on the current voice data of the current object if the voice tracking result of the current object matches the initial characteristic voice data of the initial object.
In the embodiment of the present invention, the original image data and the original voice data in the acquisition module 201 are obtained by a camera or other equipment, and the data including an image and a voice of an object to be recognized, and the feature recognition module 202 respectively performs feature recognition on the obtained original image data and original voice data, specifically, it may be a feature recognition method of performing face and lip motion detection recognition on the original image data, performing voiceprint recognition on the original voice data, and the like, so that the initial feature image data and the initial feature voice data obtained by recognition respectively include image features and voice features corresponding to the original object, and thus, the identity features of the original object can be confirmed according to the initial feature image data and the initial feature voice data. The initial feature image data may be face feature data and/or lip movement feature data corresponding to the initial object, and of course, other feature recognition methods may also be specifically used in the embodiment of the present invention, and the specific method of feature recognition and the specific content of the initial feature image data are not limited herein as long as the initial feature image data and the initial feature voice data obtained by recognition can be used to confirm the identity of the initial object. Then the face tracking module 203 performs face tracking detection on the current image data and judges whether the face tracking result of the current object is matched with the initial feature image data of the initial object; the current image data can be obtained by transmission of the camera and other equipment, and the initial characteristic image data can be used for confirming the identity of the initial object at an image angle, so that whether the current object is matched with the initial object at the image angle can be known only by judging whether the face tracking result of the current object is matched with the initial image data, wherein the current object comprises an object corresponding to the current image data and an object corresponding to the current voice data. If the face tracking result of the current object does not match the initial feature image data of the initial object, that is, the current object and the initial object do not match in image angle, which means that the object corresponding to the current image data is not the initial object, the first voice tracking module 204 may perform voice tracking on the current object at this time, and determine whether the voice tracking result of the current object matches the initial feature voice data of the initial object, so as to know whether the current object and the initial object match in voice angle, that is, whether the object corresponding to the current voice data is the initial object. If the current voice data is matched with the initial characteristic voice data of the initial object, the current voice data of the current object is subjected to voice recognition by the voice recognition module 205.
In the embodiment of the invention, when the image detection result is abnormal in the voice recognition process, namely when the image detection result shows that the current object is not matched with the initial object, whether the voice recognition is continuously carried out is judged by comparing and judging whether the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object, so that the problem of discontinuity of the voice recognition caused by stopping the voice recognition if the face detection is not available because the voice recognition is seriously dependent on the lip movement detection in the image recognition in the prior art is solved, the voice recognition can be more accurately carried out by utilizing the audio and video information, the fluency, the integrity and the accuracy of the voice recognition are improved, and the user experience is effectively improved.
Referring to fig. 3, in an implementation manner, the apparatus further includes:
and a second voice tracking module 301, configured to perform voice tracking in a single-person mode or a multi-person mode on the current object according to the face tracking result of the current object if the face tracking result of the current object matches the initial feature image data of the initial object.
In the embodiment of the present invention, an obtaining module 201 obtains original image data and original voice data, a feature recognition module 202 performs feature recognition on the original image data and the original voice data respectively to obtain initial feature image data and initial feature voice data corresponding to an initial object, and a face tracking module 203 is used for performing face tracking on current image data and determining whether a face tracking result of the current object matches with the initial feature image data of the initial object; when the face tracking result of the current object is matched with the initial feature image data of the initial object, because the number of people of the current object may include one person or more persons, whether to start the voice tracking of the multi-person mode or not can be determined by the second voice tracking module 301 according to the face tracking result of the current object, specifically, whether more than one lip movement number of the current object is needed or not, and the voice tracking recognition in the multi-person mode is performed according to reference data such as the lip movement position and direction of each object in the current object, so that the voice recognition accuracy in the multi-speaker mode is improved.
Referring to fig. 4, in an implementation, the second voice tracking module 301 includes:
and the single voice recognition module 302 is configured to perform voice recognition on the single voice data of the current object and store the single voice data if only one object data matching the initial feature image data of the initial object exists in the face tracking result of the current object.
In the embodiment of the invention, when only one object data matched with the initial characteristic image data of the initial object exists in the face tracking result of the current object, the current object is only a speaker, the current single voice data can be confirmed as the voice data of the current speaker, the voice recognition can be directly carried out on the single voice data of the current object and stored, and the stored single voice data can also be used as historical single voice data for carrying out voice separation so as to improve the accuracy of the voice recognition.
Referring to fig. 4, in an implementation manner, the second voice tracking module 301 further includes:
the multi-person voice recognition module 303 is configured to, if a plurality of object data matched with the initial feature image data of the initial object exist in the face tracking result of the current object, perform voice separation on the current voice data of the current object to obtain corresponding separated voice data of each current object, and perform voice recognition on the separated voice data.
In the embodiment of the invention, if a plurality of object data matched with the initial characteristic image data of the initial object exist in the face tracking result of the current object, which indicates that the current object is a plurality of speakers, voice separation needs to be performed on the current object, namely the current voice data of the plurality of speakers so as to obtain corresponding separated voice data corresponding to each current object, and then voice recognition is performed on the separated voice data conveniently so as to improve the voice recognition accuracy under a plurality of speaker modes.
In one embodiment, the multi-person speech recognition module 303 includes:
and the voice separation unit is used for carrying out voice separation on the current voice data of the current object according to the initial characteristic voice data and the historical single voice data.
In the embodiment of the invention, the voice separation can be realized by respectively carrying out characteristic comparison on the current voice data and the initial characteristic voice data and the historical single voice data, wherein the historical single voice data is the stored single voice data corresponding to a single speaker, for example, the stored single voice data in a single mode, and the voice separation accuracy in a plurality of speaker modes can be effectively improved by simultaneously using the initial characteristic voice data and the historical single voice data as reference data.
In one possible embodiment, the current speech data of the current object is speech separated from the initial characteristic speech data and the historical single-person speech data by beamforming. In the embodiment of the invention, when a plurality of speaker modes are adopted, the historical single voice data stored in the single speaker mode and other conditions are used as reference signals to carry out voice separation through beam forming, so that the accuracy of the beam forming for voice separation can be greatly improved.
Referring to fig. 5, in an implementation manner, the apparatus further includes:
a model building module 401, configured to build an object classification model according to the initial feature image data and/or the initial feature voice data;
the face tracking module 203 includes:
a first matching unit for judging whether the face tracking result of the current object matches with the initial feature image data of the initial object according to the object classification model;
the first voice tracking module 204 includes:
and the second matching unit is used for judging whether the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object according to the object classification model.
In the embodiment of the invention, the object classification model is established according to the initial characteristic image data and/or the initial characteristic voice data, and whether the face tracking result and the voice tracking result of the current object are matched with the corresponding data of the initial object or not is respectively judged according to the object classification model, so that the characteristic data in the initial characteristic image data and the initial characteristic voice data can be more accurately extracted, more accurate judgment results can be obtained, and the speed of voice recognition can be improved. The object classification model may be obtained by training sample image data or sample voice data separately through a neural network, or by training sample image data or sample voice data as a classification basis at the same time.
In one embodiment, the initial feature image data includes at least initial lip movement feature data corresponding to the initial object;
the face tracking module 203 includes:
and the lip movement judging unit is used for judging whether the face tracking result of the current object is matched with the initial lip movement characteristic data of the initial object.
In the embodiment of the invention, when face tracking is carried out in the voice recognition process, because the multiple differences of the image display ranges are large and the background picture noise of the image display content is more, the lip action of a speaker is required to be taken as the image detection key point, the initial characteristic image data at least comprises the initial lip action characteristic data corresponding to the initial object, the face tracking result of the current object is matched with the initial characteristic image data of the initial object, namely the face tracking result of the current object is at least matched with the initial lip action characteristic image data of the initial object, and then voice tracking in a single-person mode or a multi-person mode can be carried out according to the face tracking result of the current object.
In one embodiment, the apparatus further comprises:
and the ending module is used for ending the voice recognition of the current voice data if the voice tracking result of the current object is not matched with the initial characteristic voice data of the initial object. In the embodiment of the present invention, if the face tracking result of the current object does not match the initial feature image data of the initial object, and the voice tracking result of the current object also does not match the initial feature voice data of the initial object, it indicates that both the object corresponding to the current image data included in the current object and the object corresponding to the current voice data do not match the initial object, that is, the face in the current image is not the initial object, and the current speaker is not the initial object and is a different speaker, then the current voice data should be regarded as noise, and the voice recognition of the current voice data is stopped.
In the embodiment of the present invention, the implementation order among the steps may be replaced without affecting the implementation purpose.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A data processing method, comprising:
acquiring original image data and original voice data;
respectively carrying out feature recognition on the original image data and the original voice data to obtain initial feature image data and initial feature voice data corresponding to an initial object;
performing face tracking on the current image data, and judging whether a face tracking result of the current object is matched with the initial characteristic image data of the initial object;
if the face tracking result of the current object is not matched with the initial feature image data of the initial object, performing voice tracking on the current object, and judging whether the voice tracking result of the current object is matched with the initial feature voice data of the initial object or not;
and if the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object, performing voice recognition on the current voice data of the current object.
2. The method of claim 1, further comprising:
and if the face tracking result of the current object is matched with the initial characteristic image data of the initial object, carrying out voice tracking on the current object in a single-person mode or a multi-person mode according to the face tracking result of the current object.
3. The method of claim 2, wherein the voice tracking of the current subject in single-person mode or multi-person mode according to the face tracking result of the current subject comprises:
and if only one object data matched with the initial characteristic image data of the initial object exists in the face tracking result of the current object, performing voice recognition on the single voice data of the current object and storing the single voice data.
4. The method of claim 3, wherein the voice tracking of the current subject in single-person mode or multi-person mode according to the face tracking result of the current subject further comprises:
and if a plurality of object data matched with the initial characteristic image data of the initial object exist in the face tracking result of the current object, performing voice separation on the current voice data of the current object to obtain corresponding separated voice data of each current object, and performing voice recognition on the separated voice data.
5. The method of claim 4, wherein the voice separating the current voice data of the current object comprises:
and carrying out voice separation on the current voice data of the current object according to the initial characteristic voice data and the historical single voice data.
6. The method of claim 5 wherein the current speech data of the current subject is speech separated from the initial characteristic speech data and historical single-person speech data by beamforming.
7. The method according to any one of claims 1-6, wherein before the determining whether the face tracking result of the current subject matches the initial feature image data of the initial subject, the method further comprises:
establishing an object classification model according to the initial characteristic image data and/or the initial characteristic voice data;
the determining whether the face tracking result of the current subject matches the initial feature image data of the initial subject includes:
judging whether the face tracking result of the current object is matched with the initial characteristic image data of the initial object or not according to the object classification model;
the judging whether the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object comprises:
and judging whether the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object or not according to the object classification model.
8. The method according to any one of claims 1-6, wherein the initial feature image data comprises at least initial lip movement feature data corresponding to an initial object;
whether the face tracking result of the current subject matches the initial feature image data of the initial subject includes:
and judging whether the face tracking result of the current object is matched with the initial lip movement characteristic data of the initial object or not.
9. The method according to any one of claims 1-6, further comprising:
and if the voice tracking result of the current object is not matched with the initial characteristic voice data of the initial object, ending the voice recognition of the current voice data.
10. A data processing apparatus, comprising:
the acquisition module is used for acquiring original image data and original voice data;
the feature recognition module is used for respectively carrying out feature recognition on the original image data and the original voice data to obtain initial feature image data and initial feature voice data corresponding to an initial object;
the face tracking module is used for carrying out face tracking on the current image data and judging whether the face tracking result of the current object is matched with the initial characteristic image data of the initial object or not;
the first voice tracking module is used for carrying out voice tracking on the current object and judging whether the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object or not if the face tracking result of the current object is not matched with the initial characteristic image data of the initial object;
and the voice recognition module is used for carrying out voice recognition on the current voice data of the current object if the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object.
CN202010450209.3A 2020-05-25 2020-05-25 Data processing method and device Pending CN111767793A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010450209.3A CN111767793A (en) 2020-05-25 2020-05-25 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010450209.3A CN111767793A (en) 2020-05-25 2020-05-25 Data processing method and device

Publications (1)

Publication Number Publication Date
CN111767793A true CN111767793A (en) 2020-10-13

Family

ID=72719512

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010450209.3A Pending CN111767793A (en) 2020-05-25 2020-05-25 Data processing method and device

Country Status (1)

Country Link
CN (1) CN111767793A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945672A (en) * 2012-09-29 2013-02-27 深圳市国华识别科技开发有限公司 Voice control system for multimedia equipment, and voice control method
CN103811003A (en) * 2012-11-13 2014-05-21 联想(北京)有限公司 Voice recognition method and electronic equipment
CN104049721A (en) * 2013-03-11 2014-09-17 联想(北京)有限公司 Information processing method and electronic equipment
CN106599866A (en) * 2016-12-22 2017-04-26 上海百芝龙网络科技有限公司 Multidimensional user identity identification method
CN107393527A (en) * 2017-07-17 2017-11-24 广东讯飞启明科技发展有限公司 The determination methods of speaker's number
CN108098767A (en) * 2016-11-25 2018-06-01 北京智能管家科技有限公司 A kind of robot awakening method and device
CN111081234A (en) * 2018-10-18 2020-04-28 珠海格力电器股份有限公司 Voice acquisition method, device, equipment and storage medium
CN111128178A (en) * 2019-12-31 2020-05-08 上海赫千电子科技有限公司 Voice recognition method based on facial expression analysis

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945672A (en) * 2012-09-29 2013-02-27 深圳市国华识别科技开发有限公司 Voice control system for multimedia equipment, and voice control method
CN103811003A (en) * 2012-11-13 2014-05-21 联想(北京)有限公司 Voice recognition method and electronic equipment
CN104049721A (en) * 2013-03-11 2014-09-17 联想(北京)有限公司 Information processing method and electronic equipment
CN108098767A (en) * 2016-11-25 2018-06-01 北京智能管家科技有限公司 A kind of robot awakening method and device
CN106599866A (en) * 2016-12-22 2017-04-26 上海百芝龙网络科技有限公司 Multidimensional user identity identification method
CN107393527A (en) * 2017-07-17 2017-11-24 广东讯飞启明科技发展有限公司 The determination methods of speaker's number
CN111081234A (en) * 2018-10-18 2020-04-28 珠海格力电器股份有限公司 Voice acquisition method, device, equipment and storage medium
CN111128178A (en) * 2019-12-31 2020-05-08 上海赫千电子科技有限公司 Voice recognition method based on facial expression analysis

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘乐 等: "声纹识别:一种无需接触、不惧遮挡的身份认证方式", 《中国安全防范技术与应用》, no. 1, pages 33 - 40 *
郑方 等: "生物特征识别技术综述", 《信息安全研究》, vol. 2, no. 1, pages 12 - 26 *
郝敏 等: "基于聚类分析与说话人识别的语音跟踪", 《计算机与现代化》, no. 4, pages 7 - 18 *

Similar Documents

Publication Publication Date Title
US11398235B2 (en) Methods, apparatuses, systems, devices, and computer-readable storage media for processing speech signals based on horizontal and pitch angles and distance of a sound source relative to a microphone array
US9899025B2 (en) Speech recognition system adaptation based on non-acoustic attributes and face selection based on mouth motion using pixel intensities
CN110808048B (en) Voice processing method, device, system and storage medium
WO2019080639A1 (en) Object identifying method, computer device and computer readable storage medium
CN111833899B (en) Voice detection method based on polyphonic regions, related device and storage medium
JP4729927B2 (en) Voice detection device, automatic imaging device, and voice detection method
KR102230667B1 (en) Method and apparatus for speaker diarisation based on audio-visual data
CN110545396A (en) Voice recognition method and device based on positioning and denoising
CN108229441B (en) Classroom teaching automatic feedback system and feedback method based on image and voice analysis
CN105554443B (en) The localization method and device in abnormal sound source in video image
CN111034222A (en) Sound collecting device, sound collecting method, and program
US10964326B2 (en) System and method for audio-visual speech recognition
CN113157246B (en) Volume adjusting method and device, electronic equipment and storage medium
US9165182B2 (en) Method and apparatus for using face detection information to improve speaker segmentation
KR20070061207A (en) Apparatus and method for detecting of speech block and system for speech recognition
CN110544479A (en) Denoising voice recognition method and device
CN110750152A (en) Human-computer interaction method and system based on lip action
CN110544491A (en) Method and device for real-time association of speaker and voice recognition result thereof
CN110503957A (en) A kind of audio recognition method and device based on image denoising
Arslan et al. Performance of deep neural networks in audio surveillance
CN109997186B (en) Apparatus and method for classifying acoustic environments
CN111767793A (en) Data processing method and device
US11107476B2 (en) Speaker estimation method and speaker estimation device
CN112015364A (en) Method and device for adjusting pickup sensitivity
Gurban et al. Multimodal speaker localization in a probabilistic framework

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination