CN111767793A

CN111767793A - Data processing method and device

Info

Publication number: CN111767793A
Application number: CN202010450209.3A
Authority: CN
Inventors: 郭莉莉; 杨琳; 王旭阳; 徐培来; 柳杨
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2020-05-25
Filing date: 2020-05-25
Publication date: 2020-10-13

Abstract

The invention discloses a data processing method and a device, wherein the method comprises the following steps: respectively carrying out feature recognition on the acquired original image data and original voice data to obtain initial feature image data and initial feature voice data corresponding to an initial object; performing face tracking on the current image data, and judging whether a face tracking result of the current object is matched with the initial characteristic image data of the initial object; if the face tracking result of the current object is not matched with the initial feature image data of the initial object, performing voice tracking on the current object, and judging whether the voice tracking result of the current object is matched with the initial feature voice data of the initial object or not; and if the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object, performing voice recognition on the current voice data of the current object. The embodiment of the invention can improve the fluency and accuracy of voice recognition.

Description

Data processing method and device

Technical Field

The present invention relates to the field of information processing technologies, and in particular, to a data processing method and apparatus.

Background

With the development of technology, various industries have a wide range of applications for speech recognition and image recognition. However, the accuracy of image recognition is not high due to the influence of illumination, face angles, occlusion and the like, and the accuracy of voice recognition is not high due to the influence of various noises, multiple persons and the like. Therefore, in practical application, the voice and the image are combined, so that the recognition accuracy of the voice and the image can be respectively improved, and the real-time application requirement can be met. However, in the prior art, because the voice recognition depends heavily on the lip movement detection in the image recognition, if the face detection is not available, the voice recognition is stopped, which causes the discontinuity of the voice recognition. Therefore, how to more accurately perform speech recognition by using audio/video information is a major research point in the current image/speech recognition technology system.

Disclosure of Invention

In order to effectively overcome the above-mentioned defects in the prior art, embodiments of the present invention creatively provide a data processing method, including: acquiring original image data and original voice data; respectively carrying out feature recognition on the original image data and the original voice data to obtain initial feature image data and initial feature voice data corresponding to an initial object; performing face tracking on the current image data, and judging whether a face tracking result of the current object is matched with the initial characteristic image data of the initial object; if the face tracking result of the current object is not matched with the initial feature image data of the initial object, performing voice tracking on the current object, and judging whether the voice tracking result of the current object is matched with the initial feature voice data of the initial object or not; and if the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object, performing voice recognition on the current voice data of the current object.

In an implementation manner, if the face tracking result of the current object matches the initial feature image data of the initial object, the current object is subjected to voice tracking in a single-person mode or a multi-person mode according to the face tracking result of the current object.

In an embodiment, the performing voice tracking on the current object in a single-person mode or a multi-person mode according to the face tracking result of the current object includes: and if only one object data matched with the initial characteristic image data of the initial object exists in the face tracking result of the current object, performing voice recognition on the single voice data of the current object and storing the single voice data.

In an embodiment, the performing voice tracking on the current object in single-person mode or multi-person mode according to the face tracking result of the current object further includes: and if a plurality of object data matched with the initial characteristic image data of the initial object exist in the face tracking result of the current object, performing voice separation on the current voice data of the current object to obtain corresponding separated voice data of each current object, and performing voice recognition on the separated voice data.

In an embodiment, the voice separating the current voice data of the current object includes: and carrying out voice separation on the current voice data of the current object according to the initial characteristic voice data and the historical single voice data.

In one possible embodiment, the current speech data of the current object is speech separated from the initial characteristic speech data and historical single-person speech data by beamforming.

In an embodiment, before the determining whether the face tracking result of the current subject matches the initial feature image data of the initial subject, the method further comprises: establishing an object classification model according to the initial characteristic image data and/or the initial characteristic voice data; the determining whether the face tracking result of the current subject matches the initial feature image data of the initial subject includes: judging whether the face tracking result of the current object is matched with the initial characteristic image data of the initial object or not according to the object classification model; the judging whether the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object comprises: and judging whether the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object or not according to the object classification model.

In one embodiment, the initial feature image data includes at least initial lip movement feature data corresponding to an initial object; whether the face tracking result of the current subject matches the initial feature image data of the initial subject includes: and judging whether the face tracking result of the current object is matched with the initial lip movement characteristic data of the initial object or not.

In an embodiment, the method further comprises: and if the voice tracking result of the current object is not matched with the initial characteristic voice data of the initial object, ending the voice recognition of the current voice data.

Another aspect of an embodiment of the present invention provides a data processing apparatus, including: the acquisition module is used for acquiring original image data and original voice data; the feature recognition module is used for respectively carrying out feature recognition on the original image data and the original voice data to obtain initial feature image data and initial feature voice data corresponding to an initial object; the face tracking module is used for carrying out face tracking on the current image data and judging whether the face tracking result of the current object is matched with the initial characteristic image data of the initial object or not; the first voice tracking module is used for carrying out voice tracking on the current object and judging whether the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object or not if the face tracking result of the current object is not matched with the initial characteristic image data of the initial object; and the voice recognition module is used for carrying out voice recognition on the current voice data of the current object if the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object.

According to the data processing method and device provided by the embodiment of the invention, when the image detection result is abnormal in the voice recognition process, namely when the image detection result shows that the current object is not matched with the initial object, whether the voice recognition is continuously carried out is judged by comparing and judging whether the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object, so that the problem that the voice recognition is discontinuous due to the fact that the voice recognition is seriously dependent on lip movement detection in the image recognition if the face detection is not available is solved, the voice recognition can be stopped, the voice recognition can be carried out more accurately by using audio and video information, the fluency, the integrity and the accuracy of the voice recognition are improved, and the user experience is effectively improved.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Fig. 1 is a schematic flow chart illustrating an implementation of a data processing method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a data processing apparatus according to an embodiment of the present invention;

FIG. 3 is a block diagram of another embodiment of a data processing apparatus;

fig. 4 is a block diagram of another embodiment of a data processing apparatus according to an embodiment of the present invention;

fig. 5 is a block diagram of another embodiment of a data processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification. Rather, they are merely examples of methods, apparatus or devices consistent with certain aspects of the specification, as detailed in the claims that follow.

Referring to fig. 1, an embodiment of the present invention provides a data processing method, including:

step 101, acquiring original image data and original voice data;

102, respectively carrying out feature recognition on original image data and original voice data to obtain initial feature image data and initial feature voice data corresponding to an initial object;

step 103, performing face tracking on the current image data, and judging whether the face tracking result of the current object is matched with the initial characteristic image data of the initial object;

step 104, if the face tracking result of the current object is not matched with the initial feature image data of the initial object, performing voice tracking on the current object, and judging whether the voice tracking result of the current object is matched with the initial feature voice data of the initial object or not;

and 105, if the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object, performing voice recognition on the current voice data of the current object.

In step 101, the original image data and the original voice data are acquired by a camera or other equipment and include an image and a voice of an object to be recognized, and in step 102, the acquired original image data and the acquired original voice data are respectively subjected to feature recognition, specifically, feature recognition methods such as face and lip motion detection recognition on the original image data and voiceprint recognition on the original voice data are performed, so that the initial feature image data and the initial feature voice data obtained by recognition respectively include image features and voice features corresponding to the initial object, and the identity feature of the initial object can be confirmed according to the initial feature image data and the initial feature voice data. The initial feature image data may be face feature data and/or lip movement feature data corresponding to the initial object, and of course, other feature recognition methods may also be specifically used in step 102 in the embodiment of the present invention, and the specific method of feature recognition and the specific content of the initial feature image data are not limited herein as long as the initial feature image data and the initial feature voice data obtained by recognition can be used to confirm the identity of the initial object. Then step 103 performs face tracking detection on the current image data, and determines whether the face tracking result of the current object matches with the initial feature image data of the initial object; the current image data can be obtained by transmission of the camera and other equipment, and the initial characteristic image data can be used for confirming the identity of the initial object at an image angle, so that whether the current object is matched with the initial object at the image angle can be known only by judging whether the face tracking result of the current object is matched with the initial image data, wherein the current object comprises an object corresponding to the current image data and an object corresponding to the current voice data. If the face tracking result of the current object does not match the initial feature image data of the initial object, that is, the current object and the initial object are not matched in image angle, which means that the object corresponding to the current image data is not the initial object, then the voice tracking of the current object may be performed through step 104, and it is determined whether the voice tracking result of the current object matches the initial feature voice data of the initial object, so that it is known whether the current object and the initial object are matched in voice angle, that is, whether the object corresponding to the current voice data is the initial object. If the current voice data is matched with the initial characteristic voice data of the initial object, the current voice data of the current object is subjected to voice recognition through step 105.

In the embodiment of the invention, when the image detection result is abnormal in the voice recognition process, namely when the image detection result shows that the current object is not matched with the initial object, whether the voice recognition is continuously carried out is judged by comparing and judging whether the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object, so that the problem of discontinuity of the voice recognition caused by stopping the voice recognition if the face detection is not available because the voice recognition is seriously dependent on the lip movement detection in the image recognition in the prior art is solved, the voice recognition can be more accurately carried out by utilizing the audio and video information, the fluency, the integrity and the accuracy of the voice recognition are improved, and the user experience is effectively improved.

In an embodiment, the method further comprises: and if the face tracking result of the current object is matched with the initial characteristic image data of the initial object, performing voice tracking on the current object in a single-person mode or a multi-person mode according to the face tracking result of the current object.

In the embodiment of the invention, when the face tracking result of the current object is matched with the initial characteristic image data of the initial object, because the number of the current object possibly comprises one person or more persons, whether the voice tracking of a multi-person mode is started or not can be determined according to the face tracking result of the current object, specifically, whether the lip movement number of the current object is more than one, and the voice tracking recognition under the multi-person mode is carried out according to the reference data such as the lip movement position and direction of each object in the current object, so that the voice recognition accuracy under the multi-speaker mode is improved.

In one embodiment, performing voice tracking in single-person mode or multi-person mode on the current object according to the face tracking result of the current object comprises: and if only one object data matched with the initial characteristic image data of the initial object exists in the face tracking result of the current object, performing voice recognition on the single voice data of the current object and storing the single voice data.

In the embodiment of the invention, when only one object data matched with the initial characteristic image data of the initial object exists in the face tracking result of the current object, the current object is only a speaker, the current single voice data can be confirmed as the voice data of the current speaker, the voice recognition can be directly carried out on the single voice data of the current object and stored, and the stored single voice data can also be used as historical single voice data for carrying out voice separation so as to improve the accuracy of the voice recognition.

In an embodiment, performing voice tracking in single-person mode or multi-person mode on the current object according to the face tracking result of the current object further comprises:

and if a plurality of object data matched with the initial characteristic image data of the initial object exist in the face tracking result of the current object, performing voice separation on the current voice data of the current object to obtain corresponding separated voice data of each current object, and performing voice recognition on the separated voice data.

In the embodiment of the invention, if a plurality of object data matched with the initial characteristic image data of the initial object exist in the face tracking result of the current object, which indicates that the current object is a plurality of speakers, voice separation needs to be performed on the current object, namely the current voice data of the plurality of speakers so as to obtain corresponding separated voice data corresponding to each current object, and then voice recognition is performed on the separated voice data conveniently so as to improve the voice recognition accuracy under a plurality of speaker modes.

In one embodiment, the voice separating the current voice data of the current object includes: and carrying out voice separation on the current voice data of the current object according to the initial characteristic voice data and the historical single voice data. In the embodiment of the invention, the voice separation can be realized by respectively carrying out characteristic comparison on the current voice data and the initial characteristic voice data and the historical single voice data, wherein the historical single voice data is the stored single voice data corresponding to a single speaker, for example, the stored single voice data in a single mode, and the voice separation accuracy in a plurality of speaker modes can be effectively improved by simultaneously using the initial characteristic voice data and the historical single voice data as reference data.

In one possible embodiment, the current speech data of the current object is speech separated from the initial characteristic speech data and the historical single-person speech data by beamforming. In the embodiment of the invention, when a plurality of speaker modes are adopted, the historical single voice data stored in the single speaker mode and other conditions are used as reference signals to carry out voice separation through beam forming, so that the accuracy of the beam forming for voice separation can be greatly improved.

In one embodiment, before determining whether the face tracking result of the current subject matches the initial feature image data of the initial subject, the method further comprises:

establishing an object classification model according to the initial characteristic image data and/or the initial characteristic voice data;

determining whether the face tracking result of the current subject matches the initial feature image data of the initial subject includes:

judging whether the face tracking result of the current object is matched with the initial characteristic image data of the initial object or not according to the object classification model;

judging whether the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object comprises the following steps:

and judging whether the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object or not according to the object classification model.

In the embodiment of the invention, the object classification model is established according to the initial characteristic image data and/or the initial characteristic voice data, and whether the face tracking result and the voice tracking result of the current object are matched with the corresponding data of the initial object or not is respectively judged according to the object classification model, so that the characteristic data in the initial characteristic image data and the initial characteristic voice data can be more accurately extracted, more accurate judgment results can be obtained, and the speed of voice recognition can be improved. The object classification model may be obtained by training sample image data or sample voice data separately through a neural network, or by training sample image data or sample voice data as a classification basis at the same time.

In one embodiment, the initial feature image data includes at least initial lip movement feature data corresponding to the initial object;

whether the face tracking result of the current subject matches the initial feature image data of the initial subject includes:

it is determined whether the face tracking result of the current subject matches the initial lip movement feature data of the initial subject.

In the embodiment of the invention, when face tracking is carried out in the voice recognition process, because the multiple differences of the image display ranges are large and the background picture noise of the image display content is more, the lip action of a speaker is required to be taken as the image detection key point, the initial characteristic image data at least comprises the initial lip action characteristic data corresponding to the initial object, the face tracking result of the current object is matched with the initial characteristic image data of the initial object, namely the face tracking result of the current object is at least matched with the initial lip action characteristic image data of the initial object, and then voice tracking in a single-person mode or a multi-person mode can be carried out according to the face tracking result of the current object.

In an embodiment, the method further comprises: and if the voice tracking result of the current object is not matched with the initial characteristic voice data of the initial object, ending the voice recognition of the current voice data. In the embodiment of the present invention, if the face tracking result of the current object does not match the initial feature image data of the initial object, and the voice tracking result of the current object also does not match the initial feature voice data of the initial object, it indicates that both the object corresponding to the current image data included in the current object and the object corresponding to the current voice data do not match the initial object, that is, the face in the current image is not the initial object, and the current speaker is not the initial object and is a different speaker, then the current voice data should be regarded as noise, and the voice recognition of the current voice data is stopped.

Referring to fig. 2, another embodiment of the present invention provides a data processing apparatus, including:

an obtaining module 201, configured to obtain original image data and original voice data;

a feature recognition module 202, configured to perform feature recognition on the original image data and the original voice data, respectively, to obtain initial feature image data and initial feature voice data corresponding to an initial object;

a face tracking module 203, configured to perform face tracking on current image data and determine whether a face tracking result of the current object matches initial feature image data of an initial object;

the first voice tracking module 204 is configured to, if the face tracking result of the current object does not match the initial feature image data of the initial object, perform voice tracking on the current object, and determine whether the voice tracking result of the current object matches the initial feature voice data of the initial object;

and the voice recognition module 205 is configured to perform voice recognition on the current voice data of the current object if the voice tracking result of the current object matches the initial characteristic voice data of the initial object.

In the embodiment of the present invention, the original image data and the original voice data in the acquisition module 201 are obtained by a camera or other equipment, and the data including an image and a voice of an object to be recognized, and the feature recognition module 202 respectively performs feature recognition on the obtained original image data and original voice data, specifically, it may be a feature recognition method of performing face and lip motion detection recognition on the original image data, performing voiceprint recognition on the original voice data, and the like, so that the initial feature image data and the initial feature voice data obtained by recognition respectively include image features and voice features corresponding to the original object, and thus, the identity features of the original object can be confirmed according to the initial feature image data and the initial feature voice data. The initial feature image data may be face feature data and/or lip movement feature data corresponding to the initial object, and of course, other feature recognition methods may also be specifically used in the embodiment of the present invention, and the specific method of feature recognition and the specific content of the initial feature image data are not limited herein as long as the initial feature image data and the initial feature voice data obtained by recognition can be used to confirm the identity of the initial object. Then the face tracking module 203 performs face tracking detection on the current image data and judges whether the face tracking result of the current object is matched with the initial feature image data of the initial object; the current image data can be obtained by transmission of the camera and other equipment, and the initial characteristic image data can be used for confirming the identity of the initial object at an image angle, so that whether the current object is matched with the initial object at the image angle can be known only by judging whether the face tracking result of the current object is matched with the initial image data, wherein the current object comprises an object corresponding to the current image data and an object corresponding to the current voice data. If the face tracking result of the current object does not match the initial feature image data of the initial object, that is, the current object and the initial object do not match in image angle, which means that the object corresponding to the current image data is not the initial object, the first voice tracking module 204 may perform voice tracking on the current object at this time, and determine whether the voice tracking result of the current object matches the initial feature voice data of the initial object, so as to know whether the current object and the initial object match in voice angle, that is, whether the object corresponding to the current voice data is the initial object. If the current voice data is matched with the initial characteristic voice data of the initial object, the current voice data of the current object is subjected to voice recognition by the voice recognition module 205.

Referring to fig. 3, in an implementation manner, the apparatus further includes:

and a second voice tracking module 301, configured to perform voice tracking in a single-person mode or a multi-person mode on the current object according to the face tracking result of the current object if the face tracking result of the current object matches the initial feature image data of the initial object.

In the embodiment of the present invention, an obtaining module 201 obtains original image data and original voice data, a feature recognition module 202 performs feature recognition on the original image data and the original voice data respectively to obtain initial feature image data and initial feature voice data corresponding to an initial object, and a face tracking module 203 is used for performing face tracking on current image data and determining whether a face tracking result of the current object matches with the initial feature image data of the initial object; when the face tracking result of the current object is matched with the initial feature image data of the initial object, because the number of people of the current object may include one person or more persons, whether to start the voice tracking of the multi-person mode or not can be determined by the second voice tracking module 301 according to the face tracking result of the current object, specifically, whether more than one lip movement number of the current object is needed or not, and the voice tracking recognition in the multi-person mode is performed according to reference data such as the lip movement position and direction of each object in the current object, so that the voice recognition accuracy in the multi-speaker mode is improved.

Referring to fig. 4, in an implementation, the second voice tracking module 301 includes:

and the single voice recognition module 302 is configured to perform voice recognition on the single voice data of the current object and store the single voice data if only one object data matching the initial feature image data of the initial object exists in the face tracking result of the current object.

Referring to fig. 4, in an implementation manner, the second voice tracking module 301 further includes:

the multi-person voice recognition module 303 is configured to, if a plurality of object data matched with the initial feature image data of the initial object exist in the face tracking result of the current object, perform voice separation on the current voice data of the current object to obtain corresponding separated voice data of each current object, and perform voice recognition on the separated voice data.

In one embodiment, the multi-person speech recognition module 303 includes:

and the voice separation unit is used for carrying out voice separation on the current voice data of the current object according to the initial characteristic voice data and the historical single voice data.

In the embodiment of the invention, the voice separation can be realized by respectively carrying out characteristic comparison on the current voice data and the initial characteristic voice data and the historical single voice data, wherein the historical single voice data is the stored single voice data corresponding to a single speaker, for example, the stored single voice data in a single mode, and the voice separation accuracy in a plurality of speaker modes can be effectively improved by simultaneously using the initial characteristic voice data and the historical single voice data as reference data.

Referring to fig. 5, in an implementation manner, the apparatus further includes:

a model building module 401, configured to build an object classification model according to the initial feature image data and/or the initial feature voice data;

the face tracking module 203 includes:

a first matching unit for judging whether the face tracking result of the current object matches with the initial feature image data of the initial object according to the object classification model;

the first voice tracking module 204 includes:

and the second matching unit is used for judging whether the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object according to the object classification model.

the face tracking module 203 includes:

and the lip movement judging unit is used for judging whether the face tracking result of the current object is matched with the initial lip movement characteristic data of the initial object.

In one embodiment, the apparatus further comprises:

and the ending module is used for ending the voice recognition of the current voice data if the voice tracking result of the current object is not matched with the initial characteristic voice data of the initial object. In the embodiment of the present invention, if the face tracking result of the current object does not match the initial feature image data of the initial object, and the voice tracking result of the current object also does not match the initial feature voice data of the initial object, it indicates that both the object corresponding to the current image data included in the current object and the object corresponding to the current voice data do not match the initial object, that is, the face in the current image is not the initial object, and the current speaker is not the initial object and is a different speaker, then the current voice data should be regarded as noise, and the voice recognition of the current voice data is stopped.

In the embodiment of the present invention, the implementation order among the steps may be replaced without affecting the implementation purpose.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A data processing method, comprising:

acquiring original image data and original voice data;

respectively carrying out feature recognition on the original image data and the original voice data to obtain initial feature image data and initial feature voice data corresponding to an initial object;

performing face tracking on the current image data, and judging whether a face tracking result of the current object is matched with the initial characteristic image data of the initial object;

if the face tracking result of the current object is not matched with the initial feature image data of the initial object, performing voice tracking on the current object, and judging whether the voice tracking result of the current object is matched with the initial feature voice data of the initial object or not;

and if the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object, performing voice recognition on the current voice data of the current object.

2. The method of claim 1, further comprising:

and if the face tracking result of the current object is matched with the initial characteristic image data of the initial object, carrying out voice tracking on the current object in a single-person mode or a multi-person mode according to the face tracking result of the current object.

3. The method of claim 2, wherein the voice tracking of the current subject in single-person mode or multi-person mode according to the face tracking result of the current subject comprises:

and if only one object data matched with the initial characteristic image data of the initial object exists in the face tracking result of the current object, performing voice recognition on the single voice data of the current object and storing the single voice data.

4. The method of claim 3, wherein the voice tracking of the current subject in single-person mode or multi-person mode according to the face tracking result of the current subject further comprises:

5. The method of claim 4, wherein the voice separating the current voice data of the current object comprises:

and carrying out voice separation on the current voice data of the current object according to the initial characteristic voice data and the historical single voice data.

6. The method of claim 5 wherein the current speech data of the current subject is speech separated from the initial characteristic speech data and historical single-person speech data by beamforming.

7. The method according to any one of claims 1-6, wherein before the determining whether the face tracking result of the current subject matches the initial feature image data of the initial subject, the method further comprises:

the determining whether the face tracking result of the current subject matches the initial feature image data of the initial subject includes:

the judging whether the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object comprises:

8. The method according to any one of claims 1-6, wherein the initial feature image data comprises at least initial lip movement feature data corresponding to an initial object;

and judging whether the face tracking result of the current object is matched with the initial lip movement characteristic data of the initial object or not.

9. The method according to any one of claims 1-6, further comprising:

and if the voice tracking result of the current object is not matched with the initial characteristic voice data of the initial object, ending the voice recognition of the current voice data.

10. A data processing apparatus, comprising:

the acquisition module is used for acquiring original image data and original voice data;

the feature recognition module is used for respectively carrying out feature recognition on the original image data and the original voice data to obtain initial feature image data and initial feature voice data corresponding to an initial object;

the face tracking module is used for carrying out face tracking on the current image data and judging whether the face tracking result of the current object is matched with the initial characteristic image data of the initial object or not;

the first voice tracking module is used for carrying out voice tracking on the current object and judging whether the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object or not if the face tracking result of the current object is not matched with the initial characteristic image data of the initial object;

and the voice recognition module is used for carrying out voice recognition on the current voice data of the current object if the voice tracking result of the current object is matched with the initial characteristic voice data of the initial object.