CN114556469A

CN114556469A - Data processing method and device, electronic equipment and storage medium

Info

Publication number: CN114556469A
Application number: CN201980100983.7A
Authority: CN
Inventors: 赵亮
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd; Shenzhen Huantai Technology Co Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd; Shenzhen Huantai Technology Co Ltd
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2022-05-27
Also published as: WO2021120190A1

Abstract

A data processing method, apparatus, server and storage medium. Wherein the method comprises the following steps: acquiring target data (301); the target data includes video data and audio data; performing a person recognition operation for a target person using the target data, the target person corresponding to a speaker in the target data (302); adding a mark aiming at the target person in the video data; the mark-up and the recognized text determined based on the audio data are presented when playing the video data (303).

Description

Data processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to simultaneous interpretation technology, and in particular, to a data processing method, apparatus, electronic device, and storage medium.

Background

With the continuous development and maturity of Artificial Intelligence (AI) technology, products that solve common problems in life by applying the Artificial Intelligence technology are emerging. The Machine Simultaneous Interpretation (also called Machine Simultaneous Interpretation, AI Simultaneous Interpretation) is widely applied To scenes such as conferences and interview programs by combining technologies such As Speech Recognition (ASR), Machine Translation (MT), Speech synthesis (TTS), replacing or partially replacing manual work, and realizing Simultaneous Interpretation (SI).

In the related machine simultaneous transmission system, voice is automatically recognized through a language recognition technology, the source language characters obtained through recognition are translated into target language characters through a machine translation technology, translated results are directly displayed through a screen, and the target language characters can be converted into voice through a voice synthesis technology and then broadcasted. However, only the speech of the speaker is displayed and played synchronously, and the user cannot determine the speaker during watching, so that it is difficult to understand the speech content in accordance with the identity of the speaker.

Disclosure of Invention

In order to solve related technical problems, embodiments of the present application provide a data processing method, an apparatus, an electronic device, and a storage medium.

The embodiment of the application provides a data processing method, which comprises the following steps:

acquiring target data; the target data includes video data and audio data;

performing a person recognition operation for a target person using the target data, the target person corresponding to a speaker in the target data;

adding a mark aiming at the target person in the video data; the marker and the recognized text determined based on the audio data are presented while the video data is played.

An embodiment of the present application further provides a data processing apparatus, including:

an acquisition unit configured to acquire target data; the target data includes video data and audio data;

a first processing unit configured to perform a person recognition operation for a target person corresponding to a speaker in the target data using the target data;

a second processing unit configured to add a tag for the target person in the video data; the marker and the recognized text determined based on the audio data are presented while the video data is played.

An embodiment of the present application further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of any of the data processing methods when executing the program.

The embodiment of the present application further provides a storage medium, on which computer instructions are stored, and the instructions, when executed by a processor, implement the steps of any of the data processing methods described above.

The data processing method, the data processing device, the electronic equipment and the storage medium provided by the embodiment of the application acquire target data; the target data includes video data and audio data; performing a person recognition operation for a target person using the target data, the target person corresponding to a speaker in the target data; adding a mark aiming at the target person in the video data; the mark and the identification text determined based on the audio data are presented when the video data are played; therefore, the target person corresponding to the recognition text can be determined, the content spoken by the target person (namely, the recognition text) is correspondingly marked for the target person, so that a user can conveniently correspond the speaker (namely, the target person) to the content spoken by the speaker, and the content spoken by the target person is understood by combining the identity of the target person, so that the user can be accurately helped to understand the content, and the user experience is improved.

Drawings

FIG. 1 is a schematic diagram of a system architecture for simultaneous interpretation in the related art;

FIG. 2 is a flow chart of a simultaneous interpretation method in the related art

FIG. 3 is a schematic flow chart illustrating a data processing method according to an embodiment of the present application;

FIG. 4 is a schematic illustration of a marker of an embodiment of the present application;

FIG. 5 is a schematic flow chart illustrating a data processing method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and specific embodiments.

FIG. 1 is a schematic diagram of a system architecture for simultaneous interpretation in the related art; as shown in fig. 1, the system may include: the system comprises a machine co-transmission server, a voice processing server, a viewer mobile terminal, a Personal Computer (PC) client and a display screen.

In practical application, a lecturer can perform conference lecture through a PC client, the PC client collects audio data of the lecturer and sends the collected audio data to a machine co-transmission server in the process of performing the conference lecture, and the machine co-transmission server identifies the audio data through a voice processing server to obtain an identification result; the machine simultaneous transmission server side can send the recognition result to the PC client side, and the PC client side projects the recognition result to a display screen; the recognition result can also be sent to the mobile terminal of the audience (specifically, the recognition result of the corresponding language is correspondingly sent according to the language needed by the user), and the recognition result is displayed for the user, so that the speech content of the speaker is translated into the language needed by the user and displayed.

The recognition result may include at least one of: the audio data includes a recognition text (denoted as a first recognition text) in the same language as the audio data, a translation text (denoted as a second recognition text) in another language obtained by translating the first recognition text, and audio data generated based on the second recognition text.

The system collects audio data, identifies the audio data to obtain an identification result, and sends the identification result to the PC client, as described in detail with reference to the simultaneous interpretation method shown in fig. 2.

The PC client side can be provided with a voice acquisition module (such as a microphone), acquires audio data of a speaker during speech through the voice acquisition module, and sends the acquired audio data to the machine simultaneous transmission server side.

The machine simultaneous transmission server performs voice recognition on the audio data through the voice processing server to obtain source language characters, namely the first recognition text; performing machine translation on the source language characters to obtain target language characters, namely the second identification text;

the machine simultaneous transmission server side casts the recognition result to a display screen through a PC client side for displaying; here, the recognition result may include at least one of: the first recognition text and the second recognition text;

the machine simultaneous transmission server can also send the identification result to the audience mobile terminal and display the identification result through a screen of the audience mobile terminal.

The machine simultaneous transmission server can also perform voice synthesis on the second recognition text through the voice processing server, and play the synthesized audio data through the earphone of the mobile terminal of the audience.

The simultaneous interpretation method can realize speech recognition and translation, but in a multi-person scene such as interviews and meetings, each speaker has different identity backgrounds and roles (such as interviewers and interviewees), the speech content usually shows the own viewpoint and position of the speaker, and if the simultaneous interpretation speech content can clearly correspond to the speaker, the understanding of the speech content is very helpful by combining the role and position of the speaker. However, the above simultaneous interpretation method cannot provide the corresponding relationship between the speaker and the speech content, and requires the audience to analyze and determine the corresponding relationship between the speaker and the speech content, which increases the difficulty of understanding.

Based on this, in various embodiments of the present application, target data is obtained; the target data includes video data and audio data; performing a person recognition operation for a target person using the target data, the target person corresponding to a speaker in the target data; adding a mark aiming at the target person in the video data; the mark and the identification text determined based on the audio data are presented when the video data are played; in this way, the target person (i.e., the speaker) corresponding to the recognition text (which may be the first recognition text or the second recognition text) can be determined, and the content (i.e., the recognition text) spoken by the target person is correspondingly marked on the target person, so that the user can conveniently correspond the target person to the content spoken by the target person and understand the content spoken by the target person according to the identity of the target person, thereby accurately helping the user understand the content and improving the user experience.

Fig. 3 is a schematic flow chart of the data processing method according to the embodiment of the present application; as shown in fig. 3, the method includes:

step 301: acquiring target data; the target data includes video data and audio data;

here, the video data may be presented while the audio data is being played, that is, the audio data and the video data may be played simultaneously in a simultaneous interpretation scene.

Step 302: performing a person recognition operation for a target person using the target data;

performing corresponding person identification operation by using the target data to determine a target person corresponding to the target data and a target position of the target person in the video data;

wherein the target person corresponds to a speaker in the target data.

Step 303: adding a mark aiming at the target person in the video data;

wherein the mark and the recognized text determined based on the audio data are presented while the video data is played.

Specifically, the presentation of the mark and the identification text determined based on the audio data during the playing of the video data means that the mark and the identification text determined based on the audio data are presented in the video data while the video data is being played.

Namely, the data processing method can be applied to the scene of simultaneous interpretation, and particularly can be applied to the scene of simultaneous interpretation of multiple persons, such as interview, multi-person conference and the like.

Specifically, in the simultaneous interpretation scenario, when the speaker speaks, the first terminal (e.g., the PC client shown in fig. 1) acquires the content in real time by using the voice acquisition module (e.g., the microphone array), and then obtains the audio data. The first terminal and the server for realizing simultaneous interpretation can establish communication connection, the first terminal sends the acquired audio data to the server for realizing simultaneous interpretation, and the server can acquire the audio data in real time. Meanwhile, the second terminal shoots the video aiming at the speaker in real time by using a video acquisition module (such as a camera), namely acquires video data. The second terminal and the server for realizing simultaneous interpretation can establish communication connection, the second terminal sends the collected video data to the server for realizing simultaneous interpretation, and the server can acquire the video data in real time. The audio data are collected through the first terminal, the video data are collected through the second terminal, and the server can obtain the target data.

The simultaneous interpretation scene may adopt a system architecture shown in fig. 1, and the data processing method according to the embodiment of the present application may be applied to an electronic device, where the electronic device may be a device newly added to the system architecture shown in fig. 1, or may be a device (e.g., a machine simultaneous interpretation server, a voice processing server, and a viewer mobile terminal) in the system architecture shown in fig. 1, so as to implement the method according to the embodiment of the present application. The electronic device may be a server, a terminal held by a user, or the like.

Specifically, in practical application, the electronic device may be a server, and the server acquires target data, processes the target data by using the data processing method provided in the embodiment of the present application, and obtains a processing result; the processing result can be displayed by a terminal held by a user through a human-computer interaction interface of the terminal; the processing result can be projected to a display screen by the server and displayed through the display screen. The processing result may include video data to which a flag is added.

The electronic device may also be a server having or connected to a human-computer interaction interface, and the processing result may be displayed by the human-computer interaction interface of the server.

Here, the server may be a server newly added in the system architecture of fig. 1, and is used to implement the method of the present application (i.e., the method shown in fig. 2), or the voice processing server and the machine co-transmission server in the architecture of fig. 1 may be improved to implement the method of the present application.

The electronic equipment can also be a terminal held by a user, the terminal held by the user can acquire target data, the target data is processed by using the data processing method provided by the embodiment of the application to obtain a processing result, and the processing result is displayed through a human-computer interaction interface of the electronic equipment.

Here, the terminal held by the user may be a terminal that is newly added to the system architecture of fig. 1 and can implement the method of the present application, or may be a terminal that is modified from the mobile terminal of the viewer in the system architecture of fig. 1 to implement the method of the present application. Here, the terminal held by the user may be a PC, a mobile phone, or the like.

It should be noted that the server or the terminal held by the user may also be provided with or connected to a voice acquisition module and a video acquisition module, and the target data may be obtained by acquiring audio data and video data through the voice acquisition module and the video acquisition module provided or connected to the server or the terminal.

In the embodiment of the application, the data processing method is applied to the equal transliteration scenes of interviews and conferences, and the audio data and the video data aim at the same interview or the same conference.

The audio data will change as the interview or conference progresses, and the identification text will also change as the audio data changes. If the target character (i.e., speaker) may also change in the simultaneous interpretation scene in which multiple persons participate, the recognition text corresponding to the target character continuously changes with the change of the audio data.

And during actual application, determining a target person corresponding to the target data and a target position of the target person in the video data by using the person identification operation.

Specifically, the person identification operation includes:

identifying the target person according to the audio data;

identifying a target position of the target person according to the target data;

the adding of the mark aiming at the target person in the video data comprises the following steps:

and adding marks aiming at the target person in the video data according to the target position.

In practical application, the target person and the target position can be obtained by carrying out person identification on the video data and the audio data collected by the microphone.

Specifically, the identifying the target person according to the audio data includes:

determining a target voiceprint characteristic corresponding to the audio data, inquiring a voiceprint database according to the target voiceprint characteristic, and determining a target person corresponding to the audio data; the voiceprint database comprises at least one voiceprint model and a person corresponding to the at least one voiceprint model;

the identifying the target position of the target person according to the target data includes:

determining a facial image corresponding to the target person from an image database, performing image recognition on at least one frame of image in the video data according to the facial image, and determining the target position of the target person in the video data.

Here, the audio data may be acquired by a microphone.

Here, the voiceprint database includes at least one voiceprint model and personal information corresponding to the at least one voiceprint model;

the image database comprises at least one person face image and person information corresponding to the at least one person face image;

the personal information includes: identity (ID, Identity Document); the same person adopts the same ID in the voiceprint database and the image database; therefore, the image database is queried according to the determined target person, and the facial image corresponding to the target person can be determined from the image database.

Here, the electronic device may perform voiceprint recognition on the audio data by using a preset voiceprint recognition model to obtain a corresponding target voiceprint feature; and querying a voiceprint database according to the target voiceprint characteristics, and determining a speaker corresponding to the audio data, namely a target person.

The voiceprint recognition model can be obtained by training a specific neural network in advance, and the voiceprint recognition model is used for determining a speaker corresponding to the audio data.

Here, the electronic device may perform image recognition on at least one frame of image in the video data by using a preset image recognition model, and determine a person whose face image of the target person corresponds to the video data, that is, a speaker, that is, a target person.

The image recognition model may be obtained by training a specific neural network in advance, and the image recognition model may be used to recognize the video data and determine a person corresponding to the face image of the target person in the video data, that is, a speaker in the video data.

In practical applications, in order to determine the target person in the video data, an image database may be obtained in advance to determine the target person according to the image in the video data.

Based on this, in an embodiment, the method may further include:

acquiring at least one face image of a person and a person (recorded with an ID of the person) corresponding to each face image in the at least one face image of the person;

carrying out image recognition on each face image, and extracting the face image characteristics of each person;

and storing the ID of each person in an image database in a manner of corresponding to the corresponding facial image characteristics.

The image database may store therein: the ID of the person, the corresponding facial image characteristics.

Here, a first spatial coordinate system may be set in advance for the video data, and the position of the target person in the first spatial coordinate system may be determined as the target position based on the position of the target person in the corresponding frame image in the video data.

Specifically, each frame of image in the video data is an image with the same shape (such as a square), the first spatial coordinate system and the image have a mapping relation, and the target position of the target person in the first spatial coordinate system can be determined based on the corresponding position of the target person in the image; that is, the target position of the target person in the video data may be understood as the position of the target person in the first spatial coordinate system.

In practical application, in order to determine a target person corresponding to audio data, a voiceprint database may be obtained in advance, so as to determine the target person according to a target voiceprint feature corresponding to the audio data.

Based on this, in an embodiment, the method further comprises:

collecting the sound of at least one person and the ID corresponding to each person in the at least one person;

performing voiceprint recognition on the voice of each person in the at least one person to determine a voiceprint model of the voice of each person;

and storing the ID of each person and the corresponding voiceprint model in a voiceprint database in a corresponding mode.

The voiceprint database may store: the person's ID, the corresponding voiceprint model.

In practical application, the target person and the target position can be obtained by carrying out person identification on the audio data acquired by the microphone array.

determining a target voiceprint characteristic corresponding to the audio data;

inquiring a voiceprint database according to the target voiceprint characteristics, and determining a target person corresponding to the audio data; the voiceprint database comprises at least one voiceprint feature and a person corresponding to each voiceprint feature in the at least one voiceprint feature;

determining a sound source position corresponding to the audio data;

and determining the target position of the target person in the video data according to the sound source position.

Here, the sound source position may be determined in conjunction with a position of a voice collecting module that collects audio data.

The voice collecting module can be a microphone array with multiple channels, so that the sound source position, namely the position of a target person (speaker) in a real scene, can be determined by applying a sound source positioning technology. Specifically, determining a sound source position corresponding to the audio data includes:

acquiring sound data of each channel in a microphone array (with multiple channels);

and performing position calculation according to the sound data of each sound channel to obtain the sound source position.

Here, any one of the localization algorithms used for the location calculation based on the sound data of each channel (i.e., the sound source localization technique) may be used, and for example, a multivariate array localization algorithm or the like may be used without limitation.

Specifically, the determining the target position of the target person in the video data according to the sound source position includes:

determining that the sound source position corresponds to a target position in the video data based on the position mapping relationship; the position mapping relationship represents a position mapping relationship between a real scene and a scene in the video data.

In practical applications, a position mapping relationship may be constructed in advance in order to determine a corresponding position of a target person in video data, i.e., a target position, according to a position of the target person in a real scene (i.e., a sound source position).

Based on this, in an embodiment, the method further comprises:

constructing a second space coordinate system according to the real scene corresponding to the video data;

and establishing a corresponding relationship between the second spatial coordinate system and the first spatial coordinate system (namely the first spatial coordinate system established for the video data) to obtain the position mapping relationship.

And the sound source position represents the position of the target person in the real scene, and the sound source position is determined to correspond to the position in the video data based on the position mapping relation, namely the target person is determined to correspond to the target position in the video data.

In practical application, people can be identified through the video data and the audio data collected through the microphone array, and a target person and a target position are obtained.

Specifically, the person identification operation includes:

identifying the target person and the target position of the target person according to the video data and the audio data;

Here, the identifying the target person and the target position of the target person from the video data and the audio data includes:

performing image recognition on at least one frame of image in the video data, and determining at least one first person and the position of the at least one first person in the video data;

determining a target voiceprint characteristic corresponding to the audio data; inquiring a voiceprint database according to the target voiceprint characteristics, and determining a second person corresponding to the audio data; determining a sound source position corresponding to the audio data according to the audio data;

mapping the character position and the sound source position of the at least one first character to a preset spatial coordinate system;

determining a first person position and a first sound source position which are mapped to the preset space coordinate system and meet a preset distance condition, determining a target person by using a first person corresponding to the first person position and a second person corresponding to the first sound source position, and taking the position of the target person corresponding to the preset space coordinate system as a target position.

Here, the first persona characterizes a persona of the video data presentation; the second person characterizes a speaker corresponding to the audio data.

Here, image recognition is performed on at least one frame of image in the video data, and positions of a first person in the video data and a person of the first person in the video data are determined, which may refer to the person recognition operation performed by using the video data, and is not described herein again.

Here, the determining a target voiceprint feature corresponding to the audio data; querying a voiceprint database according to the target voiceprint characteristics, and determining a second person corresponding to the audio data; determining a sound source position corresponding to the audio data according to the audio data, which may refer to the above-mentioned human recognition operation performed by using the audio data, and is not described herein again.

That is, in the embodiment of the present application, the person recognition result obtained from the video data (i.e., the positions of the first person and the first person in the video data) and the person recognition result obtained using the audio data (i.e., the positions of the sound sources corresponding to the second person and the second person) are combined to determine the target person and the target position. Here, by combining the person recognition result obtained from the video data with the person recognition result obtained from the audio data, it is possible to reduce the error of person recognition and improve the accuracy of recognition (specifically, improve the accuracy of target person recognition and the accuracy of target position recognition).

Here, the mapping to the first person position and the first sound source position in the preset spatial coordinate system, which meet a preset distance condition, includes:

and mapping to the first human position and the first sound source position which are coincided in the preset space coordinate system, or correspondingly mapping to the first human position and the first sound source position of which the distance between the first human position and the first sound source position is smaller than a preset distance threshold value.

That is, after the position of the at least one first person and the position of the sound source are mapped to the same coordinate system, a first person position (one position in the position of the at least one first person) and a first sound source position (if only one sound source position is obtained, the first sound source position is the calculated sound source position; if different sound source positions are calculated by using different positioning algorithms, the first sound source position is one of the different sound source positions) at the same position (or at a similar position) are determined;

the first person and the second person at the same position (or at similar positions) are then used to determine the target person. Here, the proximity means that the distance between the position of the person and the position of the sound source is smaller than a preset distance threshold.

Here, the distance between the position of the person mapped to the predetermined spatial coordinate system and the position of the sound source does not meet the predetermined condition, i.e., the positions are not coincident or close, which may be caused by the audio data and the video data being played out of synchronization, and therefore, the audio data and the video data may be synchronized first. Specifically, the audio data may correspond to a first time axis, and the video data may correspond to a second time axis, and the method may further include: aligning audio data and video data according to the first timeline and the second timeline such that the audio data and the video data are synchronized. After determining the positions of a first person and a corresponding person, a second person and a corresponding sound source for the synchronized audio data and video data, mapping the positions of the person and the sound source to the same coordinate system (such as a preset spatial coordinate system), and selecting one person from the first person and the second person mapped to the same position (or similar positions) in the preset spatial coordinate system as a target person.

In addition, each position on the preset space coordinate system can also correspond to time information, and the time information represents the speaking time information of the person at the corresponding position; correspondingly, for the first person position and the first sound source position which are mapped to the preset space coordinate system and meet the preset distance condition, whether the time corresponding to the first person position and the time corresponding to the first sound source position coincide or not can be further determined. Namely, the method may include determining positions of a first person and a corresponding person, a second person and a corresponding sound source for audio data and video data, mapping the positions of the person and the sound source to a preset spatial coordinate system, and selecting one person as a target person from the first person and the second person which are mapped to the same position (or similar positions) in the preset spatial coordinate system and correspond to time coincidence; therefore, the error of character recognition can be reduced, and the accuracy of target character recognition and the accuracy of target position recognition can be improved.

Here, the preset spatial coordinate system may adopt a newly-established spatial coordinate system, and mapping the position of the person and the position of the sound source to the preset spatial coordinate system may refer to mapping the position of the sound source and the position of the sound source to the newly-established spatial coordinate system.

It should be noted that, the newly-built spatial coordinate system has a corresponding relationship with a real scene, so that the sound source position can be mapped into the newly-built spatial coordinate system. The newly-built space coordinate system has a corresponding relation with scenes in the video data, so that the positions of people can be mapped into the newly-built space coordinate system. And taking the corresponding position of the target person in the preset space coordinate system as a target position, wherein the target position is also the position of the corresponding target person in the video data.

The preset spatial coordinate system may also adopt the first spatial coordinate system set for the video data, and the mapping of the person position and the sound source position to the preset spatial coordinate system means mapping the sound source position to the first spatial coordinate system. Correspondingly, the position of the target person in the preset spatial coordinate system is taken as a target position, and the position of the target person in the first spatial coordinate system is taken as a target position.

Specifically, the determining a target person by using a first person corresponding to the first person position and a second person corresponding to the first sound source position includes:

if the first person corresponding to the first person position is the same as the second person corresponding to the first sound source position, taking the first person corresponding to the first person position (i.e. the second person corresponding to the first sound source position) as the target person;

if a first person corresponding to the first person position is different from a second person corresponding to the first sound source position, acquiring a first weight and a second weight; the first weight represents the credibility of the target person determined according to the video data, and the second weight represents the credibility of the target person determined according to the audio data; weighting the recognition result of the first person corresponding to the first person position according to the first weight, and weighting the recognition result of the second person corresponding to the first sound source position according to the second weight; and selecting one person from a first person corresponding to the first person position and a second person corresponding to the first sound source position as a target person according to the weighting processing result.

The recognition result of the first person corresponding to the first person position may include: person ID, result value (likelihood of characterizing the respective person);

the recognition result of the second person corresponding to the first sound source position may include: person ID, result value.

Here, the first weight and the second weight may be determined in advance by a developer (specifically, a plurality of tests may be performed in advance on the accuracy of image recognition and voiceprint recognition, and the test results are determined) and stored in the electronic device.

Here, it is determined whether the first person and the second person are the same, that is, it is determined whether the person ID of the first person and the person ID of the second person are the same.

For example, if the first person is person a and the second person is also person a, person a is directly taken as the target person.

If the first person is person A and the second person is person B, a first weight and a second weight need to be obtained; weighting the recognition result of person a (including the ID of person a and the corresponding result value) based on the first weight, and weighting the recognition result of person B (including the ID of person B and the corresponding result value) based on the second weight; and obtaining a person A weighting result and a person B weighting result, comparing the person A weighting result with the person B weighting result, and determining that the person A weighting result is greater than the person B weighting result, wherein the person A is taken as a target person, and otherwise, the person B is taken as the target person.

Here, it is considered that a plurality of recognition results are obtained for a possible recognition of a certain person by performing image recognition on at least one frame of image in the video data; by performing voiceprint recognition on the audio data, it is also possible to recognize a plurality of recognition results.

In a corresponding case, the first person is weighted according to the first weight, and the second person is weighted according to the second weight; selecting one person from the first person and the second person as a target person according to a weighting processing result, including:

respectively carrying out weighting processing on each first recognition result in the at least two first recognition results according to the first weight to obtain a weighting processing result aiming at each first recognition result;

respectively performing weighting processing on each second recognition result in the at least two second recognition results according to the second weight to obtain a weighting processing result aiming at each second recognition result;

determining that the at least two first recognition results and the at least two second recognition results do not have the same person (i.e., the respective person IDs in the at least two first recognition results are different from the respective person IDs in the at least two second recognition results), selecting, as the target person, the person with the largest weighting processing result from the at least two first recognition results and the at least two second recognition results according to the weighting processing result of each first recognition result and the weighting processing result of each second recognition result;

if it is determined that the at least two first recognition results and the at least two second recognition results have the same person (i.e., if some person ID exists in the at least two first recognition results and is the same as some person ID existing in the at least two second recognition results), adding the weighting processing results for the same person to obtain a weighting processing result for each person; and selecting the person with the largest weighting processing result from the at least two first recognition results and the at least two second recognition results as the target person according to the weighting processing result of each person.

The first recognition result is obtained for a certain person by performing image recognition on at least one frame of image in the video data; each first recognition result includes: person ID, corresponding result value.

The second identification result refers to a result obtained by identifying the voice print of the audio data; each second recognition result includes: person ID, corresponding result value.

For example, by performing image recognition on at least one frame of image in the video data, a plurality of recognition results are obtained for a possible recognition of a certain person, such as: recognizing person a and person B, and obtaining a probability of a 1% (i.e., one first recognition result) that person a is the target person and a probability of B% (i.e., the other first recognition result) that person B is the target person; here, a 1% + b% may equal 1;

by performing voiceprint recognition on the audio data, it is possible to recognize a plurality of recognition results, such as: recognizing person a and person C, and obtaining a probability a 2% (i.e., one second recognition result) that person a is the target person and a probability C% (i.e., another second recognition result) that person C is the target person; here, a 2% + c% may equal 1;

assuming that the first weight is x, the second weight is y, and x + y is 1;

the result of the weighting process for the recognition result of each person is as follows:

weighting processing result of recognition result of person a: a 1% × + a 2% >. y;

weighting processing result of recognition result of person B: b%. x;

weighting processing result of recognition result of person C: c%. y;

the person with the largest weighting result is selected from among person a, person B, and person C as the target person.

The likelihood characterization recognition result described above is a recognition result obtained by recognizing a person image (specifically, a facial image) by using an image recognition model, for example, the image recognition model recognizes a certain person image, and determines that the likelihood of the person image being the person a is a 1%; here, the a 1% represents that the person image has a similarity of a 1% to the face image of person a employed by the image recognition model;

for the recognition result obtained by recognizing the voice by using the voiceprint recognition model, for example, the voiceprint recognition model recognizes the audio data, and determines that the possibility of the person C corresponding to the audio data is C%; here, the similarity between the voiceprint feature corresponding to the C% representation audio data and the voiceprint model of the person C adopted by the voiceprint recognition model is C%.

In practical applications, in order to enable a user to intuitively know the content spoken by a speaker, a corresponding relationship between the speech content and the corresponding speaker needs to be established and displayed.

Based on this, in step 303, the adding a mark for the target person in the video data includes one of the following:

presenting a mark in a floating window or a fixed window at a position (such as the upper left corner, the upper right corner and the like) corresponding to the target position in the video data; the recognition text is added in the mark;

adding a preset mark aiming at a target person at a position (such as above, below, upper left corner, upper right corner and the like) corresponding to the target position in the video data; the preset mark comprises one of the following: presetting an icon, presetting an image and aiming at an outline frame of a target person; and the identification text is presented at a preset position of the video data.

Here, the recognition text is obtained by performing text recognition on the audio data corresponding to the target person; the recognition text changes according to the change of the audio data corresponding to the target person.

Here, the floating window represents a window that is movable in the video data being played; it should be noted that the movable range of the floating window may be limited to a certain range around the target person, so that the user can correspond the target person to the recognition text even after moving the floating window.

The fixed window represents a window that is fixed in position relative to the target person in the video data being played.

In the embodiment of the present application, the mark may be presented in any form of a window, for example, any mark in fig. 4 may be used. The arrows of the window point to the corresponding speakers, i.e. target persons, and the content, i.e. the recognized text, for simultaneous interpretation is presented in the window.

Here, the recognition text may be text obtained based on audio data; the language corresponding to the recognition text may be the same as the language corresponding to the audio data, or may be different from the language corresponding to the audio data, that is, the recognition text may also be a text of another language translated from a text of the same language as the audio data.

Here, the preset icon may adopt: triangle, five-pointed star, circle and other graphic marks;

the preset image can adopt the following steps: a facial image of the target person, a person identity image (such as an image representing the identity of the host, the interviewee, etc.), etc.;

the outline frame can be drawn by adopting lines such as a dotted line, a solid line and the like, for example, the outline frame can be a dotted line frame drawn for the target person;

in the case of adding a preset mark to a target person, the recognition text can be presented at a preset position of the video data; the preset position for presenting the recognition text can be preset, such as a position below the video.

Here, the mark may also be added for the displayed recognition text. Specifically, adding a mark for the target person in the video data includes:

presenting an identification text at a preset position of the video data;

presenting the identity of a target person at the head of the recognition text; the identity may be at least one of: facial image, name of person, identity of person (e.g. host, interviewee).

The user can also correspond the target character with the content spoken by the target character through the identification text and the identification of the character corresponding to the identification text.

It should be understood that the order of the steps (e.g., determining the first person, determining the second person, determining the position of the person corresponding to the first person, determining the position of the sound source corresponding to the second person, etc.) described in the above embodiments does not mean the order of execution, and the execution order of the processes should be determined by their functions and inherent logic, and should not limit the implementation processes of the embodiments of the present application in any way.

The data processing method provided by the embodiment of the application acquires target data; the target data includes video data and audio data; performing a person recognition operation for a target person using the target data, the target person corresponding to a speaker in the target data; adding a mark aiming at the target person in the video data; the mark and the identification text determined based on the audio data are presented when the video data are played; therefore, the target person corresponding to the identification text can be determined, the identification text is marked correspondingly to the target person, the user can conveniently correspond the target person to the content spoken by the target person, the content spoken by the target person is understood by combining the identity of the target person, the user can be accurately helped to understand the content spoken by the target person, and the user experience is improved.

Fig. 5 is another schematic flow chart of a data processing method according to an embodiment of the present application; as shown in fig. 5, the data processing method includes:

step 501: the audio acquisition module acquires audio data.

Step 502: performing voiceprint recognition on the audio data, and determining a first speaker according to the target voiceprint characteristics obtained through recognition; and carrying out sound source positioning on the audio data to determine the position of a sound source.

Here, the determining the first speaker according to the recognized voiceprint includes:

and querying a voiceprint database according to the target voiceprint characteristics, and determining a first speaker corresponding to the audio data.

Here, the voiceprint database may adopt the voiceprint database in the method shown in fig. 3; the voiceprint database includes at least one voiceprint model and persona information corresponding to the at least one voiceprint model. The method for constructing the voiceprint database may adopt the same process as that in the method shown in fig. 3, and details are not described here.

Here, the audio acquisition module may be a microphone array; the sound source localization of audio data includes: and calculating the position of the sound source according to the multi-channel audio data of the microphone array. The sound source position is the position of the first speaker in the real scene.

Here, the first speaker corresponds to a second person in the method of fig. 3; the identification operation in step 502 may refer to the person identification operation for the audio data in the method shown in fig. 3, which is not described herein again.

Step 503: the video acquisition module acquires video data.

Here, the video capture module may be a camera, and the video capture module captures video data, including: shooting a video at a specific position in a corresponding real scene through a camera; the captured video includes at least one speaker.

Step 504: and performing face recognition on at least one frame of image in the video data, and determining at least one second speaker and the position of a person corresponding to the at least one second speaker.

Here, a preset image recognition model may be used to perform image recognition on at least one frame of image in the video data, and determine at least one second speaker; and determining the position of the person of the at least one second speaker in the video data.

Specifically, a first spatial coordinate system may be preset for the video data, and a position of the second speaker in the first spatial coordinate system may be determined as the position of the person according to a corresponding position of the second speaker in the video data.

Each frame of image in the video data is an image with the same shape (such as square), the first space coordinate system and the image have a mapping relation, and the position of the second speaker in the first space coordinate system can be determined based on the position of the second speaker in the image.

Here, the face recognition and tracking (i.e., determining the second speaker and the position of the person corresponding to the second speaker in real time) can be achieved by continuously recognizing each frame of image in the video data.

It should be noted that, after the person and the position of the person are identified by using the image recognition (specifically, face recognition), the person can be tracked by using a Tracking Technique (Tracking Technique), and even if the face image is lost (such as when the head is lowered or turned), the position of the person can be updated synchronously. Specifically, after the target person and the target position are determined in the partial frame of the video data, the target person may move, and the corresponding mark (such as a preset icon and an outline frame for the target person) of the target person should change along with the change of the target position of the target person; here, the target person can be tracked in the subsequent image frame of the video data by using the target tracking technology, the changed target position is determined in real time, and the mark for the target person is correspondingly added, so that the effect that the mark changes along with the change of the target position of the target person is finally realized.

Here, the second speaker corresponds to the first person in the method of fig. 3; the identification operation in step 504 may refer to the person identification operation for the video data in the method shown in fig. 3, which is not described herein again.

Step 505: and determining a target speaker and a target position according to the first speaker, the sound source position, the second speaker and the person position.

Here, the target speaker is aligned and determined according to the sound source position and the person position; in this case, the information of the speaker (i.e., the speaker ID) obtained by the voiceprint recognition is combined and weighted to obtain a more accurate and stable speaker ID.

Determining a target speaker and a target position according to the first speaker, the sound source position, the second speaker and the person position, comprising:

mapping the positions of the persons corresponding to the at least two speakers and the position of the sound source corresponding to the first speaker to a preset spatial coordinate system;

determining a first person position and a first sound source position which are mapped to the preset space coordinate system and meet a preset distance condition, determining a target speaker by using a first speaker corresponding to the first sound source position and a second speaker corresponding to the first person position, and taking the position of the target speaker corresponding to the preset space coordinate system as a target position.

Here, determining a target speaker using a first speaker corresponding to the first sound source position and a second speaker corresponding to the first person position includes: if the first speaker corresponding to the first sound source position is the same as the second speaker corresponding to the first person position, taking the first speaker corresponding to the first sound source position as the target speaker; if the first speaker corresponding to the first sound source position is different from the second speaker corresponding to the first person position, acquiring a first weight and a second weight; the first weight characterizes a trustworthiness of a target speaker determined from the video data, the second weight characterizes a trustworthiness of a target speaker determined from audio data; weighting the recognition result of the first speaker corresponding to the first sound source position according to the second weight, and weighting the recognition result of the second speaker corresponding to the first person position according to the first weight; and selecting one speaker from a first speaker corresponding to the first sound source position and a second speaker corresponding to the first person position as a target speaker according to a weighting processing result.

Here, the corresponding position of the target speaker in the preset spatial coordinate system is the target position. The preset spatial coordinate system may adopt a newly-established spatial coordinate system, and the mapping of the person position and the sound source position to the preset spatial coordinate system may refer to the mapping of the sound source position and the sound source position to the newly-established spatial coordinate system. The newly-built space coordinate system has a corresponding relation with scenes in video data, so that the positions of the persons can be mapped into the newly-built space coordinate system. The newly-built space coordinate system also has a corresponding relation with a real scene, so that the sound source position can be mapped into the newly-built space coordinate system.

The preset spatial coordinate system may also be a spatial coordinate system (i.e. the first spatial coordinate system) constructed for the video data, and the position of the person may represent the position of the second speaker in the preset spatial coordinate system. Mapping the sound source position to a preset spatial coordinate system, comprising: and determining that the sound source position corresponds to a position in the video data, namely the position corresponding to the first speaker, based on the position mapping relation.

Here, the operation of step 505 may refer to a human recognition operation performed by using the video data and the audio data in fig. 3, which is not described herein again.

Step 506: and performing voice recognition on the audio data to obtain source language characters.

Step 507: and performing machine translation on the source language characters to obtain target language characters.

Step 508: and displaying the target language characters through a screen.

Here, the presenting the target language text through a screen includes:

and adding marks in the played video data aiming at the target speaker according to the target position, wherein texts corresponding to the content spoken by the speaker are presented in the marks.

Here, the mark is presented in a window manner when the video data is played; presenting within the window identified text determined based on the audio data.

In the playing process of the video data, correspondingly displaying a text corresponding to the content aiming at a speaker; thereby presenting what each speaker said in conversational form.

The window may be a floating window or a fixed window with a distinct directivity to indicate the speaker of the corresponding content, where a directional dialog window may be as shown in fig. 4.

Step 509: and synthesizing voice according to the target language characters, and playing the synthesized voice.

Here, the synthesizing of the voice from the target language text and playing the synthesized voice includes: performing voice synthesis on target language characters (namely translating characters corresponding to audio data to obtain characters of other languages corresponding to the audio data) to obtain synthesized voice, namely the translated voice corresponding to the audio data; when video data is played, the translated voice can be played simultaneously.

The data processing method provided by the embodiment of the application is combined with the audio data and the video data to perform character recognition, determine a speaker (namely a target character), eliminate recognition noise, improve the recognition accuracy and the robustness of a system using the data processing method, reduce application limitation and increase the application range; in addition, the corresponding relation between the speaker and the co-transmitted content (namely the speaking of the speaker) is displayed for the user, so that the content understanding difficulty is reduced, and the sensory experience of audiences is improved through a conversational dynamic presentation mode.

The data processing method according to the embodiment of the present application may be applied to an electronic device, and specifically, the electronic device applied by the method shown in fig. 3 may be used, and in actual application, the description of the method in fig. 3 may be referred to for related operations of the electronic device, and details are not repeated here.

It should be noted that the audio acquisition module and the video acquisition module may be connected to a server or may be a module provided by the server itself; of course, the audio acquisition module and the video acquisition module may be other independent modules, and the audio acquisition module and the video acquisition module acquire corresponding data and then transmit the data to the server, and the server performs corresponding data processing.

In order to implement the data processing method of the embodiment of the application, the embodiment of the application also provides a data processing device. FIG. 6 is a schematic diagram of a data processing apparatus according to an embodiment of the present application; as shown in fig. 6, the data processing apparatus includes:

an acquisition unit 61 configured to acquire target data; the target data includes video data and audio data;

a first processing unit 62 configured to perform a person recognition operation for a target person corresponding to a speaker in the target data using the target data;

a second processing unit 63 configured to add a tag for the target person in the video data; the marker and the recognized text determined based on the audio data are presented while the video data is played.

In one embodiment, the first processing unit 62 is configured to identify the target person according to the audio data; identifying a target position of the target person according to the target data;

the second processing unit 63 is configured to add a mark for the target person in the video data according to the target position.

In an embodiment, the first processing unit 62 is configured to determine a target voiceprint feature corresponding to the audio data;

inquiring a voiceprint database according to the target voiceprint characteristics, and determining a target person corresponding to the audio data; the voiceprint database comprises at least one voiceprint model and character information corresponding to the at least one voiceprint model;

determining a facial image corresponding to the target person from an image database, performing image recognition on at least one frame of image in the video data according to the facial image, and determining a target position of the target person in the video data; the image database includes a face image of at least one person and person information corresponding to the face image of the at least one person.

determining a sound source position corresponding to the audio data;

determining that the sound source position corresponds to a target position in the video data based on a position mapping relationship; the position mapping relationship represents a position mapping relationship between a real scene and a scene in the video data.

In one embodiment, the first processing unit 62 is configured to identify the target person and the target position of the target person according to the video data and the audio data;

In an embodiment, the first processing unit 62 is configured to perform image recognition on at least one image in the video data, and determine at least one first person in the video data and a person position of the at least one first person in the video data;

mapping the position of the person corresponding to the at least one first person and the position of the sound source to a preset spatial coordinate system; the preset space coordinate system has a position mapping relation with a real scene, and the preset space coordinate system has a position mapping relation with the scene in the video data;

determining a first person position and a first sound source position which are mapped to the preset space coordinate system and meet a preset distance condition, determining a target person by using a first person corresponding to the first person position and a second person corresponding to the first sound source position, and taking the corresponding position of the target person in the preset space coordinate system as a target position;

the voiceprint database comprises at least one voiceprint model and character information corresponding to the at least one voiceprint model; the image database includes a face image of at least one person and person information corresponding to the face image of the at least one person.

Here, the audio data is acquired by a microphone array.

In an embodiment, the first processing unit 62 is configured to, if a first person corresponding to the first person position is the same as a second person corresponding to the first sound source position, regard the first person corresponding to the first person position as the target person;

if a first person corresponding to the first person position is different from a second person corresponding to the first sound source position, acquiring a first weight and a second weight; the first weight represents the credibility of the target person determined according to the video data, and the second weight represents the credibility of the target person determined according to the audio data; weighting the recognition result of the first person corresponding to the first person position according to the first weight, and weighting the recognition result of the second person corresponding to the first sound source position according to the second weight; and selecting one person from a first person corresponding to the first person position and a second person corresponding to the first sound source position as a target person according to a weighting processing result.

In one embodiment, the first processing unit 62 is configured to acquire sound data of each channel in the microphone array; and performing position calculation according to the sound data of each sound channel to obtain the sound source position.

In an embodiment, the second processing unit 63 is configured to add a mark for the target person in the video data by one of the following methods:

presenting a mark in the form of a floating window or a fixed window at a position corresponding to the target position in the video data; the recognition text is added in the mark;

adding a preset mark aiming at a target person at a position corresponding to the target position in the video data; the preset mark comprises one of the following: presetting an icon, presetting an image and aiming at an outline frame of a target person; and the identification text is presented at a preset position of the video data.

In practical applications, the first Processing Unit 62 and the second Processing Unit 63 may be implemented by a Processor in the electronic device (e.g., a server, a terminal), such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), a Micro Control Unit (MCU), or a Programmable Gate Array (FPGA). The obtaining unit 61 may be implemented by a communication interface in the electronic device.

It should be noted that: the apparatus provided in the foregoing embodiment is only exemplified by the division of each program module when performing data processing, and in practical applications, the above processing may be distributed to different program modules according to needs, that is, the internal structure of the terminal is divided into different program modules to complete all or part of the above-described processing. In addition, the apparatus provided in the above embodiments and the data processing method embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Based on the hardware implementation of the above device, an electronic device is further provided in the embodiments of the present application, fig. 7 is a schematic diagram of a hardware structure of the electronic device in the embodiments of the present application, as shown in fig. 7, an electronic device 70 includes a memory 73, a processor 72, and a computer program stored in the memory 73 and capable of running on the processor 72; when the processor 72 located in the electronic device executes the program, the method provided by one or more technical solutions of the electronic device side is implemented.

In particular, the processor 72 located at the electronic device 70, when executing the program, implements: acquiring target data; the target data includes video data and audio data; performing a person recognition operation for a target person using the target data, the target person corresponding to a speaker in the target data; adding a mark aiming at the target person in the video data; the marker and the recognized text determined based on the audio data are presented while the video data is played.

It should be noted that, the specific steps implemented when the processor 72 located in the electronic device 70 executes the program have been described in detail above, and are not described herein again.

It is to be understood that the electronic device also includes a communication interface 71; the various components in the electronic device are coupled together by a bus system 74. It will be appreciated that the bus system 74 is configured to enable connected communication between these components. The bus system 74 includes a power bus, a control bus, a status signal bus, and the like, in addition to the data bus.

It will be appreciated that the memory 73 in this embodiment can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memories described in the embodiments of the present application are intended to comprise, without being limited to, these and any other suitable types of memory.

The method disclosed in the above embodiments of the present application may be applied to the processor 72, or implemented by the processor 72. The processor 72 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by instructions in the form of hardware, integrated logic circuits, or software in the processor 72. The processor 72 described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. Processor 72 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located on a storage medium in memory where information is read by processor 72 to perform the steps of the methods described above in conjunction with its hardware.

The embodiment of the application also provides a storage medium, in particular a computer storage medium, and more particularly a computer readable storage medium. The electronic device comprises a processor, a storage medium, a memory, a storage medium, a processing unit, a display unit and a controller.

In the several embodiments provided in the present application, it should be understood that the disclosed method and intelligent device may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all functional units in the embodiments of the present application may be integrated into one second processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

It should be noted that: "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The technical means described in the embodiments of the present application may be arbitrarily combined without conflict.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application.

Claims

A data processing method, comprising:

acquiring target data; the target data includes video data and audio data;

performing a person recognition operation for a target person using the target data, the target person corresponding to a speaker in the target data;

adding a mark aiming at the target person in the video data; the marker and the recognized text determined based on the audio data are presented while the video data is played.
The method of claim 1,

the person recognition operation includes:

identifying the target person according to the audio data;

identifying a target position of the target person according to the target data;

the adding of the mark for the target person in the video data comprises:

and adding marks aiming at the target person in the video data according to the target position.
The method of claim 2,

the identifying the target person according to the audio data comprises:

determining a target voiceprint characteristic corresponding to the audio data;

inquiring a voiceprint database according to the target voiceprint characteristics, and determining a target person corresponding to the audio data; the voiceprint database comprises at least one voiceprint model and character information corresponding to the at least one voiceprint model;

the identifying the target position of the target person according to the target data includes:

determining a facial image corresponding to the target person from an image database, performing image recognition on at least one frame of image in the video data according to the facial image, and determining a target position of the target person in the video data; the image database includes a face image of at least one person and person information corresponding to the face image of the at least one person.
The method of claim 2,

the identifying the target person according to the audio data comprises:

determining a target voiceprint characteristic corresponding to the audio data;

querying a voiceprint database according to the target voiceprint characteristics, and determining a target person corresponding to the audio data; the voiceprint database comprises at least one voiceprint model and character information corresponding to the at least one voiceprint model;

the identifying the target position of the target person according to the target data comprises:

determining a sound source position corresponding to the audio data;

determining that the sound source position corresponds to a target position in the video data based on a position mapping relationship; the position mapping relationship represents a position mapping relationship between a real scene and a scene in the video data.
The method of claim 1, wherein the person identification operation comprises:

identifying the target person and the target position of the target person according to the video data and the audio data;

the adding of the mark aiming at the target person in the video data comprises the following steps:

and adding a mark aiming at the target person in the video data according to the target position.
The method of claim 5, wherein identifying the target person and the target location of the target person based on the video data and the audio data comprises:

performing image recognition on at least one frame of image in the video data, and determining at least one first person in the video data and the person position of the at least one first person in the video data;

determining a target voiceprint characteristic corresponding to the audio data; inquiring a voiceprint database according to the target voiceprint characteristics, and determining a second person corresponding to the audio data; determining a sound source position corresponding to the audio data according to the audio data;

mapping the character position corresponding to the at least one first character and the sound source position to a preset spatial coordinate system; the preset space coordinate system has a position mapping relation with a real scene, and the preset space coordinate system has a position mapping relation with the scene in the video data;

determining a first person position and a first sound source position which are mapped to the preset space coordinate system and meet a preset distance condition, determining a target person by using a first person corresponding to the first person position and a second person corresponding to the first sound source position, and taking the corresponding position of the target person in the preset space coordinate system as a target position;

the voiceprint database comprises at least one voiceprint model and person information corresponding to the at least one voiceprint model;

the image database includes a face image of at least one person and person information corresponding to the face image of the at least one person.
The method of claim 6, wherein the determining a target person using a first person corresponding to the first person position and a second person corresponding to the first sound source position comprises:

if a first person corresponding to the first person position is the same as a second person corresponding to the first sound source position, taking the first person corresponding to the first person position as the target person;

if a first person corresponding to the first person position is different from a second person corresponding to the first sound source position, acquiring a first weight and a second weight; the first weight represents the credibility of the target person determined according to the video data, and the second weight represents the credibility of the target person determined according to the audio data; weighting the recognition result of the first person corresponding to the first person position according to the first weight, and weighting the recognition result of the second person corresponding to the first sound source position according to the second weight; and selecting one person from a first person corresponding to the first person position and a second person corresponding to the first sound source position as a target person according to a weighting processing result.
The method according to claim 4 or 6, wherein the determining a sound source position corresponding to the audio data comprises:

acquiring sound data of each sound channel in a microphone array;

and performing position calculation according to the sound data of each sound channel to obtain the sound source position.
The method of claim 1, wherein adding a tag to the video data for the target person comprises one of:

presenting a mark in the form of a floating window or a fixed window at a position corresponding to the target position in the video data; the recognition text is added in the mark;

adding a preset mark aiming at a target person at a position corresponding to the target position in the video data; the preset mark comprises one of the following: presetting an icon, presetting an image and aiming at an outline frame of a target person; and the identification text is presented at a preset position of the video data.
A data processing apparatus, comprising:

an acquisition unit configured to acquire target data; the target data includes video data and audio data;

a first processing unit configured to perform a person recognition operation for a target person corresponding to a speaker in the target data using the target data;

a second processing unit configured to add a tag for the target person in the video data; the marker and the recognized text determined based on the audio data are presented while the video data is played.
An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any one of claims 1 to 9 when executing the program.
A storage medium having stored thereon computer instructions, which when executed by a processor, carry out the steps of the method of any one of claims 1 to 9.