WO2022062471A1

WO2022062471A1 - Audio data processing method, device and system

Info

Publication number: WO2022062471A1
Application number: PCT/CN2021/098297
Authority: WO
Inventors: 张鹏
Original assignee: 华为技术有限公司
Priority date: 2020-09-25
Filing date: 2021-06-04
Publication date: 2022-03-31
Also published as: CN114333853A

Abstract

An audio data processing method, device and system for classifying conference audio data according to the identities of speakers. The method comprises: a conference record processing apparatus obtains audio data of a first conference room, sound direction information corresponding to the audio data, and an identity identification result, the identity identification result being used for indicating a correspondence between speaker identity information obtained by means of a portrait identification method and speaking time information of speakers (601); then the conference record processing apparatus performs speech segmentation on the audio data, to obtain first segmented audio data of the audio data (602); and finally, the conference record processing apparatus determines a speaker corresponding to the first segmented audio data according to voiceprint features of the first segmented audio data and the identity identification result (603). The audio data is comprised in an audio stream which further comprises additional domain information, and the additional domain information comprises the sound direction information corresponding to the audio data.

Description

A method, device and system for processing audio data

This application claims the priority of the Chinese patent application with the application number 202011027427.2 and the invention title "A method, device and system for processing audio data", which was submitted to the State Intellectual Property Office of China on September 25, 2020. Reference is incorporated in this application.

technical field

The present application relates to the field of communications, and in particular, to a method, device and system for processing audio data.

Background technique

With the rapid development of video conferencing technology, similar to the manual generation of meeting minutes during ordinary meetings, there is also a need for meeting minutes in multi-point video conferences. Existing products can automatically record the audio, video, data and other content of the entire conference during the video conference process. If only the audio data is simply recorded, when reviewing the key content or specific content of the conference, it will not be able to achieve ordinary The need for meeting minutes that can be classified by speakers.

During a video conference, if it can be determined that only one person is speaking in the entire voice file, the audio data of the entire file can be directly sent to the voiceprint recognition system for identification. If there are multiple voices in the voice file, the voice file needs to be segmented first, and then voiceprint recognition is performed on each piece of audio data. Existing voiceprint recognition systems usually require more than 10 seconds of audio data, and the longer the data, the higher the accuracy. Therefore, when segmenting audio data, the segments cannot be too short. Since there are many scenes of free conversation in a video conference, when the segment of audio data is long, a piece of speech may contain the speech of multiple people. During recognition, the recognition result will be unreliable.

The premise of realizing the above solution is that the conference participants need to register their voiceprints in the voiceprint recognition system, but the channel during sound collection has a great influence on the voiceprint characteristics. There are many kinds, and it is difficult to ensure the accuracy of voiceprint recognition of sounds collected by different sound channels.

SUMMARY OF THE INVENTION

Embodiments of the present application provide an audio data processing method, device, and system, which are used to accurately classify conference audio data.

In a first aspect, an embodiment of the present application provides a method for processing audio data, which specifically includes: the conference record processing device obtains audio data of a first conference venue, sound source location information corresponding to the audio data, and an identity recognition result, the identity The recognition result is used to indicate the correspondence between the speaker's identity information obtained by the portrait recognition method and the speaker's speaking time information; then the conference record processing device performs voice segmentation on the audio data to obtain the first score of the audio data. segment audio data; finally, the conference recording processing device determines the speaker corresponding to the first segment audio data according to the voiceprint feature of the first segment audio data and the identification result.

In this embodiment, the audio data and the sound source orientation information corresponding to the audio data can be packaged to generate an audio code stream, and then the audio code stream contains additional domain information of the audio data, and the additional domain information includes the sound source corresponding to the audio data. Orientation information. The audio data processing method may be applied to a local conference or a remote conference scenario, wherein the conference site participating in the conference may include at least one. Based on the above solution, the additional field information may further include time information of the audio data, and site identification information of the first site and other information. Portrait recognition methods include face recognition and human attribute recognition. For example, the spokesperson corresponding to the facial features is obtained through face recognition, and the human body attribute recognition includes identifying the overall clothing or physical features of the user to obtain the spokesperson corresponding to the physical features or the appearance of the user's clothing. The speaker identification information may be user identification information (such as the speaker's job number in the company or the speaker's ID number or phone number registered in the company's internal database) or user physical attribute identification information (such as the current meeting's ID number or phone number) The user is wearing white tops and black trousers or the user has a visible mark on the arm, etc.). The speaking time information may be a period of time or two time points. For example, the speaking time information is 30 seconds from 00:00:15 to 00:00:45 after the current conference starts; or the speaking time information only includes "00:00:15" and "00:00:45" these two time points. It can be understood that, in the embodiment of the present application, the timing rule indicated in the form of "00:00:00" is "hours:minutes:seconds", that is, the time point indicated by "00:00:15" is after the start of the meeting 15 seconds.

In the technical solution provided by this embodiment, the conference record processing device obtains an identification result indicating the correspondence between speaker identification information and speaking time information, and then combines the identification result with the voiceprint feature to record the audio data Further identification is performed, so that accurate classification of voice data can be achieved without pre-registering the user's voiceprint features.

Optionally, the operation of the conference recording processing device to determine the speaker corresponding to the first segmented audio data according to the voiceprint feature of the first segmented audio data and the identity recognition result may be as follows:

In a possible implementation, if the identification result indicates that the first segment of audio data corresponds to the unique speaker identity information, the conference record processing device determines the speech corresponding to the first segment of audio data according to the speaker identity information. people. That is, the meeting record processing device has obtained the identification result of the first segment of audio data and indicates that the speaker corresponding to the first segment of audio data is only user01, and the corresponding voiceprint feature is VP01, then the meeting record processing device The speaker of the first segment of audio data is determined to be this user01.

In another possible implementation manner, if the identification result indicates that the first segment of audio data corresponds to at least two speaker identity information, the conference recording processing device compares the voiceprint feature of the first segment of audio data with the first segment of audio data. The voiceprint feature of two-segment audio data, the second-segment audio data is obtained by segmenting the audio data by the conference recording processing device, and the second-segment audio data corresponds to the unique speaker identity information; The voiceprint feature of a segment of audio data is consistent with the voiceprint feature of the second segment of audio data, and the conference recording processing device determines the first segment of audio data according to the speaker identity information corresponding to the second segment of audio data. the corresponding speaker. For example, the speaker identity information of the second segment of audio data has been determined to be user02, the corresponding voiceprint feature is VP02, the voiceprint feature of the first segment of audio data is VP02, and the corresponding speaker identity information includes user03 and user02; The above analysis shows that the voiceprint feature of the first segmented audio data and the voiceprint feature of the second segmented audio data are both VP02, and from the results of the second segmented audio data, the voiceprint feature is the speech corresponding to VP02. If the person is user02, it can be determined that the speaker of the first segment of audio data is also user02.

In another possible implementation manner, if the identity recognition result indicates that the first segment of audio data corresponds to at least two speaker identity information, the conference record processing apparatus will use the speaker identity information corresponding to the first segment of audio data to And the voiceprint feature, and the speaker identity information and the voiceprint feature corresponding to the second segment audio data determine the speaker corresponding to the first segment audio data, and the second segment audio data is recorded by the conference record processing device. The audio data is obtained by voice segmentation, and the second segmented audio data corresponds to at least two speaker identity information. That is, the conference record processing device can comprehensively determine the speaker corresponding to each segmented audio data according to the voiceprint features of the multiple segmented audio data and the corresponding speaker identity information. For example, the speaker identity information of the second segment of audio data has been determined as user02 and user03, the corresponding voiceprint feature is VP02, the voiceprint feature of the first segment of audio data is VP03, and the corresponding speaker identity information includes user03 and user02 , the voiceprint feature of the third segment of audio data is VP03, and the corresponding speaker identity information is user03 and user01. As can be seen from the above analysis, the voiceprint feature of the first segmented audio data and the voiceprint feature of the third segmented audio data are both VP03, and the speaker identity information corresponding to the first segmented audio data and the third segmented audio. The identity of the speaker corresponding to the data has a unique intersection, namely user03. At this time, it can be determined that the speaker corresponding to the voiceprint feature and VP03 is user03. At this time, it can continue to be determined that the speaker corresponding to the second segment of audio data is user02, that is, the speaker corresponding to the voiceprint feature VP02 is user02.

In another possible implementation manner, the conference recording processing device determines the first segmented audio according to the voiceprint feature corresponding to the first segmented audio data, the identity recognition result, and the long-term voiceprint feature record of the first conference site The speaker corresponding to the data, the long-term voiceprint feature record includes the historical voiceprint feature record of the first venue, and the historical voiceprint feature record of the first venue is used to indicate voiceprint features, speakers, and channel identifiers Correspondence between.

Optionally, when the conference recording processing device determines the speaker corresponding to the first segmented audio data according to the voiceprint feature of the first segmented audio data, the identity recognition result and the long-term voiceprint feature record, its specific The operation may be as follows: the conference record processing device compares the voiceprint feature of the first speaker in the current conference of the first conference site with the voiceprint feature of the first speaker in the long-term voiceprint feature record, The first speaker is the confirmed speaker in the current conference of the first conference site; if the voiceprint feature of the first speaker in the current conference of the first conference site is the same as that of the first speaker in the long-term If the voiceprint features in the feature records are consistent, the conference record processing device determines that the long-term voiceprint feature record is available, and at this time the conference record processing device corresponds to the voiceprint feature of the first segment of audio data, the identity recognition result The speaker corresponding to the first segment of audio data is determined by comparing with the long-term voiceprint feature record of the first conference site.

In the process of audio data classification, the combination of short-term processing and long-term processing can improve the accuracy of audio data classification as much as possible.

Optionally, the conference record processing device compares the voiceprint feature of the first speaker in the current conference of the first conference site with the voiceprint feature of the first speaker in the long-term voiceprint feature record, The first speaker is the confirmed speaker in the current conference of the first conference site; if the voiceprint feature of the first speaker in the current conference of the first conference site is the same as that of the first speaker in the long-term If the voiceprint features in the feature record are inconsistent, the conference record processing device registers the voiceprint feature, the channel identifier and the speaker corresponding to the voiceprint feature in the current conference of the first conference site, and updates the long-term voiceprint feature record. In this way, the voiceprint feature, speaker, and channel identification of the conference site can be updated according to the actual situation, so that the long-term voiceprint feature record is available. At the same time, after each meeting, the corresponding voiceprint features and speakers are registered, so as to realize the dynamic registration of voiceprint features and speakers, which is no longer limited to the registration of voiceprint features with fixed channel identifiers, which can effectively realize the registration of audio data. accurate classification.

Optionally, after the conference recording processing device acquires the voiceprint feature of the first segmented audio data and the corresponding speaker, the conference recording processing device can acquire the voiceprint feature of the first segmented audio data. Then, the conference record processing device establishes a corresponding relationship between the voiceprint identification information and the speaker corresponding to the first segment of audio data. In this way, one-to-one correspondence between the voiceprint features and the speakers can facilitate subsequent audio data classification and processing.

Optionally, when the technical solutions provided by the application embodiments are applied in a remote multi-site conference scenario, the conference recording processing device may be a recording and broadcasting server or a functional module integrated in the multi-point control unit. Therefore, the audio code stream can be forwarded by the multipoint control unit to the conference record processing device, and the identification result is sent to the conference record processing device by the video conference terminal.

Optionally, the audio code stream is forwarded to the conference record processing apparatus by the multipoint control unit after being selected by the conference site. This reduces unnecessary data transmission and reduces the burden on the network.

Optionally, the specific operation of the conference recording processing apparatus for performing voice segmentation on the audio data may be as follows: the conference recording processing apparatus performs voice segmentation on the audio data according to the sound source location information and the human voice detection technology. This allows for more precise segmentation of audio data.

In a second aspect, an embodiment of the present application provides a method for processing audio data, which includes: a video conference terminal performs sound source localization on audio data of a first conference site to obtain sound source location information corresponding to the audio data; the The video conference terminal obtains an identification result according to the sound source orientation and the face recognition method, and the identification result is used to indicate the correspondence between the speaker identification information and the speaking time information; the video conference terminal uses the identification result, the audio data and The sound source orientation information corresponding to the audio data is sent to the conference record processing device.

In this embodiment, the video conference terminal realizes the acquisition of the image information of the speaker by locating the sound source of the audio data, and obtains the correspondence between the identity information of the speaker and the speaking time information through the portrait recognition of the image information. the identification result, and then send the identification result to the conference record processing device, so that the conference record processing device combines the identification result with the voiceprint feature to further identify the audio data, so that the user's voice Accurate classification of speech data can be achieved by pre-registering the pattern features.

Optionally, the specific process of the video conferencing terminal performing portrait recognition may be as follows: the video conferencing terminal obtains the portrait information corresponding to the position of the sound source; the video conferencing terminal performs image recognition on the portrait information to obtain the face information and/or Physical attribute information; the video conference terminal determines the speaker identity information according to the face information and/or the physical attribute information; the video conference terminal establishes a corresponding relationship between the speaker time information and the speaker identity information to obtain the identification result .

In this embodiment, the speaker identity information may be user identity information (such as the speaker's job number in the company or the speaker's ID number or phone number registered in the company's internal database) or user body attribute identification information ( For example, in the current meeting, the top of the user is wearing white clothes, and the bottom is black trousers, or there is a clear mark on the user's arm, etc.). The speaking time information may be a period of time or two time points. For example, the speaking time information is 30 seconds from 00:00:15 to 00:00:45 after the current conference starts; or the speaking time information only includes "00:00:15" and "00:00:45" these two time points. It can be understood that, in the embodiment of the present application, the timing rule indicated in the form of "00:00:00" is "hours:minutes:seconds", that is, the time point indicated by "00:00:15" is after the start of the meeting 15 seconds.

Optionally, when the technical solutions provided in the embodiments of the present application are applied to a single-user scenario of a local conference or a remote conference, the video conference terminal may also be used as the conference record processing device to implement the method of the first aspect, as follows:

The video conference terminal acquires the audio data of the current conference site, and detects segmented audio data from the audio data according to the sound source orientation and human voice; then acquires the voiceprint feature of the segmented audio data, and associates the voiceprint feature with the identity The recognition result determines the speaker corresponding to the segmented audio data.

In the technical solution provided in this embodiment, the video conference terminal acquires an identification result indicating the correspondence between speaker identity information and speaking time information, and then combines the identification result with the voiceprint feature to further perform further audio data processing. In this way, accurate classification of voice data can be achieved without pre-registering the user's voiceprint features.

Optionally, the operation of the video conference terminal to determine the speaker corresponding to the first segmented audio data according to the voiceprint feature of the first segmented audio data and the identity recognition result may be as follows:

In a possible implementation, if the identification result indicates that the first segment of audio data corresponds to unique speaker identity information, the video conference terminal determines the speaker corresponding to the first segment of audio data according to the speaker identity information. . That is, the identification result obtained by the video conference terminal of the first segment of audio data indicates that the speaker corresponding to the first segment of audio data is only user01, and the corresponding voiceprint feature is VP01, then the video conference terminal will use the first segment of audio data to identify the speaker. The speaker of the segmented audio data is determined to be this user01.

In another possible implementation manner, if the identification result indicates that the first segment of audio data corresponds to at least two speaker identity information, the video conference terminal compares the voiceprint feature of the first segment of audio data with the second segment of audio data. The voiceprint feature of segmented audio data, the second segmented audio data is obtained by performing voice segmentation on the audio data by the video conference terminal, and the second segmented audio data corresponds to the unique speaker identity information; if the first segmented audio data is obtained by segmenting the audio data If the voiceprint feature of the segment audio data is consistent with the voiceprint feature of the second segment audio data, the video conference terminal determines the speech corresponding to the first segment audio data according to the speaker identity information corresponding to the second segment audio data people. For example, the speaker identity information of the second segment of audio data has been determined to be user02, the corresponding voiceprint feature is VP02, the voiceprint feature of the first segment of audio data is VP02, and the corresponding speaker identity information includes user03 and user02; The above analysis shows that the voiceprint feature of the first segmented audio data and the voiceprint feature of the second segmented audio data are both VP02, and from the results of the second segmented audio data, the voiceprint feature is the speech corresponding to VP02. If the person is user02, it can be determined that the speaker of the first segment of audio data is also user02.

In another possible implementation manner, if the identification result indicates that the first segment of audio data corresponds to at least two speaker identity information, the video conference terminal will use the speaker identity information corresponding to the first segment of audio data and The voiceprint feature, and the speaker identity information and the voiceprint feature corresponding to the second segmented audio data determine the speaker corresponding to the first segmented audio data, and the second segmented audio data is used by the video conference terminal for the audio The data is obtained by voice segmentation, and the second segmented audio data corresponds to the identity information of at least two speakers. That is, the video conference terminal can comprehensively determine the speaker corresponding to each segmented audio data according to the voiceprint features of the multiple segmented audio data and the corresponding speaker identity information. For example, the speaker identity information of the second segment of audio data has been determined as user02 and user03, the corresponding voiceprint feature is VP02, the voiceprint feature of the first segment of audio data is VP03, and the corresponding speaker identity information includes user03 and user02 , the voiceprint feature of the third segment of audio data is VP03, and the corresponding speaker identity information is user03 and user01. As can be seen from the above analysis, the voiceprint feature of the first segmented audio data and the voiceprint feature of the third segmented audio data are both VP03, and the speaker identity information corresponding to the first segmented audio data and the third segmented audio. The identity of the speaker corresponding to the data has a unique intersection, namely user03. At this time, it can be determined that the speaker corresponding to the voiceprint feature and VP03 is user03. At this time, it can continue to be determined that the speaker corresponding to the second segment of audio data is user02, that is, the speaker corresponding to the voiceprint feature VP02 is user02.

In another possible implementation manner, the video conference terminal determines the first segmented audio data according to the voiceprint feature corresponding to the first segmented audio data, the identification result and the long-term voiceprint feature record of the first conference site Corresponding speaker, the long-term voiceprint feature record includes the historical voiceprint feature record of the first venue, and the historical voiceprint feature record of the first venue is used to indicate the voiceprint feature, the speaker, and the channel identifier. Correspondence between.

Optionally, when the video conference terminal determines the speaker corresponding to the first segmented audio data according to the voiceprint feature of the first segmented audio data, the identity recognition result and the long-term voiceprint feature record, its specific operation It may be as follows: the video conference terminal compares the voiceprint feature of the first speaker in the current conference of the first conference site with the voiceprint feature of the first speaker in the long-term voiceprint feature record, and the A speaker is the confirmed speaker in the current conference of the first conference site; if the voiceprint feature of the first speaker in the current conference of the first conference site is the same as the voiceprint feature of the first speaker in the long-term voiceprint feature record If the voiceprint features in the video conference terminal are consistent, the video conference terminal determines that the long-term voiceprint feature record is available. The long-term voiceprint feature records of the conference site are compared to determine the speaker corresponding to the first segment of audio data.

Optionally, the video conference terminal compares the voiceprint feature of the first speaker in the current conference of the first conference site with the voiceprint feature of the first speaker in the long-term voiceprint feature record, the The first speaker is the confirmed speaker in the current conference of the first conference site; if the voiceprint feature of the first speaker in the current conference of the first conference site is the same as the voiceprint feature of the first speaker in the long-term voiceprint feature If the voiceprint features in the records are inconsistent, the video conferencing terminal registers the voiceprint features, channel identifiers and the speaker corresponding to the voiceprint features in the current conference of the first conference site, and updates the long-term voiceprint feature record . In this way, the voiceprint feature, speaker, and channel identification of the conference site can be updated according to the actual situation, so that the long-term voiceprint feature record is available. At the same time, after each meeting, the corresponding voiceprint features and speakers are registered, so as to realize the dynamic registration of voiceprint features and speakers, which is no longer limited to the registration of voiceprint features with fixed channel identifiers, which can effectively realize the registration of audio data. accurate classification.

Optionally, after the video conference terminal acquires the voiceprint feature of the first segmented audio data and the corresponding speaker, the videoconferencing terminal can acquire the voiceprint identifier of the voiceprint feature of the first segmented audio data. information; then the video conference terminal establishes a corresponding relationship between the voiceprint identification information and the speaker corresponding to the first segment of audio data. In this way, one-to-one correspondence between the voiceprint features and the speakers can facilitate subsequent audio data classification and processing.

In a third aspect, the present application provides a conference record processing device, which has a function of implementing the behavior of the conference record processing device in the first aspect. This function can be implemented by hardware or by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions.

In a possible implementation manner, the apparatus includes units or modules for performing the steps of the above first aspect. For example, the device includes: an acquisition module for acquiring audio data of the first venue, sound source location information corresponding to the audio data, and an identity recognition result, where the identity recognition result is used to indicate the speaker identity information obtained by the portrait recognition method The corresponding relationship with the speaking time information of the speaker; the processing module is used to perform voice segmentation on the audio data to obtain the first segmented audio data of the audio data; according to the voiceprint feature of the first segmented audio data and the identification result to determine the speaker corresponding to the first segment of audio data.

Optionally, it also includes a storage module for storing necessary program instructions and data of the conference record processing device.

In a possible implementation manner, the apparatus includes: a processor and a transceiver, where the processor is configured to support the conference record processing apparatus to perform corresponding functions in the method provided in the first aspect. The transceiver is used for instructing the communication between the conference record processing apparatus and other devices in the conference system, such as receiving the audio data and the identification result involved in the above method sent by the video conference terminal. Optionally, the apparatus may further include a memory, which is used for coupling with the processor, and which stores necessary program instructions and data of the conference record processing apparatus.

In a possible implementation manner, when the device is a chip in a conference record processing device, the chip includes: a processing module and a transceiver module. The transceiver module may be, for example, an input/output interface, pin or circuit on the chip, and transmits the received audio data and identification result of the first conference venue to other chips or modules coupled to the chip. The processing module can be, for example, a processor, and the processor is configured to perform voice segmentation on the audio data to obtain the first segmented audio data of the audio data; according to the voiceprint feature of the first segmented audio data and the identity The recognition result determines the speaker corresponding to the first segment of audio data. The processing module can execute the computer-executed instructions stored in the storage unit, so as to support the conference record processing apparatus to execute the method provided in the first aspect. Optionally, the storage unit can be a storage unit in the chip, such as a register, a cache, etc., and the storage unit can also be a storage unit located outside the chip, such as a read-only memory (read-only memory, ROM) or a memory unit. Other types of static storage devices that store static information and instructions, random access memory (RAM), etc.

In a possible implementation manner, the apparatus includes: a processor, a radio frequency circuit and an antenna. The processor is used to control the functions of each circuit part and determine the speaker corresponding to the first segment of audio data, and then perform analog conversion, filtering, amplification and up-conversion processing through the radio frequency circuit, and then send it to the automatic transmission through the antenna. Speech recognition server. Optionally, the device further includes a memory, which stores necessary program instructions and data of the conference record processing device.

In a possible implementation manner, the device includes a communication interface and a logic circuit, where the communication interface is used to acquire an audio code stream and an identification result of the first conference site, where the audio code stream includes audio data and additional domain information, the additional domain The information includes the sound source position information corresponding to the audio data, and the identification result is used to indicate the correspondence between the speaker's identity information obtained by the portrait recognition method and the speaker's speaking time information; the logic circuit is used for the audio data. Perform voice segmentation to obtain the first segmented audio data of the audio data; and determine the speaker corresponding to the first segmented audio data according to the voiceprint feature of the first segmented audio data and the identification result.

Wherein, the processor mentioned in any of the above may be a general-purpose central processing unit (Central Processing Unit, CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more An integrated circuit for controlling the program execution of the audio data processing method of the above aspects.

In a fourth aspect, an embodiment of the present application provides a video conference device, the device having a function of implementing the behavior of the video conference terminal in the second aspect. This function can be implemented by hardware or by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions.

In a possible implementation manner, the apparatus includes units or modules for performing the steps of the second aspect above. For example, the device includes: a processing module configured to perform sound source localization on the audio data of the first conference venue to obtain sound source orientation information corresponding to the audio data; obtain an identity recognition result according to the sound source orientation and the portrait recognition method, The identification result is used to indicate the correspondence between speaker identification information and speaking time information;

The sending module is used for sending the identification result, the audio data and the sound source position information corresponding to the audio data to the conference record processing device.

Optionally, it also includes a storage module for storing necessary program instructions and data of the video conference device.

In a possible implementation manner, the apparatus includes: a processor and a transceiver, where the processor is configured to support the video conference apparatus to perform corresponding functions in the method provided in the second aspect. The transceiver is used for instructing the communication between the video conference device and various devices in the conference system, and sending the audio code stream and the identification result to the conference record processing device. Optionally, the apparatus may further include a memory, which is used for coupling with the processor, and which stores necessary program instructions and data of the video conference apparatus.

In a possible implementation, when the device is a chip in a video conference device, the chip includes: a processing module and a transceiver module. Perform sound source localization on the data to obtain the sound source position information corresponding to the audio data; obtain the identification result according to the sound source position and the portrait recognition method, and the identification result is used to indicate the correspondence between the speaker identification information and the speaking time information relationship; the transceiver module may be, for example, an input/output interface, a pin or a circuit on the chip, and the configuration information is transmitted to other chips or modules coupled to the chip. The processing module can execute the computer-executed instructions stored in the storage unit, so as to support the video conference device to perform the method provided in the second aspect. Optionally, the storage unit can be a storage unit in the chip, such as a register, a cache, etc., the storage unit can also be a storage unit located outside the chip, such as only ROM or other types that can store static information and instructions. Static storage devices, RAM, etc.

In a possible implementation manner, the apparatus includes: a processor, a baseband circuit, a radio frequency circuit and an antenna. The processor is used to control the functions of each circuit, and the baseband circuit is used to generate data packets containing audio code streams and identification results. Sent to the conference record processing device. Optionally, the device further includes a memory, which stores necessary program instructions and data of the video conference device.

In one possible implementation, the apparatus includes: a communication interface and a logic circuit. Wherein, the logic circuit is used to locate the sound source of the audio data of the first venue, so as to obtain the position information of the sound source corresponding to the audio data; and obtain the identification result according to the position of the sound source and the portrait recognition method. It is used to indicate the correspondence between speaker identity information and speaking time information; the communication interface is used to send the identity recognition result to the conference record processing device, and send the audio data to the multipoint control unit.

Wherein, the processor mentioned in any of the above may be a CPU, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of programs of the audio data processing methods in the above aspects.

In a fifth aspect, an embodiment of the present application provides a computer-readable storage medium, where computer instructions are stored in the computer storage medium, and the computer instructions are used to execute the method in any possible implementation manner of any one of the foregoing aspects.

In a sixth aspect, the embodiments of the present application provide a computer program product including instructions, which, when executed on a computer, cause the computer to execute the method in any one of the foregoing aspects.

In a seventh aspect, the present application provides a chip system, the chip system includes a processor for supporting a conference record processing device or a video conference device to implement the functions involved in the above aspects, such as generating or processing the above methods. data and/or information. In a possible design, the chip system further includes a memory for storing necessary program instructions and data of the conference record processing device or the video conference device, so as to realize the function of any one of the above aspects. The chip system can be composed of chips, and can also include chips and other discrete devices.

In an eighth aspect, an embodiment of the present application provides a conference system, which includes the conference record processing device and the video conference device according to the above aspect.

Description of drawings

1A is a schematic diagram of an embodiment of a conference system architecture in an embodiment of the present application;

1B is a schematic diagram of another embodiment of a conference system architecture in an embodiment of the present application;

2 is a schematic diagram of an embodiment of a method for processing audio data in an embodiment of the present application;

3 is a schematic diagram of a scene in which a video conference terminal collects image information in an embodiment of the present application;

4 is a schematic diagram of another embodiment of a method for processing audio data in an embodiment of the present application;

5 is a schematic diagram of another embodiment of a method for processing audio data in an embodiment of the present application;

6 is a schematic diagram of another embodiment of a method for processing audio data in an embodiment of the present application;

FIG. 7 is a schematic diagram of an embodiment of a conference record processing apparatus in an embodiment of the present application;

FIG. 8 is a schematic diagram of another embodiment of a conference record processing apparatus in an embodiment of the present application;

FIG. 9 is a schematic diagram of an embodiment of a video conference terminal in an embodiment of the present application;

FIG. 10 is a schematic diagram of another embodiment of a video conference terminal in an embodiment of the present application.

detailed description

In order to make the purpose, technical solutions and advantages of the present application clearer, the embodiments of the present application are described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. . Those of ordinary skill in the art know that with the emergence of new application scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.

The terms "first", "second" and the like in the description and claims of the present application and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It is to be understood that data so used may be interchanged under appropriate circumstances so that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having", and any variations thereof, are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or modules is not necessarily limited to those expressly listed Rather, those steps or modules may include other steps or modules not expressly listed or inherent to the process, method, product or apparatus. The naming or numbering of the steps in this application does not mean that the steps in the method flow must be executed in the time/logical sequence indicated by the naming or numbering, and the named or numbered process steps can be implemented according to the The technical purpose is to change the execution order, as long as the same or similar technical effects can be achieved. The division of units in this application is a logical division. In practical applications, there may be other division methods. For example, multiple units may be combined or integrated into another system, or some features may be ignored. , or not implemented, in addition, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, and the indirect coupling or communication connection between units may be electrical or other similar forms. There are no restrictions in the application. In addition, the units or sub-units described as separate components may or may not be physically separated, may or may not be physical units, or may be distributed into multiple circuit units, and some or all of them may be selected according to actual needs. unit to achieve the purpose of the scheme of this application.

The technical solutions of the embodiments of the present invention can be applied to local conference or remote conference scenarios. The specific system architecture of the embodiment of the present invention may include a plurality of video conference terminals, a multipoint control unit, a recording server, and an automatic speech recognition (Automatic Speech Recognition, ASR) server. Taking the embodiment shown in FIG. 1A as an example, each of the multiple video conference terminals (such as the video conference terminal 01 to the video conference terminal 03 shown in FIG. 1A ) collects conference audio data and image information of the participants. , and identify the speaker among the participants through the image information to obtain the identification result. Then, the video conference terminal sends the audio data and the identification result to the recording and broadcasting server. The recording and broadcasting server classifies the audio data according to the identification result and the direction of the sound source and sends it to the ASR server. The ASR server outputs conference records through the voice transcription function. In the embodiment shown in FIG. 1B , compared with the embodiment in FIG. 1A , the function of the recording server is integrated in the multipoint control unit (equivalent to the conference record processing module in FIG. 1B ). Each of a plurality of video conference terminals (video conference terminal 01 to video conference terminal 03 as shown in FIG. 1B ) collects conference audio data and image information of the participants, and communicates with the speakers among the participants through the image information. Perform identification to obtain an identification result. Then, the video conference terminal sends the audio data and the identification result to the multipoint control unit, and the conference record processing module in the multipoint control unit classifies the audio data and sends it to the ASR server. Finally, the ASR server outputs the meeting record through the voice transcription function.

Please refer to FIG. 2 for details. An embodiment of the audio data processing method in the embodiment of the present application includes:

201. The video conference terminal collects audio data.

In a remote conference scenario, a conference may include multiple sites, each site corresponds to at least one video conference terminal, and each site has at least one participant. In this embodiment, one video conference terminal in the conference site is used for description. During the conference, the video conference terminal uses the microphone to pick up the audio data of each speaker in real time.

202. The video conference terminal acquires the sound source bearing of the audio data.

While collecting the audio data, the video conference terminal can acquire the sound source azimuth corresponding to the audio data, and establish a corresponding relationship between the audio data and the sound source azimuth. For example, the sound source orientation of the audio data collected by the video conference terminal from 00:00:15 to 00:00:30 at the conference start time is about 30 degrees east of the video conference terminal. It can be understood that the sound source localization is allowed to have errors, so the sound source bearing can be a range value. For example, if the sound source is located at 30 degrees east, the specific range may be 28 degrees east to 32 degrees east.

In this embodiment, the video conference terminal can obtain the sound source orientation of the audio data in the following possible implementation manners:

In a possible implementation manner, an array microphone is deployed on the video conference terminal, and the sound source azimuth of the audio data is determined through sound beam information picked up by the array microphone.

In another possible implementation manner, the venue additionally deploys a device or system dedicated to sound source localization, and then uses the sound source localization device or system as a calibration reference point to determine the sound source orientation of the audio data, and then uses the sound source location device or system as a calibration reference point to determine the sound source orientation of the audio data. sent to the video conference terminal.

It can be understood that the sound source localization may adopt the above solution or any other possible implementation manner, as long as the sound source orientation of the audio data can be obtained, and the specific solution is not limited here.

203. The video conference terminal performs voice segmentation on the audio data through human voice detection to obtain segmented audio data.

The video conference terminal performs voice segmentation on the received audio data according to human voice detection to obtain different segmented audio data.

In this embodiment, the video conferencing terminal can distinguish the previous voice segment and the next voice segment according to the interval of the silent segment; or determine whether the voice segment is a human voice or a non-human voice through an algorithm, and divide the preceding and following human voice voice segments according to the non-human voice. split. For example, the video conference terminal collects audio data from 00:00:15 to 00:00:30 at the start time of the conference, then mutes it from 00:00:30 to 00:00:32, and mutes it from 00:00:32 to 00 : Audio data is captured during 00:45, muted between 00:00:45 and 00:00:50. Then the video conference terminal can use the audio data collected during the conference start time 00:00:15 to 00:00:30 as a segmented audio data, and the video conference terminal can use the audio data collected in the conference start time 00:00:32 to 00:00:45 as a segmented audio data. The audio data collected in the system is used as the next segmented audio data.

It can be understood that, in the embodiment of the present application, the timing rule indicated in the form of "00:00:00" is "hours:minutes:seconds", that is, the time point indicated by "00:00:15" is after the start of the meeting 15 seconds.

204. The video conference terminal collects image information within the azimuth range of the sound source according to the azimuth of the sound source.

The video conference terminal determines an image information collection area of the video conference terminal according to the sound source azimuth corresponding to the audio data obtained in step 202, and then collects image information in the image information collection area.

In this embodiment, the video conference terminal may collect the image information in the form of capturing a photo, or may capture a picture frame corresponding to the audio data in the video data as the image information, and the specific form is not limited here. At the same time, the camera of the video conference terminal can be fixed or can be deployed to be rotatable, and the specific situation is not limited here. When the camera of the video conference terminal is fixed (that is, the shooting range of the camera is fixed), the video conference terminal acquires images within the fixed shooting range, and then calculates and extracts image information corresponding to the audio data according to the sound source orientation. When the camera of the video conference terminal is movable, the video conference terminal can adjust the shooting range of the camera according to the direction of the sound source, so as to obtain image information corresponding to the audio data. As shown in Figure 3, the video conference terminal is located above the conference screen, and the participants are located on both sides of the conference table. When there are speakers among the participants, the video conference terminal can obtain image information within a certain angle range according to the sound source orientation . However, due to the angle problem, there may be multiple participants in the image information, or there may be only one participant. For example, when the image information of the speaker 1 is collected according to the sound source localization of the speaker 1, the image information area has only the speaker 1; and when the image information of the speaker 2 is collected according to the sound source localization of the speaker 2, The image information area includes Speaker 1 and another participant.

205. The video conference terminal performs portrait recognition on the image information to obtain an identity recognition result.

The video conference terminal performs face recognition and human body attribute recognition on the image information to obtain an identity recognition result, and the identity recognition result is used to indicate the correspondence between speaker identity information and speaking time information. For example, the spokesperson corresponding to the facial features is obtained through face recognition, and the human body attribute recognition includes identifying the overall clothing or physical features of the user to obtain the spokesperson corresponding to the physical features or the appearance of the user's clothing. The speaker identification information may be user identification information (such as the speaker's job number in the company or the speaker's ID number or phone number registered in the company's internal database) or user physical attribute identification information (such as the current meeting's ID number or phone number) The user is wearing white tops and black trousers or the user has a visible mark on the arm, etc.). The speaking time information may be a period of time or two time points. For example, the speaking time information is 30 seconds from 00:00:15 to 00:00:45 after the current conference starts; or the speaking time information only includes "00:00:15" and "00:00:45" these two time points.

In this embodiment, the specific operation of the video conference terminal for acquiring the speaker's identity information may be as follows: if the image information contains clear and identifiable face information, the video conference terminal will use face recognition technology to identify the information in the image information. and compare the face with the stored face database to determine the user identity information corresponding to the face; if the face information in the image information fails to meet the identification requirements (for example, the facial features cannot meet the face recognition requirement or no face image), the video conference terminal can perform human body attribute recognition to obtain body attribute information, and determine the user's body attribute identification information according to the body attribute information.

206. The video conference terminal packages the audio data and the corresponding sound source azimuth into an audio code stream and sends it to the multipoint control unit, and sends the identification result to the recording and broadcasting server.

The video conference terminal packages the audio data and the sound source azimuth corresponding to the audio data into an audio code stream and sends it to the multipoint control unit. In an exemplary solution, the video conference terminal encodes the audio data into an audio code stream, and then adds additional field information to the corresponding audio code stream, and uses the additional field information to indicate sound source location information corresponding to the audio data. The identity recognition result obtained by the video conferencing terminal itself through the portrait recognition can be directly sent to the recording server.

207. The multipoint control unit sends the audio stream sent by the video conference terminal to the recording and broadcasting server.

After receiving the audio code stream sent by the video conference terminal, the multipoint control unit determines the conference site to which the video conference terminal belongs according to the conference identification assigned to the video conference terminal, and then adds the conference site identification to the audio code stream, And send the audio stream to the recording server.

In a possible implementation manner, the multipoint control unit may filter the audio data of each conference site, and then select the audio data of one or more conference sites to send to the recording and broadcasting server. The multipoint control unit can compare the volume levels of the audio data of each venue, and select audio data whose volume is greater than a preset threshold for forwarding; or, the multipoint control unit can determine through an algorithm that the voice duration exceeds the preset threshold. Audio data is forwarded. Specific filtering conditions are not limited here. This reduces the amount of processing and thus speeds up processing.

208. The recording and broadcasting server decodes the audio stream to obtain audio data, and performs voice segmentation on the audio data to obtain the segmented audio data.

After acquiring the audio code stream, the recording and broadcasting server can decode the audio code stream to obtain audio data and a site identification, and then store the audio data according to the site identification. At the same time, the recording and broadcasting server performs speech segmentation on the audio data according to the sound source orientation of the audio data and the human voice detection technology, thereby obtaining segmented audio data. It can be understood that, in this embodiment, the recording and broadcasting server can further classify the audio data reported by the video conference terminal by performing voice segmentation on the audio data according to the sound source orientation and the person detection technology. For example, according to human voice detection, the video conference terminal detects that there is human voice from 00:00:15 to 00:00:30, then the video conference terminal collects audio data from 00:00:15 to 00:00:30. Divided into a segmented audio data, there is actually a speaker speaking in sound source azimuth 1 in 00:00:15 to 00:00:25, and a speaker in 00:00:25 to 00:00:30 There is also a speaker speaking in source location 2. Therefore, when the recording server performs voice segmentation again according to the sound source orientation and human voice detection, the audio data can be divided into two segments.

209. When the segmented audio data meets the minimum length for voiceprint recognition, the recording and broadcasting server extracts the voiceprint feature of the segmented audio data.

When the segmented audio data meets the minimum length for voiceprint identification, the recording and broadcasting server extracts voiceprint features from the segmented audio data according to techniques such as voiceprint clustering, and marks the voiceprint identification. In an exemplary solution, it is assumed that the recording and broadcasting server divides the audio data into 10 segmented audio data, and the duration of 8 segmented audio data satisfies the minimum length of voiceprint recognition, then the recording and broadcasting server respectively The voiceprint features are extracted from the eight segmented audio data, and voiceprint identifiers (voiceprint 1 to voiceprint 8) are marked respectively.

210. The recording and broadcasting server determines the speaker identity of the segmented audio data according to the identification result and the voiceprint feature of the segmented audio data.

The recording and broadcasting server integrates and analyzes the received identification result and the voiceprint feature of the segmented audio data to determine the speaker identity of the segmented audio data.

Specifically, the following methods can be used:

In a possible implementation manner, if the identification result indicates that the first segment of audio data corresponds to the unique speaker information, then the conference record processing device determines the first segment of audio data according to the unique speaker information indicated by the identification result. the corresponding speaker.

In another possible implementation manner, if the identification result indicates that the first segment of audio data corresponds to at least two speaker identity information, the conference recording processing device compares the voiceprint feature of the first segment of audio data with that of the second segment of audio data. The voiceprint feature of segmented audio data, wherein the second segmented audio data is obtained by performing voice segmentation on the audio data by the conference recording processing device, and the second segmented audio data corresponds to the only speaker; The voiceprint feature of a segment of audio data is consistent with the voiceprint feature of the second segment of audio data, and the conference recording processing device determines the first segment of audio data according to the speaker identity information corresponding to the second segment of audio data. the corresponding speaker.

In another possible implementation manner, if the identity recognition result indicates that the first segment of audio data corresponds to at least two speaker identity information, the conference record processing apparatus will use the speaker identity information corresponding to the first segment of audio data to And the voiceprint feature, and the speaker identity information and the voiceprint feature corresponding to the second segment audio data determine the speaker corresponding to the first segment audio data, and the second segment audio data is recorded by the conference record processing device. The audio data is obtained by voice segmentation, and the second segmented audio data corresponds to at least two speaker identity information. That is, the conference record processing apparatus can comprehensively determine the speaker corresponding to each segmented audio data according to the voiceprint features of the multiple segmented audio data and the corresponding speaker identity information. In this embodiment, both the first segmented audio data and the second segmented audio data are obtained by the conference recording processing apparatus through voice segmentation. For details, please refer to the minutes of a current meeting shown in Table 1:

Table 1

According to the content shown in the first row to the third row of the above table, there is a unique correspondence between the voiceprint feature and the speaker, so the speaker corresponding to the audio data shown in the first row to the third row can be determined; If the identification result produces multiple speakers, the recording and broadcasting server can integrate the voiceprint features corresponding to the segmented audio data with the voiceprint features of the segmented audio data of other identified speakers and the identification results. The speaker corresponding to the segmented audio data is obtained by analysis. As shown in the fourth row, the identification result display includes that the user identity ID is User03, the user body attribute ID is body04, and the voiceprint feature is VP04. In this case, it may be that the speaker indicated by body04 is looking down and reading the manuscript, while User03 is facing the camera of the video conference terminal, and body04 and User03 cannot be separated when collecting image information. According to the content shown in the 3rd line, the voiceprint feature corresponding to User03 is VP03, so if the voiceprint feature is VP04, the speaker of the content shown in the 4th line can be determined not to be User03, but body04, and The voiceprint feature corresponding to the body04 is VP04. Similarly, for the content shown in the 5th row and the 8th row, the unique speaker can also be determined accordingly. For the content of lines 6, 7, and 9, User05 and User06 cannot be distinguished, and the voiceprint features cannot be distinguished, so the speaker cannot be uniquely determined. For the content of lines 10 and 11, the voiceprint features are both VP07, but the corresponding speaker identities have a unique intersection User07. In this case, it may be that the speaker indicated by User07 spoke during the time period indicated by lines 10 and 11, while User08 was facing the video conference terminal during the time period indicated by line 10. During the time period shown in line 11, User07 is separated from User07 when collecting image information; User06 is facing the camera of the video conferencing terminal directly in the time period shown in line 11, shown in line 10 The time period is separated from User07 when collecting image information. Therefore, combining the contents of lines 10 and 11, it can be inferred that the speaker corresponding to the voiceprint feature VP07 is User07.

If the only speaker cannot be determined through the above methods, the recording server can compare the voiceprint feature and identification result of the current conference with the long-term voiceprint feature record of the conference site for further judgment. That is, the recording server compares the voiceprint feature of the first speaker in the current meeting of the first conference site with the voiceprint feature of the first speaker in the long-term voiceprint feature record, and the first speech A speaker whose corresponding relationship has been determined with segmented audio data in the current conference of the first conference site; if the voiceprint feature of the first speaker in the current conference of the first conference site is the same as that of the first speaker in the When the voiceprint features in the voiceprint feature record are consistent, the recording and broadcasting server compares the voiceprint feature corresponding to the first segmented audio data with the voiceprint feature in the long-term voiceprint feature record of the first venue A speaker corresponding to the first segment of audio data is determined. For details, please refer to an exemplary long-term voiceprint feature record shown in Table 2:

Table 2

Assuming that Conf02 is the analysis result shown in Table 1 above, the recording server can compare the most recent voiceprint features of User01 in the conference room Site01. For example, the voiceprint features of User01 in the conference room Site01 of Conf01 and Conf02 are compared. If the comparison result shows that the difference between the voiceprint features of User01 in the two meetings meets the threshold requirements, the recording server can determine that the The channels of the two conferences in the conference room Site01 are consistent, so it is determined that the long-term voiceprint feature record can be used for reference. For example, when the 7th row of Table 2 shows that the voiceprint feature is VP05, the candidate speakers are User05 and User08, and when the 3rd row of Table 2 shows that the voiceprint feature is VP05, the candidate speakers are User05, User06, and User07, so we can count The number of times the speaker in the channel flag appears, and the single speaker User05 with the largest number of occurrences is taken as the speaker corresponding to the voiceprint feature VP05. After the speaker of the voiceprint feature VP05 is determined, it can be determined that the speaker corresponding to the voiceprint feature VP06 in Table 1 is User06.

Suppose there is another conference Conf03, and Conf03 also has the voiceprint features corresponding to User01 and User01. At this time, the recording server compares the voiceprint features of User01 in Conf01 and Conf03. If the comparison result is displayed in the two conferences If the difference of the voiceprint feature of the User01 does not meet the threshold requirement, the recording and broadcasting server can register the voiceprint feature and the speaker information in the Conf03, and at the same time update the long-term voiceprint feature record. Its specific form can be shown in the 8th row to the 10th row of Table 2. It can be understood that the change of the channel corresponding to the conference may be the change of the conference room or the change of the devices involved in the conference. As shown in Table 2, Conf03 and Conf02 are in the same conference room (Conference Room Site01), and the channel ID has changed, so it can be considered that the video conference terminal of Conf03 and the video conference terminal of Conf02 have changed; The point control unit and the multipoint control unit of Conf02 have changed.

It can be understood that, the conference record processing device can perform the analysis of the long-term conference (that is, the analysis shown in Table 2) after the analysis of the short-term conference (that is, the analysis method shown in Table 1), or the analysis of the long-term conference can be performed. Afterwards, the short-term conference analysis is performed. As long as the audio data can be distinguished in the end, the specific operation method is not limited here.

211. The recording and broadcasting server sends the audio data and the classification result of the audio data to the ASR server.

After the recording server completes the matching of the audio data with the speaker, the classification result and the audio data are sent to the ASR server.

212. The ASR server outputs the audio data as text.

In this embodiment, the video conference terminal collects corresponding image information according to sound source localization, and performs preliminary portrait recognition on the image information to obtain an identification result. The identification result is combined with the voiceprint feature to further identify the audio data, so that accurate classification of the voice data can be achieved without pre-registering the user's voiceprint feature.

It can be understood that the function of the recording and broadcasting server can also be integrated in the multi-point control unit. For details, please refer to FIG. 4 . An embodiment of the audio data processing method in the embodiment of the present application includes:

401-405 are the same as 201-205 in the foregoing embodiment, and are not repeated here.

406. The video conference terminal sends the audio code stream and the identification result to the multipoint control unit.

For the method of sending the audio code stream, refer to 206 above. The difference is that this step also sends the identification result to the multipoint control unit.

407. The multipoint control unit decodes the audio code stream to obtain audio data, and performs voice segmentation on the audio data to obtain the segmented audio data.

After acquiring the audio code stream, the multipoint control unit determines the conference site to which the video conference terminal belongs according to the conference identifier allocated to the video conference terminal. The audio code stream is decoded to obtain audio data, and then the audio data is stored according to the site identifier. At the same time, the multi-point control unit performs speech segmentation on the audio data according to the sound source orientation of the audio data and the human voice detection technology. For the specific implementation of 408-410, refer to 209-211. The difference is that 408-411 is implemented by the multipoint control unit, while 209-211 is implemented by the recording and broadcasting server.

It can be understood that the function of the recording server can also be implemented in the video conference terminal. For details, please refer to FIG. 5 . An embodiment of the audio data processing method in the embodiment of the present application includes:

501-502 are the same as 201-202 in the above-mentioned embodiment, and are not repeated here. 503. The video conference terminal performs voice segmentation on the audio data through human voice detection and sound source localization to obtain segmented audio data.

For the manner in which the video conference terminal performs voice segmentation, reference may be made to 208 above, and details are not repeated here.

504-505 are the same as 204-205 in the foregoing embodiment, and are not repeated here.

The implementation of 506-508 is similar to that of 209-211, the difference is that steps 506-508 are executed by the video conference terminal, while steps 209-211 are executed by the recording server.

509. The ASR server outputs the audio data as text.

In this embodiment, the video conferencing terminal collects corresponding image information according to sound source localization, and performs preliminary portrait recognition on the image information to obtain an identity recognition result, and then the video conferencing terminal combines the identity recognition result with the voiceprint feature. The audio data is further identified, so that accurate classification of the voice data can be achieved without pre-registering the user's voiceprint features.

Please refer to FIG. 6 for details. An embodiment of the audio data processing method in the embodiment of the present application includes:

601. The conference record processing device obtains the audio data of the first meeting place, the sound source position information and the identification result corresponding to the audio data, and the identification result is used to indicate the speaker identification information obtained by the portrait recognition method and the speaker's speech. Correspondence of time information.

The conference record processing apparatus may be the recording server in the method embodiment shown in FIG. 2, the multipoint control unit in the method embodiment shown in FIG. 4, or the video conference terminal in the method embodiment shown in FIG. 5.

In an application scenario, when the conference recording processing device is the recording server in the method embodiment shown in FIG. 2, the conference recording processing device receives the audio data sent by the multipoint control unit and the sound source corresponding to the audio data. Orientation information. The audio data and the sound source position information corresponding to the audio data may be packaged to generate an audio code stream and additional field information, wherein the additional field information includes the sound source position information corresponding to the audio data. In an exemplary solution, the video conference terminal encodes the audio data into an audio code stream, and then adds additional field information to the corresponding audio code stream, and uses the additional field information to indicate sound source location information corresponding to the audio data. Then the video conference terminal sends the audio code stream to the multipoint control unit, and after receiving the audio code stream, the multipoint control unit determines the video conference terminal to which the video conference terminal belongs according to the conference ID assigned to the video conference terminal. site, then add the site identifier to the audio code stream, and send the audio code stream to the recording and broadcasting server. In a possible implementation manner, the multipoint control unit may filter the audio data of each conference site, and then select the audio data of one or more conference sites to send to the recording and broadcasting server. The multipoint control unit can compare the volume levels of the audio data of each venue, and select audio data whose volume is greater than a preset threshold for forwarding; or, the multipoint control unit can determine through an algorithm that the voice duration exceeds the preset threshold. Audio data is forwarded. Specific filtering conditions are not limited here. This reduces the amount of processing and thus speeds up processing. The identity recognition result is obtained by the video conference terminal according to sound source localization and portrait recognition, and is directly sent by the video conference terminal to the recording server.

In another application scenario, when the conference recording processing device is the multipoint control unit in the method embodiment shown in FIG. 4, the conference recording processing device receives the audio data sent by the video conference terminal and the sound source corresponding to the audio data. Orientation information. The audio data and the sound source position information corresponding to the audio data may be packaged to generate an audio code stream and additional field information, wherein the additional field information includes the sound source position information corresponding to the audio data. In an exemplary solution, the video conference terminal encodes the audio data into an audio code stream, and then adds additional field information to the corresponding audio code stream, and uses the additional field information to indicate sound source location information corresponding to the audio data. Then the video conference terminal sends the audio code stream to the multipoint control unit. The identity recognition result is obtained by the video conference terminal according to sound source localization and human image recognition, and is sent by the video conference terminal to the multipoint control unit.

In another application scenario, when the conference recording processing device is the video conference terminal in the method embodiment shown in FIG. 5, the conference recording processing device directly collects the audio data in the current conference through the microphone, and obtains it according to the sound source localization technology. The position information of the sound source corresponding to the audio data. The identity recognition result is obtained by the video conference terminal according to sound source localization and person image recognition.

602. The conference record processing apparatus performs voice segmentation on the audio data to obtain the first segmented audio data of the audio data.

The conference record processing device segments the audio data according to the sound source orientation information and the human voice detection method to obtain a plurality of segmented audio data of the audio data.

603. The conference recording processing apparatus determines a speaker corresponding to the first segment of audio data according to the voiceprint feature of the first segment of audio data and the identification result.

The conference record processing apparatus can execute the method shown in step 210 in FIG. 2 or step 409 in FIG. 4 or step 507 in FIG. 5 to obtain the speaker corresponding to the audio data, and details are not repeated here.

In this embodiment, the conference recording processing device obtains an identification result used to indicate the correspondence between the speaker's identity information and the speaker's time information, and then the conference recording processing device combines the identification result with the voiceprint feature to perform a process on the audio data. Further identification, in this way, accurate classification of voice data can be achieved without pre-registering the user's voiceprint features.

The audio data processing method in the embodiment of the present application is described above, and the conference record processing apparatus and the video conference terminal in the embodiment of the present application are described below.

For details, please refer to FIG. 7 . In this embodiment of the present application, the apparatus 700 for processing conference records includes: an acquisition module 701 and a processing module 702 , wherein the acquisition module 701 and the processing module 702 are connected through a bus. The conference record processing apparatus 700 may be the recording server in the method embodiment shown in FIG. 2 above, the multipoint control unit in the method embodiment shown in FIG. 4 above, or the video conference terminal in the method embodiment shown in FIG. 5 above, It can also be configured as one or more chips within the above device. The meeting record processing apparatus 700 may be used to execute part or all of the functions of the above-mentioned devices.

For example, the acquisition module 701 acquires the audio data of the first venue, the sound source location information and the identification result corresponding to the audio data, and the identification result is used to indicate the speaker identification information obtained by the portrait recognition method and the speaking time of the speaker The corresponding relationship of the information; the processing module 702 performs voice segmentation on the audio data to obtain the first segmented audio data of the audio data; according to the voiceprint feature of the first segmented audio data and the identification result, determine the The speaker corresponding to the first segment of audio data.

Optionally, the audio data is included in an audio code stream, and the audio code stream further includes additional field information, where the additional field information includes sound source position information corresponding to the audio data.

Optionally, the processing module 702 is specifically configured to determine the speech corresponding to the first segment of audio data according to the speaker's identity information if the identification result indicates that the first segment of audio data corresponds to the unique speaker identity information. people.

Optionally, the processing module 702 is specifically configured to compare the voiceprint feature of the first segmented audio data with that of the second if the identification result indicates that the first segmented audio data corresponds to at least two speaker identity information. The voiceprint feature of segmented audio data, the second segmented audio data is obtained by performing voice segmentation on the audio data by the conference recording processing device, and the second segmented audio data corresponds to the unique speaker identity information; If the voiceprint feature of the segmented audio data is consistent with the voiceprint feature of the second segmented audio data, the speaker corresponding to the first segmented audio data is determined according to the speaker identity information corresponding to the second segmented audio data.

Optionally, the processing module 702 is specifically configured to, if the identification result indicates that the first segmented audio data corresponds to at least two speaker identity information, then according to the speaker identity information corresponding to the first segmented audio data and The voiceprint feature, and the speaker identity information and the voiceprint feature corresponding to the second segmented audio data determine the speaker corresponding to the first segmented audio data, and the second segmented audio data is processed by the conference record processing device. The audio data is obtained by voice segmentation, and the second segmented audio data corresponds to at least two speaker identity information.

Optionally, the processing module 702 is specifically configured to determine the first segmented audio data according to the corresponding voiceprint feature of the first segmented audio data, the identity recognition result and the long-term voiceprint feature record of the first venue. Corresponding speaker, the long-term voiceprint feature record includes the historical voiceprint feature record of the first venue, and the historical voiceprint feature record of the first venue is used to indicate the voiceprint feature, the speaker, and the channel identifier. Correspondence between.

Optionally, the processing module 702 is specifically configured to compare the voiceprint feature of the first speaker in the current conference of the first conference venue with the voiceprint feature of the first speaker in the long-term voiceprint feature record. The comparison result is obtained, and the first speaker is the determined speaker in the current conference of the first conference site; if the comparison result indicates that the voiceprint feature of the first speaker in the current conference of the first conference site is the same as The voiceprint feature of the first speaker in the long-term voiceprint feature record is consistent, then the voiceprint feature corresponding to the first segment of audio data, the identification result and the long-term voiceprint feature of the first venue The record is compared to determine the speaker corresponding to the first segment of audio data.

Optionally, the processing module 702 is further configured to perform a comparison between the voiceprint feature of the first speaker in the current conference of the first conference site and the voiceprint feature of the first speaker in the long-term voiceprint feature record. The comparison result is obtained, and the first speaker is the determined speaker in the current conference of the first conference site; if the comparison result indicates that the voiceprint feature of the first speaker in the current conference of the first conference site is the same as If the voiceprint features of the first speaker in the long-term voiceprint feature record are inconsistent, register the voiceprint feature, the channel ID and the speaker corresponding to the voiceprint feature in the current conference of the first conference site, and Update the long-term voiceprint feature record.

Optionally, the conference record processing apparatus 700 further includes a storage module, which is coupled with the processing module, so that the processing module can execute the computer execution instructions stored in the storage module to implement the functions of the conference record processing apparatus in the above method embodiments. In an example, the optional storage module included in the conference record processing apparatus 700 may be a storage unit within the chip, such as a register, a cache, etc., and the storage module may also be a storage unit located outside the chip, such as a ROM or a storage unit that can store Other types of static storage devices for static information and instructions, RAM, etc.

It should be understood that the processes performed between the modules of the conference record processing apparatus in the above-mentioned embodiment corresponding to FIG. 7 are similar to the processes performed by the conference record processing apparatus in the corresponding method embodiments in FIG. 2 to FIG. 6 . Repeat.

FIG. 8 shows a schematic structural diagram of a conference record processing apparatus 800 in the above embodiment. The conference record processing apparatus 800 may be configured to be the recording server in the method embodiment shown in FIG. 2 and the recording server shown in FIG. 4 above. The multipoint control unit in the method embodiment or the video conference terminal in the method embodiment shown in FIG. 5 above. The conference record processing apparatus 800 may include: a processor 802 , a computer-readable storage medium/memory 803 , a transceiver 804 , an input device 805 and an output device 806 , and a bus 801 . Wherein, processors, transceivers, computer-readable storage media, etc. are connected through a bus. The embodiments of the present application do not limit the specific connection medium between the above components.

In one example, the transceiver 804 obtains the audio data of the first venue, the sound source location information identification result corresponding to the audio data, and the identification result is used to indicate the speaker identification information obtained by the portrait recognition method and the speaker's identity. Correspondence of speaking time information;

The processor 802 performs speech segmentation on the audio data to obtain the first segmented audio data of the audio data; and determines the first segmented audio according to the voiceprint feature of the first segmented audio data and the identification result Data corresponds to the spokesperson.

In one example, the processor 802 may include a baseband circuit, for example, may modulate and process audio data, and generate an audio code stream. The transceiver 804 may include a radio frequency circuit, so as to modulate and amplify the audio code stream and send it to the corresponding device in the conference system.

In yet another example, the processor 802 may run an operating system to control functions between various devices and devices. The transceiver 804 may include a baseband circuit and a radio frequency circuit. For example, the audio code stream or the identification result may be processed by the baseband circuit and the radio frequency circuit and then sent to the corresponding device in the conference system.

The transceiver 804 and the processor 802 can implement the corresponding steps in any of the foregoing embodiments in FIG. 2 to FIG. 6 , and details are not repeated here.

It can be understood that FIG. 8 only shows a simplified design of the conference record processing device. In practical applications, the conference record processing device can include any number of transceivers, processors, memories, etc., and all of them can realize the The conference record processing device is all within the protection scope of the present application.

The processor 802 involved in the above-mentioned apparatus 800 may be a general-purpose processor, such as a CPU, a network processor (NP), a microprocessor, etc., or an ASIC, or one or more programs for controlling the solution of the present application. implemented integrated circuits. It can also be a digital signal processor (DSP), a field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components. A controller/processor may also be a combination that implements computing functions, such as a combination comprising one or more microprocessors, a combination of a DSP and a microprocessor, and the like. Processors typically perform logical and arithmetic operations based on program instructions stored in memory.

The above-mentioned bus 801 may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (EISA for short) bus or the like. The bus can be divided into address bus, data bus, control bus and so on. For ease of presentation, only one thick line is used in FIG. 8, but it does not mean that there is only one bus or one type of bus.

The computer-readable storage medium/memory 803 mentioned above may also store an operating system and other application programs. Specifically, the program may include program code, and the program code includes computer operation instructions. More specifically, the above-mentioned memory may be ROM, other types of static storage devices that can store static information and instructions, RAM, other types of dynamic storage devices that can store information and instructions, disk storage, and the like. Memory 803 may be a combination of the above-described storage types. And the above-mentioned computer-readable storage medium/memory may be in the processor, outside the processor, or distributed over multiple entities including the processor or processing circuit. The computer-readable storage medium/memory described above may be embodied in a computer program product. By way of example, a computer program product may include a computer-readable medium in packaging materials.

Alternatively, the embodiments of the present application also provide a general-purpose processing system, for example, commonly referred to as a chip, and the general-purpose processing system includes: one or more microprocessors that provide processor functions; and an external memory that provides at least a part of a storage medium. , all of which are connected together with other support circuits through an external bus architecture. When the instructions stored in the memory are executed by the processor, the processor is caused to execute part or all of the steps of the data transmission method in the embodiment of FIG. 2 to FIG. other processes of technology.

The steps of the methods or algorithms described in conjunction with the disclosure of the present application may be implemented in a hardware manner, or may be implemented in a manner in which a processor executes software instructions. The software instructions can be composed of corresponding software modules, and the software modules can be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable hard disk, CD-ROM, or any other form of storage known in the art in the medium. An exemplary storage medium is coupled to the processor, such that the processor can read information from, and write information to, the storage medium. Of course, the storage medium can also be an integral part of the processor. The processor and storage medium may reside in an ASIC. Alternatively, the ASIC may be located in the minutes processing device. Of course, the processor and the storage medium may also exist in the conference record processing apparatus as discrete components.

Please refer to FIG. 9 for details. In this embodiment of the present application, the video conference terminal 900 includes: a processing module 901 and a sending module 902, wherein the processing module 901 and the sending module 902 are connected through a bus. The video conference terminal 900 may be the video conference terminal in the foregoing method embodiments, or may be configured as one or more chips in the foregoing video conference terminal. The video conference terminal 900 may be used to perform part or all of the functions of the above-mentioned video conference terminal.

For example, the processing module 901 performs sound source localization on the audio data of the first conference venue to obtain sound source position information corresponding to the audio data; obtains an identification result according to the sound source position and the portrait recognition method, and the identification result is used for Indicates the correspondence between speaker identity information and speaking time information; the sending module 902 sends the identity recognition result, audio data and sound source method information corresponding to the audio data to the conference record processing device.

Optionally, the processing module 901 is specifically used to obtain the portrait information corresponding to the sound source orientation; perform image recognition on the portrait information to obtain face information and/or body attribute information; according to the face information and/or the body The attribute information determines the identity information of the speaker; the identity recognition result is obtained by establishing a corresponding relationship between the time information of the speaker and the identity information of the speaker.

Optionally, the video conference terminal 900 further includes a storage module, which is coupled with the processing module, so that the processing module can execute the computer execution instructions stored in the storage module to implement the functions of the video conference terminal in the above method embodiments. In an example, the optional storage module included in the video conference terminal 900 may be an in-chip storage unit, such as a register, a cache, etc., and the storage module may also be a storage unit located outside the chip, such as a ROM or a storage unit that can store static Other types of static storage devices for information and instructions, RAM, etc.

It should be understood that the processes executed between the modules of the video conferencing terminal in the above-mentioned embodiment corresponding to FIG. 9 are similar to the processes executed by the video conferencing terminal in the corresponding method embodiments in the foregoing FIG. 2 to FIG. 6 , and details are not repeated here. .

FIG. 10 shows a schematic structural diagram of a video conference terminal 1000 in the above-mentioned embodiment, and the video conference terminal 1000 may be configured as the aforementioned video conference terminal. The video conference terminal 1000 may include: a processor 1002 , a computer-readable storage medium/memory 1003 , a transceiver 1004 , an input device 1005 and an output device 1006 , and a bus 1001 . Wherein, processors, transceivers, computer-readable storage media, etc. are connected through a bus. The embodiments of the present application do not limit the specific connection medium between the above components.

In an example, the processor 1002 performs sound source localization on the audio data of the first venue to obtain sound source location information corresponding to the audio data; and obtains an identification result according to the sound source location and the portrait recognition method. The result is used to indicate the correspondence between speaker identity information and speaking time information;

The transceiver 1004 sends the identification result and the audio data to the conference record processing apparatus.

In one example, the processor 1002 may include a baseband circuit, for example, may modulate and process audio data, and generate an audio code stream. The transceiver 1004 may include a radio frequency circuit, so as to modulate and amplify the audio code stream and send it to the corresponding device in the conference system.

In yet another example, the processor 1002 may run an operating system to control functions between various devices and devices. The transceiver 1004 may include a baseband circuit and a radio frequency circuit. For example, the audio code stream or the identification result may be processed by the baseband circuit, and then sent to the corresponding device in the conference system by the radio frequency circuit.

The transceiver 1004 and the processor 1002 can implement the corresponding steps in any of the foregoing embodiments in FIG. 3 to FIG. 7 , and details are not repeated here.

It can be understood that FIG. 10 only shows the simplified design of the video conference terminal. In practical applications, the video conference terminal can include any number of transceivers, processors, memories, etc., and all of them can realize the video conference of the present application. The terminals are all within the protection scope of this application.

The processor 1002 involved in the above-mentioned apparatus 1000 may be a general-purpose processor, such as a CPU, a network processor (NP), a microprocessor, etc., or an ASIC, or one or more programs used to control the solution of the present application implemented integrated circuits. It can also be a digital signal processor (DSP), a field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components. A controller/processor may also be a combination that implements computing functions, such as a combination comprising one or more microprocessors, a combination of a DSP and a microprocessor, and the like. Processors typically perform logical and arithmetic operations based on program instructions stored in memory.

The above-mentioned bus 1001 may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (EISA for short) bus or the like. The bus can be divided into address bus, data bus, control bus and so on. For ease of presentation, only one thick line is used in FIG. 10, but it does not mean that there is only one bus or one type of bus.

The above-mentioned computer-readable storage medium/memory 1003 may also store an operating system and other application programs. Specifically, the program may include program code, and the program code includes computer operation instructions. More specifically, the above-mentioned memory may be ROM, other types of static storage devices that can store static information and instructions, RAM, other types of dynamic storage devices that can store information and instructions, disk storage, and the like. The memory 1003 may be a combination of the above storage types. And the above-mentioned computer-readable storage medium/memory may be in the processor, outside the processor, or distributed over multiple entities including the processor or processing circuit. The computer-readable storage medium/memory described above may be embodied in a computer program product. By way of example, a computer program product may include a computer-readable medium in packaging materials.

The steps of the methods or algorithms described in conjunction with the disclosure of the present application may be implemented in a hardware manner, or may be implemented in a manner in which a processor executes software instructions. The software instructions can be composed of corresponding software modules, and the software modules can be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable hard disk, CD-ROM, or any other form of storage known in the art in the medium. An exemplary storage medium is coupled to the processor, such that the processor can read information from, and write information to, the storage medium. Of course, the storage medium can also be an integral part of the processor. The processor and storage medium may reside in an ASIC. Alternatively, the ASIC can be located in the videoconferencing terminal. Of course, the processor and the storage medium may also exist in the video conference terminal as discrete components.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system, device and unit described above may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be other division methods, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

The unit described as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , which includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method in each embodiment of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .

As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: they can still The technical solutions described in the embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the present application.

Claims

A method for processing audio data, comprising:

The conference record processing device obtains the audio data of the first conference site in the current conference, the sound source position information corresponding to the audio data, and the identification result, and the identification result is used to indicate the speaker identification information obtained by the portrait recognition method and the speech. Corresponding relationship of people's speaking time information;

The conference record processing device performs voice segmentation on the audio data to obtain the first segmented audio data of the audio data;

The conference record processing device determines a speaker corresponding to the first segmented audio data according to the voiceprint feature of the first segmented audio data and the identity recognition result.
The method according to claim 1, wherein the audio data is contained in an audio code stream, the audio code stream further includes additional domain information, and the additional domain information includes a sound source corresponding to the audio data Orientation information.
The method according to claim 1 or 2, wherein the conference record processing device determines that the first segmented audio data corresponds to the first segmented audio data according to the voiceprint feature of the first segmented audio data and the identification result. Speakers include:

If the identification result indicates that the first segment of audio data corresponds to unique speaker identity information, the conference record processing apparatus determines the speaker corresponding to the first segment of audio data according to the speaker identity information.
The method according to claim 1 or 2, wherein the conference record processing device determines that the first segmented audio data corresponds to the first segmented audio data according to the voiceprint feature of the first segmented audio data and the identification result. Speakers include:

If the identification result indicates that the first segmented audio data corresponds to at least two speaker identity information, the conference recording processing device compares the voiceprint feature of the first segmented audio data with the second segmented audio The voiceprint feature of the data, the second segmented audio data is obtained by performing voice segmentation on the audio data by the conference record processing device, and the second segmented audio data corresponds to the unique speaker identity information;

If the voiceprint feature of the first segmented audio data is consistent with the voiceprint feature of the second segmented audio data, then the conference record processing device determines, according to the speaker identity information corresponding to the second segmented audio data, the the speaker corresponding to the first segment of audio data.
The method according to claim 1 or 2, wherein the conference record processing device determines that the first segmented audio data corresponds to the first segmented audio data according to the voiceprint feature of the first segmented audio data and the identification result. Speakers include:

If the identity recognition result indicates that the first segment of audio data corresponds to at least two speaker identity information, the conference recording processing device will, according to the speaker identity information and voiceprint feature corresponding to the first segment of audio data, , and the speaker identity information and voiceprint feature corresponding to the second segmented audio data determine the speaker corresponding to the first segmented audio data, and the second segmented audio data is recorded by the conference record processing device. The audio data is obtained by voice segmentation, and the second segmented audio data corresponds to at least two speaker identities.
The method according to any one of claims 1 to 4, wherein, if the meeting record processing device does not determine the said the only speaker corresponding to the first segment of audio data, the method further includes:

The conference record processing device determines, according to the voiceprint feature corresponding to the first segmented audio data, the identity recognition result and the long-term voiceprint feature record of the first conference venue, the corresponding voiceprint of the first segmented audio data. Speaker, the long-term voiceprint feature record includes the historical voiceprint feature record of the first venue, and the historical voiceprint feature record of the first venue is used to indicate the voiceprint feature, the speaker, and the channel identifier. Correspondence.
The method according to claim 6, wherein the conference recording processing device is based on the voiceprint feature corresponding to the first segmented audio data, the identity recognition result and the long-term voiceprint of the first conference site The feature record determines that the speakers corresponding to the first segment of audio data include:

The conference record processing device compares the voiceprint feature of the first speaker in the current conference of the first conference site with the voiceprint feature of the first speaker in the long-term voiceprint feature record, the first speaker is the determined speaker in the current conference of the first conference site;

If the voiceprint feature of the first speaker in the current conference at the first conference site is consistent with the voiceprint feature of the first speaker in the long-term voiceprint feature record, the conference record processing The apparatus determines the speaker corresponding to the first segment audio data by using the voiceprint feature corresponding to the first segment audio data, the identity recognition result and the long-term voiceprint feature record of the first conference site.
The method according to claim 6, wherein the method further comprises:

The conference record processing device compares the voiceprint feature of the first speaker in the current conference of the first conference site with the voiceprint feature of the first speaker in the long-term voiceprint feature record, the first speaker is the determined speaker in the current conference of the first conference site;

If the comparison result indicates that the voiceprint feature of the first speaker in the current conference of the first conference site is inconsistent with the voiceprint feature of the first speaker in the long-term voiceprint feature record, Then, the conference record processing device registers the voiceprint feature, the channel identifier and the speaker corresponding to the voiceprint feature in the current conference of the first conference site, and updates the long-term voiceprint feature record.
A method for processing audio data, comprising:

The video conference terminal performs sound source localization on the audio data of the first conference site to obtain sound source orientation information corresponding to the audio data;

The video conference terminal obtains an identification result according to the sound source orientation and the portrait identification method, and the identification result is used to indicate the correspondence between speaker identification information and speaking time information;

The video conference terminal sends the identification result, the audio data and the sound source location information corresponding to the audio data to the conference record processing device.
The method according to claim 9, wherein the obtaining, by the video conferencing terminal, an identity recognition result according to the sound source azimuth and the face recognition method comprises:

acquiring, by the video conference terminal, the portrait information corresponding to the sound source orientation;

The video conference terminal performs image recognition on the portrait information to obtain face information and/or body attribute information;

The video conference terminal determines the speaker identity information according to the face information and/or the body attribute information;

The video conference terminal establishes a corresponding relationship between the speaker time information and the speaker identity information to obtain the identity recognition result.
A conference record processing device, characterized in that it includes:

The acquisition module is used to acquire the audio data of the first venue, the sound source position information corresponding to the audio data and the identity recognition result, and the identity recognition result is used to indicate the speaker identity information obtained by the portrait recognition method and the speaker's identity information. Correspondence of speaking time information;

The processing module is configured to perform voice segmentation on the audio data to obtain the first segmented audio data of the audio data; determine the segmented audio data according to the voiceprint feature of the first segmented audio data and the identification result. the speaker corresponding to the first segment of audio data.
The device according to claim 11, wherein the audio data is contained in an audio code stream, the audio code stream further includes additional domain information, and the additional domain information includes a sound source corresponding to the audio data Orientation information.
The apparatus according to claim 11 or 12, wherein the processing module is specifically configured to, if the identity recognition result indicates that the first segment of audio data corresponds to unique speaker identity information, The person identity information determines the speaker corresponding to the first segment of audio data.
The device according to claim 11 or 12, wherein the processing module is specifically configured to compare the identification information of the first segment of audio data to at least two speaker identification information if the identification result indicates that the first segment of audio data corresponds to the identification information of at least two speakers. The voiceprint feature of the first segmented audio data and the voiceprint feature of the second segmented audio data, the second segmented audio data is obtained by performing voice segmentation on the audio data by the conference record processing device, so The second segment of audio data corresponds to the unique speaker identity information;

If the voiceprint feature of the first segmented audio data is consistent with the voiceprint feature of the second segmented audio data, determine the first segmented audio according to the speaker identity information corresponding to the second segmented audio data Data corresponds to the spokesperson.
The apparatus according to claim 11 or 12, wherein the processing module is specifically configured to, if the identification result indicates that the first segment of audio data corresponds to at least two speaker identification information, The speaker identity information and voiceprint feature corresponding to the first segmented audio data, and the speaker identity information and voiceprint feature corresponding to the second segmented audio data determine the speaker corresponding to the first segmented audio data , the second segmented audio data is obtained by performing voice segmentation on the audio data by the conference recording processing device, and the second segmented audio data corresponds to at least two speaker identities.
The device according to any one of claims 11 to 15, wherein the processing module is further configured to, according to the voiceprint feature corresponding to the first segmented audio data, the identity recognition result and the The long-term voiceprint feature record of the first venue determines the speaker corresponding to the first segmented audio data, and the long-term voiceprint feature record includes the historical voiceprint feature record of the first venue. The historical voiceprint feature record of is used to indicate the correspondence between voiceprint features, speakers, and channel identifiers.
The apparatus according to claim 16, wherein the processing module is specifically configured to compare the voiceprint feature of the first speaker in the current conference of the first conference site with the first speaker in the The voiceprint features in the long-term voiceprint feature record are compared to obtain a comparison result, and the first speaker is the determined speaker in the current meeting of the first conference venue;

If the comparison result indicates that the voiceprint feature of the first speaker in the current conference of the first conference site is consistent with the voiceprint feature of the first speaker in the long-term voiceprint feature record, Then compare the voiceprint feature corresponding to the first segmented audio data, the identification result and the long-term voiceprint feature record of the first venue to determine the speaker corresponding to the first segmented audio data .
The apparatus according to claim 16, wherein the processing module is further configured to compare the voiceprint feature of the first speaker in the current conference of the first conference site with the first speaker in the The voiceprint features in the long-term voiceprint feature record are compared to obtain a comparison result, and the first speaker is the determined speaker in the current meeting of the first conference venue;

If the comparison result indicates that the voiceprint feature of the first speaker in the current conference of the first conference site is inconsistent with the voiceprint feature of the first speaker in the long-term voiceprint feature record, Then, register the voiceprint feature, the channel identifier and the speaker corresponding to the voiceprint feature in the current conference of the first conference site, and update the long-term voiceprint feature record.
A video conference terminal, comprising:

A processing module, configured to perform sound source localization on the audio data of the first venue to obtain sound source orientation information corresponding to the audio data; obtain an identification result according to the sound source orientation and the portrait recognition method, and the identification The result is used to indicate the correspondence between speaker identity information and speaking time information;

The sending module is configured to send the identification result, the audio data and the sound source location information corresponding to the audio data to the conference record processing device.
The video conference terminal according to claim 19, wherein the processing module is specifically configured to obtain the portrait information corresponding to the sound source orientation; perform image recognition on the portrait information to obtain face information and/or body information attribute information; determine the speaker identity information according to the face information and/or the body attribute information; establish a corresponding relationship between the speaker time information and the speaker identity information to obtain the identity recognition result.
A conference record processing device, characterized in that it includes at least one processor and a memory, the processor is configured to be coupled with the memory, and the processor invokes an instruction stored in the memory to control the terminal to execute the claim The method of any one of 1 to 8.
A video conference terminal, characterized in that it includes at least one processor and a memory, the processor is configured to be coupled with the memory, and the processor invokes an instruction stored in the memory to control the terminal to execute claim 9 The method of any one of to 10.
A conference record processing system, characterized in that it comprises the conference record processing device according to any one of claims 11 to 18, the video conference terminal and multipoint control device according to any one of claims 19 to 20 Unit and Automatic Speech Recognition ASR Server.
A computer storage medium The computer storage medium stores computer instructions for performing the method of any one of claims 1 to 10 above.
A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 10 above.