WO2022062471A1 - Audio data processing method, device and system - Google Patents

Audio data processing method, device and system Download PDF

Info

Publication number
WO2022062471A1
WO2022062471A1 PCT/CN2021/098297 CN2021098297W WO2022062471A1 WO 2022062471 A1 WO2022062471 A1 WO 2022062471A1 CN 2021098297 W CN2021098297 W CN 2021098297W WO 2022062471 A1 WO2022062471 A1 WO 2022062471A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio data
speaker
conference
voiceprint feature
information
Prior art date
Application number
PCT/CN2021/098297
Other languages
French (fr)
Chinese (zh)
Inventor
张鹏
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022062471A1 publication Critical patent/WO2022062471A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems

Definitions

  • the present application relates to the field of communications, and in particular, to a method, device and system for processing audio data.
  • the audio data of the entire file can be directly sent to the voiceprint recognition system for identification. If there are multiple voices in the voice file, the voice file needs to be segmented first, and then voiceprint recognition is performed on each piece of audio data.
  • Existing voiceprint recognition systems usually require more than 10 seconds of audio data, and the longer the data, the higher the accuracy. Therefore, when segmenting audio data, the segments cannot be too short. Since there are many scenes of free conversation in a video conference, when the segment of audio data is long, a piece of speech may contain the speech of multiple people. During recognition, the recognition result will be unreliable.
  • the premise of realizing the above solution is that the conference participants need to register their voiceprints in the voiceprint recognition system, but the channel during sound collection has a great influence on the voiceprint characteristics. There are many kinds, and it is difficult to ensure the accuracy of voiceprint recognition of sounds collected by different sound channels.
  • Embodiments of the present application provide an audio data processing method, device, and system, which are used to accurately classify conference audio data.
  • an embodiment of the present application provides a method for processing audio data, which specifically includes: the conference record processing device obtains audio data of a first conference venue, sound source location information corresponding to the audio data, and an identity recognition result, the identity The recognition result is used to indicate the correspondence between the speaker's identity information obtained by the portrait recognition method and the speaker's speaking time information; then the conference record processing device performs voice segmentation on the audio data to obtain the first score of the audio data. segment audio data; finally, the conference recording processing device determines the speaker corresponding to the first segment audio data according to the voiceprint feature of the first segment audio data and the identification result.
  • the audio data and the sound source orientation information corresponding to the audio data can be packaged to generate an audio code stream, and then the audio code stream contains additional domain information of the audio data, and the additional domain information includes the sound source corresponding to the audio data.
  • Orientation information The audio data processing method may be applied to a local conference or a remote conference scenario, wherein the conference site participating in the conference may include at least one. Based on the above solution, the additional field information may further include time information of the audio data, and site identification information of the first site and other information.
  • Portrait recognition methods include face recognition and human attribute recognition.
  • the spokesperson corresponding to the facial features is obtained through face recognition, and the human body attribute recognition includes identifying the overall clothing or physical features of the user to obtain the spokesperson corresponding to the physical features or the appearance of the user's clothing.
  • the speaker identification information may be user identification information (such as the speaker's job number in the company or the speaker's ID number or phone number registered in the company's internal database) or user physical attribute identification information (such as the current meeting's ID number or phone number) The user is wearing white tops and black trousers or the user has a visible mark on the arm, etc.).
  • the speaking time information may be a period of time or two time points.
  • the speaking time information is 30 seconds from 00:00:15 to 00:00:45 after the current conference starts; or the speaking time information only includes "00:00:15" and "00:00:45” these two time points.
  • the timing rule indicated in the form of "00:00:00” is "hours:minutes:seconds", that is, the time point indicated by "00:00:15" is after the start of the meeting 15 seconds.
  • the conference record processing device obtains an identification result indicating the correspondence between speaker identification information and speaking time information, and then combines the identification result with the voiceprint feature to record the audio data Further identification is performed, so that accurate classification of voice data can be achieved without pre-registering the user's voiceprint features.
  • the operation of the conference recording processing device to determine the speaker corresponding to the first segmented audio data according to the voiceprint feature of the first segmented audio data and the identity recognition result may be as follows:
  • the conference record processing device determines the speech corresponding to the first segment of audio data according to the speaker identity information. people. That is, the meeting record processing device has obtained the identification result of the first segment of audio data and indicates that the speaker corresponding to the first segment of audio data is only user01, and the corresponding voiceprint feature is VP01, then the meeting record processing device The speaker of the first segment of audio data is determined to be this user01.
  • the conference recording processing device compares the voiceprint feature of the first segment of audio data with the first segment of audio data.
  • the voiceprint feature of two-segment audio data, the second-segment audio data is obtained by segmenting the audio data by the conference recording processing device, and the second-segment audio data corresponds to the unique speaker identity information;
  • the voiceprint feature of a segment of audio data is consistent with the voiceprint feature of the second segment of audio data, and the conference recording processing device determines the first segment of audio data according to the speaker identity information corresponding to the second segment of audio data. the corresponding speaker.
  • the speaker identity information of the second segment of audio data has been determined to be user02, the corresponding voiceprint feature is VP02, the voiceprint feature of the first segment of audio data is VP02, and the corresponding speaker identity information includes user03 and user02;
  • the above analysis shows that the voiceprint feature of the first segmented audio data and the voiceprint feature of the second segmented audio data are both VP02, and from the results of the second segmented audio data, the voiceprint feature is the speech corresponding to VP02. If the person is user02, it can be determined that the speaker of the first segment of audio data is also user02.
  • the conference record processing apparatus will use the speaker identity information corresponding to the first segment of audio data to And the voiceprint feature, and the speaker identity information and the voiceprint feature corresponding to the second segment audio data determine the speaker corresponding to the first segment audio data, and the second segment audio data is recorded by the conference record processing device.
  • the audio data is obtained by voice segmentation, and the second segmented audio data corresponds to at least two speaker identity information. That is, the conference record processing device can comprehensively determine the speaker corresponding to each segmented audio data according to the voiceprint features of the multiple segmented audio data and the corresponding speaker identity information.
  • the speaker identity information of the second segment of audio data has been determined as user02 and user03, the corresponding voiceprint feature is VP02, the voiceprint feature of the first segment of audio data is VP03, and the corresponding speaker identity information includes user03 and user02 , the voiceprint feature of the third segment of audio data is VP03, and the corresponding speaker identity information is user03 and user01.
  • the voiceprint feature of the first segmented audio data and the voiceprint feature of the third segmented audio data are both VP03, and the speaker identity information corresponding to the first segmented audio data and the third segmented audio.
  • the identity of the speaker corresponding to the data has a unique intersection, namely user03.
  • the conference recording processing device determines the first segmented audio according to the voiceprint feature corresponding to the first segmented audio data, the identity recognition result, and the long-term voiceprint feature record of the first conference site
  • the speaker corresponding to the data, the long-term voiceprint feature record includes the historical voiceprint feature record of the first venue, and the historical voiceprint feature record of the first venue is used to indicate voiceprint features, speakers, and channel identifiers Correspondence between.
  • the conference recording processing device determines the speaker corresponding to the first segmented audio data according to the voiceprint feature of the first segmented audio data, the identity recognition result and the long-term voiceprint feature record, its specific The operation may be as follows: the conference record processing device compares the voiceprint feature of the first speaker in the current conference of the first conference site with the voiceprint feature of the first speaker in the long-term voiceprint feature record, The first speaker is the confirmed speaker in the current conference of the first conference site; if the voiceprint feature of the first speaker in the current conference of the first conference site is the same as that of the first speaker in the long-term If the voiceprint features in the feature records are consistent, the conference record processing device determines that the long-term voiceprint feature record is available, and at this time the conference record processing device corresponds to the voiceprint feature of the first segment of audio data, the identity recognition result The speaker corresponding to the first segment of audio data is determined by comparing with the long-term voiceprint feature record of the first conference site.
  • the combination of short-term processing and long-term processing can improve the accuracy of audio data classification as much as possible.
  • the conference record processing device compares the voiceprint feature of the first speaker in the current conference of the first conference site with the voiceprint feature of the first speaker in the long-term voiceprint feature record,
  • the first speaker is the confirmed speaker in the current conference of the first conference site; if the voiceprint feature of the first speaker in the current conference of the first conference site is the same as that of the first speaker in the long-term If the voiceprint features in the feature record are inconsistent, the conference record processing device registers the voiceprint feature, the channel identifier and the speaker corresponding to the voiceprint feature in the current conference of the first conference site, and updates the long-term voiceprint feature record.
  • the voiceprint feature, speaker, and channel identification of the conference site can be updated according to the actual situation, so that the long-term voiceprint feature record is available.
  • the corresponding voiceprint features and speakers are registered, so as to realize the dynamic registration of voiceprint features and speakers, which is no longer limited to the registration of voiceprint features with fixed channel identifiers, which can effectively realize the registration of audio data. accurate classification.
  • the conference recording processing device can acquire the voiceprint feature of the first segmented audio data. Then, the conference record processing device establishes a corresponding relationship between the voiceprint identification information and the speaker corresponding to the first segment of audio data. In this way, one-to-one correspondence between the voiceprint features and the speakers can facilitate subsequent audio data classification and processing.
  • the conference recording processing device may be a recording and broadcasting server or a functional module integrated in the multi-point control unit. Therefore, the audio code stream can be forwarded by the multipoint control unit to the conference record processing device, and the identification result is sent to the conference record processing device by the video conference terminal.
  • the audio code stream is forwarded to the conference record processing apparatus by the multipoint control unit after being selected by the conference site. This reduces unnecessary data transmission and reduces the burden on the network.
  • the specific operation of the conference recording processing apparatus for performing voice segmentation on the audio data may be as follows: the conference recording processing apparatus performs voice segmentation on the audio data according to the sound source location information and the human voice detection technology. This allows for more precise segmentation of audio data.
  • an embodiment of the present application provides a method for processing audio data, which includes: a video conference terminal performs sound source localization on audio data of a first conference site to obtain sound source location information corresponding to the audio data; the The video conference terminal obtains an identification result according to the sound source orientation and the face recognition method, and the identification result is used to indicate the correspondence between the speaker identification information and the speaking time information; the video conference terminal uses the identification result, the audio data and The sound source orientation information corresponding to the audio data is sent to the conference record processing device.
  • the video conference terminal realizes the acquisition of the image information of the speaker by locating the sound source of the audio data, and obtains the correspondence between the identity information of the speaker and the speaking time information through the portrait recognition of the image information. the identification result, and then send the identification result to the conference record processing device, so that the conference record processing device combines the identification result with the voiceprint feature to further identify the audio data, so that the user's voice Accurate classification of speech data can be achieved by pre-registering the pattern features.
  • the specific process of the video conferencing terminal performing portrait recognition may be as follows: the video conferencing terminal obtains the portrait information corresponding to the position of the sound source; the video conferencing terminal performs image recognition on the portrait information to obtain the face information and/or Physical attribute information; the video conference terminal determines the speaker identity information according to the face information and/or the physical attribute information; the video conference terminal establishes a corresponding relationship between the speaker time information and the speaker identity information to obtain the identification result .
  • the speaker identity information may be user identity information (such as the speaker's job number in the company or the speaker's ID number or phone number registered in the company's internal database) or user body attribute identification information ( For example, in the current meeting, the top of the user is wearing white clothes, and the bottom is black trousers, or there is a clear mark on the user's arm, etc.).
  • the speaking time information may be a period of time or two time points. For example, the speaking time information is 30 seconds from 00:00:15 to 00:00:45 after the current conference starts; or the speaking time information only includes "00:00:15" and "00:00:45” these two time points. It can be understood that, in the embodiment of the present application, the timing rule indicated in the form of "00:00:00” is "hours:minutes:seconds", that is, the time point indicated by "00:00:15" is after the start of the meeting 15 seconds.
  • the video conference terminal may also be used as the conference record processing device to implement the method of the first aspect, as follows:
  • the video conference terminal acquires the audio data of the current conference site, and detects segmented audio data from the audio data according to the sound source orientation and human voice; then acquires the voiceprint feature of the segmented audio data, and associates the voiceprint feature with the identity
  • the recognition result determines the speaker corresponding to the segmented audio data.
  • the video conference terminal acquires an identification result indicating the correspondence between speaker identity information and speaking time information, and then combines the identification result with the voiceprint feature to further perform further audio data processing. In this way, accurate classification of voice data can be achieved without pre-registering the user's voiceprint features.
  • the operation of the video conference terminal to determine the speaker corresponding to the first segmented audio data according to the voiceprint feature of the first segmented audio data and the identity recognition result may be as follows:
  • the video conference terminal determines the speaker corresponding to the first segment of audio data according to the speaker identity information. That is, the identification result obtained by the video conference terminal of the first segment of audio data indicates that the speaker corresponding to the first segment of audio data is only user01, and the corresponding voiceprint feature is VP01, then the video conference terminal will use the first segment of audio data to identify the speaker. The speaker of the segmented audio data is determined to be this user01.
  • the video conference terminal compares the voiceprint feature of the first segment of audio data with the second segment of audio data.
  • the voiceprint feature of segmented audio data, the second segmented audio data is obtained by performing voice segmentation on the audio data by the video conference terminal, and the second segmented audio data corresponds to the unique speaker identity information; if the first segmented audio data is obtained by segmenting the audio data If the voiceprint feature of the segment audio data is consistent with the voiceprint feature of the second segment audio data, the video conference terminal determines the speech corresponding to the first segment audio data according to the speaker identity information corresponding to the second segment audio data people.
  • the speaker identity information of the second segment of audio data has been determined to be user02, the corresponding voiceprint feature is VP02, the voiceprint feature of the first segment of audio data is VP02, and the corresponding speaker identity information includes user03 and user02;
  • the above analysis shows that the voiceprint feature of the first segmented audio data and the voiceprint feature of the second segmented audio data are both VP02, and from the results of the second segmented audio data, the voiceprint feature is the speech corresponding to VP02. If the person is user02, it can be determined that the speaker of the first segment of audio data is also user02.
  • the video conference terminal will use the speaker identity information corresponding to the first segment of audio data and The voiceprint feature, and the speaker identity information and the voiceprint feature corresponding to the second segmented audio data determine the speaker corresponding to the first segmented audio data, and the second segmented audio data is used by the video conference terminal for the audio
  • the data is obtained by voice segmentation, and the second segmented audio data corresponds to the identity information of at least two speakers. That is, the video conference terminal can comprehensively determine the speaker corresponding to each segmented audio data according to the voiceprint features of the multiple segmented audio data and the corresponding speaker identity information.
  • the speaker identity information of the second segment of audio data has been determined as user02 and user03, the corresponding voiceprint feature is VP02, the voiceprint feature of the first segment of audio data is VP03, and the corresponding speaker identity information includes user03 and user02 , the voiceprint feature of the third segment of audio data is VP03, and the corresponding speaker identity information is user03 and user01.
  • the voiceprint feature of the first segmented audio data and the voiceprint feature of the third segmented audio data are both VP03, and the speaker identity information corresponding to the first segmented audio data and the third segmented audio.
  • the identity of the speaker corresponding to the data has a unique intersection, namely user03.
  • the video conference terminal determines the first segmented audio data according to the voiceprint feature corresponding to the first segmented audio data, the identification result and the long-term voiceprint feature record of the first conference site Corresponding speaker, the long-term voiceprint feature record includes the historical voiceprint feature record of the first venue, and the historical voiceprint feature record of the first venue is used to indicate the voiceprint feature, the speaker, and the channel identifier. Correspondence between.
  • the video conference terminal determines the speaker corresponding to the first segmented audio data according to the voiceprint feature of the first segmented audio data, the identity recognition result and the long-term voiceprint feature record, its specific operation It may be as follows: the video conference terminal compares the voiceprint feature of the first speaker in the current conference of the first conference site with the voiceprint feature of the first speaker in the long-term voiceprint feature record, and the A speaker is the confirmed speaker in the current conference of the first conference site; if the voiceprint feature of the first speaker in the current conference of the first conference site is the same as the voiceprint feature of the first speaker in the long-term voiceprint feature record If the voiceprint features in the video conference terminal are consistent, the video conference terminal determines that the long-term voiceprint feature record is available. The long-term voiceprint feature records of the conference site are compared to determine the speaker corresponding to the first segment of audio data.
  • the combination of short-term processing and long-term processing can improve the accuracy of audio data classification as much as possible.
  • the video conference terminal compares the voiceprint feature of the first speaker in the current conference of the first conference site with the voiceprint feature of the first speaker in the long-term voiceprint feature record, the The first speaker is the confirmed speaker in the current conference of the first conference site; if the voiceprint feature of the first speaker in the current conference of the first conference site is the same as the voiceprint feature of the first speaker in the long-term voiceprint feature If the voiceprint features in the records are inconsistent, the video conferencing terminal registers the voiceprint features, channel identifiers and the speaker corresponding to the voiceprint features in the current conference of the first conference site, and updates the long-term voiceprint feature record .
  • the voiceprint feature, speaker, and channel identification of the conference site can be updated according to the actual situation, so that the long-term voiceprint feature record is available.
  • the corresponding voiceprint features and speakers are registered, so as to realize the dynamic registration of voiceprint features and speakers, which is no longer limited to the registration of voiceprint features with fixed channel identifiers, which can effectively realize the registration of audio data. accurate classification.
  • the videoconferencing terminal can acquire the voiceprint identifier of the voiceprint feature of the first segmented audio data. information; then the video conference terminal establishes a corresponding relationship between the voiceprint identification information and the speaker corresponding to the first segment of audio data. In this way, one-to-one correspondence between the voiceprint features and the speakers can facilitate subsequent audio data classification and processing.
  • the present application provides a conference record processing device, which has a function of implementing the behavior of the conference record processing device in the first aspect.
  • This function can be implemented by hardware or by executing corresponding software by hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the apparatus includes units or modules for performing the steps of the above first aspect.
  • the device includes: an acquisition module for acquiring audio data of the first venue, sound source location information corresponding to the audio data, and an identity recognition result, where the identity recognition result is used to indicate the speaker identity information obtained by the portrait recognition method The corresponding relationship with the speaking time information of the speaker; the processing module is used to perform voice segmentation on the audio data to obtain the first segmented audio data of the audio data; according to the voiceprint feature of the first segmented audio data and the identification result to determine the speaker corresponding to the first segment of audio data.
  • it also includes a storage module for storing necessary program instructions and data of the conference record processing device.
  • the apparatus includes: a processor and a transceiver, where the processor is configured to support the conference record processing apparatus to perform corresponding functions in the method provided in the first aspect.
  • the transceiver is used for instructing the communication between the conference record processing apparatus and other devices in the conference system, such as receiving the audio data and the identification result involved in the above method sent by the video conference terminal.
  • the apparatus may further include a memory, which is used for coupling with the processor, and which stores necessary program instructions and data of the conference record processing apparatus.
  • the chip when the device is a chip in a conference record processing device, the chip includes: a processing module and a transceiver module.
  • the transceiver module may be, for example, an input/output interface, pin or circuit on the chip, and transmits the received audio data and identification result of the first conference venue to other chips or modules coupled to the chip.
  • the processing module can be, for example, a processor, and the processor is configured to perform voice segmentation on the audio data to obtain the first segmented audio data of the audio data; according to the voiceprint feature of the first segmented audio data and the identity
  • the recognition result determines the speaker corresponding to the first segment of audio data.
  • the processing module can execute the computer-executed instructions stored in the storage unit, so as to support the conference record processing apparatus to execute the method provided in the first aspect.
  • the storage unit can be a storage unit in the chip, such as a register, a cache, etc.
  • the storage unit can also be a storage unit located outside the chip, such as a read-only memory (read-only memory, ROM) or a memory unit.
  • ROM read-only memory
  • RAM random access memory
  • the apparatus includes: a processor, a radio frequency circuit and an antenna.
  • the processor is used to control the functions of each circuit part and determine the speaker corresponding to the first segment of audio data, and then perform analog conversion, filtering, amplification and up-conversion processing through the radio frequency circuit, and then send it to the automatic transmission through the antenna.
  • Speech recognition server the device further includes a memory, which stores necessary program instructions and data of the conference record processing device.
  • the device includes a communication interface and a logic circuit, where the communication interface is used to acquire an audio code stream and an identification result of the first conference site, where the audio code stream includes audio data and additional domain information, the additional domain
  • the information includes the sound source position information corresponding to the audio data, and the identification result is used to indicate the correspondence between the speaker's identity information obtained by the portrait recognition method and the speaker's speaking time information; the logic circuit is used for the audio data. Perform voice segmentation to obtain the first segmented audio data of the audio data; and determine the speaker corresponding to the first segmented audio data according to the voiceprint feature of the first segmented audio data and the identification result.
  • the processor mentioned in any of the above may be a general-purpose central processing unit (Central Processing Unit, CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more An integrated circuit for controlling the program execution of the audio data processing method of the above aspects.
  • CPU Central Processing Unit
  • ASIC application-specific integrated circuit
  • an embodiment of the present application provides a video conference device, the device having a function of implementing the behavior of the video conference terminal in the second aspect.
  • This function can be implemented by hardware or by executing corresponding software by hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the apparatus includes units or modules for performing the steps of the second aspect above.
  • the device includes: a processing module configured to perform sound source localization on the audio data of the first conference venue to obtain sound source orientation information corresponding to the audio data; obtain an identity recognition result according to the sound source orientation and the portrait recognition method, The identification result is used to indicate the correspondence between speaker identification information and speaking time information;
  • the sending module is used for sending the identification result, the audio data and the sound source position information corresponding to the audio data to the conference record processing device.
  • it also includes a storage module for storing necessary program instructions and data of the video conference device.
  • the apparatus includes: a processor and a transceiver, where the processor is configured to support the video conference apparatus to perform corresponding functions in the method provided in the second aspect.
  • the transceiver is used for instructing the communication between the video conference device and various devices in the conference system, and sending the audio code stream and the identification result to the conference record processing device.
  • the apparatus may further include a memory, which is used for coupling with the processor, and which stores necessary program instructions and data of the video conference apparatus.
  • the chip when the device is a chip in a video conference device, the chip includes: a processing module and a transceiver module. Perform sound source localization on the data to obtain the sound source position information corresponding to the audio data; obtain the identification result according to the sound source position and the portrait recognition method, and the identification result is used to indicate the correspondence between the speaker identification information and the speaking time information relationship;
  • the transceiver module may be, for example, an input/output interface, a pin or a circuit on the chip, and the configuration information is transmitted to other chips or modules coupled to the chip.
  • the processing module can execute the computer-executed instructions stored in the storage unit, so as to support the video conference device to perform the method provided in the second aspect.
  • the storage unit can be a storage unit in the chip, such as a register, a cache, etc.
  • the storage unit can also be a storage unit located outside the chip, such as only ROM or other types that can store static information and instructions. Static storage devices, RAM, etc.
  • the apparatus includes: a processor, a baseband circuit, a radio frequency circuit and an antenna.
  • the processor is used to control the functions of each circuit, and the baseband circuit is used to generate data packets containing audio code streams and identification results.
  • the device further includes a memory, which stores necessary program instructions and data of the video conference device.
  • the apparatus includes: a communication interface and a logic circuit.
  • the logic circuit is used to locate the sound source of the audio data of the first venue, so as to obtain the position information of the sound source corresponding to the audio data; and obtain the identification result according to the position of the sound source and the portrait recognition method. It is used to indicate the correspondence between speaker identity information and speaking time information; the communication interface is used to send the identity recognition result to the conference record processing device, and send the audio data to the multipoint control unit.
  • the processor mentioned in any of the above may be a CPU, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of programs of the audio data processing methods in the above aspects.
  • an embodiment of the present application provides a computer-readable storage medium, where computer instructions are stored in the computer storage medium, and the computer instructions are used to execute the method in any possible implementation manner of any one of the foregoing aspects.
  • the embodiments of the present application provide a computer program product including instructions, which, when executed on a computer, cause the computer to execute the method in any one of the foregoing aspects.
  • the present application provides a chip system
  • the chip system includes a processor for supporting a conference record processing device or a video conference device to implement the functions involved in the above aspects, such as generating or processing the above methods. data and/or information.
  • the chip system further includes a memory for storing necessary program instructions and data of the conference record processing device or the video conference device, so as to realize the function of any one of the above aspects.
  • the chip system can be composed of chips, and can also include chips and other discrete devices.
  • an embodiment of the present application provides a conference system, which includes the conference record processing device and the video conference device according to the above aspect.
  • FIG. 1A is a schematic diagram of an embodiment of a conference system architecture in an embodiment of the present application.
  • FIG. 1B is a schematic diagram of another embodiment of a conference system architecture in an embodiment of the present application.
  • FIG. 2 is a schematic diagram of an embodiment of a method for processing audio data in an embodiment of the present application
  • FIG. 3 is a schematic diagram of a scene in which a video conference terminal collects image information in an embodiment of the present application
  • FIG. 4 is a schematic diagram of another embodiment of a method for processing audio data in an embodiment of the present application.
  • FIG. 5 is a schematic diagram of another embodiment of a method for processing audio data in an embodiment of the present application.
  • FIG. 6 is a schematic diagram of another embodiment of a method for processing audio data in an embodiment of the present application.
  • FIG. 7 is a schematic diagram of an embodiment of a conference record processing apparatus in an embodiment of the present application.
  • FIG. 8 is a schematic diagram of another embodiment of a conference record processing apparatus in an embodiment of the present application.
  • FIG. 9 is a schematic diagram of an embodiment of a video conference terminal in an embodiment of the present application.
  • FIG. 10 is a schematic diagram of another embodiment of a video conference terminal in an embodiment of the present application.
  • the naming or numbering of the steps in this application does not mean that the steps in the method flow must be executed in the time/logical sequence indicated by the naming or numbering, and the named or numbered process steps can be implemented according to the The technical purpose is to change the execution order, as long as the same or similar technical effects can be achieved.
  • the division of units in this application is a logical division. In practical applications, there may be other division methods. For example, multiple units may be combined or integrated into another system, or some features may be ignored. , or not implemented, in addition, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, and the indirect coupling or communication connection between units may be electrical or other similar forms. There are no restrictions in the application.
  • units or sub-units described as separate components may or may not be physically separated, may or may not be physical units, or may be distributed into multiple circuit units, and some or all of them may be selected according to actual needs. unit to achieve the purpose of the scheme of this application.
  • the technical solutions of the embodiments of the present invention can be applied to local conference or remote conference scenarios.
  • the specific system architecture of the embodiment of the present invention may include a plurality of video conference terminals, a multipoint control unit, a recording server, and an automatic speech recognition (Automatic Speech Recognition, ASR) server.
  • ASR Automatic Speech Recognition
  • each of the multiple video conference terminals collects conference audio data and image information of the participants. , and identify the speaker among the participants through the image information to obtain the identification result. Then, the video conference terminal sends the audio data and the identification result to the recording and broadcasting server.
  • the recording and broadcasting server classifies the audio data according to the identification result and the direction of the sound source and sends it to the ASR server.
  • the ASR server outputs conference records through the voice transcription function.
  • the function of the recording server is integrated in the multipoint control unit (equivalent to the conference record processing module in FIG. 1B ).
  • Each of a plurality of video conference terminals (video conference terminal 01 to video conference terminal 03 as shown in FIG. 1B ) collects conference audio data and image information of the participants, and communicates with the speakers among the participants through the image information. Perform identification to obtain an identification result.
  • the video conference terminal sends the audio data and the identification result to the multipoint control unit, and the conference record processing module in the multipoint control unit classifies the audio data and sends it to the ASR server. Finally, the ASR server outputs the meeting record through the voice transcription function.
  • An embodiment of the audio data processing method in the embodiment of the present application includes:
  • the video conference terminal collects audio data.
  • a conference may include multiple sites, each site corresponds to at least one video conference terminal, and each site has at least one participant.
  • one video conference terminal in the conference site is used for description.
  • the video conference terminal uses the microphone to pick up the audio data of each speaker in real time.
  • the video conference terminal acquires the sound source bearing of the audio data.
  • the video conference terminal can acquire the sound source azimuth corresponding to the audio data, and establish a corresponding relationship between the audio data and the sound source azimuth.
  • the sound source orientation of the audio data collected by the video conference terminal from 00:00:15 to 00:00:30 at the conference start time is about 30 degrees east of the video conference terminal.
  • the sound source localization is allowed to have errors, so the sound source bearing can be a range value. For example, if the sound source is located at 30 degrees east, the specific range may be 28 degrees east to 32 degrees east.
  • the video conference terminal can obtain the sound source orientation of the audio data in the following possible implementation manners:
  • an array microphone is deployed on the video conference terminal, and the sound source azimuth of the audio data is determined through sound beam information picked up by the array microphone.
  • the venue additionally deploys a device or system dedicated to sound source localization, and then uses the sound source localization device or system as a calibration reference point to determine the sound source orientation of the audio data, and then uses the sound source location device or system as a calibration reference point to determine the sound source orientation of the audio data. sent to the video conference terminal.
  • the sound source localization may adopt the above solution or any other possible implementation manner, as long as the sound source orientation of the audio data can be obtained, and the specific solution is not limited here.
  • the video conference terminal performs voice segmentation on the audio data through human voice detection to obtain segmented audio data.
  • the video conference terminal performs voice segmentation on the received audio data according to human voice detection to obtain different segmented audio data.
  • the video conferencing terminal can distinguish the previous voice segment and the next voice segment according to the interval of the silent segment; or determine whether the voice segment is a human voice or a non-human voice through an algorithm, and divide the preceding and following human voice voice segments according to the non-human voice. split. For example, the video conference terminal collects audio data from 00:00:15 to 00:00:30 at the start time of the conference, then mutes it from 00:00:30 to 00:00:32, and mutes it from 00:00:32 to 00 : Audio data is captured during 00:45, muted between 00:00:45 and 00:00:50.
  • the video conference terminal can use the audio data collected during the conference start time 00:00:15 to 00:00:30 as a segmented audio data, and the video conference terminal can use the audio data collected in the conference start time 00:00:32 to 00:00:45 as a segmented audio data.
  • the audio data collected in the system is used as the next segmented audio data.
  • the timing rule indicated in the form of "00:00:00” is “hours:minutes:seconds", that is, the time point indicated by "00:00:15” is after the start of the meeting 15 seconds.
  • the video conference terminal collects image information within the azimuth range of the sound source according to the azimuth of the sound source.
  • the video conference terminal determines an image information collection area of the video conference terminal according to the sound source azimuth corresponding to the audio data obtained in step 202, and then collects image information in the image information collection area.
  • the video conference terminal may collect the image information in the form of capturing a photo, or may capture a picture frame corresponding to the audio data in the video data as the image information, and the specific form is not limited here.
  • the camera of the video conference terminal can be fixed or can be deployed to be rotatable, and the specific situation is not limited here.
  • the video conference terminal acquires images within the fixed shooting range, and then calculates and extracts image information corresponding to the audio data according to the sound source orientation.
  • the video conference terminal can adjust the shooting range of the camera according to the direction of the sound source, so as to obtain image information corresponding to the audio data.
  • the video conference terminal is located above the conference screen, and the participants are located on both sides of the conference table.
  • the video conference terminal can obtain image information within a certain angle range according to the sound source orientation .
  • the image information area has only the speaker 1; and when the image information of the speaker 2 is collected according to the sound source localization of the speaker 2,
  • the image information area includes Speaker 1 and another participant.
  • the video conference terminal performs portrait recognition on the image information to obtain an identity recognition result.
  • the video conference terminal performs face recognition and human body attribute recognition on the image information to obtain an identity recognition result, and the identity recognition result is used to indicate the correspondence between speaker identity information and speaking time information.
  • the spokesperson corresponding to the facial features is obtained through face recognition
  • the human body attribute recognition includes identifying the overall clothing or physical features of the user to obtain the spokesperson corresponding to the physical features or the appearance of the user's clothing.
  • the speaker identification information may be user identification information (such as the speaker's job number in the company or the speaker's ID number or phone number registered in the company's internal database) or user physical attribute identification information (such as the current meeting's ID number or phone number) The user is wearing white tops and black trousers or the user has a visible mark on the arm, etc.).
  • the speaking time information may be a period of time or two time points.
  • the speaking time information is 30 seconds from 00:00:15 to 00:00:45 after the current conference starts; or the speaking time information only includes "00:00:15" and "00:00:45" these two time points.
  • the specific operation of the video conference terminal for acquiring the speaker's identity information may be as follows: if the image information contains clear and identifiable face information, the video conference terminal will use face recognition technology to identify the information in the image information. and compare the face with the stored face database to determine the user identity information corresponding to the face; if the face information in the image information fails to meet the identification requirements (for example, the facial features cannot meet the face recognition requirement or no face image), the video conference terminal can perform human body attribute recognition to obtain body attribute information, and determine the user's body attribute identification information according to the body attribute information.
  • the video conference terminal packages the audio data and the corresponding sound source azimuth into an audio code stream and sends it to the multipoint control unit, and sends the identification result to the recording and broadcasting server.
  • the video conference terminal packages the audio data and the sound source azimuth corresponding to the audio data into an audio code stream and sends it to the multipoint control unit.
  • the video conference terminal encodes the audio data into an audio code stream, and then adds additional field information to the corresponding audio code stream, and uses the additional field information to indicate sound source location information corresponding to the audio data.
  • the identity recognition result obtained by the video conferencing terminal itself through the portrait recognition can be directly sent to the recording server.
  • the multipoint control unit sends the audio stream sent by the video conference terminal to the recording and broadcasting server.
  • the multipoint control unit After receiving the audio code stream sent by the video conference terminal, the multipoint control unit determines the conference site to which the video conference terminal belongs according to the conference identification assigned to the video conference terminal, and then adds the conference site identification to the audio code stream, And send the audio stream to the recording server.
  • the multipoint control unit may filter the audio data of each conference site, and then select the audio data of one or more conference sites to send to the recording and broadcasting server.
  • the multipoint control unit can compare the volume levels of the audio data of each venue, and select audio data whose volume is greater than a preset threshold for forwarding; or, the multipoint control unit can determine through an algorithm that the voice duration exceeds the preset threshold. Audio data is forwarded. Specific filtering conditions are not limited here. This reduces the amount of processing and thus speeds up processing.
  • the recording and broadcasting server decodes the audio stream to obtain audio data, and performs voice segmentation on the audio data to obtain the segmented audio data.
  • the recording and broadcasting server can decode the audio code stream to obtain audio data and a site identification, and then store the audio data according to the site identification. At the same time, the recording and broadcasting server performs speech segmentation on the audio data according to the sound source orientation of the audio data and the human voice detection technology, thereby obtaining segmented audio data. It can be understood that, in this embodiment, the recording and broadcasting server can further classify the audio data reported by the video conference terminal by performing voice segmentation on the audio data according to the sound source orientation and the person detection technology.
  • the video conference terminal detects that there is human voice from 00:00:15 to 00:00:30, then the video conference terminal collects audio data from 00:00:15 to 00:00:30. Divided into a segmented audio data, there is actually a speaker speaking in sound source azimuth 1 in 00:00:15 to 00:00:25, and a speaker in 00:00:25 to 00:00:30 There is also a speaker speaking in source location 2. Therefore, when the recording server performs voice segmentation again according to the sound source orientation and human voice detection, the audio data can be divided into two segments.
  • the recording and broadcasting server extracts the voiceprint feature of the segmented audio data.
  • the recording and broadcasting server extracts voiceprint features from the segmented audio data according to techniques such as voiceprint clustering, and marks the voiceprint identification.
  • voiceprint clustering In an exemplary solution, it is assumed that the recording and broadcasting server divides the audio data into 10 segmented audio data, and the duration of 8 segmented audio data satisfies the minimum length of voiceprint recognition, then the recording and broadcasting server respectively
  • the voiceprint features are extracted from the eight segmented audio data, and voiceprint identifiers (voiceprint 1 to voiceprint 8) are marked respectively.
  • the recording and broadcasting server determines the speaker identity of the segmented audio data according to the identification result and the voiceprint feature of the segmented audio data.
  • the recording and broadcasting server integrates and analyzes the received identification result and the voiceprint feature of the segmented audio data to determine the speaker identity of the segmented audio data.
  • the conference record processing device determines the first segment of audio data according to the unique speaker information indicated by the identification result. the corresponding speaker.
  • the conference recording processing device compares the voiceprint feature of the first segment of audio data with that of the second segment of audio data.
  • the voiceprint feature of segmented audio data wherein the second segmented audio data is obtained by performing voice segmentation on the audio data by the conference recording processing device, and the second segmented audio data corresponds to the only speaker;
  • the voiceprint feature of a segment of audio data is consistent with the voiceprint feature of the second segment of audio data, and the conference recording processing device determines the first segment of audio data according to the speaker identity information corresponding to the second segment of audio data. the corresponding speaker.
  • the conference record processing apparatus will use the speaker identity information corresponding to the first segment of audio data to And the voiceprint feature, and the speaker identity information and the voiceprint feature corresponding to the second segment audio data determine the speaker corresponding to the first segment audio data, and the second segment audio data is recorded by the conference record processing device.
  • the audio data is obtained by voice segmentation, and the second segmented audio data corresponds to at least two speaker identity information. That is, the conference record processing apparatus can comprehensively determine the speaker corresponding to each segmented audio data according to the voiceprint features of the multiple segmented audio data and the corresponding speaker identity information.
  • both the first segmented audio data and the second segmented audio data are obtained by the conference recording processing apparatus through voice segmentation.
  • Table 1 For details, please refer to the minutes of a current meeting shown in Table 1:
  • the recording and broadcasting server can integrate the voiceprint features corresponding to the segmented audio data with the voiceprint features of the segmented audio data of other identified speakers and the identification results.
  • the speaker corresponding to the segmented audio data is obtained by analysis.
  • the identification result display includes that the user identity ID is User03, the user body attribute ID is body04, and the voiceprint feature is VP04.
  • the speaker indicated by body04 is looking down and reading the manuscript, while User03 is facing the camera of the video conference terminal, and body04 and User03 cannot be separated when collecting image information.
  • the voiceprint feature corresponding to User03 is VP03
  • the speaker of the content shown in the 4th line can be determined not to be User03, but body04, and
  • the voiceprint feature corresponding to the body04 is VP04.
  • the unique speaker can also be determined accordingly.
  • the voiceprint features cannot be distinguished, so the speaker cannot be uniquely determined.
  • the voiceprint features are both VP07, but the corresponding speaker identities have a unique intersection User07.
  • the speaker indicated by User07 spoke during the time period indicated by lines 10 and 11, while User08 was facing the video conference terminal during the time period indicated by line 10.
  • User07 is separated from User07 when collecting image information;
  • User06 is facing the camera of the video conferencing terminal directly in the time period shown in line 11, shown in line 10
  • the time period is separated from User07 when collecting image information. Therefore, combining the contents of lines 10 and 11, it can be inferred that the speaker corresponding to the voiceprint feature VP07 is User07.
  • the recording server can compare the voiceprint feature and identification result of the current conference with the long-term voiceprint feature record of the conference site for further judgment. That is, the recording server compares the voiceprint feature of the first speaker in the current meeting of the first conference site with the voiceprint feature of the first speaker in the long-term voiceprint feature record, and the first speech A speaker whose corresponding relationship has been determined with segmented audio data in the current conference of the first conference site; if the voiceprint feature of the first speaker in the current conference of the first conference site is the same as that of the first speaker in the When the voiceprint features in the voiceprint feature record are consistent, the recording and broadcasting server compares the voiceprint feature corresponding to the first segmented audio data with the voiceprint feature in the long-term voiceprint feature record of the first venue A speaker corresponding to the first segment of audio data is determined.
  • Table 2 an exemplary long-term voiceprint feature record shown in Table 2:
  • the recording server can compare the most recent voiceprint features of User01 in the conference room Site01. For example, the voiceprint features of User01 in the conference room Site01 of Conf01 and Conf02 are compared. If the comparison result shows that the difference between the voiceprint features of User01 in the two meetings meets the threshold requirements, the recording server can determine that the The channels of the two conferences in the conference room Site01 are consistent, so it is determined that the long-term voiceprint feature record can be used for reference.
  • the candidate speakers are User05 and User08
  • the candidate speakers are User05, User06, and User07
  • the speaker of the voiceprint feature VP05 is determined, it can be determined that the speaker corresponding to the voiceprint feature VP06 in Table 1 is User06.
  • the recording server compares the voiceprint features of User01 in Conf01 and Conf03. If the comparison result is displayed in the two conferences If the difference of the voiceprint feature of the User01 does not meet the threshold requirement, the recording and broadcasting server can register the voiceprint feature and the speaker information in the Conf03, and at the same time update the long-term voiceprint feature record. Its specific form can be shown in the 8th row to the 10th row of Table 2. It can be understood that the change of the channel corresponding to the conference may be the change of the conference room or the change of the devices involved in the conference.
  • the conference record processing device can perform the analysis of the long-term conference (that is, the analysis shown in Table 2) after the analysis of the short-term conference (that is, the analysis method shown in Table 1), or the analysis of the long-term conference can be performed. Afterwards, the short-term conference analysis is performed. As long as the audio data can be distinguished in the end, the specific operation method is not limited here.
  • the recording and broadcasting server sends the audio data and the classification result of the audio data to the ASR server.
  • the recording server completes the matching of the audio data with the speaker, the classification result and the audio data are sent to the ASR server.
  • the ASR server outputs the audio data as text.
  • the video conference terminal collects corresponding image information according to sound source localization, and performs preliminary portrait recognition on the image information to obtain an identification result.
  • the identification result is combined with the voiceprint feature to further identify the audio data, so that accurate classification of the voice data can be achieved without pre-registering the user's voiceprint feature.
  • An embodiment of the audio data processing method in the embodiment of the present application includes:
  • 401-405 are the same as 201-205 in the foregoing embodiment, and are not repeated here.
  • the video conference terminal sends the audio code stream and the identification result to the multipoint control unit.
  • this step also sends the identification result to the multipoint control unit.
  • the multipoint control unit decodes the audio code stream to obtain audio data, and performs voice segmentation on the audio data to obtain the segmented audio data.
  • the multipoint control unit After acquiring the audio code stream, the multipoint control unit determines the conference site to which the video conference terminal belongs according to the conference identifier allocated to the video conference terminal. The audio code stream is decoded to obtain audio data, and then the audio data is stored according to the site identifier. At the same time, the multi-point control unit performs speech segmentation on the audio data according to the sound source orientation of the audio data and the human voice detection technology.
  • 408-410 refer to 209-211. The difference is that 408-411 is implemented by the multipoint control unit, while 209-211 is implemented by the recording and broadcasting server.
  • An embodiment of the audio data processing method in the embodiment of the present application includes:
  • 501-502 are the same as 201-202 in the above-mentioned embodiment, and are not repeated here. 503.
  • the video conference terminal performs voice segmentation on the audio data through human voice detection and sound source localization to obtain segmented audio data.
  • 504-505 are the same as 204-205 in the foregoing embodiment, and are not repeated here.
  • 506-508 is similar to that of 209-211, the difference is that steps 506-508 are executed by the video conference terminal, while steps 209-211 are executed by the recording server.
  • the ASR server outputs the audio data as text.
  • the video conferencing terminal collects corresponding image information according to sound source localization, and performs preliminary portrait recognition on the image information to obtain an identity recognition result, and then the video conferencing terminal combines the identity recognition result with the voiceprint feature.
  • the audio data is further identified, so that accurate classification of the voice data can be achieved without pre-registering the user's voiceprint features.
  • An embodiment of the audio data processing method in the embodiment of the present application includes:
  • the conference record processing device obtains the audio data of the first meeting place, the sound source position information and the identification result corresponding to the audio data, and the identification result is used to indicate the speaker identification information obtained by the portrait recognition method and the speaker's speech. Correspondence of time information.
  • the conference record processing apparatus may be the recording server in the method embodiment shown in FIG. 2, the multipoint control unit in the method embodiment shown in FIG. 4, or the video conference terminal in the method embodiment shown in FIG. 5.
  • the conference recording processing device when the conference recording processing device is the recording server in the method embodiment shown in FIG. 2, the conference recording processing device receives the audio data sent by the multipoint control unit and the sound source corresponding to the audio data. Orientation information.
  • the audio data and the sound source position information corresponding to the audio data may be packaged to generate an audio code stream and additional field information, wherein the additional field information includes the sound source position information corresponding to the audio data.
  • the video conference terminal encodes the audio data into an audio code stream, and then adds additional field information to the corresponding audio code stream, and uses the additional field information to indicate sound source location information corresponding to the audio data.
  • the video conference terminal sends the audio code stream to the multipoint control unit, and after receiving the audio code stream, the multipoint control unit determines the video conference terminal to which the video conference terminal belongs according to the conference ID assigned to the video conference terminal. site, then add the site identifier to the audio code stream, and send the audio code stream to the recording and broadcasting server.
  • the multipoint control unit may filter the audio data of each conference site, and then select the audio data of one or more conference sites to send to the recording and broadcasting server.
  • the multipoint control unit can compare the volume levels of the audio data of each venue, and select audio data whose volume is greater than a preset threshold for forwarding; or, the multipoint control unit can determine through an algorithm that the voice duration exceeds the preset threshold. Audio data is forwarded. Specific filtering conditions are not limited here. This reduces the amount of processing and thus speeds up processing.
  • the identity recognition result is obtained by the video conference terminal according to sound source localization and portrait recognition, and is directly sent by the video conference terminal to the recording server.
  • the conference recording processing device receives the audio data sent by the video conference terminal and the sound source corresponding to the audio data.
  • the audio data and the sound source position information corresponding to the audio data may be packaged to generate an audio code stream and additional field information, wherein the additional field information includes the sound source position information corresponding to the audio data.
  • the video conference terminal encodes the audio data into an audio code stream, and then adds additional field information to the corresponding audio code stream, and uses the additional field information to indicate sound source location information corresponding to the audio data. Then the video conference terminal sends the audio code stream to the multipoint control unit.
  • the identity recognition result is obtained by the video conference terminal according to sound source localization and human image recognition, and is sent by the video conference terminal to the multipoint control unit.
  • the conference recording processing device when the conference recording processing device is the video conference terminal in the method embodiment shown in FIG. 5, the conference recording processing device directly collects the audio data in the current conference through the microphone, and obtains it according to the sound source localization technology. The position information of the sound source corresponding to the audio data. The identity recognition result is obtained by the video conference terminal according to sound source localization and person image recognition.
  • the conference record processing apparatus performs voice segmentation on the audio data to obtain the first segmented audio data of the audio data.
  • the conference record processing device segments the audio data according to the sound source orientation information and the human voice detection method to obtain a plurality of segmented audio data of the audio data.
  • the conference recording processing apparatus determines a speaker corresponding to the first segment of audio data according to the voiceprint feature of the first segment of audio data and the identification result.
  • the conference record processing apparatus can execute the method shown in step 210 in FIG. 2 or step 409 in FIG. 4 or step 507 in FIG. 5 to obtain the speaker corresponding to the audio data, and details are not repeated here.
  • the conference recording processing device obtains an identification result used to indicate the correspondence between the speaker's identity information and the speaker's time information, and then the conference recording processing device combines the identification result with the voiceprint feature to perform a process on the audio data. Further identification, in this way, accurate classification of voice data can be achieved without pre-registering the user's voiceprint features.
  • the audio data processing method in the embodiment of the present application is described above, and the conference record processing apparatus and the video conference terminal in the embodiment of the present application are described below.
  • the apparatus 700 for processing conference records includes: an acquisition module 701 and a processing module 702 , wherein the acquisition module 701 and the processing module 702 are connected through a bus.
  • the conference record processing apparatus 700 may be the recording server in the method embodiment shown in FIG. 2 above, the multipoint control unit in the method embodiment shown in FIG. 4 above, or the video conference terminal in the method embodiment shown in FIG. 5 above, It can also be configured as one or more chips within the above device.
  • the meeting record processing apparatus 700 may be used to execute part or all of the functions of the above-mentioned devices.
  • the acquisition module 701 acquires the audio data of the first venue, the sound source location information and the identification result corresponding to the audio data, and the identification result is used to indicate the speaker identification information obtained by the portrait recognition method and the speaking time of the speaker The corresponding relationship of the information; the processing module 702 performs voice segmentation on the audio data to obtain the first segmented audio data of the audio data; according to the voiceprint feature of the first segmented audio data and the identification result, determine the The speaker corresponding to the first segment of audio data.
  • the audio data is included in an audio code stream, and the audio code stream further includes additional field information, where the additional field information includes sound source position information corresponding to the audio data.
  • the processing module 702 is specifically configured to determine the speech corresponding to the first segment of audio data according to the speaker's identity information if the identification result indicates that the first segment of audio data corresponds to the unique speaker identity information. people.
  • the processing module 702 is specifically configured to compare the voiceprint feature of the first segmented audio data with that of the second if the identification result indicates that the first segmented audio data corresponds to at least two speaker identity information.
  • the voiceprint feature of segmented audio data, the second segmented audio data is obtained by performing voice segmentation on the audio data by the conference recording processing device, and the second segmented audio data corresponds to the unique speaker identity information; If the voiceprint feature of the segmented audio data is consistent with the voiceprint feature of the second segmented audio data, the speaker corresponding to the first segmented audio data is determined according to the speaker identity information corresponding to the second segmented audio data.
  • the processing module 702 is specifically configured to, if the identification result indicates that the first segmented audio data corresponds to at least two speaker identity information, then according to the speaker identity information corresponding to the first segmented audio data and The voiceprint feature, and the speaker identity information and the voiceprint feature corresponding to the second segmented audio data determine the speaker corresponding to the first segmented audio data, and the second segmented audio data is processed by the conference record processing device.
  • the audio data is obtained by voice segmentation, and the second segmented audio data corresponds to at least two speaker identity information.
  • the processing module 702 is specifically configured to determine the first segmented audio data according to the corresponding voiceprint feature of the first segmented audio data, the identity recognition result and the long-term voiceprint feature record of the first venue.
  • the long-term voiceprint feature record includes the historical voiceprint feature record of the first venue, and the historical voiceprint feature record of the first venue is used to indicate the voiceprint feature, the speaker, and the channel identifier. Correspondence between.
  • the processing module 702 is specifically configured to compare the voiceprint feature of the first speaker in the current conference of the first conference venue with the voiceprint feature of the first speaker in the long-term voiceprint feature record.
  • the comparison result is obtained, and the first speaker is the determined speaker in the current conference of the first conference site; if the comparison result indicates that the voiceprint feature of the first speaker in the current conference of the first conference site is the same as The voiceprint feature of the first speaker in the long-term voiceprint feature record is consistent, then the voiceprint feature corresponding to the first segment of audio data, the identification result and the long-term voiceprint feature of the first venue The record is compared to determine the speaker corresponding to the first segment of audio data.
  • the processing module 702 is further configured to perform a comparison between the voiceprint feature of the first speaker in the current conference of the first conference site and the voiceprint feature of the first speaker in the long-term voiceprint feature record.
  • the comparison result is obtained, and the first speaker is the determined speaker in the current conference of the first conference site; if the comparison result indicates that the voiceprint feature of the first speaker in the current conference of the first conference site is the same as If the voiceprint features of the first speaker in the long-term voiceprint feature record are inconsistent, register the voiceprint feature, the channel ID and the speaker corresponding to the voiceprint feature in the current conference of the first conference site, and Update the long-term voiceprint feature record.
  • the conference record processing apparatus 700 further includes a storage module, which is coupled with the processing module, so that the processing module can execute the computer execution instructions stored in the storage module to implement the functions of the conference record processing apparatus in the above method embodiments.
  • the optional storage module included in the conference record processing apparatus 700 may be a storage unit within the chip, such as a register, a cache, etc., and the storage module may also be a storage unit located outside the chip, such as a ROM or a storage unit that can store Other types of static storage devices for static information and instructions, RAM, etc.
  • FIG. 8 shows a schematic structural diagram of a conference record processing apparatus 800 in the above embodiment.
  • the conference record processing apparatus 800 may be configured to be the recording server in the method embodiment shown in FIG. 2 and the recording server shown in FIG. 4 above.
  • the conference record processing apparatus 800 may include: a processor 802 , a computer-readable storage medium/memory 803 , a transceiver 804 , an input device 805 and an output device 806 , and a bus 801 .
  • processors, transceivers, computer-readable storage media, etc. are connected through a bus.
  • the embodiments of the present application do not limit the specific connection medium between the above components.
  • the transceiver 804 obtains the audio data of the first venue, the sound source location information identification result corresponding to the audio data, and the identification result is used to indicate the speaker identification information obtained by the portrait recognition method and the speaker's identity. Correspondence of speaking time information;
  • the processor 802 performs speech segmentation on the audio data to obtain the first segmented audio data of the audio data; and determines the first segmented audio according to the voiceprint feature of the first segmented audio data and the identification result Data corresponds to the spokesperson.
  • the processor 802 may include a baseband circuit, for example, may modulate and process audio data, and generate an audio code stream.
  • the transceiver 804 may include a radio frequency circuit, so as to modulate and amplify the audio code stream and send it to the corresponding device in the conference system.
  • the processor 802 may run an operating system to control functions between various devices and devices.
  • the transceiver 804 may include a baseband circuit and a radio frequency circuit.
  • the audio code stream or the identification result may be processed by the baseband circuit and the radio frequency circuit and then sent to the corresponding device in the conference system.
  • the transceiver 804 and the processor 802 can implement the corresponding steps in any of the foregoing embodiments in FIG. 2 to FIG. 6 , and details are not repeated here.
  • FIG. 8 only shows a simplified design of the conference record processing device.
  • the conference record processing device can include any number of transceivers, processors, memories, etc., and all of them can realize the The conference record processing device is all within the protection scope of the present application.
  • the processor 802 involved in the above-mentioned apparatus 800 may be a general-purpose processor, such as a CPU, a network processor (NP), a microprocessor, etc., or an ASIC, or one or more programs for controlling the solution of the present application. implemented integrated circuits. It can also be a digital signal processor (DSP), a field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components.
  • DSP digital signal processor
  • FPGA field-programmable gate array
  • a controller/processor may also be a combination that implements computing functions, such as a combination comprising one or more microprocessors, a combination of a DSP and a microprocessor, and the like. Processors typically perform logical and arithmetic operations based on program instructions stored in memory.
  • the above-mentioned bus 801 may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (EISA for short) bus or the like.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of presentation, only one thick line is used in FIG. 8, but it does not mean that there is only one bus or one type of bus.
  • the computer-readable storage medium/memory 803 mentioned above may also store an operating system and other application programs.
  • the program may include program code, and the program code includes computer operation instructions.
  • the above-mentioned memory may be ROM, other types of static storage devices that can store static information and instructions, RAM, other types of dynamic storage devices that can store information and instructions, disk storage, and the like.
  • Memory 803 may be a combination of the above-described storage types.
  • the above-mentioned computer-readable storage medium/memory may be in the processor, outside the processor, or distributed over multiple entities including the processor or processing circuit.
  • the computer-readable storage medium/memory described above may be embodied in a computer program product.
  • a computer program product may include a computer-readable medium in packaging materials.
  • the embodiments of the present application also provide a general-purpose processing system, for example, commonly referred to as a chip, and the general-purpose processing system includes: one or more microprocessors that provide processor functions; and an external memory that provides at least a part of a storage medium. , all of which are connected together with other support circuits through an external bus architecture.
  • the processor is caused to execute part or all of the steps of the data transmission method in the embodiment of FIG. 2 to FIG. other processes of technology.
  • the steps of the methods or algorithms described in conjunction with the disclosure of the present application may be implemented in a hardware manner, or may be implemented in a manner in which a processor executes software instructions.
  • the software instructions can be composed of corresponding software modules, and the software modules can be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable hard disk, CD-ROM, or any other form of storage known in the art in the medium.
  • An exemplary storage medium is coupled to the processor, such that the processor can read information from, and write information to, the storage medium.
  • the storage medium can also be an integral part of the processor.
  • the processor and storage medium may reside in an ASIC. Alternatively, the ASIC may be located in the minutes processing device.
  • the processor and the storage medium may also exist in the conference record processing apparatus as discrete components.
  • the video conference terminal 900 includes: a processing module 901 and a sending module 902, wherein the processing module 901 and the sending module 902 are connected through a bus.
  • the video conference terminal 900 may be the video conference terminal in the foregoing method embodiments, or may be configured as one or more chips in the foregoing video conference terminal.
  • the video conference terminal 900 may be used to perform part or all of the functions of the above-mentioned video conference terminal.
  • the processing module 901 performs sound source localization on the audio data of the first conference venue to obtain sound source position information corresponding to the audio data; obtains an identification result according to the sound source position and the portrait recognition method, and the identification result is used for Indicates the correspondence between speaker identity information and speaking time information; the sending module 902 sends the identity recognition result, audio data and sound source method information corresponding to the audio data to the conference record processing device.
  • the processing module 901 is specifically used to obtain the portrait information corresponding to the sound source orientation; perform image recognition on the portrait information to obtain face information and/or body attribute information; according to the face information and/or the body
  • the attribute information determines the identity information of the speaker; the identity recognition result is obtained by establishing a corresponding relationship between the time information of the speaker and the identity information of the speaker.
  • the video conference terminal 900 further includes a storage module, which is coupled with the processing module, so that the processing module can execute the computer execution instructions stored in the storage module to implement the functions of the video conference terminal in the above method embodiments.
  • the optional storage module included in the video conference terminal 900 may be an in-chip storage unit, such as a register, a cache, etc., and the storage module may also be a storage unit located outside the chip, such as a ROM or a storage unit that can store static Other types of static storage devices for information and instructions, RAM, etc.
  • FIG. 10 shows a schematic structural diagram of a video conference terminal 1000 in the above-mentioned embodiment, and the video conference terminal 1000 may be configured as the aforementioned video conference terminal.
  • the video conference terminal 1000 may include: a processor 1002 , a computer-readable storage medium/memory 1003 , a transceiver 1004 , an input device 1005 and an output device 1006 , and a bus 1001 . Wherein, processors, transceivers, computer-readable storage media, etc. are connected through a bus.
  • the embodiments of the present application do not limit the specific connection medium between the above components.
  • the processor 1002 performs sound source localization on the audio data of the first venue to obtain sound source location information corresponding to the audio data; and obtains an identification result according to the sound source location and the portrait recognition method. The result is used to indicate the correspondence between speaker identity information and speaking time information;
  • the transceiver 1004 sends the identification result and the audio data to the conference record processing apparatus.
  • the processor 1002 may include a baseband circuit, for example, may modulate and process audio data, and generate an audio code stream.
  • the transceiver 1004 may include a radio frequency circuit, so as to modulate and amplify the audio code stream and send it to the corresponding device in the conference system.
  • the processor 1002 may run an operating system to control functions between various devices and devices.
  • the transceiver 1004 may include a baseband circuit and a radio frequency circuit.
  • the audio code stream or the identification result may be processed by the baseband circuit, and then sent to the corresponding device in the conference system by the radio frequency circuit.
  • the transceiver 1004 and the processor 1002 can implement the corresponding steps in any of the foregoing embodiments in FIG. 3 to FIG. 7 , and details are not repeated here.
  • FIG. 10 only shows the simplified design of the video conference terminal.
  • the video conference terminal can include any number of transceivers, processors, memories, etc., and all of them can realize the video conference of the present application.
  • the terminals are all within the protection scope of this application.
  • the processor 1002 involved in the above-mentioned apparatus 1000 may be a general-purpose processor, such as a CPU, a network processor (NP), a microprocessor, etc., or an ASIC, or one or more programs used to control the solution of the present application implemented integrated circuits. It can also be a digital signal processor (DSP), a field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components.
  • DSP digital signal processor
  • FPGA field-programmable gate array
  • a controller/processor may also be a combination that implements computing functions, such as a combination comprising one or more microprocessors, a combination of a DSP and a microprocessor, and the like. Processors typically perform logical and arithmetic operations based on program instructions stored in memory.
  • the above-mentioned bus 1001 may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (EISA for short) bus or the like.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of presentation, only one thick line is used in FIG. 10, but it does not mean that there is only one bus or one type of bus.
  • the above-mentioned computer-readable storage medium/memory 1003 may also store an operating system and other application programs.
  • the program may include program code, and the program code includes computer operation instructions.
  • the above-mentioned memory may be ROM, other types of static storage devices that can store static information and instructions, RAM, other types of dynamic storage devices that can store information and instructions, disk storage, and the like.
  • the memory 1003 may be a combination of the above storage types.
  • the above-mentioned computer-readable storage medium/memory may be in the processor, outside the processor, or distributed over multiple entities including the processor or processing circuit.
  • the computer-readable storage medium/memory described above may be embodied in a computer program product.
  • a computer program product may include a computer-readable medium in packaging materials.
  • the embodiments of the present application also provide a general-purpose processing system, for example, commonly referred to as a chip, and the general-purpose processing system includes: one or more microprocessors that provide processor functions; and an external memory that provides at least a part of a storage medium. , all of which are connected together with other support circuits through an external bus architecture.
  • the processor is caused to execute part or all of the steps of the data transmission method in the embodiment of FIG. 2 to FIG. other processes of technology.
  • the steps of the methods or algorithms described in conjunction with the disclosure of the present application may be implemented in a hardware manner, or may be implemented in a manner in which a processor executes software instructions.
  • the software instructions can be composed of corresponding software modules, and the software modules can be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable hard disk, CD-ROM, or any other form of storage known in the art in the medium.
  • An exemplary storage medium is coupled to the processor, such that the processor can read information from, and write information to, the storage medium.
  • the storage medium can also be an integral part of the processor.
  • the processor and storage medium may reside in an ASIC.
  • the ASIC can be located in the videoconferencing terminal.
  • the processor and the storage medium may also exist in the video conference terminal as discrete components.
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the unit is only a logical function division.
  • there may be other division methods for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the unit described as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium.
  • the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , which includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method in each embodiment of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .

Abstract

An audio data processing method, device and system for classifying conference audio data according to the identities of speakers. The method comprises: a conference record processing apparatus obtains audio data of a first conference room, sound direction information corresponding to the audio data, and an identity identification result, the identity identification result being used for indicating a correspondence between speaker identity information obtained by means of a portrait identification method and speaking time information of speakers (601); then the conference record processing apparatus performs speech segmentation on the audio data, to obtain first segmented audio data of the audio data (602); and finally, the conference record processing apparatus determines a speaker corresponding to the first segmented audio data according to voiceprint features of the first segmented audio data and the identity identification result (603). The audio data is comprised in an audio stream which further comprises additional domain information, and the additional domain information comprises the sound direction information corresponding to the audio data.

Description

一种音频数据的处理方法、设备和系统A method, device and system for processing audio data
本申请要求于2020年09月25日提交中国国家知识产权局、申请号为202011027427.2、发明名称为“一种音频数据的处理方法、设备和系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202011027427.2 and the invention title "A method, device and system for processing audio data", which was submitted to the State Intellectual Property Office of China on September 25, 2020. Reference is incorporated in this application.
技术领域technical field
本申请涉及通信领域,尤其涉及一种音频数据的处理方法、设备和系统。The present application relates to the field of communications, and in particular, to a method, device and system for processing audio data.
背景技术Background technique
随着视频会议技术的飞速发展,类似于普通会议开会过程中人工产生会议记录,在多点视频会议中,也同样存在会议纪要的需求。现有产品已经可以实现在视频会议过程中自动记录整个会议的音视频、数据等内容,如果只是对音频数据单纯的记录下来,当对会议的重点内容或者特定内容进行回顾时,就无法达到普通会议那种可以按发言人进行分类的会议纪要整理需求。With the rapid development of video conferencing technology, similar to the manual generation of meeting minutes during ordinary meetings, there is also a need for meeting minutes in multi-point video conferences. Existing products can automatically record the audio, video, data and other content of the entire conference during the video conference process. If only the audio data is simply recorded, when reviewing the key content or specific content of the conference, it will not be able to achieve ordinary The need for meeting minutes that can be classified by speakers.
在视频会议进行中,如果可以确定整个语音文件只有一个人在讲话,就可以直接将整个文件的音频数据发送至声纹识别系统进行识别。如果语音文件中有多个人的语音,则需要先对语音文件进行分段,然后对每段音频数据分别进行声纹识别。现有的声纹识别系统,通常需要10秒以上的音频数据,数据越长,准确度越高。因此,在对音频数据进行分段时,段不能太短。由于在视频会议中,自由交谈的场景较多,因此当对音频数据的分段较长时,一段语音可能包含多个人的语音,在将这多个人的音频数据段送到声纹识别系统进行识别时,识别结果将是不可靠的。During a video conference, if it can be determined that only one person is speaking in the entire voice file, the audio data of the entire file can be directly sent to the voiceprint recognition system for identification. If there are multiple voices in the voice file, the voice file needs to be segmented first, and then voiceprint recognition is performed on each piece of audio data. Existing voiceprint recognition systems usually require more than 10 seconds of audio data, and the longer the data, the higher the accuracy. Therefore, when segmenting audio data, the segments cannot be too short. Since there are many scenes of free conversation in a video conference, when the segment of audio data is long, a piece of speech may contain the speech of multiple people. During recognition, the recognition result will be unreliable.
而实现上述方案的前提是会议参与人需要在声纹识别系统进行声纹注册,但是声音采集时的信道对声纹特征影响较大,预先注册声纹时一般采用单一信道,而识别时的信道多种多样,难以保证不同声音信道采集的声音的声纹识别准确性。The premise of realizing the above solution is that the conference participants need to register their voiceprints in the voiceprint recognition system, but the channel during sound collection has a great influence on the voiceprint characteristics. There are many kinds, and it is difficult to ensure the accuracy of voiceprint recognition of sounds collected by different sound channels.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供了一种音频数据的处理方法、设备和系统,用于对会议音频数据实现精确分类。Embodiments of the present application provide an audio data processing method, device, and system, which are used to accurately classify conference audio data.
第一方面,本申请实施例提供一种音频数据的处理方法,其具体包括:该会议记录处理装置获取第一会场的音频数据、该音频数据对应的声源方位信息和身份识别结果,该身份识别结果用于指示通过人像识别方法得到的发言人身份信息与发言人的发言时间信息的对应关系;然后该会议记录处理装置对该音频数据进行语音分段,以获得该音频数据的第一分段音频数据;最后该会议记录处理装置根据该第一分段音频数据的声纹特征和该身份识别结果确定该第一分段音频数据对应的发言人。In a first aspect, an embodiment of the present application provides a method for processing audio data, which specifically includes: the conference record processing device obtains audio data of a first conference venue, sound source location information corresponding to the audio data, and an identity recognition result, the identity The recognition result is used to indicate the correspondence between the speaker's identity information obtained by the portrait recognition method and the speaker's speaking time information; then the conference record processing device performs voice segmentation on the audio data to obtain the first score of the audio data. segment audio data; finally, the conference recording processing device determines the speaker corresponding to the first segment audio data according to the voiceprint feature of the first segment audio data and the identification result.
本实施例中,该音频数据和该音频数据对应声源方位信息可以打包生成音频码流,然后该音频码流包含该音频数据的附加域信息,该附加域信息包括该音频数据对应的声源方位信息。该音频数据的处理方法可以应用于本地会议或远程会议场景下,其中,参与会议的会场可以包括至少一个。基于上述方案,该附加域信息中还可以包括该音频数据的时间信息以及该第一会场的会场标识信息其他信息。人像识别方法包括人脸识别以及对于人体 属性识别。比如通过人脸识别得到面部特征对应的发言人,而人体属性识别包括对于用户整体衣着或者身体特征进行识别得到身体特征或用户衣着外观对应的发言人。该发言人身份信息可以为用户身份标识信息(比如发言人在公司内的工号或者发言人在公司内部数据库已登记的身份证号码或者电话号码)或者用户身体属性标识信息(比如当前会议中该用户上衣穿着白色衣服,下身为黑色长裤或者该用户的手臂上有个明显的记号等等)。而该发言时间信息可以是一段时间或者两个时间点。比如该发言时间信息为当前会议开始后的00:00:15至00:00:45这一段30秒时间;或者该发言时间信息仅包括“00:00:15”和“00:00:45”这两个时间点。可以理解的是,本申请实施例中,该“00:00:00”形式指示的计时规则为“时:分:秒”,即“00:00:15”指示的时间点为会议开始之后的第15秒。In this embodiment, the audio data and the sound source orientation information corresponding to the audio data can be packaged to generate an audio code stream, and then the audio code stream contains additional domain information of the audio data, and the additional domain information includes the sound source corresponding to the audio data. Orientation information. The audio data processing method may be applied to a local conference or a remote conference scenario, wherein the conference site participating in the conference may include at least one. Based on the above solution, the additional field information may further include time information of the audio data, and site identification information of the first site and other information. Portrait recognition methods include face recognition and human attribute recognition. For example, the spokesperson corresponding to the facial features is obtained through face recognition, and the human body attribute recognition includes identifying the overall clothing or physical features of the user to obtain the spokesperson corresponding to the physical features or the appearance of the user's clothing. The speaker identification information may be user identification information (such as the speaker's job number in the company or the speaker's ID number or phone number registered in the company's internal database) or user physical attribute identification information (such as the current meeting's ID number or phone number) The user is wearing white tops and black trousers or the user has a visible mark on the arm, etc.). The speaking time information may be a period of time or two time points. For example, the speaking time information is 30 seconds from 00:00:15 to 00:00:45 after the current conference starts; or the speaking time information only includes "00:00:15" and "00:00:45" these two time points. It can be understood that, in the embodiment of the present application, the timing rule indicated in the form of "00:00:00" is "hours:minutes:seconds", that is, the time point indicated by "00:00:15" is after the start of the meeting 15 seconds.
本实施例提供的技术方案中,该会议记录处理装置获取了用于指示发言人身份信息和发言时间信息的对应关系的身份识别结果,然后将该身份识别结果与声纹特征相结合对音频数据进行进一步识别,这样可以不需要对用户的声纹特征进行预先注册就可以实现语音数据的精确分类。In the technical solution provided by this embodiment, the conference record processing device obtains an identification result indicating the correspondence between speaker identification information and speaking time information, and then combines the identification result with the voiceprint feature to record the audio data Further identification is performed, so that accurate classification of voice data can be achieved without pre-registering the user's voiceprint features.
可选的,该会议记录处理装置根据该第一分段音频数据的声纹特征和该身份识别结果确定该第一分段音频数据对应的发言人的操作可以如下:Optionally, the operation of the conference recording processing device to determine the speaker corresponding to the first segmented audio data according to the voiceprint feature of the first segmented audio data and the identity recognition result may be as follows:
一种可能实现方式中,若该身份识别结果指示该第一分段音频数据对应唯一发言人身份信息,则该会议记录处理装置根据该发言人身份信息确定该第一分段音频数据对应的发言人。即该会议记录处理装置获取了该第一分段音频数据的身份识别结果指示该第一段音频数据对应的发言人只有user01,且对应的声纹特征为VP01,则该会议记录处理装置将该第一分段音频数据的发言人确定为该user01。In a possible implementation, if the identification result indicates that the first segment of audio data corresponds to the unique speaker identity information, the conference record processing device determines the speech corresponding to the first segment of audio data according to the speaker identity information. people. That is, the meeting record processing device has obtained the identification result of the first segment of audio data and indicates that the speaker corresponding to the first segment of audio data is only user01, and the corresponding voiceprint feature is VP01, then the meeting record processing device The speaker of the first segment of audio data is determined to be this user01.
另一种可能实现方式中,若该身份识别结果指示该第一分段音频数据对应至少两个发言人身份信息,则该会议记录处理装置对比该第一分段音频数据的声纹特征与第二分段音频数据的声纹特征,该第二分段音频数据由该会议记录处理装置对该音频数据进行语音分段得到,该第二分段音频数据对应唯一发言人身份信息;若该第一分段音频数据的声纹特征与第二分段音频数据的声纹特征一致,则该会议记录处理装置根据该第二分段音频数据对应的发言人身份信息确定该第一分段音频数据对应的发言人。比如第二分段音频数据已确定发言人身份信息为user02,对应的声纹特征为VP02,该第一分段音频数据的声纹特征为VP02,对应的发言人身份信息包括user03和user02;由上述分析可知,该第一分段音频数据的声纹特征与第二分段音频数据的声纹特征同为VP02,而由第二分段音频数据的结果可知,声纹特征为VP02对应的发言人为user02,则可以确定该第一分段音频数据的发言人也为user02。In another possible implementation manner, if the identification result indicates that the first segment of audio data corresponds to at least two speaker identity information, the conference recording processing device compares the voiceprint feature of the first segment of audio data with the first segment of audio data. The voiceprint feature of two-segment audio data, the second-segment audio data is obtained by segmenting the audio data by the conference recording processing device, and the second-segment audio data corresponds to the unique speaker identity information; The voiceprint feature of a segment of audio data is consistent with the voiceprint feature of the second segment of audio data, and the conference recording processing device determines the first segment of audio data according to the speaker identity information corresponding to the second segment of audio data. the corresponding speaker. For example, the speaker identity information of the second segment of audio data has been determined to be user02, the corresponding voiceprint feature is VP02, the voiceprint feature of the first segment of audio data is VP02, and the corresponding speaker identity information includes user03 and user02; The above analysis shows that the voiceprint feature of the first segmented audio data and the voiceprint feature of the second segmented audio data are both VP02, and from the results of the second segmented audio data, the voiceprint feature is the speech corresponding to VP02. If the person is user02, it can be determined that the speaker of the first segment of audio data is also user02.
另一种可能实现方式中,若该身份识别结果指示该第一分段音频数据对应至少两个发言人身份信息,则该会议记录处理装置根据该第一分段音频数据对应的发言人身份信息和声纹特征,以及该第二分段音频数据对应的发言人身份信息和声纹特征确定该第一分段音频数据对应的发言人,该第二分段音频数据由该会议记录处理装置对该音频数据进行语音分段得到,该第二分段音频数据对应至少两个发言人身份信息。即该会议记录处理装置可 以根据多个分段音频数据的声纹特征以及对应的发言人身份信息进行综合判断各个分段音频数据对应的发言人。比如第二分段音频数据已确定发言人身份信息为user02和user03,对应的声纹特征为VP02,该第一分段音频数据的声纹特征为VP03,对应的发言人身份信息包括user03和user02,第三分段音频数据的声纹特征为VP03,对应的发言人身份信息为user03和user01。由上述分析可知,该第一分段音频数据的声纹特征与第三分段音频数据的声纹特征同为VP03,而第一分段音频数据对应的发言人身份信息与第三分段音频数据对应的发言人身份存在唯一交集,即user03,这时可以确定声纹特征同VP03对应的发言人为user03。这时可以继续判断该第二分段音频数据对应的发言人为user02,即声纹特征为VP02对应的发言人为user02。In another possible implementation manner, if the identity recognition result indicates that the first segment of audio data corresponds to at least two speaker identity information, the conference record processing apparatus will use the speaker identity information corresponding to the first segment of audio data to And the voiceprint feature, and the speaker identity information and the voiceprint feature corresponding to the second segment audio data determine the speaker corresponding to the first segment audio data, and the second segment audio data is recorded by the conference record processing device. The audio data is obtained by voice segmentation, and the second segmented audio data corresponds to at least two speaker identity information. That is, the conference record processing device can comprehensively determine the speaker corresponding to each segmented audio data according to the voiceprint features of the multiple segmented audio data and the corresponding speaker identity information. For example, the speaker identity information of the second segment of audio data has been determined as user02 and user03, the corresponding voiceprint feature is VP02, the voiceprint feature of the first segment of audio data is VP03, and the corresponding speaker identity information includes user03 and user02 , the voiceprint feature of the third segment of audio data is VP03, and the corresponding speaker identity information is user03 and user01. As can be seen from the above analysis, the voiceprint feature of the first segmented audio data and the voiceprint feature of the third segmented audio data are both VP03, and the speaker identity information corresponding to the first segmented audio data and the third segmented audio. The identity of the speaker corresponding to the data has a unique intersection, namely user03. At this time, it can be determined that the speaker corresponding to the voiceprint feature and VP03 is user03. At this time, it can continue to be determined that the speaker corresponding to the second segment of audio data is user02, that is, the speaker corresponding to the voiceprint feature VP02 is user02.
另一种可能实现方式中,该会议记录处理装置根据该第一分段音频数据对应的声纹特征、该身份识别结果和该第一会场的长时声纹特征纪录确定该第一分段音频数据对应的发言人,所述长时声纹特征纪录包括所述第一会场的历史声纹特征纪录,所述第一会场的历史声纹特征纪录用于指示声纹特征、发言人以及信道标识之间的对应关系。In another possible implementation manner, the conference recording processing device determines the first segmented audio according to the voiceprint feature corresponding to the first segmented audio data, the identity recognition result, and the long-term voiceprint feature record of the first conference site The speaker corresponding to the data, the long-term voiceprint feature record includes the historical voiceprint feature record of the first venue, and the historical voiceprint feature record of the first venue is used to indicate voiceprint features, speakers, and channel identifiers Correspondence between.
可选的,在该会议记录处理装置根据该第一分段音频数据的声纹特征、该身份识别结果和长时声纹特征纪录确定该第一分段音频数据对应的发言人时,其具体操作可以如下:该会议记录处理装置将第一发言人在该第一会场的当前会议中的声纹特征与该第一发言人在该长时声纹特征纪录中的声纹特征进行比对,该第一发言人为该第一会场的当前会议中已确定的发言人;若该第一发言人在该第一会场的当前会议中的声纹特征与该第一发言人在该长时声纹特征纪录中的声纹特征一致,则该会议记录处理装置确定该长时声纹特征纪录可用,此时该会议记录处理装置将该第一分段音频数据对应的声纹特征、该身份识别结果与该第一会场的长时声纹特征纪录进行比对确定该第一分段音频数据对应的发言人。Optionally, when the conference recording processing device determines the speaker corresponding to the first segmented audio data according to the voiceprint feature of the first segmented audio data, the identity recognition result and the long-term voiceprint feature record, its specific The operation may be as follows: the conference record processing device compares the voiceprint feature of the first speaker in the current conference of the first conference site with the voiceprint feature of the first speaker in the long-term voiceprint feature record, The first speaker is the confirmed speaker in the current conference of the first conference site; if the voiceprint feature of the first speaker in the current conference of the first conference site is the same as that of the first speaker in the long-term If the voiceprint features in the feature records are consistent, the conference record processing device determines that the long-term voiceprint feature record is available, and at this time the conference record processing device corresponds to the voiceprint feature of the first segment of audio data, the identity recognition result The speaker corresponding to the first segment of audio data is determined by comparing with the long-term voiceprint feature record of the first conference site.
在音频数据分类过程中,将短时处理与长时处理相结合可以尽量提高音频数据分类的准确度。In the process of audio data classification, the combination of short-term processing and long-term processing can improve the accuracy of audio data classification as much as possible.
可选的,该会议记录处理装置将第一发言人在该第一会场的当前会议中的声纹特征与该第一发言人在该长时声纹特征纪录中的声纹特征进行比对,该第一发言人为该第一会场的当前会议中已确定的发言人;若该第一发言人在该第一会场的当前会议中的声纹特征与该第一发言人在该长时声纹特征纪录中的声纹特征不一致,则该会议记录处理装置将该第一会场的当前会议中的声纹特征、信道标识以及该声纹特征对应的发言人进行注册,并更新该长时声纹特征纪录。这样可以根据实际情况对于会场的声纹特征、发言人以及信道标识进行更新,从而使得该长时声纹特征纪录可用。同时每次会议之后都对相应的声纹特征以及发言人进行注册,从而实现声纹特征与发言人的动态注册,不再局限于固定信道标识的声纹特征注册,可以有效的实现音频数据的分类准确。Optionally, the conference record processing device compares the voiceprint feature of the first speaker in the current conference of the first conference site with the voiceprint feature of the first speaker in the long-term voiceprint feature record, The first speaker is the confirmed speaker in the current conference of the first conference site; if the voiceprint feature of the first speaker in the current conference of the first conference site is the same as that of the first speaker in the long-term If the voiceprint features in the feature record are inconsistent, the conference record processing device registers the voiceprint feature, the channel identifier and the speaker corresponding to the voiceprint feature in the current conference of the first conference site, and updates the long-term voiceprint feature record. In this way, the voiceprint feature, speaker, and channel identification of the conference site can be updated according to the actual situation, so that the long-term voiceprint feature record is available. At the same time, after each meeting, the corresponding voiceprint features and speakers are registered, so as to realize the dynamic registration of voiceprint features and speakers, which is no longer limited to the registration of voiceprint features with fixed channel identifiers, which can effectively realize the registration of audio data. accurate classification.
可选的,该会议记录处理装置在获取到该第一分段音频数据的声纹特征以及对应的发言人之后,该会议记录处理装置可以获取该第一分段音频数据的声纹特征的声纹标识信息;然后该会议记录处理装置将该声纹标识信息与该第一分段音频数据对应的发言人建立对应关系。这样可以将声纹特征与发言人一一对应,方便后续音频数据分类处理。Optionally, after the conference recording processing device acquires the voiceprint feature of the first segmented audio data and the corresponding speaker, the conference recording processing device can acquire the voiceprint feature of the first segmented audio data. Then, the conference record processing device establishes a corresponding relationship between the voiceprint identification information and the speaker corresponding to the first segment of audio data. In this way, one-to-one correspondence between the voiceprint features and the speakers can facilitate subsequent audio data classification and processing.
可选的,在申请实施例提供的技术方案应用于远程多会场会议场景下时,该会议记录 处理装置可以是录播服务器也可以是集成于多点控制单元中的功能模块。因此该音频码流可以由多点控制单元转发至该会议记录处理装置,该身份识别结果由视频会议终端发送至该会议记录处理装置。Optionally, when the technical solutions provided by the application embodiments are applied in a remote multi-site conference scenario, the conference recording processing device may be a recording and broadcasting server or a functional module integrated in the multi-point control unit. Therefore, the audio code stream can be forwarded by the multipoint control unit to the conference record processing device, and the identification result is sent to the conference record processing device by the video conference terminal.
可选的,该音频码流由该多点控制单元通过会场选通之后转发至该会议记录处理装置。这样可以减少不必要的数据传输,减轻网络负担。Optionally, the audio code stream is forwarded to the conference record processing apparatus by the multipoint control unit after being selected by the conference site. This reduces unnecessary data transmission and reduces the burden on the network.
可选的,该会议记录处理装置对该音频数据进行语音分段的具体操作可以如下:该会议记录处理装置根据该声源方位信息和人声检测技术对该音频数据进行语音分段。这样可以对音频数据进行更精确的分段。Optionally, the specific operation of the conference recording processing apparatus for performing voice segmentation on the audio data may be as follows: the conference recording processing apparatus performs voice segmentation on the audio data according to the sound source location information and the human voice detection technology. This allows for more precise segmentation of audio data.
第二方面,本申请实施例中提供一种音频数据的处理方法,其包括:视频会议终端对第一会场的音频数据进行声源定位,以获取该音频数据所对应的声源方位信息;该视频会议终端根据该声源方位和人像识别方法获取身份识别结果,该身份识别结果用于指示发言人身份信息与发言时间信息的对应关系;该视频会议终端将该身份识别结果、该音频数据和该音频数据对应的声源方位信息发送给会议记录处理装置。In a second aspect, an embodiment of the present application provides a method for processing audio data, which includes: a video conference terminal performs sound source localization on audio data of a first conference site to obtain sound source location information corresponding to the audio data; the The video conference terminal obtains an identification result according to the sound source orientation and the face recognition method, and the identification result is used to indicate the correspondence between the speaker identification information and the speaking time information; the video conference terminal uses the identification result, the audio data and The sound source orientation information corresponding to the audio data is sent to the conference record processing device.
本实施例中,该视频会议终端通过对音频数据进行声源定位从而实现对发言人的图像信息采集,并通过对图像信息的人像识别得到用于指示发言人身份信息与发言时间信息的对应关系的身份识别结果,然后将该身份识别结果发送至会议记录处理装置,使得该会议记录处理装置将该身份识别结果与声纹特征相结合对音频数据进行进一步识别,这样可以不需要对用户的声纹特征进行预先注册就可以实现语音数据的精确分类。In this embodiment, the video conference terminal realizes the acquisition of the image information of the speaker by locating the sound source of the audio data, and obtains the correspondence between the identity information of the speaker and the speaking time information through the portrait recognition of the image information. the identification result, and then send the identification result to the conference record processing device, so that the conference record processing device combines the identification result with the voiceprint feature to further identify the audio data, so that the user's voice Accurate classification of speech data can be achieved by pre-registering the pattern features.
可选的,该视频会议终端在进行人像识别的具体过程可以如下:该视频会议终端获取该声源方位对应的人像信息;该视频会议终端对该人像信息进行图像识别得到人脸信息和/或身体属性信息;该视频会议终端根据该人脸信息和/或该身体属性信息确定该发言人身份信息;该视频会议终端将发言人时间信息与该发言人身份信息建立对应关系得到该身份识别结果。Optionally, the specific process of the video conferencing terminal performing portrait recognition may be as follows: the video conferencing terminal obtains the portrait information corresponding to the position of the sound source; the video conferencing terminal performs image recognition on the portrait information to obtain the face information and/or Physical attribute information; the video conference terminal determines the speaker identity information according to the face information and/or the physical attribute information; the video conference terminal establishes a corresponding relationship between the speaker time information and the speaker identity information to obtain the identification result .
本实施例中,该发言人身份信息可以为用户身份标识信息(比如发言人在公司内的工号或者发言人在公司内部数据库已登记的身份证号码或者电话号码)或者用户身体属性标识信息(比如当前会议中该用户上衣穿着白色衣服,下身为黑色长裤或者该用户的手臂上有个明显的记号等等)。而该发言时间信息可以是一段时间或者两个时间点。比如该发言时间信息为当前会议开始后的00:00:15至00:00:45这一段30秒时间;或者该发言时间信息仅包括“00:00:15”和“00:00:45”这两个时间点。可以理解的是,本申请实施例中,该“00:00:00”形式指示的计时规则为“时:分:秒”,即“00:00:15”指示的时间点为会议开始之后的第15秒。In this embodiment, the speaker identity information may be user identity information (such as the speaker's job number in the company or the speaker's ID number or phone number registered in the company's internal database) or user body attribute identification information ( For example, in the current meeting, the top of the user is wearing white clothes, and the bottom is black trousers, or there is a clear mark on the user's arm, etc.). The speaking time information may be a period of time or two time points. For example, the speaking time information is 30 seconds from 00:00:15 to 00:00:45 after the current conference starts; or the speaking time information only includes "00:00:15" and "00:00:45" these two time points. It can be understood that, in the embodiment of the present application, the timing rule indicated in the form of "00:00:00" is "hours:minutes:seconds", that is, the time point indicated by "00:00:15" is after the start of the meeting 15 seconds.
可选的,在本申请实施例提供的技术方案应用于本地会议或者远程会议的单用户场景时,该视频会议终端也可以作为该会议记录处理装置实现上述第一方面的方法,具体如下:Optionally, when the technical solutions provided in the embodiments of the present application are applied to a single-user scenario of a local conference or a remote conference, the video conference terminal may also be used as the conference record processing device to implement the method of the first aspect, as follows:
该视频会议终端获取当前会场的音频数据,并对该音频数据根据声源方位和人声检测得到分段音频数据;然后获取该分段音频数据的声纹特征,将该声纹特征与该身份识别结果确定该分段音频数据对应的发言人。The video conference terminal acquires the audio data of the current conference site, and detects segmented audio data from the audio data according to the sound source orientation and human voice; then acquires the voiceprint feature of the segmented audio data, and associates the voiceprint feature with the identity The recognition result determines the speaker corresponding to the segmented audio data.
本实施例提供的技术方案中,该视频会议终端获取用于指示发言人身份信息和发言时间信息的对应关系的身份识别结果,然后将该身份识别结果与声纹特征相结合对音频数据进行进一步识别,这样可以不需要对用户的声纹特征进行预先注册就可以实现语音数据的精确分类。In the technical solution provided in this embodiment, the video conference terminal acquires an identification result indicating the correspondence between speaker identity information and speaking time information, and then combines the identification result with the voiceprint feature to further perform further audio data processing. In this way, accurate classification of voice data can be achieved without pre-registering the user's voiceprint features.
可选的,该视频会议终端根据该第一分段音频数据的声纹特征和该身份识别结果确定该第一分段音频数据对应的发言人的操作可以如下:Optionally, the operation of the video conference terminal to determine the speaker corresponding to the first segmented audio data according to the voiceprint feature of the first segmented audio data and the identity recognition result may be as follows:
一种可能实现方式中,若该身份识别结果指示该第一分段音频数据对应唯一发言人身份信息,则该视频会议终端根据该发言人身份信息确定该第一分段音频数据对应的发言人。即该视频会议终端获取了该第一分段音频数据的身份识别结果指示该第一段音频数据对应的发言人只有user01,且对应的声纹特征为VP01,则该视频会议终端将该第一分段音频数据的发言人确定为该user01。In a possible implementation, if the identification result indicates that the first segment of audio data corresponds to unique speaker identity information, the video conference terminal determines the speaker corresponding to the first segment of audio data according to the speaker identity information. . That is, the identification result obtained by the video conference terminal of the first segment of audio data indicates that the speaker corresponding to the first segment of audio data is only user01, and the corresponding voiceprint feature is VP01, then the video conference terminal will use the first segment of audio data to identify the speaker. The speaker of the segmented audio data is determined to be this user01.
另一种可能实现方式中,若该身份识别结果指示该第一分段音频数据对应至少两个发言人身份信息,则该视频会议终端对比该第一分段音频数据的声纹特征与第二分段音频数据的声纹特征,该第二分段音频数据由该视频会议终端对该音频数据进行语音分段得到,该第二分段音频数据对应唯一发言人身份信息;若该第一分段音频数据的声纹特征与第二分段音频数据的声纹特征一致,则该视频会议终端根据该第二分段音频数据对应的发言人身份信息确定该第一分段音频数据对应的发言人。比如第二分段音频数据已确定发言人身份信息为user02,对应的声纹特征为VP02,该第一分段音频数据的声纹特征为VP02,对应的发言人身份信息包括user03和user02;由上述分析可知,该第一分段音频数据的声纹特征与第二分段音频数据的声纹特征同为VP02,而由第二分段音频数据的结果可知,声纹特征为VP02对应的发言人为user02,则可以确定该第一分段音频数据的发言人也为user02。In another possible implementation manner, if the identification result indicates that the first segment of audio data corresponds to at least two speaker identity information, the video conference terminal compares the voiceprint feature of the first segment of audio data with the second segment of audio data. The voiceprint feature of segmented audio data, the second segmented audio data is obtained by performing voice segmentation on the audio data by the video conference terminal, and the second segmented audio data corresponds to the unique speaker identity information; if the first segmented audio data is obtained by segmenting the audio data If the voiceprint feature of the segment audio data is consistent with the voiceprint feature of the second segment audio data, the video conference terminal determines the speech corresponding to the first segment audio data according to the speaker identity information corresponding to the second segment audio data people. For example, the speaker identity information of the second segment of audio data has been determined to be user02, the corresponding voiceprint feature is VP02, the voiceprint feature of the first segment of audio data is VP02, and the corresponding speaker identity information includes user03 and user02; The above analysis shows that the voiceprint feature of the first segmented audio data and the voiceprint feature of the second segmented audio data are both VP02, and from the results of the second segmented audio data, the voiceprint feature is the speech corresponding to VP02. If the person is user02, it can be determined that the speaker of the first segment of audio data is also user02.
另一种可能实现方式中,若该身份识别结果指示该第一分段音频数据对应至少两个发言人身份信息,则该视频会议终端根据该第一分段音频数据对应的发言人身份信息和声纹特征,以及该第二分段音频数据对应的发言人身份信息和声纹特征确定该第一分段音频数据对应的发言人,该第二分段音频数据由该视频会议终端对该音频数据进行语音分段得到,该第二分段音频数据对应至少两个发言人身份信息。即该视频会议终端可以根据多个分段音频数据的声纹特征以及对应的发言人身份信息进行综合判断各个分段音频数据对应的发言人。比如第二分段音频数据已确定发言人身份信息为user02和user03,对应的声纹特征为VP02,该第一分段音频数据的声纹特征为VP03,对应的发言人身份信息包括user03和user02,第三分段音频数据的声纹特征为VP03,对应的发言人身份信息为user03和user01。由上述分析可知,该第一分段音频数据的声纹特征与第三分段音频数据的声纹特征同为VP03,而第一分段音频数据对应的发言人身份信息与第三分段音频数据对应的发言人身份存在唯一交集,即user03,这时可以确定声纹特征同VP03对应的发言人为user03。这时可以继续判断该第二分段音频数据对应的发言人为user02,即声纹特征为VP02对应的发言人为user02。In another possible implementation manner, if the identification result indicates that the first segment of audio data corresponds to at least two speaker identity information, the video conference terminal will use the speaker identity information corresponding to the first segment of audio data and The voiceprint feature, and the speaker identity information and the voiceprint feature corresponding to the second segmented audio data determine the speaker corresponding to the first segmented audio data, and the second segmented audio data is used by the video conference terminal for the audio The data is obtained by voice segmentation, and the second segmented audio data corresponds to the identity information of at least two speakers. That is, the video conference terminal can comprehensively determine the speaker corresponding to each segmented audio data according to the voiceprint features of the multiple segmented audio data and the corresponding speaker identity information. For example, the speaker identity information of the second segment of audio data has been determined as user02 and user03, the corresponding voiceprint feature is VP02, the voiceprint feature of the first segment of audio data is VP03, and the corresponding speaker identity information includes user03 and user02 , the voiceprint feature of the third segment of audio data is VP03, and the corresponding speaker identity information is user03 and user01. As can be seen from the above analysis, the voiceprint feature of the first segmented audio data and the voiceprint feature of the third segmented audio data are both VP03, and the speaker identity information corresponding to the first segmented audio data and the third segmented audio. The identity of the speaker corresponding to the data has a unique intersection, namely user03. At this time, it can be determined that the speaker corresponding to the voiceprint feature and VP03 is user03. At this time, it can continue to be determined that the speaker corresponding to the second segment of audio data is user02, that is, the speaker corresponding to the voiceprint feature VP02 is user02.
另一种可能实现方式中,该视频会议终端根据该第一分段音频数据对应的声纹特征、 该身份识别结果和该第一会场的长时声纹特征纪录确定该第一分段音频数据对应的发言人,所述长时声纹特征纪录包括所述第一会场的历史声纹特征纪录,所述第一会场的历史声纹特征纪录用于指示声纹特征、发言人以及信道标识之间的对应关系。In another possible implementation manner, the video conference terminal determines the first segmented audio data according to the voiceprint feature corresponding to the first segmented audio data, the identification result and the long-term voiceprint feature record of the first conference site Corresponding speaker, the long-term voiceprint feature record includes the historical voiceprint feature record of the first venue, and the historical voiceprint feature record of the first venue is used to indicate the voiceprint feature, the speaker, and the channel identifier. Correspondence between.
可选的,在该视频会议终端根据该第一分段音频数据的声纹特征、该身份识别结果和长时声纹特征纪录确定该第一分段音频数据对应的发言人时,其具体操作可以如下:该视频会议终端将第一发言人在该第一会场的当前会议中的声纹特征与该第一发言人在该长时声纹特征纪录中的声纹特征进行比对,该第一发言人为该第一会场的当前会议中已确定的发言人;若该第一发言人在该第一会场的当前会议中的声纹特征与该第一发言人在该长时声纹特征纪录中的声纹特征一致,则该视频会议终端确定该长时声纹特征纪录可用,此时该视频会议终端将该第一分段音频数据对应的声纹特征、该身份识别结果与该第一会场的长时声纹特征纪录进行比对确定该第一分段音频数据对应的发言人。Optionally, when the video conference terminal determines the speaker corresponding to the first segmented audio data according to the voiceprint feature of the first segmented audio data, the identity recognition result and the long-term voiceprint feature record, its specific operation It may be as follows: the video conference terminal compares the voiceprint feature of the first speaker in the current conference of the first conference site with the voiceprint feature of the first speaker in the long-term voiceprint feature record, and the A speaker is the confirmed speaker in the current conference of the first conference site; if the voiceprint feature of the first speaker in the current conference of the first conference site is the same as the voiceprint feature of the first speaker in the long-term voiceprint feature record If the voiceprint features in the video conference terminal are consistent, the video conference terminal determines that the long-term voiceprint feature record is available. The long-term voiceprint feature records of the conference site are compared to determine the speaker corresponding to the first segment of audio data.
在音频数据分类过程中,将短时处理与长时处理相结合可以尽量提高音频数据分类的准确度。In the process of audio data classification, the combination of short-term processing and long-term processing can improve the accuracy of audio data classification as much as possible.
可选的,该视频会议终端将第一发言人在该第一会场的当前会议中的声纹特征与该第一发言人在该长时声纹特征纪录中的声纹特征进行比对,该第一发言人为该第一会场的当前会议中已确定的发言人;若该第一发言人在该第一会场的当前会议中的声纹特征与该第一发言人在该长时声纹特征纪录中的声纹特征不一致,则该视频会议终端将该第一会场的当前会议中的声纹特征、信道标识以及该声纹特征对应的发言人进行注册,并更新该长时声纹特征纪录。这样可以根据实际情况对于会场的声纹特征、发言人以及信道标识进行更新,从而使得该长时声纹特征纪录可用。同时每次会议之后都对相应的声纹特征以及发言人进行注册,从而实现声纹特征与发言人的动态注册,不再局限于固定信道标识的声纹特征注册,可以有效的实现音频数据的分类准确。Optionally, the video conference terminal compares the voiceprint feature of the first speaker in the current conference of the first conference site with the voiceprint feature of the first speaker in the long-term voiceprint feature record, the The first speaker is the confirmed speaker in the current conference of the first conference site; if the voiceprint feature of the first speaker in the current conference of the first conference site is the same as the voiceprint feature of the first speaker in the long-term voiceprint feature If the voiceprint features in the records are inconsistent, the video conferencing terminal registers the voiceprint features, channel identifiers and the speaker corresponding to the voiceprint features in the current conference of the first conference site, and updates the long-term voiceprint feature record . In this way, the voiceprint feature, speaker, and channel identification of the conference site can be updated according to the actual situation, so that the long-term voiceprint feature record is available. At the same time, after each meeting, the corresponding voiceprint features and speakers are registered, so as to realize the dynamic registration of voiceprint features and speakers, which is no longer limited to the registration of voiceprint features with fixed channel identifiers, which can effectively realize the registration of audio data. accurate classification.
可选的,该视频会议终端在获取到该第一分段音频数据的声纹特征以及对应的发言人之后,该视频会议终端可以获取该第一分段音频数据的声纹特征的声纹标识信息;然后该视频会议终端将该声纹标识信息与该第一分段音频数据对应的发言人建立对应关系。这样可以将声纹特征与发言人一一对应,方便后续音频数据分类处理。Optionally, after the video conference terminal acquires the voiceprint feature of the first segmented audio data and the corresponding speaker, the videoconferencing terminal can acquire the voiceprint identifier of the voiceprint feature of the first segmented audio data. information; then the video conference terminal establishes a corresponding relationship between the voiceprint identification information and the speaker corresponding to the first segment of audio data. In this way, one-to-one correspondence between the voiceprint features and the speakers can facilitate subsequent audio data classification and processing.
第三方面,本申请提供一种会议记录处理装置,该装置具有实现上述第一方面中会议记录处理装置行为的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。In a third aspect, the present application provides a conference record processing device, which has a function of implementing the behavior of the conference record processing device in the first aspect. This function can be implemented by hardware or by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions.
在一个可能的实现方式中,该装置包括用于执行以上第一方面各个步骤的单元或模块。例如,该装置包括:获取模块,用于获取第一会场的音频数据,该音频数据对应的声源方位信息和身份识别结果,该身份识别结果用于指示通过人像识别方法得到的发言人身份信息与发言人的发言时间信息的对应关系;处理模块,用于对该音频数据进行语音分段,以获取该音频数据的第一分段音频数据;根据该第一分段音频数据的声纹特征和该身份识别结果确定该第一分段音频数据对应的发言人。In a possible implementation manner, the apparatus includes units or modules for performing the steps of the above first aspect. For example, the device includes: an acquisition module for acquiring audio data of the first venue, sound source location information corresponding to the audio data, and an identity recognition result, where the identity recognition result is used to indicate the speaker identity information obtained by the portrait recognition method The corresponding relationship with the speaking time information of the speaker; the processing module is used to perform voice segmentation on the audio data to obtain the first segmented audio data of the audio data; according to the voiceprint feature of the first segmented audio data and the identification result to determine the speaker corresponding to the first segment of audio data.
可选的,还包括存储模块,用于保存会议记录处理装置必要的程序指令和数据。Optionally, it also includes a storage module for storing necessary program instructions and data of the conference record processing device.
在一种可能的实现方式中,该装置包括:处理器和收发器,该处理器被配置为支持会 议记录处理装置执行上述第一方面提供的方法中相应的功能。收发器用于指示会议记录处理装置和会议系统中其他设备之间的通信,比如接收视频会议终端发送上述方法中所涉及的音频数据和身份识别结果。可选的,此装置还可以包括存储器,该存储器用于与处理器耦合,其保存会议记录处理装置必要的程序指令和数据。In a possible implementation manner, the apparatus includes: a processor and a transceiver, where the processor is configured to support the conference record processing apparatus to perform corresponding functions in the method provided in the first aspect. The transceiver is used for instructing the communication between the conference record processing apparatus and other devices in the conference system, such as receiving the audio data and the identification result involved in the above method sent by the video conference terminal. Optionally, the apparatus may further include a memory, which is used for coupling with the processor, and which stores necessary program instructions and data of the conference record processing apparatus.
在一种可能的实现方式中,当该装置为会议记录处理装置内的芯片时,该芯片包括:处理模块和收发模块。该收发模块例如可以是该芯片上的输入/输出接口、管脚或电路等,将接收到的第一会场的音频数据和身份识别结果传送给与此芯片耦合的其他芯片或模块中。该处理模块例如可以是处理器,此处理器用于对该音频数据进行语音分段,以获取该音频数据的第一分段音频数据;根据该第一分段音频数据的声纹特征和该身份识别结果确定该第一分段音频数据对应的发言人。该处理模块可执行存储单元存储的计算机执行指令,以支持会议记录处理装置执行上述第一方面提供的方法。可选地,该存储单元可以为该芯片内的存储单元,如寄存器、缓存等,该存储单元还可以是位于该芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。In a possible implementation manner, when the device is a chip in a conference record processing device, the chip includes: a processing module and a transceiver module. The transceiver module may be, for example, an input/output interface, pin or circuit on the chip, and transmits the received audio data and identification result of the first conference venue to other chips or modules coupled to the chip. The processing module can be, for example, a processor, and the processor is configured to perform voice segmentation on the audio data to obtain the first segmented audio data of the audio data; according to the voiceprint feature of the first segmented audio data and the identity The recognition result determines the speaker corresponding to the first segment of audio data. The processing module can execute the computer-executed instructions stored in the storage unit, so as to support the conference record processing apparatus to execute the method provided in the first aspect. Optionally, the storage unit can be a storage unit in the chip, such as a register, a cache, etc., and the storage unit can also be a storage unit located outside the chip, such as a read-only memory (read-only memory, ROM) or a memory unit. Other types of static storage devices that store static information and instructions, random access memory (RAM), etc.
在一种可能的实现方式中,该装置包括:处理器,射频电路和天线。其中处理器用于实现对各个电路部分功能的控制并确定该第一分段音频数据对应的发言人,然后经由射频电路进行模拟转换、滤波、放大和上变频等处理后,再经由天线发送给自动语音识别服务器。可选的,该装置还包括存储器,其保存会议记录处理装置必要的程序指令和数据。In a possible implementation manner, the apparatus includes: a processor, a radio frequency circuit and an antenna. The processor is used to control the functions of each circuit part and determine the speaker corresponding to the first segment of audio data, and then perform analog conversion, filtering, amplification and up-conversion processing through the radio frequency circuit, and then send it to the automatic transmission through the antenna. Speech recognition server. Optionally, the device further includes a memory, which stores necessary program instructions and data of the conference record processing device.
在一种可能实现方式中,该装置包括通信接口和逻辑电路,该通信接口用于获取第一会场的音频码流和身份识别结果,该音频码流包括音频数据和附加域信息,该附加域信息包括该音频数据对应的声源方位信息,该身份识别结果用于指示通过人像识别方法得到的发言人身份信息与发言人的发言时间信息的对应关系;该逻辑电路,用于对该音频数据进行语音分段,以获取该音频数据的第一分段音频数据;根据该第一分段音频数据的声纹特征和该身份识别结果确定该第一分段音频数据对应的发言人。In a possible implementation manner, the device includes a communication interface and a logic circuit, where the communication interface is used to acquire an audio code stream and an identification result of the first conference site, where the audio code stream includes audio data and additional domain information, the additional domain The information includes the sound source position information corresponding to the audio data, and the identification result is used to indicate the correspondence between the speaker's identity information obtained by the portrait recognition method and the speaker's speaking time information; the logic circuit is used for the audio data. Perform voice segmentation to obtain the first segmented audio data of the audio data; and determine the speaker corresponding to the first segmented audio data according to the voiceprint feature of the first segmented audio data and the identification result.
其中,上述任一处提到的处理器,可以是一个通用中央处理器(Central Processing Unit,CPU),微处理器,特定应用集成电路(application-specific integrated circuit,ASIC),或一个或多个用于控制上述各方面音频数据的处理方法的程序执行的集成电路。Wherein, the processor mentioned in any of the above may be a general-purpose central processing unit (Central Processing Unit, CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more An integrated circuit for controlling the program execution of the audio data processing method of the above aspects.
第四方面,本申请实施例提供了一种视频会议装置,该装置具有实现上述第二方面中视频会议终端行为的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。In a fourth aspect, an embodiment of the present application provides a video conference device, the device having a function of implementing the behavior of the video conference terminal in the second aspect. This function can be implemented by hardware or by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions.
在一个可能的实现方式中,该装置包括用于执行以上第二方面各个步骤的单元或模块。例如,该装置包括:处理模块,用于对第一会场的音频数据进行声源定位,以获取该音频数据所对应的声源方位信息;根据该声源方位和人像识别方法获取身份识别结果,该身份识别结果用于指示发言人身份信息与发言时间信息的对应关系;In a possible implementation manner, the apparatus includes units or modules for performing the steps of the second aspect above. For example, the device includes: a processing module configured to perform sound source localization on the audio data of the first conference venue to obtain sound source orientation information corresponding to the audio data; obtain an identity recognition result according to the sound source orientation and the portrait recognition method, The identification result is used to indicate the correspondence between speaker identification information and speaking time information;
发送模块,用于将该身份识别结果、音频数据以及音频数据对应的声源方位信息发送给会议记录处理装置。The sending module is used for sending the identification result, the audio data and the sound source position information corresponding to the audio data to the conference record processing device.
可选的,还包括存储模块,用于保存视频会议装置必要的程序指令和数据。Optionally, it also includes a storage module for storing necessary program instructions and data of the video conference device.
在一种可能的实现方式中,该装置包括:处理器和收发器,该处理器被配置为支持视频会议装置执行上述第二方面提供的方法中相应的功能。收发器用于指示视频会议装置和会议系统中各个设备之间的通信,向会议记录处理装置发送音频码流和身份识别结果。可选的,此装置还可以包括存储器,该存储器用于与处理器耦合,其保存视频会议装置必要的程序指令和数据。In a possible implementation manner, the apparatus includes: a processor and a transceiver, where the processor is configured to support the video conference apparatus to perform corresponding functions in the method provided in the second aspect. The transceiver is used for instructing the communication between the video conference device and various devices in the conference system, and sending the audio code stream and the identification result to the conference record processing device. Optionally, the apparatus may further include a memory, which is used for coupling with the processor, and which stores necessary program instructions and data of the video conference apparatus.
在一种可能的实现方式中,当该装置为视频会议装置内的芯片时,该芯片包括:处理模块和收发模块,该处理模块例如可以是处理器,此处理器用于对第一会场的音频数据进行声源定位,以获取该音频数据所对应的声源方位信息;根据该声源方位和人像识别方法获取身份识别结果,该身份识别结果用于指示发言人身份信息与发言时间信息的对应关系;该收发模块例如可以是该芯片上的输入/输出接口、管脚或电路等,配置信息传送给与此芯片耦合的其他芯片或模块中。该处理模块可执行存储单元存储的计算机执行指令,以支持视频会议装置执行上述第二方面提供的方法。可选地,该存储单元可以为该芯片内的存储单元,如寄存器、缓存等,该存储单元还可以是位于该芯片外部的存储单元,如只ROM或可存储静态信息和指令的其他类型的静态存储设备,RAM等。In a possible implementation, when the device is a chip in a video conference device, the chip includes: a processing module and a transceiver module. Perform sound source localization on the data to obtain the sound source position information corresponding to the audio data; obtain the identification result according to the sound source position and the portrait recognition method, and the identification result is used to indicate the correspondence between the speaker identification information and the speaking time information relationship; the transceiver module may be, for example, an input/output interface, a pin or a circuit on the chip, and the configuration information is transmitted to other chips or modules coupled to the chip. The processing module can execute the computer-executed instructions stored in the storage unit, so as to support the video conference device to perform the method provided in the second aspect. Optionally, the storage unit can be a storage unit in the chip, such as a register, a cache, etc., the storage unit can also be a storage unit located outside the chip, such as only ROM or other types that can store static information and instructions. Static storage devices, RAM, etc.
在一种可能的实现方式中,该装置包括:处理器,基带电路,射频电路和天线。其中处理器用于实现对各个电路部分功能的控制,基带电路用于生成包含音频码流和身份识别结果的数据包,经由射频电路进行模拟转换、滤波、放大和上变频等处理后,再经由天线发送给会议记录处理装置。可选的,该装置还包括存储器,其保存视频会议装置必要的程序指令和数据。In a possible implementation manner, the apparatus includes: a processor, a baseband circuit, a radio frequency circuit and an antenna. The processor is used to control the functions of each circuit, and the baseband circuit is used to generate data packets containing audio code streams and identification results. Sent to the conference record processing device. Optionally, the device further includes a memory, which stores necessary program instructions and data of the video conference device.
在一种可能实现方式中,该装置包括:通信接口和逻辑电路。其中,逻辑电路,用于对第一会场的音频数据进行声源定位,以获取该音频数据所对应的声源方位信息;根据该声源方位和人像识别方法获取身份识别结果,该身份识别结果用于指示发言人身份信息与发言时间信息的对应关系;通信接口,用于将该身份识别结果发送给会议记录处理装置,并将该音频数据发送给多点控制单元。In one possible implementation, the apparatus includes: a communication interface and a logic circuit. Wherein, the logic circuit is used to locate the sound source of the audio data of the first venue, so as to obtain the position information of the sound source corresponding to the audio data; and obtain the identification result according to the position of the sound source and the portrait recognition method. It is used to indicate the correspondence between speaker identity information and speaking time information; the communication interface is used to send the identity recognition result to the conference record processing device, and send the audio data to the multipoint control unit.
其中,上述任一处提到的处理器,可以是一个CPU,微处理器,ASIC,或一个或多个用于控制上述各方面音频数据处理方法的程序执行的集成电路。Wherein, the processor mentioned in any of the above may be a CPU, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of programs of the audio data processing methods in the above aspects.
第五方面,本申请实施例提供一种计算机可读存储介质,该计算机存储介质存储有计算机指令,该计算机指令用于执行上述各方面中任意一方面任意可能的实施方式该的方法。In a fifth aspect, an embodiment of the present application provides a computer-readable storage medium, where computer instructions are stored in the computer storage medium, and the computer instructions are used to execute the method in any possible implementation manner of any one of the foregoing aspects.
第六方面,本申请实施例提供一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述各方面中任意一方面该的方法。In a sixth aspect, the embodiments of the present application provide a computer program product including instructions, which, when executed on a computer, cause the computer to execute the method in any one of the foregoing aspects.
第七方面,本申请提供了一种芯片系统,该芯片系统包括处理器,用于支持会议记录处理装置或视频会议装置实现上述方面中所涉及的功能,例如生成或处理上述方法中所涉及的数据和/或信息。在一种可能的设计中,该芯片系统还包括存储器,该存储器,用于保存会议记录处理装置或视频会议装置必要的程序指令和数据,以实现上述各方面中任意一方面的功能。该芯片系统可以由芯片构成,也可以包含芯片和其他分立器件。In a seventh aspect, the present application provides a chip system, the chip system includes a processor for supporting a conference record processing device or a video conference device to implement the functions involved in the above aspects, such as generating or processing the above methods. data and/or information. In a possible design, the chip system further includes a memory for storing necessary program instructions and data of the conference record processing device or the video conference device, so as to realize the function of any one of the above aspects. The chip system can be composed of chips, and can also include chips and other discrete devices.
第八方面,本申请实施例提供一种会议系统,该系统包括上述方面该的会议记录处理装置和视频会议装置。In an eighth aspect, an embodiment of the present application provides a conference system, which includes the conference record processing device and the video conference device according to the above aspect.
附图说明Description of drawings
图1A为本申请实施例中会议系统架构的一个实施例示意图;1A is a schematic diagram of an embodiment of a conference system architecture in an embodiment of the present application;
图1B为本申请实施例中会议系统架构的另一个实施例示意图;1B is a schematic diagram of another embodiment of a conference system architecture in an embodiment of the present application;
图2为本申请实施例中音频数据的处理方法的一个实施例示意图;2 is a schematic diagram of an embodiment of a method for processing audio data in an embodiment of the present application;
图3为本申请实施例中视频会议终端采集图像信息的一个场景示意图;3 is a schematic diagram of a scene in which a video conference terminal collects image information in an embodiment of the present application;
图4为本申请实施例中音频数据的处理方法的另一个实施例示意图;4 is a schematic diagram of another embodiment of a method for processing audio data in an embodiment of the present application;
图5为本申请实施例中音频数据的处理方法的另一个实施例示意图;5 is a schematic diagram of another embodiment of a method for processing audio data in an embodiment of the present application;
图6为本申请实施例中音频数据的处理方法的另一个实施例示意图;6 is a schematic diagram of another embodiment of a method for processing audio data in an embodiment of the present application;
图7为本申请实施例中会议记录处理装置的一个实施例示意图;FIG. 7 is a schematic diagram of an embodiment of a conference record processing apparatus in an embodiment of the present application;
图8为本申请实施例中会议记录处理装置的另一个实施例示意图;FIG. 8 is a schematic diagram of another embodiment of a conference record processing apparatus in an embodiment of the present application;
图9为本申请实施例中视频会议终端的一个实施例示意图;FIG. 9 is a schematic diagram of an embodiment of a video conference terminal in an embodiment of the present application;
图10为本申请实施例中视频会议终端的另一个实施例示意图。FIG. 10 is a schematic diagram of another embodiment of a video conference terminal in an embodiment of the present application.
具体实施方式detailed description
为了使本申请的目的、技术方案及优点更加清楚明白,下面结合附图,对本申请的实施例进行描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。本领域普通技术人员可知,随着新应用场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。In order to make the purpose, technical solutions and advantages of the present application clearer, the embodiments of the present application are described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. . Those of ordinary skill in the art know that with the emergence of new application scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或模块的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或模块,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或模块。在本申请中出现的对步骤进行的命名或者编号,并不意味着必须按照命名或者编号所指示的时间/逻辑先后顺序执行方法流程中的步骤,已经命名或者编号的流程步骤可以根据要实现的技术目的变更执行次序,只要能达到相同或者相类似的技术效果即可。本申请中所出现的单元的划分,是一种逻辑上的划分,实际应用中实现时可以有另外的划分方式,例如多个单元可以结合成或集成在另一个系统中,或一些特征可以忽略,或不执行,另外,所显示的或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元之间的间接耦合或通信连接可以是电性或其他类似的形式,本申请中均不作限定。并且,作为分离部件说明的单元或子单元可以是也可以不是物理上的分离,可以是也可以不是物理单元,或者可以分布到多个电路单元中,可以根据实际的需要选择其中的 部分或全部单元来实现本申请方案的目的。The terms "first", "second" and the like in the description and claims of the present application and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It is to be understood that data so used may be interchanged under appropriate circumstances so that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having", and any variations thereof, are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or modules is not necessarily limited to those expressly listed Rather, those steps or modules may include other steps or modules not expressly listed or inherent to the process, method, product or apparatus. The naming or numbering of the steps in this application does not mean that the steps in the method flow must be executed in the time/logical sequence indicated by the naming or numbering, and the named or numbered process steps can be implemented according to the The technical purpose is to change the execution order, as long as the same or similar technical effects can be achieved. The division of units in this application is a logical division. In practical applications, there may be other division methods. For example, multiple units may be combined or integrated into another system, or some features may be ignored. , or not implemented, in addition, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, and the indirect coupling or communication connection between units may be electrical or other similar forms. There are no restrictions in the application. In addition, the units or sub-units described as separate components may or may not be physically separated, may or may not be physical units, or may be distributed into multiple circuit units, and some or all of them may be selected according to actual needs. unit to achieve the purpose of the scheme of this application.
本发明实施例的技术方案可以应用于本地会议或远程会议场景中。本发明实施例的具体系统架构可以包括多个视频会议终端、多点控制单元、录播服务器以及自动语音识别(Automatic Speech Recognition,ASR)服务器。以图1A所示的实施例为例,其中,多个视频会议终端(如图1A中所示的视频会议终端01至视频会议终端03)中的每一个采集会议音频数据以及与会人员的图像信息,并通过该图像信息对与会人员中的发言人进行身份识别得到身份识别结果。然后,该视频会议终端将音频数据和该身份识别结果发送给录播服务器。该录播服务器根据该身份识别结果以及声源方位对音频数据进行分类之后发送给ASR服务器。ASR服务器通过语音转写功能将会议记录输出。如图1B所示的实施例相比图1A的实施例,录播服务器的功能集成在该多点控制单元(相当于图1B中的会议记录处理模块)。多个视频会议终端(如图1B中所示的视频会议终端01至视频会议终端03)中的每一个采集会议音频数据以及与会人员的图像信息,并通过该图像信息对与会人员中的发言人进行身份识别得到身份识别结果。然后,该视频会议终端将音频数据和该身份识别结果发送给多点控制单元,则该多点控制单元中的会议记录处理模块对音频数据进行分类之后发送给ASR服务器。最后,ASR服务器通过语音转写功能将会议记录输出。The technical solutions of the embodiments of the present invention can be applied to local conference or remote conference scenarios. The specific system architecture of the embodiment of the present invention may include a plurality of video conference terminals, a multipoint control unit, a recording server, and an automatic speech recognition (Automatic Speech Recognition, ASR) server. Taking the embodiment shown in FIG. 1A as an example, each of the multiple video conference terminals (such as the video conference terminal 01 to the video conference terminal 03 shown in FIG. 1A ) collects conference audio data and image information of the participants. , and identify the speaker among the participants through the image information to obtain the identification result. Then, the video conference terminal sends the audio data and the identification result to the recording and broadcasting server. The recording and broadcasting server classifies the audio data according to the identification result and the direction of the sound source and sends it to the ASR server. The ASR server outputs conference records through the voice transcription function. In the embodiment shown in FIG. 1B , compared with the embodiment in FIG. 1A , the function of the recording server is integrated in the multipoint control unit (equivalent to the conference record processing module in FIG. 1B ). Each of a plurality of video conference terminals (video conference terminal 01 to video conference terminal 03 as shown in FIG. 1B ) collects conference audio data and image information of the participants, and communicates with the speakers among the participants through the image information. Perform identification to obtain an identification result. Then, the video conference terminal sends the audio data and the identification result to the multipoint control unit, and the conference record processing module in the multipoint control unit classifies the audio data and sends it to the ASR server. Finally, the ASR server outputs the meeting record through the voice transcription function.
具体请参阅图2所示,本申请实施例中音频数据的处理方法的一个实施例包括:Please refer to FIG. 2 for details. An embodiment of the audio data processing method in the embodiment of the present application includes:
201、视频会议终端采集音频数据。201. The video conference terminal collects audio data.
在远程会议场景下,一场会议可能包括多个会场,每个会场对应至少一个视频会议终端,且每个会场中有至少一个与会人员。本实施例中,以会场中的其中一个视频会议终端进行说明。在会议过程中,该视频会议终端利用麦克风实时拾取各个发言人的音频数据。In a remote conference scenario, a conference may include multiple sites, each site corresponds to at least one video conference terminal, and each site has at least one participant. In this embodiment, one video conference terminal in the conference site is used for description. During the conference, the video conference terminal uses the microphone to pick up the audio data of each speaker in real time.
202、该视频会议终端获取该音频数据的声源方位。202. The video conference terminal acquires the sound source bearing of the audio data.
该视频会议终端在采集到音频数据的同时,可以获取该音频数据对应的声源方位,并将该音频数据与声源方位建立对应关系。比如该视频会议终端在会议开始时间00:00:15至00:00:30内采集到的音频数据的声源方位为相对于该视频会议终端的偏东30度左右。可以理解的是,声源定位允许存在误差,因此该声源方位可以是一个范围值,比如定位到偏东30度,则具体范围可能是偏东28度至偏东32度。While collecting the audio data, the video conference terminal can acquire the sound source azimuth corresponding to the audio data, and establish a corresponding relationship between the audio data and the sound source azimuth. For example, the sound source orientation of the audio data collected by the video conference terminal from 00:00:15 to 00:00:30 at the conference start time is about 30 degrees east of the video conference terminal. It can be understood that the sound source localization is allowed to have errors, so the sound source bearing can be a range value. For example, if the sound source is located at 30 degrees east, the specific range may be 28 degrees east to 32 degrees east.
本实施例中,该视频会议终端获取该音频数据的声源方位可以采用如下几种可能实现方式:In this embodiment, the video conference terminal can obtain the sound source orientation of the audio data in the following possible implementation manners:
一种可能实现方式中,该视频会议终端上部署阵列麦克风,通过阵列麦克风拾音的声音波束信息确定该音频数据的声源方位。In a possible implementation manner, an array microphone is deployed on the video conference terminal, and the sound source azimuth of the audio data is determined through sound beam information picked up by the array microphone.
另一种可能实现方式中,该会场另外部署专用于声源定位的装置或者系统,然后以该声源定位装置或系统为标定参考点确定该音频数据的声源方位,然后将该声源方位发送给该视频会议终端。In another possible implementation manner, the venue additionally deploys a device or system dedicated to sound source localization, and then uses the sound source localization device or system as a calibration reference point to determine the sound source orientation of the audio data, and then uses the sound source location device or system as a calibration reference point to determine the sound source orientation of the audio data. sent to the video conference terminal.
可以理解的是,该声源定位可以采用上述方案也可以采用其他任一可能实现方式,只要可以获取该音频数据的声源方位即可,具体方案此处不做限定。It can be understood that the sound source localization may adopt the above solution or any other possible implementation manner, as long as the sound source orientation of the audio data can be obtained, and the specific solution is not limited here.
203、该视频会议终端通过人声检测对该音频数据进行语音分段,以得到分段音频数据。203. The video conference terminal performs voice segmentation on the audio data through human voice detection to obtain segmented audio data.
该视频会议终端根据人声检测对于接收到的音频数据进行语音分段得到不同的分段音 频数据。The video conference terminal performs voice segmentation on the received audio data according to human voice detection to obtain different segmented audio data.
本实施例中,该视频会议终端可以根据静音段间隔区分前面一段语音片段和后面一段语音片段;或者通过算法判断语音片段是人声还是非人声,根据非人声将前后的人声语音段分割开来。比如该视频会议终端在会议开始时间00:00:15至00:00:30内采集到音频数据,然后在00:00:30至00:00:32期间静音,在00:00:32至00:00:45期间采集至音频数据,在00:00:45至00:00:50期间静音。则该视频会议终端可以将在会议开始时间00:00:15至00:00:30内采集到音频数据作为一个分段音频数据,将在会议开始时间00:00:32至00:00:45内采集至的音频数据作为下一个分段音频数据。In this embodiment, the video conferencing terminal can distinguish the previous voice segment and the next voice segment according to the interval of the silent segment; or determine whether the voice segment is a human voice or a non-human voice through an algorithm, and divide the preceding and following human voice voice segments according to the non-human voice. split. For example, the video conference terminal collects audio data from 00:00:15 to 00:00:30 at the start time of the conference, then mutes it from 00:00:30 to 00:00:32, and mutes it from 00:00:32 to 00 : Audio data is captured during 00:45, muted between 00:00:45 and 00:00:50. Then the video conference terminal can use the audio data collected during the conference start time 00:00:15 to 00:00:30 as a segmented audio data, and the video conference terminal can use the audio data collected in the conference start time 00:00:32 to 00:00:45 as a segmented audio data. The audio data collected in the system is used as the next segmented audio data.
可以理解的是,本申请实施例中,该“00:00:00”形式指示的计时规则为“时:分:秒”,即“00:00:15”指示的时间点为会议开始之后的第15秒。It can be understood that, in the embodiment of the present application, the timing rule indicated in the form of "00:00:00" is "hours:minutes:seconds", that is, the time point indicated by "00:00:15" is after the start of the meeting 15 seconds.
204、该视频会议终端根据声源方位采集该声源方位范围内的图像信息。204. The video conference terminal collects image information within the azimuth range of the sound source according to the azimuth of the sound source.
该视频会议终端根据步骤202中获取到的音频数据对应的声源方位确定该视频会议终端的图像信息采集区域,然后在该图像信息采集区域采集图像信息。The video conference terminal determines an image information collection area of the video conference terminal according to the sound source azimuth corresponding to the audio data obtained in step 202, and then collects image information in the image information collection area.
本实施例中,该视频会议终端可以是通过抓拍照片的形式采集该图像信息也可以是抓取视频数据中的该音频数据对应的图片帧作为该图像信息,具体形式此处不做限定。同时该视频会议终端的摄像头可固定也可部署为可转动,具体情况此处不做限定。当该视频会议终端的摄像头固定(即该摄像头的拍摄范围固定)时,该视频会议终端获取固定拍摄范围内的图像,然后根据声源方位计算提取该音频数据对应的图像信息。当该视频会议终端的摄像头可移动时,该视频会议终端可以根据该声源方位调整该摄像头的拍摄范围,从而获取该音频数据对应的图像信息。如图3所示,该视频会议终端位于会议屏幕上方,而与会人员位于会议桌的两边,当与会人员中存在发言人时,该视频会议终端可以根据声源方位获取一定角度范围内的图像信息。而由于角度问题,该图像信息可能会出现多个与会人员,也可以只有一个与会人员。如在根据发言人1的声源定位采集该发言人1的图像信息时,该图像信息区域仅有发言人1;而在根据发言人2的声源定位采集该发言人2的图像信息时,该图像信息区域包括发言人1和另一位与会人员。In this embodiment, the video conference terminal may collect the image information in the form of capturing a photo, or may capture a picture frame corresponding to the audio data in the video data as the image information, and the specific form is not limited here. At the same time, the camera of the video conference terminal can be fixed or can be deployed to be rotatable, and the specific situation is not limited here. When the camera of the video conference terminal is fixed (that is, the shooting range of the camera is fixed), the video conference terminal acquires images within the fixed shooting range, and then calculates and extracts image information corresponding to the audio data according to the sound source orientation. When the camera of the video conference terminal is movable, the video conference terminal can adjust the shooting range of the camera according to the direction of the sound source, so as to obtain image information corresponding to the audio data. As shown in Figure 3, the video conference terminal is located above the conference screen, and the participants are located on both sides of the conference table. When there are speakers among the participants, the video conference terminal can obtain image information within a certain angle range according to the sound source orientation . However, due to the angle problem, there may be multiple participants in the image information, or there may be only one participant. For example, when the image information of the speaker 1 is collected according to the sound source localization of the speaker 1, the image information area has only the speaker 1; and when the image information of the speaker 2 is collected according to the sound source localization of the speaker 2, The image information area includes Speaker 1 and another participant.
205、该视频会议终端对该图像信息进行人像识别得到身份识别结果。205. The video conference terminal performs portrait recognition on the image information to obtain an identity recognition result.
该视频会议终端对该图像信息进行人脸识别和人体属性识别得到身份识别结果,该身份识别结果用于指示发言人身份信息和发言时间信息的对应关系。比如通过人脸识别得到面部特征对应的发言人,而人体属性识别包括对于用户整体衣着或者身体特征进行识别得到身体特征或用户衣着外观对应的发言人。该发言人身份信息可以为用户身份标识信息(比如发言人在公司内的工号或者发言人在公司内部数据库已登记的身份证号码或者电话号码)或者用户身体属性标识信息(比如当前会议中该用户上衣穿着白色衣服,下身为黑色长裤或者该用户的手臂上有个明显的记号等等)。而该发言时间信息可以是一段时间或者两个时间点。比如该发言时间信息为当前会议开始后的00:00:15至00:00:45这一段30秒时间;或者该发言时间信息仅包括“00:00:15”和“00:00:45”这两个时间点。The video conference terminal performs face recognition and human body attribute recognition on the image information to obtain an identity recognition result, and the identity recognition result is used to indicate the correspondence between speaker identity information and speaking time information. For example, the spokesperson corresponding to the facial features is obtained through face recognition, and the human body attribute recognition includes identifying the overall clothing or physical features of the user to obtain the spokesperson corresponding to the physical features or the appearance of the user's clothing. The speaker identification information may be user identification information (such as the speaker's job number in the company or the speaker's ID number or phone number registered in the company's internal database) or user physical attribute identification information (such as the current meeting's ID number or phone number) The user is wearing white tops and black trousers or the user has a visible mark on the arm, etc.). The speaking time information may be a period of time or two time points. For example, the speaking time information is 30 seconds from 00:00:15 to 00:00:45 after the current conference starts; or the speaking time information only includes "00:00:15" and "00:00:45" these two time points.
本实施例中,该视频会议终端获取该发言人身份信息的具体操作可以如下:若该图像信息中包含清晰可辩的人脸信息,则该视频会议终端将利用人脸识别技术识别图像信息中 的人脸,并将该人脸与已存储的人脸数据库进行比对确定该人脸对应的用户身份标识信息;若该图像信息中人脸信息达不到识别要求(比如面部特征无法满足人脸识别要求或者无面部图像),则该视频会议终端可以进行人体属性识别得到身体属性信息,并根据该身体属性信息确定用户身体属性标识信息。In this embodiment, the specific operation of the video conference terminal for acquiring the speaker's identity information may be as follows: if the image information contains clear and identifiable face information, the video conference terminal will use face recognition technology to identify the information in the image information. and compare the face with the stored face database to determine the user identity information corresponding to the face; if the face information in the image information fails to meet the identification requirements (for example, the facial features cannot meet the face recognition requirement or no face image), the video conference terminal can perform human body attribute recognition to obtain body attribute information, and determine the user's body attribute identification information according to the body attribute information.
206、该视频会议终端将音频数据和对应的声源方位打包为音频码流发送给多点控制单元,并将该身份识别结果发送给录播服务器。206. The video conference terminal packages the audio data and the corresponding sound source azimuth into an audio code stream and sends it to the multipoint control unit, and sends the identification result to the recording and broadcasting server.
该视频会议终端将该音频数据与该音频数据对应的声源方位打包为音频码流发送给该多点控制单元。一种示例性方案中,该视频会议终端将该音频数据编码为音频码流,然后在对应的音频码流中添加附加域信息,用该附加域信息指示该音频数据对应的声源方位信息。而该视频会议终端自身进行人像识别得到的身份识别结果可以直接发送给该录播服务器。The video conference terminal packages the audio data and the sound source azimuth corresponding to the audio data into an audio code stream and sends it to the multipoint control unit. In an exemplary solution, the video conference terminal encodes the audio data into an audio code stream, and then adds additional field information to the corresponding audio code stream, and uses the additional field information to indicate sound source location information corresponding to the audio data. The identity recognition result obtained by the video conferencing terminal itself through the portrait recognition can be directly sent to the recording server.
207、该多点控制单元将该视频会议终端发送的音频码流发送该录播服务器。207. The multipoint control unit sends the audio stream sent by the video conference terminal to the recording and broadcasting server.
该多点控制单元在接收到该视频会议终端发送的音频码流之后,根据分配给该视频会议终端的会议标识确定该视频会议终端所属的会场,然后在该音频码流中添加该会场标识,并将该音频码流发送给该录播服务器。After receiving the audio code stream sent by the video conference terminal, the multipoint control unit determines the conference site to which the video conference terminal belongs according to the conference identification assigned to the video conference terminal, and then adds the conference site identification to the audio code stream, And send the audio stream to the recording server.
一种可能实现方式中,该多点控制单元可以对各个会场的音频数据进行筛选,然后选择一个或多个会场的音频数据发送给该录播服务器。该多点控制单元可以将各个会场的音频数据的音量大小进行比较,选择音量大小大于预设阈值的音频数据进行转发;或者,该多点控制单元可以通过算法确定人声时长超过预设阈值的音频数据进行转发。具体筛选条件,此处不作限定。这样可以减少处理量,从而加快处理速度。In a possible implementation manner, the multipoint control unit may filter the audio data of each conference site, and then select the audio data of one or more conference sites to send to the recording and broadcasting server. The multipoint control unit can compare the volume levels of the audio data of each venue, and select audio data whose volume is greater than a preset threshold for forwarding; or, the multipoint control unit can determine through an algorithm that the voice duration exceeds the preset threshold. Audio data is forwarded. Specific filtering conditions are not limited here. This reduces the amount of processing and thus speeds up processing.
208、该录播服务器将该音频码流解码得到音频数据,并对该音频数据进行语音分段,以得到该分段音频数据。208. The recording and broadcasting server decodes the audio stream to obtain audio data, and performs voice segmentation on the audio data to obtain the segmented audio data.
该录播服务器获取到该音频码流之后可以将该音频码流解码得到音频数据和会场标识,然后将该音频数据按照该会场标识进行存储。同时该录播服务器根据该音频数据的声源方位和人声检测技术对该音频数据进行语音分段,从而得到分段音频数据。可以理解的是,本实施例中,该录播服务器根据声源方位和人检测技术对该音频数据进行语音分段可以对视频会议终端上报的音频数据进行进一步分类。比如视频会议终端根据人声检测检测到00:00:15至00:00:30内一直存在人声,则该视频会议终端将该00:00:15至00:00:30内采集到音频数据划分为一个分段音频数据,实际上在00:00:15至00:00:25内声源方位1内存在一个发言人在发言,而在00:00:25至00:00:30内声源方位2内也存在一个发言人在发言。因此在录播服务器重新根据声源方位和人声检测进行语音分段时,可以分为两个分段音频数据。After acquiring the audio code stream, the recording and broadcasting server can decode the audio code stream to obtain audio data and a site identification, and then store the audio data according to the site identification. At the same time, the recording and broadcasting server performs speech segmentation on the audio data according to the sound source orientation of the audio data and the human voice detection technology, thereby obtaining segmented audio data. It can be understood that, in this embodiment, the recording and broadcasting server can further classify the audio data reported by the video conference terminal by performing voice segmentation on the audio data according to the sound source orientation and the person detection technology. For example, according to human voice detection, the video conference terminal detects that there is human voice from 00:00:15 to 00:00:30, then the video conference terminal collects audio data from 00:00:15 to 00:00:30. Divided into a segmented audio data, there is actually a speaker speaking in sound source azimuth 1 in 00:00:15 to 00:00:25, and a speaker in 00:00:25 to 00:00:30 There is also a speaker speaking in source location 2. Therefore, when the recording server performs voice segmentation again according to the sound source orientation and human voice detection, the audio data can be divided into two segments.
209、在该分段音频数据符合声纹识别最小长度时,该录播服务器提取该分段音频数据的声纹特征。209. When the segmented audio data meets the minimum length for voiceprint recognition, the recording and broadcasting server extracts the voiceprint feature of the segmented audio data.
在该分段音频数据符合声纹识别最小长度时,该录播服务器根据声纹聚类等技术对该分段音频数据提取声纹特征,并标注声纹标识。一种示例性方案中,假设该录播服务器将该音频数据划分出10个分段音频数据,其中有8个分段音频数据的时长满足声纹识别的最 小长度,则该录播服务器分别对这8个分段音频数据提取声纹特征,并分别标注声纹标识(声纹1至声纹8)。When the segmented audio data meets the minimum length for voiceprint identification, the recording and broadcasting server extracts voiceprint features from the segmented audio data according to techniques such as voiceprint clustering, and marks the voiceprint identification. In an exemplary solution, it is assumed that the recording and broadcasting server divides the audio data into 10 segmented audio data, and the duration of 8 segmented audio data satisfies the minimum length of voiceprint recognition, then the recording and broadcasting server respectively The voiceprint features are extracted from the eight segmented audio data, and voiceprint identifiers (voiceprint 1 to voiceprint 8) are marked respectively.
210、该录播服务器根据该身份识别结果和该分段音频数据的声纹特征确定该分段音频数据的发言人身份。210. The recording and broadcasting server determines the speaker identity of the segmented audio data according to the identification result and the voiceprint feature of the segmented audio data.
该录播服务器将接收到的身份识别结果和该分段音频数据的声纹特征进行整合分析确定该分段音频数据的发言人身份。The recording and broadcasting server integrates and analyzes the received identification result and the voiceprint feature of the segmented audio data to determine the speaker identity of the segmented audio data.
具体可以采用如下方式:Specifically, the following methods can be used:
一种可能实现方式中,若该身份识别结果指示第一分段音频数据对应唯一发言人信息,则该会议记录处理装置根据该身份识别结果指示的唯一发言人信息确定该第一分段音频数据对应的发言人。In a possible implementation manner, if the identification result indicates that the first segment of audio data corresponds to the unique speaker information, then the conference record processing device determines the first segment of audio data according to the unique speaker information indicated by the identification result. the corresponding speaker.
另一种可能实现方式中,若该身份识别结果指示该第一分段音频数据对应至少两个发言人身份信息,则该会议记录处理装置对比第一分段音频数据的声纹特征与第二分段音频数据的声纹特征,其中,该第二分段音频数据由该会议记录处理装置对该音频数据进行语音分段得到,且该第二分段音频数据对应唯一发言人;若该第一分段音频数据的声纹特征与第二分段音频数据的声纹特征一致,则该会议记录处理装置根据该第二分段音频数据对应的发言人身份信息确定该第一分段音频数据对应的发言人。In another possible implementation manner, if the identification result indicates that the first segment of audio data corresponds to at least two speaker identity information, the conference recording processing device compares the voiceprint feature of the first segment of audio data with that of the second segment of audio data. The voiceprint feature of segmented audio data, wherein the second segmented audio data is obtained by performing voice segmentation on the audio data by the conference recording processing device, and the second segmented audio data corresponds to the only speaker; The voiceprint feature of a segment of audio data is consistent with the voiceprint feature of the second segment of audio data, and the conference recording processing device determines the first segment of audio data according to the speaker identity information corresponding to the second segment of audio data. the corresponding speaker.
另一种可能实现方式中,若该身份识别结果指示该第一分段音频数据对应至少两个发言人身份信息,则该会议记录处理装置根据该第一分段音频数据对应的发言人身份信息和声纹特征,以及该第二分段音频数据对应的发言人身份信息和声纹特征确定该第一分段音频数据对应的发言人,该第二分段音频数据由该会议记录处理装置对该音频数据进行语音分段得到,该第二分段音频数据对应至少两个发言人身份信息。即该会议记录处理装置可以根据多个分段音频数据的声纹特征以及对应的发言人身份信息进行综合判断各个分段音频数据对应的发言人。本实施例中,该第一分段音频数据与该第二分段音频数据均由该会议记录处理装置通过语音分段得到。具体请参阅表1所示的一个当前会议的会议记录:In another possible implementation manner, if the identity recognition result indicates that the first segment of audio data corresponds to at least two speaker identity information, the conference record processing apparatus will use the speaker identity information corresponding to the first segment of audio data to And the voiceprint feature, and the speaker identity information and the voiceprint feature corresponding to the second segment audio data determine the speaker corresponding to the first segment audio data, and the second segment audio data is recorded by the conference record processing device. The audio data is obtained by voice segmentation, and the second segmented audio data corresponds to at least two speaker identity information. That is, the conference record processing apparatus can comprehensively determine the speaker corresponding to each segmented audio data according to the voiceprint features of the multiple segmented audio data and the corresponding speaker identity information. In this embodiment, both the first segmented audio data and the second segmented audio data are obtained by the conference recording processing apparatus through voice segmentation. For details, please refer to the minutes of a current meeting shown in Table 1:
表1Table 1
Figure PCTCN2021098297-appb-000001
Figure PCTCN2021098297-appb-000001
Figure PCTCN2021098297-appb-000002
Figure PCTCN2021098297-appb-000002
根据上表第1行至第3行所示的内容可知,声纹特征与发言人存在唯一对应关系,因此可以确定该第1行至第3行所示的音频数据对应的发言人;而对于身份识别结果产生了多个发言人的情况下,该录播服务器可以将该分段音频数据对应的声纹特征与其他已确定发言人的分段音频数据的声纹特征以及身份识别结果进行整合分析得到该分段音频数据对应的发言人。如第4行所示的内容,该身份识别结果显示包括用户身份ID为User03,用户身体属性ID为body04,声纹特征为VP04。这种情况下,可能是body04所指示的发言人在低头念稿,而User03正好正面朝向了视频会议终端的摄像头,且body04和User03在采集图像信息时无法分开。根据第3行所示的内容可知,User03对应的声纹特征为VP03,因此在声纹特征为VP04的情况下,该第4行所示内容的发言人可以确定不是User03,而是body04,且该body04对应的声纹特征为VP04。同理,对于第5行和第8行所示的内容也可以相应的确定出唯一的发言人。而对于第6行、第7行以及第9行的内容来说,User05和User06一直无法区分出来,且声纹特征也无法进行区别,因此发言人无法唯一确定。而对于第10行以及第11行的内容来说,其声纹特征均为VP07,但是对应的发言人身份存在唯一交集User07。这种情况下,可能是User07所指示的发言人在第10行和第11行所指示的时间段内均有发言,而User08在第10行所示的时间段内正好正面朝向了视频会议终端的摄像头,在第11行所示的时间段内与User07在采集图像信息时分开了;User06在第11行所示的时间段内正好正面朝向了视频会议终端的摄像头,在第10行所示的时间段内与User07在采集图像信息时分开了。因此结合第10行和第11行的内容可以推断得到声纹特征VP07对应的发言人为User07。According to the content shown in the first row to the third row of the above table, there is a unique correspondence between the voiceprint feature and the speaker, so the speaker corresponding to the audio data shown in the first row to the third row can be determined; If the identification result produces multiple speakers, the recording and broadcasting server can integrate the voiceprint features corresponding to the segmented audio data with the voiceprint features of the segmented audio data of other identified speakers and the identification results. The speaker corresponding to the segmented audio data is obtained by analysis. As shown in the fourth row, the identification result display includes that the user identity ID is User03, the user body attribute ID is body04, and the voiceprint feature is VP04. In this case, it may be that the speaker indicated by body04 is looking down and reading the manuscript, while User03 is facing the camera of the video conference terminal, and body04 and User03 cannot be separated when collecting image information. According to the content shown in the 3rd line, the voiceprint feature corresponding to User03 is VP03, so if the voiceprint feature is VP04, the speaker of the content shown in the 4th line can be determined not to be User03, but body04, and The voiceprint feature corresponding to the body04 is VP04. Similarly, for the content shown in the 5th row and the 8th row, the unique speaker can also be determined accordingly. For the content of lines 6, 7, and 9, User05 and User06 cannot be distinguished, and the voiceprint features cannot be distinguished, so the speaker cannot be uniquely determined. For the content of lines 10 and 11, the voiceprint features are both VP07, but the corresponding speaker identities have a unique intersection User07. In this case, it may be that the speaker indicated by User07 spoke during the time period indicated by lines 10 and 11, while User08 was facing the video conference terminal during the time period indicated by line 10. During the time period shown in line 11, User07 is separated from User07 when collecting image information; User06 is facing the camera of the video conferencing terminal directly in the time period shown in line 11, shown in line 10 The time period is separated from User07 when collecting image information. Therefore, combining the contents of lines 10 and 11, it can be inferred that the speaker corresponding to the voiceprint feature VP07 is User07.
如果经过上述方式还是无法确定唯一发言人,则该录播服务器可以将该当前会议的声纹特征以及身份识别结果与该会场的长时声纹特征纪录进行比对,进行进一步的判断。即该录播服务器将第一发言人在该第一会场的当前会议中的声纹特征与该第一发言人在该长时声纹特征纪录中的声纹特征进行比对,该第一发言人为该第一会场的当前会议中已与分段音频数据确定对应关系的发言人;若该第一发言人在该第一会场的当前会议中的声纹特征与该第一发言人在该长时声纹特征纪录中的声纹特征一致,则该录播服务器将该第一分段音频数据对应的声纹特征与该第一会场的长时声纹特征纪录中的声纹特征进行比对确定该第一分段音频数据对应的发言人。具体请参阅表2所示的一个示例性长时声纹特征纪录:If the only speaker cannot be determined through the above methods, the recording server can compare the voiceprint feature and identification result of the current conference with the long-term voiceprint feature record of the conference site for further judgment. That is, the recording server compares the voiceprint feature of the first speaker in the current meeting of the first conference site with the voiceprint feature of the first speaker in the long-term voiceprint feature record, and the first speech A speaker whose corresponding relationship has been determined with segmented audio data in the current conference of the first conference site; if the voiceprint feature of the first speaker in the current conference of the first conference site is the same as that of the first speaker in the When the voiceprint features in the voiceprint feature record are consistent, the recording and broadcasting server compares the voiceprint feature corresponding to the first segmented audio data with the voiceprint feature in the long-term voiceprint feature record of the first venue A speaker corresponding to the first segment of audio data is determined. For details, please refer to an exemplary long-term voiceprint feature record shown in Table 2:
表2Table 2
Figure PCTCN2021098297-appb-000003
Figure PCTCN2021098297-appb-000003
Figure PCTCN2021098297-appb-000004
Figure PCTCN2021098297-appb-000004
假设Conf02为上述表1所示的分析结果,则该录播服务器可以将User01在会议室Site01中最近的声纹特征进行对比。如将Conf01和Conf02在会议室Site01中的User01的声纹特征进行比对,若比对结果显示在两次会议中该User01的声纹特征差值满足阈值要求,则该录播服务器可以确定在会议室Site01中的这两次会议的信道是一致的,从而确定该长时声纹特征纪录可用于参考。比如表2中第7行显示声纹特征为VP05时,候选发言人为User05和User08,而在表2的第3行显示声纹特征为VP05时,候选发言人为User05、User06和User07,因此可以统计该信道标志中的发言人出现的次数,取次数出现最多的单一发言人User05作为该声纹特征VP05对应的发言人。在确定声纹特征VP05的发言人之后,就可以确定表1中声纹特征VP06对应的发言人则为User06。Assuming that Conf02 is the analysis result shown in Table 1 above, the recording server can compare the most recent voiceprint features of User01 in the conference room Site01. For example, the voiceprint features of User01 in the conference room Site01 of Conf01 and Conf02 are compared. If the comparison result shows that the difference between the voiceprint features of User01 in the two meetings meets the threshold requirements, the recording server can determine that the The channels of the two conferences in the conference room Site01 are consistent, so it is determined that the long-term voiceprint feature record can be used for reference. For example, when the 7th row of Table 2 shows that the voiceprint feature is VP05, the candidate speakers are User05 and User08, and when the 3rd row of Table 2 shows that the voiceprint feature is VP05, the candidate speakers are User05, User06, and User07, so we can count The number of times the speaker in the channel flag appears, and the single speaker User05 with the largest number of occurrences is taken as the speaker corresponding to the voiceprint feature VP05. After the speaker of the voiceprint feature VP05 is determined, it can be determined that the speaker corresponding to the voiceprint feature VP06 in Table 1 is User06.
假设存在另一场会议Conf03,且Conf03也存在User01以及User01对应的声纹特征,此时该录播服务器将Conf01和Conf03中User01的声纹特征进行比对,若比对结果显示在两次会议中该User01的声纹特征差值不满足阈值要求,则该录播服务器可以将该Conf03中的声纹特征以及发言人信息进行注册,同时更新长时声纹特征纪录。其具体形式可以如表2的第8行至第10行所示。可以理解的是,会议对应的信道产生变化可以是会议室发生了变化,也可以是会议中涉及到的设备产生了变化。如表2所示,Conf03和Conf02在同一会议室(会议室Site01),而信道标识发生了变化,因此可以认为Conf03的视频会议终端和Conf02的视频会议终端发生了变化;也可以认为Conf03的多点控制单元和Conf02的多点控制单元发生了变化。Suppose there is another conference Conf03, and Conf03 also has the voiceprint features corresponding to User01 and User01. At this time, the recording server compares the voiceprint features of User01 in Conf01 and Conf03. If the comparison result is displayed in the two conferences If the difference of the voiceprint feature of the User01 does not meet the threshold requirement, the recording and broadcasting server can register the voiceprint feature and the speaker information in the Conf03, and at the same time update the long-term voiceprint feature record. Its specific form can be shown in the 8th row to the 10th row of Table 2. It can be understood that the change of the channel corresponding to the conference may be the change of the conference room or the change of the devices involved in the conference. As shown in Table 2, Conf03 and Conf02 are in the same conference room (Conference Room Site01), and the channel ID has changed, so it can be considered that the video conference terminal of Conf03 and the video conference terminal of Conf02 have changed; The point control unit and the multipoint control unit of Conf02 have changed.
可以理解的是,该会议记录处理装置可以在短时会议分析(即表1所示的分析方式)之后再进行长时会议分析(即表2所示的分析),也可以在长时会议分析之后再进行短时会议分析,只要最终可以实现音频数据的区分,具体的操作方式此处不做限定。It can be understood that, the conference record processing device can perform the analysis of the long-term conference (that is, the analysis shown in Table 2) after the analysis of the short-term conference (that is, the analysis method shown in Table 1), or the analysis of the long-term conference can be performed. Afterwards, the short-term conference analysis is performed. As long as the audio data can be distinguished in the end, the specific operation method is not limited here.
211、该录播服务器将音频数据和音频数据的分类结果发送至ASR服务器。211. The recording and broadcasting server sends the audio data and the classification result of the audio data to the ASR server.
在该录播服务器完成音频数据与发言人的匹配之后,将分类结果和该音频数据发送到该ASR服务器。After the recording server completes the matching of the audio data with the speaker, the classification result and the audio data are sent to the ASR server.
212、该ASR服务器将该音频数据输出为文字。212. The ASR server outputs the audio data as text.
本实施例中,该视频会议终端根据声源定位采集相应的图像信息,并对该图像信息进行初步的人像识别得到身份识别结果,然后该录播服务器在获取了该身份识别结果后,将该身份识别结果与声纹特征相结合对音频数据进行进一步识别,这样可以不需要对用户的声纹特征进行预先注册就可以实现语音数据的精确分类。In this embodiment, the video conference terminal collects corresponding image information according to sound source localization, and performs preliminary portrait recognition on the image information to obtain an identification result. The identification result is combined with the voiceprint feature to further identify the audio data, so that accurate classification of the voice data can be achieved without pre-registering the user's voiceprint feature.
可以理解的是,该录播服务器的功能也可以集成在该多点控制单元,具体请参阅图4所示,本申请实施例中音频数据的处理方法的一个实施例包括:It can be understood that the function of the recording and broadcasting server can also be integrated in the multi-point control unit. For details, please refer to FIG. 4 . An embodiment of the audio data processing method in the embodiment of the present application includes:
401-405跟上述实施例中的201-205一致,在此不再赘述。401-405 are the same as 201-205 in the foregoing embodiment, and are not repeated here.
406、该视频会议终端将音频码流和身份识别结果发送给多点控制单元。406. The video conference terminal sends the audio code stream and the identification result to the multipoint control unit.
发送音频码流的方式可以参考上述206。不同的是,本步骤还将身份识别结果也发送给了多点控制单元。For the method of sending the audio code stream, refer to 206 above. The difference is that this step also sends the identification result to the multipoint control unit.
407、该多点控制单元将该音频码流解码得到音频数据,并对该音频数据进行语音分段,以得到该分段音频数据。407. The multipoint control unit decodes the audio code stream to obtain audio data, and performs voice segmentation on the audio data to obtain the segmented audio data.
该多点控制单元获取到该音频码流之后,根据分配给该视频会议终端的会议标识确定该视频会议终端所属的会场。将该音频码流解码得到音频数据,然后将该音频数据按照该会场标识进行存储。同时该多点控制单元根据该音频数据的声源方位和人声检测技术对该音频数据进行语音分段,具体的分段方式可以跟上述实施例中的308相同,此处不再赘述。408-410具体的实现方式参考209-211,所不同的是,408-411是由多点控制单元实现的,而209-211则是由录播服务器实现。After acquiring the audio code stream, the multipoint control unit determines the conference site to which the video conference terminal belongs according to the conference identifier allocated to the video conference terminal. The audio code stream is decoded to obtain audio data, and then the audio data is stored according to the site identifier. At the same time, the multi-point control unit performs speech segmentation on the audio data according to the sound source orientation of the audio data and the human voice detection technology. For the specific implementation of 408-410, refer to 209-211. The difference is that 408-411 is implemented by the multipoint control unit, while 209-211 is implemented by the recording and broadcasting server.
可以理解的是,该录播服务器的功能也可以在视频会议终端实现,具体请参阅图5所示,本申请实施例中音频数据的处理方法的一个实施例包括:It can be understood that the function of the recording server can also be implemented in the video conference terminal. For details, please refer to FIG. 5 . An embodiment of the audio data processing method in the embodiment of the present application includes:
501-502跟上述实施例中的201-202一致,在此不再赘述。503、该视频会议终端通过人声检测和声源定位对该音频数据进行语音分段,以得到分段音频数据。501-502 are the same as 201-202 in the above-mentioned embodiment, and are not repeated here. 503. The video conference terminal performs voice segmentation on the audio data through human voice detection and sound source localization to obtain segmented audio data.
该视频会议终端进行语音分段的方式可以参考上述208,具体此处不再赘述。For the manner in which the video conference terminal performs voice segmentation, reference may be made to 208 above, and details are not repeated here.
504-505跟上述实施例中的204-205一致,在此不再赘述。504-505 are the same as 204-205 in the foregoing embodiment, and are not repeated here.
506-508跟209-211的实现方式相似,在不同之处在于,步骤506-508是由视频会议终端执行,而步骤209-211则是由录播服务器执行。The implementation of 506-508 is similar to that of 209-211, the difference is that steps 506-508 are executed by the video conference terminal, while steps 209-211 are executed by the recording server.
509、该ASR服务器将该音频数据输出为文字。509. The ASR server outputs the audio data as text.
本实施例中,该视频会议终端根据声源定位采集相应的图像信息,并对该图像信息进行初步的人像识别得到身份识别结果,然后视频会议终端将该身份识别结果与声纹特征相结合对音频数据进行进一步识别,这样可以不需要对用户的声纹特征进行预先注册就可以 实现语音数据的精确分类。In this embodiment, the video conferencing terminal collects corresponding image information according to sound source localization, and performs preliminary portrait recognition on the image information to obtain an identity recognition result, and then the video conferencing terminal combines the identity recognition result with the voiceprint feature. The audio data is further identified, so that accurate classification of the voice data can be achieved without pre-registering the user's voiceprint features.
具体请参阅图6所示,本申请实施例中音频数据的处理方法的一个实施例包括:Please refer to FIG. 6 for details. An embodiment of the audio data processing method in the embodiment of the present application includes:
601、会议记录处理装置获取第一会场的音频数据,该音频数据对应的声源方位信息和身份识别结果,该身份识别结果用于指示通过人像识别方法得到的发言人身份信息与发言人的发言时间信息的对应关系。601. The conference record processing device obtains the audio data of the first meeting place, the sound source position information and the identification result corresponding to the audio data, and the identification result is used to indicate the speaker identification information obtained by the portrait recognition method and the speaker's speech. Correspondence of time information.
该会议记录处理装置可以是上述图2所示方法实施例中的录播服务器、上述图4所示方法实施例中的多点控制单元或者上述图5所示方法实施例中的视频会议终端。The conference record processing apparatus may be the recording server in the method embodiment shown in FIG. 2, the multipoint control unit in the method embodiment shown in FIG. 4, or the video conference terminal in the method embodiment shown in FIG. 5.
一种应用场景中,该会议记录处理装置为上述图2所示方法实施例中的录播服务器时,该会议记录处理装置接收该多点控制单元发送的音频数据和该音频数据对应的声源方位信息。其中,该音频数据和该音频数据对应的声源方位信息可以打包生成音频码流和附加域信息,其中,该附加域信息包括该音频数据对应的声源方位信息。一种示例性方案中,该视频会议终端将该音频数据编码为音频码流,然后在对应的音频码流中添加附加域信息,用该附加域信息指示该音频数据对应的声源方位信息。然后该视频会议终端将该音频码流发送给该多点控制单元,而该多点控制单元在接收到该音频码流之后,根据分配给该视频会议终端的会议标识确定该视频会议终端所属的会场,然后在该音频码流中添加该会场标识,并将该音频码流发送给该录播服务器。一种可能实现方式中,该多点控制单元可以对各个会场的音频数据进行筛选,然后选择一个或多个会场的音频数据发送给该录播服务器。该多点控制单元可以将各个会场的音频数据的音量大小进行比较,选择音量大小大于预设阈值的音频数据进行转发;或者,该多点控制单元可以通过算法确定人声时长超过预设阈值的音频数据进行转发。具体筛选条件,此处不作限定。这样可以减少处理量,从而加快处理速度。而该身份识别结果由该视频会议终端根据声源定位和人像识别得到,并由该视频会议终端直接发送给该录播服务器。In an application scenario, when the conference recording processing device is the recording server in the method embodiment shown in FIG. 2, the conference recording processing device receives the audio data sent by the multipoint control unit and the sound source corresponding to the audio data. Orientation information. The audio data and the sound source position information corresponding to the audio data may be packaged to generate an audio code stream and additional field information, wherein the additional field information includes the sound source position information corresponding to the audio data. In an exemplary solution, the video conference terminal encodes the audio data into an audio code stream, and then adds additional field information to the corresponding audio code stream, and uses the additional field information to indicate sound source location information corresponding to the audio data. Then the video conference terminal sends the audio code stream to the multipoint control unit, and after receiving the audio code stream, the multipoint control unit determines the video conference terminal to which the video conference terminal belongs according to the conference ID assigned to the video conference terminal. site, then add the site identifier to the audio code stream, and send the audio code stream to the recording and broadcasting server. In a possible implementation manner, the multipoint control unit may filter the audio data of each conference site, and then select the audio data of one or more conference sites to send to the recording and broadcasting server. The multipoint control unit can compare the volume levels of the audio data of each venue, and select audio data whose volume is greater than a preset threshold for forwarding; or, the multipoint control unit can determine through an algorithm that the voice duration exceeds the preset threshold. Audio data is forwarded. Specific filtering conditions are not limited here. This reduces the amount of processing and thus speeds up processing. The identity recognition result is obtained by the video conference terminal according to sound source localization and portrait recognition, and is directly sent by the video conference terminal to the recording server.
另一个应用场景中,该会议记录处理装置为上述图4所示方法实施例中的多点控制单元时,该会议记录处理装置接收该视频会议终端发送的音频数据和该音频数据对应的声源方位信息。其中,该音频数据和该音频数据对应的声源方位信息可以打包生成音频码流和附加域信息,其中,该附加域信息包括该音频数据对应的声源方位信息。一种示例性方案中,该视频会议终端将该音频数据编码为音频码流,然后在对应的音频码流中添加附加域信息,用该附加域信息指示该音频数据对应的声源方位信息。然后该视频会议终端将该音频码流发送给该多点控制单元。而该身份识别结果由该视频会议终端根据声源定位和人像识别得到,并由该视频会议终端发送给该多点控制单元。In another application scenario, when the conference recording processing device is the multipoint control unit in the method embodiment shown in FIG. 4, the conference recording processing device receives the audio data sent by the video conference terminal and the sound source corresponding to the audio data. Orientation information. The audio data and the sound source position information corresponding to the audio data may be packaged to generate an audio code stream and additional field information, wherein the additional field information includes the sound source position information corresponding to the audio data. In an exemplary solution, the video conference terminal encodes the audio data into an audio code stream, and then adds additional field information to the corresponding audio code stream, and uses the additional field information to indicate sound source location information corresponding to the audio data. Then the video conference terminal sends the audio code stream to the multipoint control unit. The identity recognition result is obtained by the video conference terminal according to sound source localization and human image recognition, and is sent by the video conference terminal to the multipoint control unit.
另一个应用场景中,该会议记录处理装置为上述图5所示方法实施例中的视频会议终端时,该会议记录处理装置直接通过麦克风采集当前会议中的音频数据,并根据声源定位技术获取该音频数据对应的声源方位信息。而该身份识别结果由该视频会议终端根据声源定位和人像识别得到。In another application scenario, when the conference recording processing device is the video conference terminal in the method embodiment shown in FIG. 5, the conference recording processing device directly collects the audio data in the current conference through the microphone, and obtains it according to the sound source localization technology. The position information of the sound source corresponding to the audio data. The identity recognition result is obtained by the video conference terminal according to sound source localization and person image recognition.
602、该会议记录处理装置对该音频数据进行语音分段,以获取该音频数据的第一分段音频数据。602. The conference record processing apparatus performs voice segmentation on the audio data to obtain the first segmented audio data of the audio data.
该会议记录处理装置根据声源方位信息和人声检测方法对该音频数据进行分段,得到 该音频数据的多个分段音频数据。The conference record processing device segments the audio data according to the sound source orientation information and the human voice detection method to obtain a plurality of segmented audio data of the audio data.
603、该会议记录处理装置根据该第一分段音频数据的声纹特征和该身份识别结果确定该第一分段音频数据对应的发言人。603. The conference recording processing apparatus determines a speaker corresponding to the first segment of audio data according to the voiceprint feature of the first segment of audio data and the identification result.
该会议记录处理装置在可以执行上述图2中步骤210或者上述图4中步骤409或者上述图5中步骤507所示的方法得到音频数据对应的发言人,具体此处不再赘述。The conference record processing apparatus can execute the method shown in step 210 in FIG. 2 or step 409 in FIG. 4 or step 507 in FIG. 5 to obtain the speaker corresponding to the audio data, and details are not repeated here.
本实施例中,该会议记录处理装置获取用于指示发言人身份信息与发言人时间信息对应关系的身份识别结果,然后会议记录处理装置将该身份识别结果与声纹特征相结合对音频数据进行进一步识别,这样可以不需要对用户的声纹特征进行预先注册就可以实现语音数据的精确分类。In this embodiment, the conference recording processing device obtains an identification result used to indicate the correspondence between the speaker's identity information and the speaker's time information, and then the conference recording processing device combines the identification result with the voiceprint feature to perform a process on the audio data. Further identification, in this way, accurate classification of voice data can be achieved without pre-registering the user's voiceprint features.
上面描述了本申请实施例中音频数据的处理方法,下面对本申请实施例中会议记录处理装置和视频会议终端进行描述。The audio data processing method in the embodiment of the present application is described above, and the conference record processing apparatus and the video conference terminal in the embodiment of the present application are described below.
具体请参阅图7所示,本申请实施例中会议记录处理装置700包括:获取模块701和处理模块702,其中获取模块701和处理模块702通过总线连接。会议记录处理装置700可以是上述图2所示方法实施例中的录播服务器、上述图4所示方法实施例中的多点控制单元或者上述图5所示方法实施例中的视频会议终端,也可以配置为上述设备内的一个或多个芯片。会议记录处理装置700可以用于执行上述设备的部分或全部功能。For details, please refer to FIG. 7 . In this embodiment of the present application, the apparatus 700 for processing conference records includes: an acquisition module 701 and a processing module 702 , wherein the acquisition module 701 and the processing module 702 are connected through a bus. The conference record processing apparatus 700 may be the recording server in the method embodiment shown in FIG. 2 above, the multipoint control unit in the method embodiment shown in FIG. 4 above, or the video conference terminal in the method embodiment shown in FIG. 5 above, It can also be configured as one or more chips within the above device. The meeting record processing apparatus 700 may be used to execute part or all of the functions of the above-mentioned devices.
例如,获取模块701获取第一会场的音频数据,该音频数据对应的声源方位信息和身份识别结果,该身份识别结果用于指示通过人像识别方法得到的发言人身份信息与发言人的发言时间信息的对应关系;该处理模块702对该音频数据进行语音分段,以获取该音频数据的第一分段音频数据;根据该第一分段音频数据的声纹特征和该身份识别结果确定该第一分段音频数据对应的发言人。For example, the acquisition module 701 acquires the audio data of the first venue, the sound source location information and the identification result corresponding to the audio data, and the identification result is used to indicate the speaker identification information obtained by the portrait recognition method and the speaking time of the speaker The corresponding relationship of the information; the processing module 702 performs voice segmentation on the audio data to obtain the first segmented audio data of the audio data; according to the voiceprint feature of the first segmented audio data and the identification result, determine the The speaker corresponding to the first segment of audio data.
可选的,该音频数据被包含在音频码流中,该音频码流还包括附加域信息,该附加域信息包括该音频数据对应的声源方位信息。Optionally, the audio data is included in an audio code stream, and the audio code stream further includes additional field information, where the additional field information includes sound source position information corresponding to the audio data.
可选的,该处理模块702,具体用于若该身份识别结果指示该第一分段音频数据对应唯一发言人身份信息,则根据该发言人身份信息确定该第一分段音频数据对应的发言人。Optionally, the processing module 702 is specifically configured to determine the speech corresponding to the first segment of audio data according to the speaker's identity information if the identification result indicates that the first segment of audio data corresponds to the unique speaker identity information. people.
可选的,该处理模块702,具体用于若该身份识别结果指示该第一分段音频数据对应至少两个发言人身份信息,则对比该第一分段音频数据的声纹特征与第二分段音频数据的声纹特征,该第二分段音频数据由该会议记录处理装置对该音频数据进行语音分段得到,该第二分段音频数据对应唯一发言人身份信息;若该第一分段音频数据的声纹特征与第二分段音频数据的声纹特征一致,则根据该第二分段音频数据对应的发言人身份信息确定该第一分段音频数据对应的发言人。Optionally, the processing module 702 is specifically configured to compare the voiceprint feature of the first segmented audio data with that of the second if the identification result indicates that the first segmented audio data corresponds to at least two speaker identity information. The voiceprint feature of segmented audio data, the second segmented audio data is obtained by performing voice segmentation on the audio data by the conference recording processing device, and the second segmented audio data corresponds to the unique speaker identity information; If the voiceprint feature of the segmented audio data is consistent with the voiceprint feature of the second segmented audio data, the speaker corresponding to the first segmented audio data is determined according to the speaker identity information corresponding to the second segmented audio data.
可选的,该处理模块702,具体用于若该身份识别结果指示该第一分段音频数据对应至少两个发言人身份信息,则根据该第一分段音频数据对应的发言人身份信息和声纹特征,以及该第二分段音频数据对应的发言人身份信息和声纹特征确定该第一分段音频数据对应的发言人,该第二分段音频数据由该会议记录处理装置对该音频数据进行语音分段得到,该第二分段音频数据对应至少两个发言人身份信息。Optionally, the processing module 702 is specifically configured to, if the identification result indicates that the first segmented audio data corresponds to at least two speaker identity information, then according to the speaker identity information corresponding to the first segmented audio data and The voiceprint feature, and the speaker identity information and the voiceprint feature corresponding to the second segmented audio data determine the speaker corresponding to the first segmented audio data, and the second segmented audio data is processed by the conference record processing device. The audio data is obtained by voice segmentation, and the second segmented audio data corresponds to at least two speaker identity information.
可选的,该处理模块702,具体用于根据该第一分段音频数据对应的声纹特征、该身 份识别结果和该第一会场的长时声纹特征纪录确定该第一分段音频数据对应的发言人,所述长时声纹特征纪录包括所述第一会场的历史声纹特征纪录,所述第一会场的历史声纹特征纪录用于指示声纹特征、发言人以及信道标识之间的对应关系。Optionally, the processing module 702 is specifically configured to determine the first segmented audio data according to the corresponding voiceprint feature of the first segmented audio data, the identity recognition result and the long-term voiceprint feature record of the first venue. Corresponding speaker, the long-term voiceprint feature record includes the historical voiceprint feature record of the first venue, and the historical voiceprint feature record of the first venue is used to indicate the voiceprint feature, the speaker, and the channel identifier. Correspondence between.
可选的,该处理模块702,具体用于将第一发言人在该第一会场的当前会议中的声纹特征与该第一发言人在该长时声纹特征纪录中的声纹特征进行比对得到对比结果,该第一发言人为该第一会场的当前会议中已确定的发言人;若该比对结果指示该第一发言人在该第一会场的当前会议中的声纹特征与该第一发言人在该长时声纹特征纪录中的声纹特征一致,则将该第一分段音频数据对应的声纹特征、该身份识别结果与该第一会场的长时声纹特征纪录进行比对确定该第一分段音频数据对应的发言人。Optionally, the processing module 702 is specifically configured to compare the voiceprint feature of the first speaker in the current conference of the first conference venue with the voiceprint feature of the first speaker in the long-term voiceprint feature record. The comparison result is obtained, and the first speaker is the determined speaker in the current conference of the first conference site; if the comparison result indicates that the voiceprint feature of the first speaker in the current conference of the first conference site is the same as The voiceprint feature of the first speaker in the long-term voiceprint feature record is consistent, then the voiceprint feature corresponding to the first segment of audio data, the identification result and the long-term voiceprint feature of the first venue The record is compared to determine the speaker corresponding to the first segment of audio data.
可选的,该处理模块702,还用于将第一发言人在该第一会场的当前会议中的声纹特征与该第一发言人在该长时声纹特征纪录中的声纹特征进行比对得到对比结果,该第一发言人为该第一会场的当前会议中已确定的发言人;若该比对结果指示该第一发言人在该第一会场的当前会议中的声纹特征与该第一发言人在该长时声纹特征纪录中的声纹特征不一致,则将该第一会场的当前会议中的声纹特征、信道标识以及该声纹特征对应的发言人进行注册,并更新该长时声纹特征纪录。Optionally, the processing module 702 is further configured to perform a comparison between the voiceprint feature of the first speaker in the current conference of the first conference site and the voiceprint feature of the first speaker in the long-term voiceprint feature record. The comparison result is obtained, and the first speaker is the determined speaker in the current conference of the first conference site; if the comparison result indicates that the voiceprint feature of the first speaker in the current conference of the first conference site is the same as If the voiceprint features of the first speaker in the long-term voiceprint feature record are inconsistent, register the voiceprint feature, the channel ID and the speaker corresponding to the voiceprint feature in the current conference of the first conference site, and Update the long-term voiceprint feature record.
可选的,会议记录处理装置700还包括存储模块,此存储模块与处理模块耦合,使得处理模块可执行存储模块中存储的计算机执行指令以实现上述方法实施例中会议记录处理装置的功能。在一个示例中,会议记录处理装置700中可选的包括的存储模块可以为芯片内的存储单元,如寄存器、缓存等,该存储模块还可以是位于芯片外部的存储单元,如ROM或可存储静态信息和指令的其他类型的静态存储设备,RAM等。Optionally, the conference record processing apparatus 700 further includes a storage module, which is coupled with the processing module, so that the processing module can execute the computer execution instructions stored in the storage module to implement the functions of the conference record processing apparatus in the above method embodiments. In an example, the optional storage module included in the conference record processing apparatus 700 may be a storage unit within the chip, such as a register, a cache, etc., and the storage module may also be a storage unit located outside the chip, such as a ROM or a storage unit that can store Other types of static storage devices for static information and instructions, RAM, etc.
应理解,上述图7对应实施例中会议记录处理装置的各模块之间所执行的流程与前述图2至图6中对应方法实施例中的会议记录处理装置执行的流程类似,具体此处不再赘述。It should be understood that the processes performed between the modules of the conference record processing apparatus in the above-mentioned embodiment corresponding to FIG. 7 are similar to the processes performed by the conference record processing apparatus in the corresponding method embodiments in FIG. 2 to FIG. 6 . Repeat.
图8示出了上述实施例中一种会议记录处理装置800可能的结构示意图,该会议记录处理装置800可以配置成是前述图2所示方法实施例中的录播服务器、上述图4所示方法实施例中的多点控制单元或者上述图5所示方法实施例中的视频会议终端。该会议记录处理装置800可以包括:处理器802、计算机可读存储介质/存储器803、收发器804、输入设备805和输出设备806,以及总线801。其中,处理器,收发器,计算机可读存储介质等通过总线连接。本申请实施例不限定上述部件之间的具体连接介质。FIG. 8 shows a schematic structural diagram of a conference record processing apparatus 800 in the above embodiment. The conference record processing apparatus 800 may be configured to be the recording server in the method embodiment shown in FIG. 2 and the recording server shown in FIG. 4 above. The multipoint control unit in the method embodiment or the video conference terminal in the method embodiment shown in FIG. 5 above. The conference record processing apparatus 800 may include: a processor 802 , a computer-readable storage medium/memory 803 , a transceiver 804 , an input device 805 and an output device 806 , and a bus 801 . Wherein, processors, transceivers, computer-readable storage media, etc. are connected through a bus. The embodiments of the present application do not limit the specific connection medium between the above components.
一个示例中,该收发器804获取第一会场的音频数据、该音频数据对应的声源方位信息身份识别结果,该身份识别结果用于指示通过人像识别方法得到的发言人身份信息与发言人的发言时间信息的对应关系;In one example, the transceiver 804 obtains the audio data of the first venue, the sound source location information identification result corresponding to the audio data, and the identification result is used to indicate the speaker identification information obtained by the portrait recognition method and the speaker's identity. Correspondence of speaking time information;
该处理器802对该音频数据进行语音分段,以获取该音频数据的第一分段音频数据;根据该第一分段音频数据的声纹特征和该身份识别结果确定该第一分段音频数据对应的发言人。The processor 802 performs speech segmentation on the audio data to obtain the first segmented audio data of the audio data; and determines the first segmented audio according to the voiceprint feature of the first segmented audio data and the identification result Data corresponds to the spokesperson.
一个示例中,处理器802可以包括基带电路,例如,可以对音频数据调制处理,并生成音频码流。收发器804可以包括射频电路,以对音频码流进行调制放大等处理后发送给会议系统中相应的设备。In one example, the processor 802 may include a baseband circuit, for example, may modulate and process audio data, and generate an audio code stream. The transceiver 804 may include a radio frequency circuit, so as to modulate and amplify the audio code stream and send it to the corresponding device in the conference system.
又一个示例中,处理器802可以运行操作系统,控制各个设备和器件之间的功能。收发器804可以包括基带电路和射频电路,例如,可以对音频码流或身份识别结果经由基带电路,射频电路进行处理后发送给会议系统中相应的设备。In yet another example, the processor 802 may run an operating system to control functions between various devices and devices. The transceiver 804 may include a baseband circuit and a radio frequency circuit. For example, the audio code stream or the identification result may be processed by the baseband circuit and the radio frequency circuit and then sent to the corresponding device in the conference system.
该收发器804与该处理器802可以实现上述图2至图6中任一实施例中相应的步骤,具体此处不做赘述。The transceiver 804 and the processor 802 can implement the corresponding steps in any of the foregoing embodiments in FIG. 2 to FIG. 6 , and details are not repeated here.
可以理解的是,图8仅仅示出了会议记录处理装置的简化设计,在实际应用中,会议记录处理装置可以包含任意数量的收发器,处理器,存储器等,而所有的可以实现本申请的会议记录处理装置都在本申请的保护范围之内。It can be understood that FIG. 8 only shows a simplified design of the conference record processing device. In practical applications, the conference record processing device can include any number of transceivers, processors, memories, etc., and all of them can realize the The conference record processing device is all within the protection scope of the present application.
上述装置800中涉及的处理器802可以是通用处理器,例如CPU、网络处理器(network processor,NP)、微处理器等,也可以是ASIC,或一个或多个用于控制本申请方案程序执行的集成电路。还可以是数字信号处理器(digital signal processor,DSP)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。控制器/处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等等。处理器通常是基于存储器内存储的程序指令来执行逻辑和算术运算。The processor 802 involved in the above-mentioned apparatus 800 may be a general-purpose processor, such as a CPU, a network processor (NP), a microprocessor, etc., or an ASIC, or one or more programs for controlling the solution of the present application. implemented integrated circuits. It can also be a digital signal processor (DSP), a field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components. A controller/processor may also be a combination that implements computing functions, such as a combination comprising one or more microprocessors, a combination of a DSP and a microprocessor, and the like. Processors typically perform logical and arithmetic operations based on program instructions stored in memory.
上述涉及的总线801可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。为便于表示,图8中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。The above-mentioned bus 801 may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (EISA for short) bus or the like. The bus can be divided into address bus, data bus, control bus and so on. For ease of presentation, only one thick line is used in FIG. 8, but it does not mean that there is only one bus or one type of bus.
上述涉及的计算机可读存储介质/存储器803还可以保存有操作系统和其他应用程序。具体地,程序可以包括程序代码,程序代码包括计算机操作指令。更具体的,上述存储器可以是ROM、可存储静态信息和指令的其他类型的静态存储设备、RAM、可存储信息和指令的其他类型的动态存储设备、磁盘存储器等等。存储器803可以是上述存储类型的组合。并且上述计算机可读存储介质/存储器可以在处理器中,还可以在处理器的外部,或在包括处理器或处理电路的多个实体上分布。上述计算机可读存储介质/存储器可以具体体现在计算机程序产品中。举例而言,计算机程序产品可以包括封装材料中的计算机可读介质。The computer-readable storage medium/memory 803 mentioned above may also store an operating system and other application programs. Specifically, the program may include program code, and the program code includes computer operation instructions. More specifically, the above-mentioned memory may be ROM, other types of static storage devices that can store static information and instructions, RAM, other types of dynamic storage devices that can store information and instructions, disk storage, and the like. Memory 803 may be a combination of the above-described storage types. And the above-mentioned computer-readable storage medium/memory may be in the processor, outside the processor, or distributed over multiple entities including the processor or processing circuit. The computer-readable storage medium/memory described above may be embodied in a computer program product. By way of example, a computer program product may include a computer-readable medium in packaging materials.
可以替换的,本申请实施例还提供一种通用处理系统,例如通称为芯片,该通用处理系统包括:提供处理器功能的一个或多个微处理器;以及提供存储介质的至少一部分的外部存储器,所有这些都通过外部总线体系结构与其它支持电路连接在一起。当存储器存储的指令被处理器执行时,使得处理器执行第一通信装置在图2至图6该实施例中的数据传输方法中的部分或全部步骤,和/或用于本申请所描述的技术的其它过程。Alternatively, the embodiments of the present application also provide a general-purpose processing system, for example, commonly referred to as a chip, and the general-purpose processing system includes: one or more microprocessors that provide processor functions; and an external memory that provides at least a part of a storage medium. , all of which are connected together with other support circuits through an external bus architecture. When the instructions stored in the memory are executed by the processor, the processor is caused to execute part or all of the steps of the data transmission method in the embodiment of FIG. 2 to FIG. other processes of technology.
结合本申请公开内容所描述的方法或者算法的步骤可以硬件的方式来实现,也可以是由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于RAM存储器、闪存、ROM存储器、EPROM存储器、EEPROM存储器、寄存器、硬盘、移动硬盘、CD-ROM或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。另 外,该ASIC可以位于会议记录处理装置中。当然,处理器和存储介质也可以作为分立组件存在于会议记录处理装置中。The steps of the methods or algorithms described in conjunction with the disclosure of the present application may be implemented in a hardware manner, or may be implemented in a manner in which a processor executes software instructions. The software instructions can be composed of corresponding software modules, and the software modules can be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable hard disk, CD-ROM, or any other form of storage known in the art in the medium. An exemplary storage medium is coupled to the processor, such that the processor can read information from, and write information to, the storage medium. Of course, the storage medium can also be an integral part of the processor. The processor and storage medium may reside in an ASIC. Alternatively, the ASIC may be located in the minutes processing device. Of course, the processor and the storage medium may also exist in the conference record processing apparatus as discrete components.
具体请参阅图9所示,本申请实施例中视频会议终端900包括:处理模块901和发送模块902,其中处理模块901和发送模块902通过总线连接。视频会议终端900可以是上述方法实施例中的视频会议终端,也可以配置为上述视频会议终端内的一个或多个芯片。视频会议终端900可以用于执行上述视频会议终端的部分或全部功能。Please refer to FIG. 9 for details. In this embodiment of the present application, the video conference terminal 900 includes: a processing module 901 and a sending module 902, wherein the processing module 901 and the sending module 902 are connected through a bus. The video conference terminal 900 may be the video conference terminal in the foregoing method embodiments, or may be configured as one or more chips in the foregoing video conference terminal. The video conference terminal 900 may be used to perform part or all of the functions of the above-mentioned video conference terminal.
例如,处理模块901对第一会场的音频数据进行声源定位,以获取该音频数据所对应的声源方位信息;根据该声源方位和人像识别方法获取身份识别结果,该身份识别结果用于指示发言人身份信息与发言时间信息的对应关系;该发送模块902将该身份识别结果、音频数据和该音频数据对应的声源方法信息发送给会议记录处理装置。For example, the processing module 901 performs sound source localization on the audio data of the first conference venue to obtain sound source position information corresponding to the audio data; obtains an identification result according to the sound source position and the portrait recognition method, and the identification result is used for Indicates the correspondence between speaker identity information and speaking time information; the sending module 902 sends the identity recognition result, audio data and sound source method information corresponding to the audio data to the conference record processing device.
可选的,该处理模块901,具体用于获取该声源方位对应的人像信息;对该人像信息进行图像识别得到人脸信息和/或身体属性信息;根据该人脸信息和/或该身体属性信息确定该发言人身份信息;将发言人时间信息与该发言人身份信息建立对应关系得到该身份识别结果。Optionally, the processing module 901 is specifically used to obtain the portrait information corresponding to the sound source orientation; perform image recognition on the portrait information to obtain face information and/or body attribute information; according to the face information and/or the body The attribute information determines the identity information of the speaker; the identity recognition result is obtained by establishing a corresponding relationship between the time information of the speaker and the identity information of the speaker.
可选的,视频会议终端900还包括存储模块,此存储模块与处理模块耦合,使得处理模块可执行存储模块中存储的计算机执行指令以实现上述方法实施例中视频会议终端的功能。在一个示例中,视频会议终端900中可选的包括的存储模块可以为芯片内的存储单元,如寄存器、缓存等,该存储模块还可以是位于芯片外部的存储单元,如ROM或可存储静态信息和指令的其他类型的静态存储设备,RAM等。Optionally, the video conference terminal 900 further includes a storage module, which is coupled with the processing module, so that the processing module can execute the computer execution instructions stored in the storage module to implement the functions of the video conference terminal in the above method embodiments. In an example, the optional storage module included in the video conference terminal 900 may be an in-chip storage unit, such as a register, a cache, etc., and the storage module may also be a storage unit located outside the chip, such as a ROM or a storage unit that can store static Other types of static storage devices for information and instructions, RAM, etc.
应理解,上述图9对应实施例中视频会议终端的各模块之间所执行的流程与前述图2至图6中对应方法实施例中的视频会议终端执行的流程类似,具体此处不再赘述。It should be understood that the processes executed between the modules of the video conferencing terminal in the above-mentioned embodiment corresponding to FIG. 9 are similar to the processes executed by the video conferencing terminal in the corresponding method embodiments in the foregoing FIG. 2 to FIG. 6 , and details are not repeated here. .
图10示出了上述实施例中一种视频会议终端1000可能的结构示意图,该视频会议终端1000可以配置成是前述视频会议终端。该视频会议终端1000可以包括:处理器1002、计算机可读存储介质/存储器1003、收发器1004、输入设备1005和输出设备1006,以及总线1001。其中,处理器,收发器,计算机可读存储介质等通过总线连接。本申请实施例不限定上述部件之间的具体连接介质。FIG. 10 shows a schematic structural diagram of a video conference terminal 1000 in the above-mentioned embodiment, and the video conference terminal 1000 may be configured as the aforementioned video conference terminal. The video conference terminal 1000 may include: a processor 1002 , a computer-readable storage medium/memory 1003 , a transceiver 1004 , an input device 1005 and an output device 1006 , and a bus 1001 . Wherein, processors, transceivers, computer-readable storage media, etc. are connected through a bus. The embodiments of the present application do not limit the specific connection medium between the above components.
一个示例中,该处理器1002对第一会场的音频数据进行声源定位,以获取该音频数据所对应的声源方位信息;根据该声源方位和人像识别方法获取身份识别结果,该身份识别结果用于指示发言人身份信息与发言时间信息的对应关系;In an example, the processor 1002 performs sound source localization on the audio data of the first venue to obtain sound source location information corresponding to the audio data; and obtains an identification result according to the sound source location and the portrait recognition method. The result is used to indicate the correspondence between speaker identity information and speaking time information;
该收发器1004将该身份识别结果和该音频数据发送给会议记录处理装置。The transceiver 1004 sends the identification result and the audio data to the conference record processing apparatus.
一个示例中,处理器1002可以包括基带电路,例如,可以对音频数据调制处理,并生成音频码流。收发器1004可以包括射频电路,以对音频码流进行调制放大等处理后发送给会议系统中相应的设备。In one example, the processor 1002 may include a baseband circuit, for example, may modulate and process audio data, and generate an audio code stream. The transceiver 1004 may include a radio frequency circuit, so as to modulate and amplify the audio code stream and send it to the corresponding device in the conference system.
又一个示例中,处理器1002可以运行操作系统,控制各个设备和器件之间的功能。收发器1004可以包括基带电路和射频电路,例如,可以对音频码流或身份识别结果经由基带电路,射频电路进行处理后发送给会议系统中相应的设备。In yet another example, the processor 1002 may run an operating system to control functions between various devices and devices. The transceiver 1004 may include a baseband circuit and a radio frequency circuit. For example, the audio code stream or the identification result may be processed by the baseband circuit, and then sent to the corresponding device in the conference system by the radio frequency circuit.
该收发器1004与该处理器1002可以实现上述图3至图7中任一实施例中相应的步骤, 具体此处不做赘述。The transceiver 1004 and the processor 1002 can implement the corresponding steps in any of the foregoing embodiments in FIG. 3 to FIG. 7 , and details are not repeated here.
可以理解的是,图10仅仅示出了视频会议终端的简化设计,在实际应用中,视频会议终端可以包含任意数量的收发器,处理器,存储器等,而所有的可以实现本申请的视频会议终端都在本申请的保护范围之内。It can be understood that FIG. 10 only shows the simplified design of the video conference terminal. In practical applications, the video conference terminal can include any number of transceivers, processors, memories, etc., and all of them can realize the video conference of the present application. The terminals are all within the protection scope of this application.
上述装置1000中涉及的处理器1002可以是通用处理器,例如CPU、网络处理器(network processor,NP)、微处理器等,也可以是ASIC,或一个或多个用于控制本申请方案程序执行的集成电路。还可以是数字信号处理器(digital signal processor,DSP)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。控制器/处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等等。处理器通常是基于存储器内存储的程序指令来执行逻辑和算术运算。The processor 1002 involved in the above-mentioned apparatus 1000 may be a general-purpose processor, such as a CPU, a network processor (NP), a microprocessor, etc., or an ASIC, or one or more programs used to control the solution of the present application implemented integrated circuits. It can also be a digital signal processor (DSP), a field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components. A controller/processor may also be a combination that implements computing functions, such as a combination comprising one or more microprocessors, a combination of a DSP and a microprocessor, and the like. Processors typically perform logical and arithmetic operations based on program instructions stored in memory.
上述涉及的总线1001可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。为便于表示,图10中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。The above-mentioned bus 1001 may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (EISA for short) bus or the like. The bus can be divided into address bus, data bus, control bus and so on. For ease of presentation, only one thick line is used in FIG. 10, but it does not mean that there is only one bus or one type of bus.
上述涉及的计算机可读存储介质/存储器1003还可以保存有操作系统和其他应用程序。具体地,程序可以包括程序代码,程序代码包括计算机操作指令。更具体的,上述存储器可以是ROM、可存储静态信息和指令的其他类型的静态存储设备、RAM、可存储信息和指令的其他类型的动态存储设备、磁盘存储器等等。存储器1003可以是上述存储类型的组合。并且上述计算机可读存储介质/存储器可以在处理器中,还可以在处理器的外部,或在包括处理器或处理电路的多个实体上分布。上述计算机可读存储介质/存储器可以具体体现在计算机程序产品中。举例而言,计算机程序产品可以包括封装材料中的计算机可读介质。The above-mentioned computer-readable storage medium/memory 1003 may also store an operating system and other application programs. Specifically, the program may include program code, and the program code includes computer operation instructions. More specifically, the above-mentioned memory may be ROM, other types of static storage devices that can store static information and instructions, RAM, other types of dynamic storage devices that can store information and instructions, disk storage, and the like. The memory 1003 may be a combination of the above storage types. And the above-mentioned computer-readable storage medium/memory may be in the processor, outside the processor, or distributed over multiple entities including the processor or processing circuit. The computer-readable storage medium/memory described above may be embodied in a computer program product. By way of example, a computer program product may include a computer-readable medium in packaging materials.
可以替换的,本申请实施例还提供一种通用处理系统,例如通称为芯片,该通用处理系统包括:提供处理器功能的一个或多个微处理器;以及提供存储介质的至少一部分的外部存储器,所有这些都通过外部总线体系结构与其它支持电路连接在一起。当存储器存储的指令被处理器执行时,使得处理器执行第一通信装置在图2至图6该实施例中的数据传输方法中的部分或全部步骤,和/或用于本申请所描述的技术的其它过程。Alternatively, the embodiments of the present application also provide a general-purpose processing system, for example, commonly referred to as a chip, and the general-purpose processing system includes: one or more microprocessors that provide processor functions; and an external memory that provides at least a part of a storage medium. , all of which are connected together with other support circuits through an external bus architecture. When the instructions stored in the memory are executed by the processor, the processor is caused to execute part or all of the steps of the data transmission method in the embodiment of FIG. 2 to FIG. other processes of technology.
结合本申请公开内容所描述的方法或者算法的步骤可以硬件的方式来实现,也可以是由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于RAM存储器、闪存、ROM存储器、EPROM存储器、EEPROM存储器、寄存器、硬盘、移动硬盘、CD-ROM或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。另外,该ASIC可以位于视频会议终端中。当然,处理器和存储介质也可以作为分立组件存在于视频会议终端中。The steps of the methods or algorithms described in conjunction with the disclosure of the present application may be implemented in a hardware manner, or may be implemented in a manner in which a processor executes software instructions. The software instructions can be composed of corresponding software modules, and the software modules can be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable hard disk, CD-ROM, or any other form of storage known in the art in the medium. An exemplary storage medium is coupled to the processor, such that the processor can read information from, and write information to, the storage medium. Of course, the storage medium can also be an integral part of the processor. The processor and storage medium may reside in an ASIC. Alternatively, the ASIC can be located in the videoconferencing terminal. Of course, the processor and the storage medium may also exist in the video conference terminal as discrete components.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system, device and unit described above may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,该单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be other division methods, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
该作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The unit described as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
该集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例该方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , which includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method in each embodiment of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .
以上该,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: they can still The technical solutions described in the embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the present application.

Claims (25)

  1. 一种音频数据的处理方法,其特征在于,包括:A method for processing audio data, comprising:
    会议记录处理装置获取第一会场在当前会议的音频数据、所述音频数据对应的声源方位信息和身份识别结果,所述身份识别结果用于指示通过人像识别方法得到的发言人身份信息与发言人的发言时间信息的对应关系;The conference record processing device obtains the audio data of the first conference site in the current conference, the sound source position information corresponding to the audio data, and the identification result, and the identification result is used to indicate the speaker identification information obtained by the portrait recognition method and the speech. Corresponding relationship of people's speaking time information;
    所述会议记录处理装置对所述音频数据进行语音分段,以获取所述音频数据的第一分段音频数据;The conference record processing device performs voice segmentation on the audio data to obtain the first segmented audio data of the audio data;
    所述会议记录处理装置根据所述第一分段音频数据的声纹特征和所述身份识别结果确定所述第一分段音频数据对应的发言人。The conference record processing device determines a speaker corresponding to the first segmented audio data according to the voiceprint feature of the first segmented audio data and the identity recognition result.
  2. 根据权利要求1所述的方法,其特征在于,所述音频数据被包含在音频码流中,所述音频码流还包括附加域信息,所述附加域信息包括所述音频数据对应的声源方位信息。The method according to claim 1, wherein the audio data is contained in an audio code stream, the audio code stream further includes additional domain information, and the additional domain information includes a sound source corresponding to the audio data Orientation information.
  3. 根据权利要求1或2所述的方法,其特征在于,所述会议记录处理装置根据所述第一分段音频数据的声纹特征和所述身份识别结果确定所述第一分段音频数据对应的发言人包括:The method according to claim 1 or 2, wherein the conference record processing device determines that the first segmented audio data corresponds to the first segmented audio data according to the voiceprint feature of the first segmented audio data and the identification result. Speakers include:
    若所述身份识别结果指示所述第一分段音频数据对应唯一发言人身份信息,则所述会议记录处理装置根据所述发言人身份信息确定所述第一分段音频数据对应的发言人。If the identification result indicates that the first segment of audio data corresponds to unique speaker identity information, the conference record processing apparatus determines the speaker corresponding to the first segment of audio data according to the speaker identity information.
  4. 根据权利要求1或2所述的方法,其特征在于,所述会议记录处理装置根据所述第一分段音频数据的声纹特征和所述身份识别结果确定所述第一分段音频数据对应的发言人包括:The method according to claim 1 or 2, wherein the conference record processing device determines that the first segmented audio data corresponds to the first segmented audio data according to the voiceprint feature of the first segmented audio data and the identification result. Speakers include:
    若所述身份识别结果指示所述第一分段音频数据对应至少两个发言人身份信息,则所述会议记录处理装置对比所述第一分段音频数据的声纹特征与第二分段音频数据的声纹特征,所述第二分段音频数据由所述会议记录处理装置对所述音频数据进行语音分段得到,所述第二分段音频数据对应唯一发言人身份信息;If the identification result indicates that the first segmented audio data corresponds to at least two speaker identity information, the conference recording processing device compares the voiceprint feature of the first segmented audio data with the second segmented audio The voiceprint feature of the data, the second segmented audio data is obtained by performing voice segmentation on the audio data by the conference record processing device, and the second segmented audio data corresponds to the unique speaker identity information;
    若所述第一分段音频数据的声纹特征与第二分段音频数据的声纹特征一致,则所述会议记录处理装置根据所述第二分段音频数据对应的发言人身份信息确定所述第一分段音频数据对应的发言人。If the voiceprint feature of the first segmented audio data is consistent with the voiceprint feature of the second segmented audio data, then the conference record processing device determines, according to the speaker identity information corresponding to the second segmented audio data, the the speaker corresponding to the first segment of audio data.
  5. 根据权利要求1或2所述的方法,其特征在于,所述会议记录处理装置根据所述第一分段音频数据的声纹特征和所述身份识别结果确定所述第一分段音频数据对应的发言人包括:The method according to claim 1 or 2, wherein the conference record processing device determines that the first segmented audio data corresponds to the first segmented audio data according to the voiceprint feature of the first segmented audio data and the identification result. Speakers include:
    若所述身份识别结果指示所述第一分段音频数据对应至少两个发言人身份信息,则所述会议记录处理装置根据所述第一分段音频数据对应的发言人身份信息和声纹特征,以及所述第二分段音频数据对应的发言人身份信息和声纹特征确定所述第一分段音频数据对应的发言人,所述第二分段音频数据由所述会议记录处理装置对所述音频数据进行语音分段得到,所述第二分段音频数据对应至少两个发言人身份信息。If the identity recognition result indicates that the first segment of audio data corresponds to at least two speaker identity information, the conference recording processing device will, according to the speaker identity information and voiceprint feature corresponding to the first segment of audio data, , and the speaker identity information and voiceprint feature corresponding to the second segmented audio data determine the speaker corresponding to the first segmented audio data, and the second segmented audio data is recorded by the conference record processing device. The audio data is obtained by voice segmentation, and the second segmented audio data corresponds to at least two speaker identities.
  6. 根据权利要求1至4中任一项所述的方法,其特征在于,若所述会议记录处理装置根据所述第一分段音频数据的声纹特征和所述身份识别结果未确定出所述第一分段音频数据对应的唯一发言人,则所述方法还包括:The method according to any one of claims 1 to 4, wherein, if the meeting record processing device does not determine the said the only speaker corresponding to the first segment of audio data, the method further includes:
    所述会议记录处理装置根据所述第一分段音频数据对应的声纹特征、所述身份识别结果和所述第一会场的长时声纹特征纪录确定所述第一分段音频数据对应的发言人,所述长时声纹特征纪录包括所述第一会场的历史声纹特征纪录,所述第一会场的历史声纹特征纪录用于指示声纹特征、发言人以及信道标识之间的对应关系。The conference record processing device determines, according to the voiceprint feature corresponding to the first segmented audio data, the identity recognition result and the long-term voiceprint feature record of the first conference venue, the corresponding voiceprint of the first segmented audio data. Speaker, the long-term voiceprint feature record includes the historical voiceprint feature record of the first venue, and the historical voiceprint feature record of the first venue is used to indicate the voiceprint feature, the speaker, and the channel identifier. Correspondence.
  7. 根据权利要求6所述的方法,其特征在于,所述会议记录处理装置根据所述第一分段音频数据对应的声纹特征、所述身份识别结果和所述第一会场的长时声纹特征纪录确定所述第一分段音频数据对应的发言人包括:The method according to claim 6, wherein the conference recording processing device is based on the voiceprint feature corresponding to the first segmented audio data, the identity recognition result and the long-term voiceprint of the first conference site The feature record determines that the speakers corresponding to the first segment of audio data include:
    所述会议记录处理装置将第一发言人在所述第一会场的当前会议中的声纹特征与所述第一发言人在所述长时声纹特征纪录中的声纹特征进行比对,所述第一发言人为所述第一会场的当前会议中已确定的发言人;The conference record processing device compares the voiceprint feature of the first speaker in the current conference of the first conference site with the voiceprint feature of the first speaker in the long-term voiceprint feature record, the first speaker is the determined speaker in the current conference of the first conference site;
    若所述第一发言人在所述第一会场的当前会议中的声纹特征与所述第一发言人在所述长时声纹特征纪录中的声纹特征一致,则所述会议记录处理装置将所述第一分段音频数据对应的声纹特征、所述身份识别结果与所述第一会场的长时声纹特征纪录确定所述第一分段音频数据对应的发言人。If the voiceprint feature of the first speaker in the current conference at the first conference site is consistent with the voiceprint feature of the first speaker in the long-term voiceprint feature record, the conference record processing The apparatus determines the speaker corresponding to the first segment audio data by using the voiceprint feature corresponding to the first segment audio data, the identity recognition result and the long-term voiceprint feature record of the first conference site.
  8. 根据权利要求6所述的方法,其特征在于,所述方法还包括:The method according to claim 6, wherein the method further comprises:
    所述会议记录处理装置将第一发言人在所述第一会场的当前会议中的声纹特征与所述第一发言人在所述长时声纹特征纪录中的声纹特征进行比对,所述第一发言人为所述第一会场的当前会议中已确定的发言人;The conference record processing device compares the voiceprint feature of the first speaker in the current conference of the first conference site with the voiceprint feature of the first speaker in the long-term voiceprint feature record, the first speaker is the determined speaker in the current conference of the first conference site;
    若所述比对结果指示所述第一发言人在所述第一会场的当前会议中的声纹特征与所述第一发言人在所述长时声纹特征纪录中的声纹特征不一致,则所述会议记录处理装置将所述第一会场的当前会议中的声纹特征、信道标识以及所述声纹特征对应的发言人进行注册,并更新所述长时声纹特征纪录。If the comparison result indicates that the voiceprint feature of the first speaker in the current conference of the first conference site is inconsistent with the voiceprint feature of the first speaker in the long-term voiceprint feature record, Then, the conference record processing device registers the voiceprint feature, the channel identifier and the speaker corresponding to the voiceprint feature in the current conference of the first conference site, and updates the long-term voiceprint feature record.
  9. 一种音频数据的处理方法,其特征在于,包括:A method for processing audio data, comprising:
    视频会议终端对第一会场的音频数据进行声源定位,以获取所述音频数据所对应的声源方位信息;The video conference terminal performs sound source localization on the audio data of the first conference site to obtain sound source orientation information corresponding to the audio data;
    所述视频会议终端根据所述声源方位和人像识别方法获取身份识别结果,所述身份识别结果用于指示发言人身份信息与发言时间信息的对应关系;The video conference terminal obtains an identification result according to the sound source orientation and the portrait identification method, and the identification result is used to indicate the correspondence between speaker identification information and speaking time information;
    所述视频会议终端将所述身份识别结果、所述音频数据和所述音频数据对应的声源方位信息发送给会议记录处理装置。The video conference terminal sends the identification result, the audio data and the sound source location information corresponding to the audio data to the conference record processing device.
  10. 根据权利要求9所述的方法,其特征在于,所述视频会议终端根据所述声源方位和人像识别方法获取身份识别结果包括:The method according to claim 9, wherein the obtaining, by the video conferencing terminal, an identity recognition result according to the sound source azimuth and the face recognition method comprises:
    所述视频会议终端获取所述声源方位对应的人像信息;acquiring, by the video conference terminal, the portrait information corresponding to the sound source orientation;
    所述视频会议终端对所述人像信息进行图像识别得到人脸信息和/或身体属性信息;The video conference terminal performs image recognition on the portrait information to obtain face information and/or body attribute information;
    所述视频会议终端根据所述人脸信息和/或所述身体属性信息确定所述发言人身份信息;The video conference terminal determines the speaker identity information according to the face information and/or the body attribute information;
    所述视频会议终端将发言人时间信息与所述发言人身份信息建立对应关系得到所述身份识别结果。The video conference terminal establishes a corresponding relationship between the speaker time information and the speaker identity information to obtain the identity recognition result.
  11. 一种会议记录处理装置,其特征在于,包括:A conference record processing device, characterized in that it includes:
    获取模块,用于获取第一会场的音频数据,所述音频数据对应的声源方位信息和身份识别结果,所述身份识别结果用于指示通过人像识别方法得到的发言人身份信息与发言人的发言时间信息的对应关系;The acquisition module is used to acquire the audio data of the first venue, the sound source position information corresponding to the audio data and the identity recognition result, and the identity recognition result is used to indicate the speaker identity information obtained by the portrait recognition method and the speaker's identity information. Correspondence of speaking time information;
    处理模块,用于对所述音频数据进行语音分段,以获取所述音频数据的第一分段音频数据;根据所述第一分段音频数据的声纹特征和所述身份识别结果确定所述第一分段音频数据对应的发言人。The processing module is configured to perform voice segmentation on the audio data to obtain the first segmented audio data of the audio data; determine the segmented audio data according to the voiceprint feature of the first segmented audio data and the identification result. the speaker corresponding to the first segment of audio data.
  12. 根据权利要求11所述的装置,其特征在于,所述音频数据被包含在音频码流中,所述音频码流还包括附加域信息,所述附加域信息包括所述音频数据对应的声源方位信息。The device according to claim 11, wherein the audio data is contained in an audio code stream, the audio code stream further includes additional domain information, and the additional domain information includes a sound source corresponding to the audio data Orientation information.
  13. 根据权利要求11或12所述的装置,其特征在于,所述处理模块,具体用于若所述身份识别结果指示所述第一分段音频数据对应唯一发言人身份信息,则根据所述发言人身份信息确定所述第一分段音频数据对应的发言人。The apparatus according to claim 11 or 12, wherein the processing module is specifically configured to, if the identity recognition result indicates that the first segment of audio data corresponds to unique speaker identity information, The person identity information determines the speaker corresponding to the first segment of audio data.
  14. 根据权利要求11或12所述的装置,其特征在于,所述处理模块,具体用于若所述身份识别结果指示所述第一分段音频数据对应至少两个发言人身份信息,则对比所述第一分段音频数据的声纹特征与第二分段音频数据的声纹特征,所述第二分段音频数据由所述会议记录处理装置对所述音频数据进行语音分段得到,所述第二分段音频数据对应唯一发言人身份信息;The device according to claim 11 or 12, wherein the processing module is specifically configured to compare the identification information of the first segment of audio data to at least two speaker identification information if the identification result indicates that the first segment of audio data corresponds to the identification information of at least two speakers. The voiceprint feature of the first segmented audio data and the voiceprint feature of the second segmented audio data, the second segmented audio data is obtained by performing voice segmentation on the audio data by the conference record processing device, so The second segment of audio data corresponds to the unique speaker identity information;
    若所述第一分段音频数据的声纹特征与第二分段音频数据的声纹特征一致,则根据所述第二分段音频数据对应的发言人身份信息确定所述第一分段音频数据对应的发言人。If the voiceprint feature of the first segmented audio data is consistent with the voiceprint feature of the second segmented audio data, determine the first segmented audio according to the speaker identity information corresponding to the second segmented audio data Data corresponds to the spokesperson.
  15. 根据权利要求11或12所述的装置,其特征在于,所述处理模块,具体用于若所述身份识别结果指示所述第一分段音频数据对应至少两个发言人身份信息,则根据所述第一分段音频数据对应的发言人身份信息和声纹特征,以及所述第二分段音频数据对应的发言人身份信息和声纹特征确定所述第一分段音频数据对应的发言人,所述第二分段音频数据由所述会议记录处理装置对所述音频数据进行语音分段得到,所述第二分段音频数据对应至少两个发言人身份信息。The apparatus according to claim 11 or 12, wherein the processing module is specifically configured to, if the identification result indicates that the first segment of audio data corresponds to at least two speaker identification information, The speaker identity information and voiceprint feature corresponding to the first segmented audio data, and the speaker identity information and voiceprint feature corresponding to the second segmented audio data determine the speaker corresponding to the first segmented audio data , the second segmented audio data is obtained by performing voice segmentation on the audio data by the conference recording processing device, and the second segmented audio data corresponds to at least two speaker identities.
  16. 根据权利要求11至15中任一项所述的装置,其特征在于,所述处理模块,还用于根据所述第一分段音频数据对应的声纹特征、所述身份识别结果和所述第一会场的长时声纹特征纪录确定所述第一分段音频数据对应的发言人,所述长时声纹特征纪录包括所述第一会场的历史声纹特征纪录,所述第一会场的历史声纹特征纪录用于指示声纹特征、发言人以及信道标识之间的对应关系。The device according to any one of claims 11 to 15, wherein the processing module is further configured to, according to the voiceprint feature corresponding to the first segmented audio data, the identity recognition result and the The long-term voiceprint feature record of the first venue determines the speaker corresponding to the first segmented audio data, and the long-term voiceprint feature record includes the historical voiceprint feature record of the first venue. The historical voiceprint feature record of is used to indicate the correspondence between voiceprint features, speakers, and channel identifiers.
  17. 根据权利要求16所述的装置,其特征在于,所述处理模块,具体用于将第一发言人在所述第一会场的当前会议中的声纹特征与所述第一发言人在所述长时声纹特征纪录中的声纹特征进行比对得到对比结果,所述第一发言人为所述第一会场的当前会议中已确定的发言人;The apparatus according to claim 16, wherein the processing module is specifically configured to compare the voiceprint feature of the first speaker in the current conference of the first conference site with the first speaker in the The voiceprint features in the long-term voiceprint feature record are compared to obtain a comparison result, and the first speaker is the determined speaker in the current meeting of the first conference venue;
    若所述比对结果指示所述第一发言人在所述第一会场的当前会议中的声纹特征与所述第一发言人在所述长时声纹特征纪录中的声纹特征一致,则将所述第一分段音频数据对应的声纹特征、所述身份识别结果与所述第一会场的长时声纹特征纪录进行比对确定所述第 一分段音频数据对应的发言人。If the comparison result indicates that the voiceprint feature of the first speaker in the current conference of the first conference site is consistent with the voiceprint feature of the first speaker in the long-term voiceprint feature record, Then compare the voiceprint feature corresponding to the first segmented audio data, the identification result and the long-term voiceprint feature record of the first venue to determine the speaker corresponding to the first segmented audio data .
  18. 根据权利要求16所述的装置,其特征在于,所述处理模块,还用于将第一发言人在所述第一会场的当前会议中的声纹特征与所述第一发言人在所述长时声纹特征纪录中的声纹特征进行比对得到对比结果,所述第一发言人为所述第一会场的当前会议中已确定的发言人;The apparatus according to claim 16, wherein the processing module is further configured to compare the voiceprint feature of the first speaker in the current conference of the first conference site with the first speaker in the The voiceprint features in the long-term voiceprint feature record are compared to obtain a comparison result, and the first speaker is the determined speaker in the current meeting of the first conference venue;
    若所述比对结果指示所述第一发言人在所述第一会场的当前会议中的声纹特征与所述第一发言人在所述长时声纹特征纪录中的声纹特征不一致,则将所述第一会场的当前会议中的声纹特征、信道标识以及所述声纹特征对应的发言人进行注册,并更新所述长时声纹特征纪录。If the comparison result indicates that the voiceprint feature of the first speaker in the current conference of the first conference site is inconsistent with the voiceprint feature of the first speaker in the long-term voiceprint feature record, Then, register the voiceprint feature, the channel identifier and the speaker corresponding to the voiceprint feature in the current conference of the first conference site, and update the long-term voiceprint feature record.
  19. 一种视频会议终端,其特征在于,包括:A video conference terminal, comprising:
    处理模块,用于对第一会场的音频数据进行声源定位,以获取所述音频数据所对应的声源方位信息;根据所述声源方位和人像识别方法获取身份识别结果,所述身份识别结果用于指示发言人身份信息与发言时间信息的对应关系;A processing module, configured to perform sound source localization on the audio data of the first venue to obtain sound source orientation information corresponding to the audio data; obtain an identification result according to the sound source orientation and the portrait recognition method, and the identification The result is used to indicate the correspondence between speaker identity information and speaking time information;
    发送模块,用于将所述身份识别结果、所述音频数据和所述音频数据对应的声源方位信息发送给会议记录处理装置。The sending module is configured to send the identification result, the audio data and the sound source location information corresponding to the audio data to the conference record processing device.
  20. 根据权利要求19所述的视频会议终端,其特征在于,所述处理模块,具体用于获取所述声源方位对应的人像信息;对所述人像信息进行图像识别得到人脸信息和/或身体属性信息;根据所述人脸信息和/或所述身体属性信息确定所述发言人身份信息;将发言人时间信息与所述发言人身份信息建立对应关系得到所述身份识别结果。The video conference terminal according to claim 19, wherein the processing module is specifically configured to obtain the portrait information corresponding to the sound source orientation; perform image recognition on the portrait information to obtain face information and/or body information attribute information; determine the speaker identity information according to the face information and/or the body attribute information; establish a corresponding relationship between the speaker time information and the speaker identity information to obtain the identity recognition result.
  21. 一种会议记录处理装置,其特征在于,包括至少一个处理器和存储器,所述处理器用于与所述存储器耦合,所述处理器调用所述存储器中存储的指令以控制所述终端执行权利要求1至8中任一项所述的方法。A conference record processing device, characterized in that it includes at least one processor and a memory, the processor is configured to be coupled with the memory, and the processor invokes an instruction stored in the memory to control the terminal to execute the claim The method of any one of 1 to 8.
  22. 一种视频会议终端,其特征在于,包括至少一个处理器和存储器,所述处理器用于与所述存储器耦合,所述处理器调用所述存储器中存储的指令以控制所述终端执行权利要求9至10中任一项所述的方法。A video conference terminal, characterized in that it includes at least one processor and a memory, the processor is configured to be coupled with the memory, and the processor invokes an instruction stored in the memory to control the terminal to execute claim 9 The method of any one of to 10.
  23. 一种会议记录处理系统,其特征在于,包括如权利要求11至18中任一项所述的会议记录处理装置和如权利要求19至20中任一项所述的视频会议终端以及多点控制单元和自动语音识别ASR服务器。A conference record processing system, characterized in that it comprises the conference record processing device according to any one of claims 11 to 18, the video conference terminal and multipoint control device according to any one of claims 19 to 20 Unit and Automatic Speech Recognition ASR Server.
  24. 一种计算机存储介质所述计算机存储介质存储有计算机指令,所述计算机指令用于执行上述权利要求1至权利要求10中任意任一项所述的方法。A computer storage medium The computer storage medium stores computer instructions for performing the method of any one of claims 1 to 10 above.
  25. 一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述权利要求1至权利要求10中任意任一项所述的方法。A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 10 above.
PCT/CN2021/098297 2020-09-25 2021-06-04 Audio data processing method, device and system WO2022062471A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011027427.2 2020-09-25
CN202011027427.2A CN114333853A (en) 2020-09-25 2020-09-25 Audio data processing method, equipment and system

Publications (1)

Publication Number Publication Date
WO2022062471A1 true WO2022062471A1 (en) 2022-03-31

Family

ID=80844861

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/098297 WO2022062471A1 (en) 2020-09-25 2021-06-04 Audio data processing method, device and system

Country Status (2)

Country Link
CN (1) CN114333853A (en)
WO (1) WO2022062471A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117059092A (en) * 2023-10-11 2023-11-14 北京吉道尔科技有限公司 Intelligent medical interactive intelligent diagnosis method and system based on blockchain
CN117528335A (en) * 2023-12-05 2024-02-06 广东鼎诺科技音频有限公司 Audio equipment applying directional microphone and noise reduction method

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023212879A1 (en) * 2022-05-05 2023-11-09 北京小米移动软件有限公司 Object audio data generation method and apparatus, electronic device, and storage medium
CN115019809B (en) * 2022-05-17 2024-04-02 中国南方电网有限责任公司超高压输电公司广州局 Method, apparatus, device, medium and program product for monitoring false entry prevention interval

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102968991A (en) * 2012-11-29 2013-03-13 华为技术有限公司 Method, device and system for sorting voice conference minutes
US20150235654A1 (en) * 2011-06-17 2015-08-20 At&T Intellectual Property I, L.P. Speaker association with a visual representation of spoken content
CN106657865A (en) * 2016-12-16 2017-05-10 联想(北京)有限公司 Method and device for generating conference summary and video conference system
CN106782545A (en) * 2016-12-16 2017-05-31 广州视源电子科技股份有限公司 A kind of system and method that audio, video data is changed into writing record
CN109560941A (en) * 2018-12-12 2019-04-02 深圳市沃特沃德股份有限公司 Minutes method, apparatus, intelligent terminal and storage medium
CN110022454A (en) * 2018-01-10 2019-07-16 华为技术有限公司 A kind of method and relevant device identifying identity in video conference
CN110232925A (en) * 2019-06-28 2019-09-13 百度在线网络技术(北京)有限公司 Generate the method, apparatus and conference terminal of minutes
US20190318743A1 (en) * 2018-04-17 2019-10-17 Gong I.O Ltd. Metadata-based diarization of teleconferences
WO2019217133A1 (en) * 2018-05-07 2019-11-14 Microsoft Technology Licensing, Llc Voice identification enrollment
EP3627505A1 (en) * 2018-09-21 2020-03-25 Televic Conference NV Real-time speaker identification with diarization
CN111402892A (en) * 2020-03-23 2020-07-10 郑州智利信信息技术有限公司 Conference recording template generation method based on voice recognition

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150235654A1 (en) * 2011-06-17 2015-08-20 At&T Intellectual Property I, L.P. Speaker association with a visual representation of spoken content
CN102968991A (en) * 2012-11-29 2013-03-13 华为技术有限公司 Method, device and system for sorting voice conference minutes
CN106657865A (en) * 2016-12-16 2017-05-10 联想(北京)有限公司 Method and device for generating conference summary and video conference system
CN106782545A (en) * 2016-12-16 2017-05-31 广州视源电子科技股份有限公司 A kind of system and method that audio, video data is changed into writing record
CN110022454A (en) * 2018-01-10 2019-07-16 华为技术有限公司 A kind of method and relevant device identifying identity in video conference
US20190318743A1 (en) * 2018-04-17 2019-10-17 Gong I.O Ltd. Metadata-based diarization of teleconferences
WO2019217133A1 (en) * 2018-05-07 2019-11-14 Microsoft Technology Licensing, Llc Voice identification enrollment
EP3627505A1 (en) * 2018-09-21 2020-03-25 Televic Conference NV Real-time speaker identification with diarization
CN109560941A (en) * 2018-12-12 2019-04-02 深圳市沃特沃德股份有限公司 Minutes method, apparatus, intelligent terminal and storage medium
CN110232925A (en) * 2019-06-28 2019-09-13 百度在线网络技术(北京)有限公司 Generate the method, apparatus and conference terminal of minutes
CN111402892A (en) * 2020-03-23 2020-07-10 郑州智利信信息技术有限公司 Conference recording template generation method based on voice recognition

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117059092A (en) * 2023-10-11 2023-11-14 北京吉道尔科技有限公司 Intelligent medical interactive intelligent diagnosis method and system based on blockchain
CN117528335A (en) * 2023-12-05 2024-02-06 广东鼎诺科技音频有限公司 Audio equipment applying directional microphone and noise reduction method

Also Published As

Publication number Publication date
CN114333853A (en) 2022-04-12

Similar Documents

Publication Publication Date Title
WO2022062471A1 (en) Audio data processing method, device and system
US11343446B2 (en) Systems and methods for implementing personal camera that adapts to its surroundings, both co-located and remote
US20190215464A1 (en) Systems and methods for decomposing a video stream into face streams
EP2172016B1 (en) Techniques for detecting a display device
US20160359941A1 (en) Automated video editing based on activity in video conference
US10230922B2 (en) Combining installed audio-visual sensors with ad-hoc mobile audio-visual sensors for smart meeting rooms
US20140340467A1 (en) Method and System for Facial Recognition for a Videoconference
WO2019184650A1 (en) Subtitle generation method and terminal
US9165182B2 (en) Method and apparatus for using face detection information to improve speaker segmentation
WO2021120190A1 (en) Data processing method and apparatus, electronic device, and storage medium
US11405584B1 (en) Smart audio muting in a videoconferencing system
US7792326B2 (en) Method of tracking vocal target
US20180288373A1 (en) Treatment method for doorbell communication
WO2021134720A1 (en) Method for processing conference data and related device
CN112908336A (en) Role separation method for voice processing device and voice processing device thereof
TWI798867B (en) Video processing method and associated system on chip
TW202405796A (en) Video processing method for performing partial highlighting with aid of auxiliary information detection, and associated system on chip
TWI805233B (en) Method and system for controlling multi-party voice communication
TWI764020B (en) Video conference system and method thereof
US20220415003A1 (en) Video processing method and associated system on chip
CN111182256A (en) Information processing method and server
US20230237621A1 (en) Video processing method and associated system on chip
CN117544745A (en) Video processing method and system chip for local emphasis by aid of auxiliary information
TW201939483A (en) Voice system and voice detection method
TW202405795A (en) Video processing method for performing partial highlighting with aid of hand gesture detection, and associated system on chip

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21870855

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21870855

Country of ref document: EP

Kind code of ref document: A1