CN114333853A - Audio data processing method, equipment and system - Google Patents

Audio data processing method, equipment and system Download PDF

Info

Publication number
CN114333853A
CN114333853A CN202011027427.2A CN202011027427A CN114333853A CN 114333853 A CN114333853 A CN 114333853A CN 202011027427 A CN202011027427 A CN 202011027427A CN 114333853 A CN114333853 A CN 114333853A
Authority
CN
China
Prior art keywords
audio data
speaker
conference
voiceprint feature
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011027427.2A
Other languages
Chinese (zh)
Inventor
张鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202011027427.2A priority Critical patent/CN114333853A/en
Priority to PCT/CN2021/098297 priority patent/WO2022062471A1/en
Publication of CN114333853A publication Critical patent/CN114333853A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Telephonic Communication Services (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The embodiment of the application provides a method, equipment and a system for processing audio data, which are used for classifying conference audio data according to speaker identities. The embodiment of the application specifically comprises the following steps: the conference recording processing device acquires audio data of a first conference room, sound source position information corresponding to the audio data and an identity recognition result, wherein the additional domain information comprises the sound source position information corresponding to the audio data, and the identity recognition result is used for indicating the corresponding relation between the speaker identity information obtained by a portrait recognition method and the speaker speaking time information; then the conference recording processing device carries out voice segmentation on the audio data to obtain first segmented audio data of the audio data; and finally, the conference record processing device determines the speaker corresponding to the first sectional audio data according to the voiceprint characteristics of the first sectional audio data and the identification result.

Description

Audio data processing method, equipment and system
Technical Field
The present application relates to the field of communications, and in particular, to a method, device, and system for processing audio data.
Background
With the rapid development of the video conference technology, similar to the manual generation of conference records in the process of meeting of a common conference, the conference summary requirement also exists in the multipoint video conference. The existing product can automatically record the contents of audio and video, data and the like of the whole conference in the video conference process, if the audio data is simply recorded, when the key contents or specific contents of the conference are reviewed, the conference summary sorting requirement of the common conference which can be classified according to speakers cannot be met.
During the video conference, if it can be determined that only one person in the whole voice file speaks, the audio data of the whole file can be directly sent to the voiceprint recognition system for recognition. If the voice file contains voices of a plurality of people, the voice file needs to be segmented first, and then voiceprint recognition is carried out on each segment of audio data. Existing voiceprint recognition systems typically require more than 10 seconds of audio data, with the longer the data, the higher the accuracy. Therefore, when segmenting audio data, the segments cannot be too short. Since there are many free-talking scenes in the video conference, when the audio data segment is long, a speech segment may contain voices of multiple persons, and when the audio data segments of the multiple persons are sent to the voiceprint recognition system for recognition, the recognition result will be unreliable.
The premise for realizing the scheme is that conference participants need to register voiceprints in a voiceprint recognition system, but the influence of a channel during voice collection on voiceprint characteristics is large, a single channel is generally adopted during voiceprint pre-registration, the channels during recognition are various, and the accuracy of voiceprint recognition of voices collected by different voice channels is difficult to guarantee.
Disclosure of Invention
The embodiment of the application provides a method, equipment and a system for processing audio data, which are used for accurately classifying conference audio data.
In a first aspect, an embodiment of the present application provides a method for processing audio data, which specifically includes: the conference recording processing device acquires audio data of a first conference room, sound source azimuth information corresponding to the audio data and an identity recognition result, wherein the identity recognition result is used for indicating the corresponding relation between the speaker identity information obtained by a portrait recognition method and the speaker speaking time information; then the conference recording processing device carries out voice segmentation on the audio data to obtain first segmented audio data of the audio data; and finally, the conference record processing device determines the speaker corresponding to the first sectional audio data according to the voiceprint characteristics of the first sectional audio data and the identification result.
In this embodiment, the audio data and the sound source azimuth information corresponding to the audio data may be packaged to generate an audio code stream, and then the audio code stream includes additional domain information of the audio data, where the additional domain information includes the sound source azimuth information corresponding to the audio data. The audio data processing method can be applied to a local conference or a remote conference scene, wherein the conference place participating in the conference can comprise at least one conference place. Based on the above scheme, the additional domain information may further include time information of the audio data and conference hall identification information of the first conference hall and other information. The human image recognition method comprises human face recognition and human body attribute recognition. For example, the speaker corresponding to the facial feature is obtained through face recognition, and the human body attribute recognition includes that the speaker corresponding to the body feature or the appearance of the user clothing is obtained through recognition of the user overall clothing or the body feature. The speaker identification information may be user identification information (such as a work number of the speaker in the company, or an identification number or a telephone number registered by the speaker in a database inside the company) or user body attribute identification information (such as white clothes worn on the upper garment of the user, black trousers worn on the lower garment of the user, or a mark obvious on the arm of the user in the current conference, etc.). And the talk time information may be a period of time or two points of time. For example, the speaking time information is 00 after the current conference starts: 00: 15 to 00: 00: 45 for a period of 30 seconds; or the talk time information includes only "00: 00: 15 "and" 00: 00: 45 "at both points in time. It is understood that, in the embodiment of the present application, the "00: 00: 00 "the timing rule indicated by the form" time: dividing into: seconds ", i.e." 00: 00: the time point indicated by 15 "is the 15 th second after the conference starts.
In the technical scheme provided by this embodiment, the conference recording processing apparatus obtains an identification result indicating a correspondence between speaker identification information and speaker time information, and then combines the identification result with a voiceprint feature to further identify audio data, so that accurate classification of voice data can be realized without pre-registering the voiceprint feature of a user.
Optionally, the operation of the conference record processing apparatus determining the speaker corresponding to the first segment audio data according to the voiceprint feature of the first segment audio data and the identification result may be as follows:
in a possible implementation manner, if the identification result indicates that the first segmented audio data corresponds to unique speaker identity information, the conference record processing device determines a speaker corresponding to the first segmented audio data according to the speaker identity information. That is, the conference record processing apparatus obtains the identification result of the first segment of audio data indicating that the speaker corresponding to the first segment of audio data is only user01 and the corresponding voiceprint feature is VP01, then the conference record processing apparatus determines the speaker of the first segment of audio data as the user 01.
In another possible implementation manner, if the identification result indicates that the first segmented audio data corresponds to at least two speaker identity information, the conference recording processing device compares a voiceprint feature of the first segmented audio data with a voiceprint feature of second segmented audio data, the second segmented audio data is obtained by performing voice segmentation on the audio data by the conference recording processing device, and the second segmented audio data corresponds to only speaker identity information; and if the voiceprint characteristics of the first segmented audio data are consistent with the voiceprint characteristics of the second segmented audio data, the conference record processing device determines the speaker corresponding to the first segmented audio data according to the speaker identity information corresponding to the second segmented audio data. For example, the determined speaker identity information of the second segmented audio data is user02, the corresponding voiceprint feature is VP02, the voiceprint feature of the first segmented audio data is VP02, and the corresponding speaker identity information includes user03 and user 02; from the above analysis, it can be seen that the voiceprint feature of the first segment of audio data and the voiceprint feature of the second segment of audio data are both VP02, and from the result of the second segment of audio data, if the voiceprint feature is the speaker user02 corresponding to VP02, it can be determined that the speaker of the first segment of audio data is also user 02.
In another possible implementation manner, if the identification result indicates that the first segmented audio data corresponds to at least two pieces of speaker identity information, the conference recording processing device determines a speaker corresponding to the first segmented audio data according to the speaker identity information and voiceprint characteristics corresponding to the first segmented audio data, and the speaker identity information and voiceprint characteristics corresponding to the second segmented audio data, where the second segmented audio data is obtained by performing voice segmentation on the audio data by the conference recording processing device, and the second segmented audio data corresponds to at least two pieces of speaker identity information. That is, the conference recording processing device can comprehensively judge the speakers corresponding to the segmented audio data according to the voiceprint characteristics of the segmented audio data and the corresponding speaker identity information. For example, the second segmented audio data has determined speaker identity information as user02 and user03, the corresponding voiceprint feature as VP02, the voiceprint feature of the first segmented audio data as VP03, the corresponding speaker identity information as user03 and user02, the voiceprint feature of the third segmented audio data as VP03, and the corresponding speaker identity information as user03 and user 01. As can be seen from the above analysis, the voiceprint feature of the first segment of audio data and the voiceprint feature of the third segment of audio data are both VP03, and the speaker identity information corresponding to the first segment of audio data and the speaker identity information corresponding to the third segment of audio data have a unique intersection, that is, user03, where it can be determined that the voiceprint feature and the speaker 03 corresponding to VP03 are the same. At this time, the speaker user02 corresponding to the second segment of audio data, that is, the speaker user02 corresponding to the voiceprint feature VP02, may be determined continuously.
In another possible implementation manner, the conference record processing apparatus determines the speaker corresponding to the first segmented audio data according to the voiceprint feature corresponding to the first segmented audio data, the identification result, and the long-term voiceprint feature record of the first conference site, where the long-term voiceprint feature record includes the historical voiceprint feature record of the first conference site, and the historical voiceprint feature record of the first conference site is used to indicate a corresponding relationship between the voiceprint feature, the speaker, and the channel identifier.
Optionally, when the conference record processing apparatus determines the speaker corresponding to the first segmented audio data according to the voiceprint feature of the first segmented audio data, the identification result, and the long-term voiceprint feature record, the specific operation may be as follows: the conference record processing device compares the voiceprint characteristics of the first speaker in the current conference of the first conference place with the voiceprint characteristics of the first speaker in the long-time voiceprint characteristic record, and the first speaker is the determined speaker in the current conference of the first conference place; if the voiceprint feature of the first speaker in the current conference of the first conference site is consistent with the voiceprint feature of the first speaker in the long-time voiceprint feature record, the conference record processing device determines that the long-time voiceprint feature record is available, and at the moment, the conference record processing device compares the voiceprint feature and the identification result corresponding to the first segmented audio data with the long-time voiceprint feature record of the first conference site to determine the speaker corresponding to the first segmented audio data.
In the audio data classification process, the accuracy of audio data classification can be improved as much as possible by combining short-term processing and long-term processing.
Optionally, the conference record processing apparatus compares the voiceprint feature of the first speaker in the current conference of the first conference room with the voiceprint feature of the first speaker in the long-time voiceprint feature record, where the first speaker is the determined speaker in the current conference of the first conference room; if the voiceprint feature of the first speaker in the current conference of the first conference site is not consistent with the voiceprint feature of the first speaker in the long-time voiceprint feature record, the conference record processing device registers the voiceprint feature, the channel identifier and the speaker corresponding to the voiceprint feature in the current conference of the first conference site, and updates the long-time voiceprint feature record. Therefore, the voiceprint characteristics, the speaker and the channel identification of the meeting place can be updated according to the actual situation, and the long-time voiceprint characteristic record can be available. Meanwhile, after each conference, corresponding voiceprint features and speakers are registered, so that dynamic registration of the voiceprint features and the speakers is realized, registration of the voiceprint features which are limited to fixed channel identifiers is not needed, and accurate classification of audio data can be effectively realized.
Optionally, after the conference recording processing apparatus obtains the voiceprint feature of the first segment of audio data and the corresponding speaker, the conference recording processing apparatus may obtain the voiceprint identification information of the voiceprint feature of the first segment of audio data; and then the conference record processing device establishes a corresponding relation between the voiceprint identification information and the speaker corresponding to the first sectional audio data. Therefore, the voiceprint characteristics can be in one-to-one correspondence with the speakers, and subsequent audio data classification processing is facilitated.
Optionally, when the technical solution provided in the application embodiment is applied to a remote multi-meeting-place meeting scene, the meeting record processing apparatus may be a recording and playing server or a functional module integrated in a multi-point control unit. Therefore, the audio code stream can be forwarded to the conference record processing device by the multipoint control unit, and the identity recognition result is sent to the conference record processing device by the video conference terminal.
Optionally, the audio code stream is gated by the multipoint control unit through a conference place and then forwarded to the conference recording processing apparatus. Therefore, unnecessary data transmission can be reduced, and the network burden can be reduced.
Optionally, the specific operation of the conference recording processing apparatus to perform voice segmentation on the audio data may be as follows: the conference recording processing device carries out voice segmentation on the audio data according to the sound source azimuth information and the human voice detection technology. This allows for more accurate segmentation of the audio data.
In a second aspect, an embodiment of the present application provides a method for processing audio data, including: the video conference terminal carries out sound source positioning on the audio data of the first meeting place so as to obtain sound source azimuth information corresponding to the audio data; the video conference terminal acquires an identity recognition result according to the sound source direction and the portrait recognition method, wherein the identity recognition result is used for indicating the corresponding relation between the identity information of the speaker and the speaker time information; and the video conference terminal sends the identification result, the audio data and the sound source azimuth information corresponding to the audio data to a conference record processing device.
In this embodiment, the video conference terminal performs sound source localization on the audio data to acquire image information of a speaker, obtains an identification result for indicating a correspondence between identity information of the speaker and utterance time information by recognizing a portrait of the image information, and then sends the identification result to the conference recording processing device, so that the conference recording processing device combines the identification result with a voiceprint feature to further recognize the audio data, and thus, accurate classification of the audio data can be realized without pre-registering the voiceprint feature of a user.
Optionally, the specific process of the video conference terminal in performing portrait identification may be as follows: the video conference terminal acquires portrait information corresponding to the sound source direction; the video conference terminal carries out image recognition on the portrait information to obtain face information and/or body attribute information; the video conference terminal determines the identity information of the speaker according to the face information and/or the body attribute information; and the video conference terminal establishes a corresponding relation between the speaker time information and the speaker identity information to obtain the identity recognition result.
In this embodiment, the speaker identification information may be user identification information (for example, a work number of the speaker in the company, or an identification number or a telephone number registered by the speaker in a database inside the company) or user body attribute identification information (for example, a user wears white clothes on the upper garment, black trousers on the lower garment, or a mark on the arm of the user is obvious in the current conference, etc.). And the talk time information may be a period of time or two points of time. For example, the speaking time information is 00 after the current conference starts: 00: 15 to 00: 00: 45 for a period of 30 seconds; or the talk time information includes only "00: 00: 15 "and" 00: 00: 45 "at both points in time. It is understood that, in the embodiment of the present application, the "00: 00: 00 "the timing rule indicated by the form" time: dividing into: seconds ", i.e." 00: 00: the time point indicated by 15 "is the 15 th second after the conference starts.
Optionally, when the technical solution provided in the embodiment of the present application is applied to a single-user scene of a local conference or a teleconference, the video conference terminal may also be used as the conference record processing apparatus to implement the method of the first aspect, specifically as follows:
the video conference terminal acquires audio data of a current conference place and detects the audio data according to the sound source direction and the voice to obtain segmented audio data; and then acquiring the voiceprint characteristics of the segmented audio data, and determining the speaker corresponding to the segmented audio data according to the voiceprint characteristics and the identity recognition result.
In the technical scheme provided by this embodiment, the video conference terminal obtains an identification result indicating a correspondence between the speaker identification information and the speaker time information, and then combines the identification result with the voiceprint feature to further identify the audio data, so that accurate classification of the voice data can be realized without pre-registering the voiceprint feature of the user.
Optionally, the operation of the video conference terminal determining the speaker corresponding to the first segment audio data according to the voiceprint feature of the first segment audio data and the identification result may be as follows:
in one possible implementation manner, if the identification result indicates that the first segmented audio data corresponds to unique speaker identity information, the video conference terminal determines a speaker corresponding to the first segmented audio data according to the speaker identity information. That is, the video conference terminal obtains the identification result of the first segment of audio data, which indicates that the speaker corresponding to the first segment of audio data is only user01, and the corresponding voiceprint feature is VP01, then the video conference terminal determines the speaker of the first segment of audio data as the user 01.
In another possible implementation manner, if the identity recognition result indicates that the first segmented audio data corresponds to at least two speaker identity information, the video conference terminal compares a voiceprint feature of the first segmented audio data with a voiceprint feature of second segmented audio data, the second segmented audio data is obtained by performing voice segmentation on the audio data by the video conference terminal, and the second segmented audio data corresponds to only speaker identity information; and if the voiceprint characteristics of the first segmented audio data are consistent with the voiceprint characteristics of the second segmented audio data, the video conference terminal determines the speaker corresponding to the first segmented audio data according to the speaker identity information corresponding to the second segmented audio data. For example, the determined speaker identity information of the second segmented audio data is user02, the corresponding voiceprint feature is VP02, the voiceprint feature of the first segmented audio data is VP02, and the corresponding speaker identity information includes user03 and user 02; from the above analysis, it can be seen that the voiceprint feature of the first segment of audio data and the voiceprint feature of the second segment of audio data are both VP02, and from the result of the second segment of audio data, if the voiceprint feature is the speaker user02 corresponding to VP02, it can be determined that the speaker of the first segment of audio data is also user 02.
In another possible implementation manner, if the identification result indicates that the first segmented audio data corresponds to at least two pieces of speaker identity information, the video conference terminal determines a speaker corresponding to the first segmented audio data according to the speaker identity information and voiceprint characteristics corresponding to the first segmented audio data, and the speaker identity information and voiceprint characteristics corresponding to the second segmented audio data, where the second segmented audio data is obtained by performing voice segmentation on the audio data by the video conference terminal, and the second segmented audio data corresponds to at least two pieces of speaker identity information. That is, the video conference terminal can comprehensively judge the speakers corresponding to the segmented audio data according to the voiceprint characteristics of the segmented audio data and the corresponding speaker identity information. For example, the second segmented audio data has determined speaker identity information as user02 and user03, the corresponding voiceprint feature as VP02, the voiceprint feature of the first segmented audio data as VP03, the corresponding speaker identity information as user03 and user02, the voiceprint feature of the third segmented audio data as VP03, and the corresponding speaker identity information as user03 and user 01. As can be seen from the above analysis, the voiceprint feature of the first segment of audio data and the voiceprint feature of the third segment of audio data are both VP03, and the speaker identity information corresponding to the first segment of audio data and the speaker identity information corresponding to the third segment of audio data have a unique intersection, that is, user03, where it can be determined that the voiceprint feature and the speaker 03 corresponding to VP03 are the same. At this time, the speaker user02 corresponding to the second segment of audio data, that is, the speaker user02 corresponding to the voiceprint feature VP02, may be determined continuously.
In another possible implementation manner, the video conference terminal determines the speaker corresponding to the first segment of audio data according to the voiceprint feature corresponding to the first segment of audio data, the identification result, and the long-term voiceprint feature record of the first conference site, where the long-term voiceprint feature record includes the historical voiceprint feature record of the first conference site, and the historical voiceprint feature record of the first conference site is used to indicate a corresponding relationship between the voiceprint feature, the speaker, and the channel identifier.
Optionally, when the video conference terminal determines the speaker corresponding to the first segment of audio data according to the voiceprint feature of the first segment of audio data, the identification result, and the long-term voiceprint feature record, the specific operation may be as follows: the video conference terminal compares the voiceprint characteristics of the first speaker in the current conference of the first conference place with the voiceprint characteristics of the first speaker in the long-time voiceprint characteristic record, and the first speaker is the determined speaker in the current conference of the first conference place; if the voiceprint feature of the first speaker in the current conference of the first conference site is consistent with the voiceprint feature of the first speaker in the long-time voiceprint feature record, the video conference terminal determines that the long-time voiceprint feature record is available, and at the moment, the video conference terminal compares the voiceprint feature and the identity recognition result corresponding to the first segmented audio data with the long-time voiceprint feature record of the first conference site to determine the speaker corresponding to the first segmented audio data.
In the audio data classification process, the accuracy of audio data classification can be improved as much as possible by combining short-term processing and long-term processing.
Optionally, the video conference terminal compares the voiceprint feature of the first speaker in the current conference of the first conference room with the voiceprint feature of the first speaker in the long-time voiceprint feature record, and the first speaker is the determined speaker in the current conference of the first conference room; if the voiceprint feature of the first speaker in the current conference of the first conference site is not consistent with the voiceprint feature of the first speaker in the long-time voiceprint feature record, the video conference terminal registers the voiceprint feature, the channel identifier and the speaker corresponding to the voiceprint feature in the current conference of the first conference site, and updates the long-time voiceprint feature record. Therefore, the voiceprint characteristics, the speaker and the channel identification of the meeting place can be updated according to the actual situation, and the long-time voiceprint characteristic record can be available. Meanwhile, after each conference, corresponding voiceprint features and speakers are registered, so that dynamic registration of the voiceprint features and the speakers is realized, registration of the voiceprint features which are limited to fixed channel identifiers is not needed, and accurate classification of audio data can be effectively realized.
Optionally, after the video conference terminal acquires the voiceprint feature of the first segment of audio data and the corresponding speaker, the video conference terminal may acquire the voiceprint identification information of the voiceprint feature of the first segment of audio data; and then the video conference terminal establishes a corresponding relation between the voiceprint identification information and the speaker corresponding to the first sectional audio data. Therefore, the voiceprint characteristics can be in one-to-one correspondence with the speakers, and subsequent audio data classification processing is facilitated.
In a third aspect, the present application provides a conference record processing apparatus having a function of implementing the behavior of the conference record processing apparatus in the first aspect. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above.
In one possible implementation, the apparatus includes means or modules for performing the steps of the first aspect above. For example, the apparatus includes: the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring audio data of a first meeting place, sound source azimuth information and an identity recognition result corresponding to the audio data, and the identity recognition result is used for indicating the corresponding relation between speaker identity information obtained by a portrait recognition method and speaker speaking time information; the processing module is used for carrying out voice segmentation on the audio data so as to obtain first segmented audio data of the audio data; and determining a speaker corresponding to the first segmented audio data according to the voiceprint characteristics of the first segmented audio data and the identification result.
Optionally, the conference recording processing device further comprises a storage module, configured to store necessary program instructions and data of the conference recording processing device.
In one possible implementation, the apparatus includes: a processor and a transceiver, the processor being configured to support the conference recording processing apparatus to perform the respective functions of the method provided by the first aspect. The transceiver is used for instructing communication between the conference record processing device and other equipment in the conference system, such as receiving audio data and identification results related to the method sent by the video conference terminal. Optionally, the apparatus may further comprise a memory for coupling to the processor, which stores program instructions and data necessary for the meeting minutes processing apparatus.
In one possible implementation, when the device is a chip within a conference recording processing device, the chip includes: a processing module and a transceiver module. The transceiver module may be, for example, an input/output interface, pin or circuit on the chip, and transmits the received audio data and the identification result of the first meeting place to other chips or modules coupled to the chip. The processing module may be, for example, a processor, and the processor is configured to perform speech segmentation on the audio data to obtain first segmented audio data of the audio data; and determining a speaker corresponding to the first segmented audio data according to the voiceprint characteristics of the first segmented audio data and the identification result. The processing module can execute the computer-executable instructions stored in the storage unit to support the conference recording processing apparatus to execute the method provided by the first aspect. Alternatively, the storage unit may be a storage unit in the chip, such as a register, a cache, and the like, and the storage unit may also be a storage unit located outside the chip, such as a read-only memory (ROM) or another type of static storage device that can store static information and instructions, a Random Access Memory (RAM), and the like.
In one possible implementation, the apparatus includes: a processor, radio frequency circuitry, and an antenna. The processor is used for realizing control over functions of all circuit parts, determining a speaker corresponding to the first segmented audio data, performing analog conversion, filtering, amplification, up-conversion and other processing through the radio frequency circuit, and sending the processed data to the automatic voice recognition server through the antenna. Optionally, the apparatus further comprises a memory that stores necessary program instructions and data for the meeting record handling apparatus.
In one possible implementation manner, the apparatus includes a communication interface and a logic circuit, where the communication interface is configured to obtain an audio code stream of a first meeting place and an identification result, the audio code stream includes audio data and additional domain information, the additional domain information includes sound source direction information corresponding to the audio data, and the identification result is used to indicate a correspondence between speaker identity information obtained by a human image recognition method and speaker speaking time information; the logic circuit is used for carrying out voice segmentation on the audio data so as to obtain first segmented audio data of the audio data; and determining a speaker corresponding to the first segmented audio data according to the voiceprint characteristics of the first segmented audio data and the identification result.
The processor mentioned in any one of the above may be a general Processing Unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs of the Processing methods of the audio data in the above aspects.
In a fourth aspect, an embodiment of the present application provides a video conference device, where the device has a function of implementing a behavior of a video conference terminal in the second aspect. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above.
In one possible implementation, the apparatus includes means or modules for performing the steps of the second aspect above. For example, the apparatus includes: the processing module is used for positioning a sound source of the audio data of the first meeting place so as to acquire sound source azimuth information corresponding to the audio data; obtaining an identity recognition result according to the sound source direction and the portrait recognition method, wherein the identity recognition result is used for indicating the corresponding relation between the identity information of the speaker and the speaking time information;
and the sending module is used for sending the identification result, the audio data and the sound source azimuth information corresponding to the audio data to the conference record processing device.
Optionally, the video conference system further comprises a storage module for storing necessary program instructions and data of the video conference device.
In one possible implementation, the apparatus includes: a processor and a transceiver, the processor being configured to support the video conferencing apparatus to perform the respective functions of the method provided by the second aspect described above. The transceiver is used for indicating the communication between the video conference device and each device in the conference system and sending the audio code stream and the identity recognition result to the conference record processing device. Optionally, the apparatus may further comprise a memory for coupling to the processor, which stores program instructions and data necessary for the video conferencing apparatus.
In one possible implementation, when the apparatus is a chip within a video conference apparatus, the chip includes: the system comprises a processing module and a transceiver module, wherein the processing module can be a processor, and the processor is used for positioning a sound source of audio data of a first meeting place to acquire sound source azimuth information corresponding to the audio data; obtaining an identity recognition result according to the sound source direction and the portrait recognition method, wherein the identity recognition result is used for indicating the corresponding relation between the identity information of the speaker and the speaking time information; the transceiver module may be, for example, an input/output interface, pin or circuit on the chip, and the configuration information is transmitted to other chips or modules coupled to the chip. The processing module can execute the computer-executable instructions stored in the storage unit to support the video conference apparatus to execute the method provided by the second aspect. Alternatively, the storage unit may be a storage unit in the chip, such as a register, a cache, or the like, and the storage unit may also be a storage unit located outside the chip, such as a ROM only or other types of static storage devices that can store static information and instructions, a RAM, or the like.
In one possible implementation, the apparatus includes: a processor, baseband circuitry, radio frequency circuitry, and an antenna. The processor is used for realizing control of functions of each circuit part, the baseband circuit is used for generating a data packet containing an audio code stream and an identity recognition result, and the data packet is subjected to analog conversion, filtering, amplification, up-conversion and other processing through the radio frequency circuit and then is sent to the conference record processing device through the antenna. Optionally, the apparatus further comprises a memory that stores necessary program instructions and data for the video conferencing apparatus.
In one possible implementation, the apparatus includes: a communication interface and logic circuitry. The logic circuit is used for carrying out sound source positioning on the audio data of the first meeting place so as to acquire sound source azimuth information corresponding to the audio data; obtaining an identity recognition result according to the sound source direction and the portrait recognition method, wherein the identity recognition result is used for indicating the corresponding relation between the identity information of the speaker and the speaking time information; and the communication interface is used for sending the identification result to the conference record processing device and sending the audio data to the multipoint control unit.
The processor mentioned in any of the above may be a CPU, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the programs of the audio data processing methods of the above aspects.
In a fifth aspect, the present application provides a computer-readable storage medium storing computer instructions for executing the method according to any possible implementation manner of any one of the above aspects.
In a sixth aspect, embodiments of the present application provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any one of the above aspects.
In a seventh aspect, the present application provides a chip system, which includes a processor, and is configured to support a conference recording processing apparatus or a video conference apparatus to implement the functions recited in the above aspects, such as generating or processing data and/or information recited in the above methods. In one possible design, the system-on-chip further includes a memory for storing program instructions and data necessary for the conference recording processing apparatus or the video conference apparatus to implement the functions of any one of the above aspects. The chip system may be formed by a chip, and may also include a chip and other discrete devices.
In an eighth aspect, an embodiment of the present application provides a conference system, which includes the conference record processing apparatus and the video conference apparatus in the above aspect.
Drawings
FIG. 1A is a schematic diagram of an embodiment of a conference system architecture in an embodiment of the present application;
fig. 1B is a schematic diagram of another embodiment of a conference system architecture in the embodiment of the present application;
FIG. 2 is a schematic diagram of an embodiment of a method for processing audio data according to an embodiment of the present application;
fig. 3 is a schematic view of a scene in which a video conference terminal acquires image information according to an embodiment of the present application;
fig. 4 is a schematic diagram of another embodiment of a processing method of audio data in the embodiment of the present application;
fig. 5 is a schematic diagram of another embodiment of a processing method of audio data in the embodiment of the present application;
fig. 6 is a schematic diagram of another embodiment of a processing method of audio data in the embodiment of the present application;
fig. 7 is a schematic diagram of an embodiment of a conference record processing apparatus in an embodiment of the present application;
fig. 8 is a schematic diagram of another embodiment of a conference record processing apparatus in the embodiment of the present application;
fig. 9 is a schematic diagram of an embodiment of a video conference terminal in the embodiment of the present application;
fig. 10 is a schematic diagram of another embodiment of a video conference terminal in the embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application are described below with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments of the present application, but not all embodiments. As can be known to those skilled in the art, with the advent of new application scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.
The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Moreover, the terms "comprises," "comprising," and any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus. The naming or numbering of the steps appearing in the present application does not mean that the steps in the method flow have to be executed in the chronological/logical order indicated by the naming or numbering, and the named or numbered process steps may be executed in a modified order depending on the technical purpose to be achieved, as long as the same or similar technical effects are achieved. The division of the units presented in this application is a logical division, and in practical applications, there may be another division, for example, multiple units may be combined or integrated into another system, or some features may be omitted, or not executed, and in addition, the shown or discussed coupling or direct coupling or communication connection between each other may be through some interfaces, and the indirect coupling or communication connection between the units may be in an electrical or other similar form, which is not limited in this application. Furthermore, the units or sub-units described as the separate parts may or may not be physically separate, may or may not be physical units, or may be distributed in a plurality of circuit units, and some or all of the units may be selected according to actual needs to achieve the purpose of the present disclosure.
The technical scheme of the embodiment of the invention can be applied to local conference or remote conference scenes. The specific system architecture of the embodiment of the invention can comprise a plurality of video conference terminals, a multipoint control unit, a recording and broadcasting server and an Automatic Speech Recognition (ASR) server. Taking the embodiment shown in fig. 1A as an example, each of a plurality of video conference terminals (e.g., the video conference terminals 01 to 03 shown in fig. 1A) collects conference audio data and image information of conference participants, and performs identity recognition on speakers of the conference participants through the image information to obtain an identity recognition result. Then, the video conference terminal sends the audio data and the identification result to the recording and broadcasting server. And the recording and broadcasting server classifies the audio data according to the identity recognition result and the sound source direction and then sends the audio data to the ASR server. And the ASR server outputs the conference record through a voice transcription function. In contrast to the embodiment of fig. 1A, the functions of the recording server are integrated in the multipoint control unit (corresponding to the conference recording processing module in fig. 1B) as in the embodiment of fig. 1B. Each of the plurality of video conference terminals (e.g., the video conference terminals 01 to 03 shown in fig. 1B) collects conference audio data and image information of conference participants, and performs identity recognition on speakers of the conference participants through the image information to obtain an identity recognition result. Then, the video conference terminal sends the audio data and the identity recognition result to the multipoint control unit, and then the conference recording processing module in the multipoint control unit classifies the audio data and sends the classified audio data to the ASR server. And finally, the ASR server outputs the conference record through a voice transcription function.
Specifically, referring to fig. 2, an embodiment of a method for processing audio data in the embodiment of the present application includes:
201. and the video conference terminal acquires audio data.
In a teleconference scenario, a conference may include multiple conference sites, each of which corresponds to at least one video conference terminal, and each of which includes at least one participant. In this embodiment, one of the video conference terminals in the conference hall will be described. During the conference, the video conference terminal picks up the audio data of each speaker in real time by using the microphone.
202. The video conference terminal acquires a sound source position of the audio data.
The video conference terminal can acquire the sound source position corresponding to the audio data while acquiring the audio data, and establishes a corresponding relation between the audio data and the sound source position. For example, when the video conference terminal is at conference start time 00: 00: 15 to 00: 00: the sound source orientation of the audio data collected within 30 is about 30 degrees off east with respect to the video conference terminal. It will be appreciated that the sound source location is allowed to be in error, and thus the sound source orientation may be a range of values, such as 30 degrees to the east, and the specific range may be 28 degrees to 32 degrees to the east.
In this embodiment, the following possible implementation manners may be adopted for the video conference terminal to obtain the sound source position of the audio data:
in one possible implementation, an array microphone is deployed on the video conference terminal, and the sound source position of the audio data is determined through sound beam information picked up by the array microphone.
In another possible implementation, the conference hall additionally deploys a device or system dedicated to sound source localization, then determines the sound source location of the audio data with the sound source localization device or system as a calibration reference point, and then transmits the sound source location to the video conference terminal.
It is to be understood that the sound source localization may adopt the above-mentioned scheme or any other possible implementation manner as long as the sound source location of the audio data can be obtained, and the specific scheme is not limited herein.
203. And the video conference terminal performs voice segmentation on the audio data through human voice detection to obtain segmented audio data.
The video conference terminal carries out voice segmentation on the received audio data according to the human voice detection to obtain different segmented audio data.
In this embodiment, the video conference terminal can distinguish a preceding voice segment from a following voice segment according to a mute segment interval; or judging whether the voice segment is the voice or the non-voice through an algorithm, and segmenting the voice segment of the front voice and the voice segment of the rear voice according to the non-voice. For example, when the video conference terminal is at conference start time 00: 00: 15 to 00: 00: 30, then in 00: 00: 30 to 00: 00: mute during 32, at 00: 00: 32 to 00: 00: 45 duration, audio data was acquired, at 00: 00: 45 to 00: 00: and mute in 50 periods. The video conference terminal may transmit at conference start time 00: 00: 15 to 00: 00: 30 as a piece of audio data, will be acquired at conference start time 00: 00: 32 to 00: 00: 45 as the next segmented audio data.
It is understood that, in the embodiment of the present application, the "00: 00: 00 "the timing rule indicated by the form" time: dividing into: seconds ", i.e." 00: 00: the time point indicated by 15 "is the 15 th second after the conference starts.
204. And the video conference terminal collects the image information in the sound source azimuth range according to the sound source azimuth.
The video conference terminal determines an image information acquisition area of the video conference terminal according to the sound source position corresponding to the audio data acquired in step 202, and then acquires image information in the image information acquisition area.
In this embodiment, the video conference terminal may capture the image information in a form of capturing a photo, or capture a picture frame corresponding to the audio data in the video data as the image information, and the specific form is not limited here. Meanwhile, the camera of the video conference terminal can be fixed and can also be arranged in a rotatable manner, and the specific situation is not limited here. When the camera of the video conference terminal is fixed (namely the shooting range of the camera is fixed), the video conference terminal acquires images in the fixed shooting range, and then image information corresponding to the audio data is calculated and extracted according to the direction of the sound source. When the camera of the video conference terminal is movable, the video conference terminal can adjust the shooting range of the camera according to the sound source direction, so that the image information corresponding to the audio data is obtained. As shown in fig. 3, the video conference terminal is located above the conference screen, and the conference participants are located at two sides of the conference table, and when there is a speaker in the conference participants, the video conference terminal can acquire image information within a certain angle range according to the sound source direction. Due to the angle problem, the image information may have a plurality of participants or only one participant. When the image information of the speaker 1 is collected according to the sound source positioning of the speaker 1, the image information area only has the speaker 1; and when the image information of the speaker 2 is collected according to the sound source location of the speaker 2, the image information area comprises the speaker 1 and another participant.
205. And the video conference terminal carries out portrait recognition on the image information to obtain an identity recognition result.
The video conference terminal carries out face recognition and human body attribute recognition on the image information to obtain an identity recognition result, and the identity recognition result is used for indicating the corresponding relation between the identity information of the speaker and the speaking time information. For example, the speaker corresponding to the facial feature is obtained through face recognition, and the human body attribute recognition includes that the speaker corresponding to the body feature or the appearance of the user clothing is obtained through recognition of the user overall clothing or the body feature. The speaker identification information may be user identification information (such as a work number of the speaker in the company, or an identification number or a telephone number registered by the speaker in a database inside the company) or user body attribute identification information (such as white clothes worn on the upper garment of the user, black trousers worn on the lower garment of the user, or a mark obvious on the arm of the user in the current conference, etc.). And the talk time information may be a period of time or two points of time. For example, the speaking time information is 00 after the current conference starts: 00: 15 to 00: 00: 45 for a period of 30 seconds; or the talk time information includes only "00: 00: 15 "and" 00: 00: 45 "at both points in time.
In this embodiment, the specific operation of the video conference terminal for acquiring the identity information of the speaker may be as follows: if the image information contains clearly distinguishable face information, the video conference terminal identifies the face in the image information by using a face identification technology, and compares the face with a stored face database to determine user identity identification information corresponding to the face; if the face information in the image information does not meet the identification requirement (for example, the facial features cannot meet the face identification requirement or no face image), the video conference terminal may perform body attribute identification to obtain body attribute information, and determine the user body attribute identification information according to the body attribute information.
206. The video conference terminal packs the audio data and the corresponding sound source direction into an audio code stream and sends the audio code stream to the multipoint control unit, and sends the identity recognition result to the recording and broadcasting server.
And the video conference terminal packs the audio data and the sound source direction corresponding to the audio data into an audio code stream and sends the audio code stream to the multipoint control unit. In an exemplary scheme, the video conference terminal encodes the audio data into an audio code stream, and then adds additional domain information to the corresponding audio code stream, and indicates the sound source orientation information corresponding to the audio data by using the additional domain information. And the video conference terminal can directly send the identity recognition result obtained by the portrait recognition to the recording and broadcasting server.
207. And the multipoint control unit sends the audio code stream sent by the video conference terminal to the recording and broadcasting server.
After receiving the audio code stream sent by the video conference terminal, the multipoint control unit determines the conference place to which the video conference terminal belongs according to the conference identifier allocated to the video conference terminal, then adds the conference place identifier in the audio code stream, and sends the audio code stream to the recording and playing server.
In one possible implementation manner, the multipoint control unit may filter audio data of each conference site, and then select one or more audio data of the conference sites to send to the recording and playing server. The multipoint control unit can compare the volume of the audio data of each meeting place and select the audio data with the volume larger than a preset threshold value to forward; or, the multipoint control unit may determine, through an algorithm, that the voice data whose voice duration exceeds a preset threshold is forwarded. The specific screening conditions are not limited herein. This can reduce the amount of processing and thus increase the processing speed.
208. The recording and broadcasting server decodes the audio code stream to obtain audio data, and performs voice segmentation on the audio data to obtain the segmented audio data.
The recording and broadcasting server can decode the audio code stream to obtain audio data and meeting place identification after acquiring the audio code stream, and then stores the audio data according to the meeting place identification. And simultaneously, the recording and broadcasting server carries out voice segmentation on the audio data according to the sound source direction of the audio data and the human voice detection technology, so that segmented audio data are obtained. It can be understood that, in this embodiment, the recording and playing server performs voice segmentation on the audio data according to the sound source direction and the human detection technology, so as to further classify the audio data reported by the video conference terminal. For example, a video conference terminal detects 00 according to human voice detection: 00: 15 to 00: 00: if there is a voice in 30, the video conference terminal will compare the 00: 00: 15 to 00: 00: the audio data collected in 30 is divided into a segment of audio data, which is actually divided into a plurality of segments in the range of 00: 00: 15 to 00: 00: within 25 there is a speaker speaking within the sound source position 1, and at 00: 00: 25 to 00: 00: there is also a speaker speaking in sound source position 2 in 30. Therefore, when the recording and broadcasting server carries out voice segmentation again according to the sound source direction and the human voice detection, the voice can be divided into two segmented audio data.
209. And when the segmented audio data conforms to the minimum voiceprint recognition length, the recording and playing server extracts the voiceprint characteristics of the segmented audio data.
When the segmented audio data accords with the minimum voiceprint recognition length, the recording and broadcasting server extracts voiceprint characteristics from the segmented audio data according to the techniques of voiceprint clustering and the like and marks voiceprint identification. In an exemplary scheme, assuming that the recording and playing server divides the audio data into 10 pieces of segmented audio data, wherein the duration of 8 pieces of segmented audio data satisfies the minimum length of voiceprint recognition, the recording and playing server extracts voiceprint features from the 8 pieces of segmented audio data, and labels the voiceprint identifiers (voiceprint 1 to voiceprint 8).
210. And the recording and broadcasting server determines the identity of the speaker of the segmented audio data according to the identity recognition result and the voiceprint characteristics of the segmented audio data.
And the recording and broadcasting server integrates and analyzes the received identity recognition result and the voiceprint characteristics of the segmented audio data to determine the identity of the speaker of the segmented audio data.
Specifically, the following method can be adopted:
in a possible implementation manner, if the identification result indicates that the first segment of audio data corresponds to the unique speaker information, the conference record processing apparatus determines the speaker corresponding to the first segment of audio data according to the unique speaker information indicated by the identification result.
In another possible implementation manner, if the identity recognition result indicates that the first segmented audio data corresponds to at least two speaker identity information, the conference recording processing device compares a voiceprint feature of the first segmented audio data with a voiceprint feature of second segmented audio data, where the second segmented audio data is obtained by performing voice segmentation on the audio data by the conference recording processing device, and the second segmented audio data corresponds to a unique speaker; and if the voiceprint characteristics of the first segmented audio data are consistent with the voiceprint characteristics of the second segmented audio data, the conference record processing device determines the speaker corresponding to the first segmented audio data according to the speaker identity information corresponding to the second segmented audio data.
In another possible implementation manner, if the identification result indicates that the first segmented audio data corresponds to at least two pieces of speaker identity information, the conference recording processing device determines a speaker corresponding to the first segmented audio data according to the speaker identity information and voiceprint characteristics corresponding to the first segmented audio data, and the speaker identity information and voiceprint characteristics corresponding to the second segmented audio data, where the second segmented audio data is obtained by performing voice segmentation on the audio data by the conference recording processing device, and the second segmented audio data corresponds to at least two pieces of speaker identity information. That is, the conference recording processing device can comprehensively judge the speakers corresponding to the segmented audio data according to the voiceprint characteristics of the segmented audio data and the corresponding speaker identity information. In this embodiment, the first segment of audio data and the second segment of audio data are both obtained by the conference recording processing apparatus through voice segmentation. Specifically, refer to a meeting record of a current meeting as shown in table 1:
TABLE 1
Figure BDA0002702527800000131
According to the contents shown in the lines 1 to 3 in the table above, the voiceprint feature and the speaker have a unique corresponding relationship, so that the speaker corresponding to the audio data shown in the lines 1 to 3 can be determined; and when a plurality of speakers are generated in the identification result, the recording and playing server may integrate and analyze the voiceprint features corresponding to the segmented audio data with the voiceprint features of the segmented audio data of other determined speakers and the identification result to obtain the speakers corresponding to the segmented audio data. As shown in line 4, the ID display includes a User ID of User03, a body attribute ID of body04, and a voiceprint feature of VP 04. In this case, it may be that the speaker indicated by the body04 is in low-head manuscript, the User03 is just facing the camera of the video conference terminal, and the body04 and the User03 cannot be separated when capturing image information. According to the content shown in line 3, the voiceprint feature corresponding to User03 is VP03, so that in the case of the voiceprint feature being VP04, the speaker of the content shown in line 4 can determine that the speaker is not User03 but body04, and the voiceprint feature corresponding to body04 is VP 04. Similarly, a unique speaker can be determined for the content shown in lines 5 and 8. For the contents of lines 6, 7 and 9, User05 and User06 cannot be distinguished and the voiceprint features cannot be distinguished, so that the speaker cannot be uniquely identified. For the contents of lines 10 and 11, the voiceprint features are VP07, but there is a unique intersection User07 for the corresponding speaker identities. In this case, it may be that the speaker indicated by User07 speaks in both the time periods indicated in lines 10 and 11, and User08 faces the camera of the video conference terminal right in the time period indicated in line 10, and is separated from User07 in capturing the image information in the time period indicated in line 11; user06 is facing right toward the camera of the video conference terminal during the time period shown on line 11, and is separated from User07 at the time of capturing image information during the time period shown on line 10. Therefore, the speaker person 07 corresponding to the voiceprint feature VP07 can be inferred from the contents of lines 10 and 11.
If the only speaker cannot be determined by the above method, the recording and broadcasting server can compare the voiceprint feature and the identification result of the current conference with the long-time voiceprint feature record of the conference place for further judgment. The recording and broadcasting server compares the voiceprint feature of the first speaker in the current conference of the first meeting place with the voiceprint feature of the first speaker in the long-time voiceprint feature record, and the first speaker is the speaker who has determined the corresponding relation with the segmented audio data in the current conference of the first meeting place; if the voiceprint feature of the first speaker in the current conference of the first meeting place is consistent with the voiceprint feature of the first speaker in the long-time voiceprint feature record, the recording and playing server compares the voiceprint feature corresponding to the first segmented audio data with the voiceprint feature in the long-time voiceprint feature record of the first meeting place to determine the speaker corresponding to the first segmented audio data. Specifically, refer to an exemplary long-term voiceprint profile as shown in table 2:
TABLE 2
Figure BDA0002702527800000141
Figure BDA0002702527800000151
Assuming Conf02 is the analysis result shown in table 1 above, the recording server can compare the latest voiceprint signature of User01 in conference room Site 01. If the comparison result shows that the difference between the voiceprint features of the User01 in the two conferences meets the threshold requirement, the recording and playing server can determine that the channels of the two conferences in the conference room Site01 are consistent, so as to determine that the long-time voiceprint feature record can be used for reference. For example, when the 7 th row in table 2 shows that the voiceprint feature is VP05, the candidate speakers are User05 and User08, and when the 3 rd row in table 2 shows that the voiceprint feature is VP05, the candidate speakers are User05, User06 and User07, so the number of occurrences of the speakers in the channel mark can be counted, and the single speaker User05 with the largest number of occurrences can be taken as the speaker corresponding to the voiceprint feature VP 05. After determining the speaker of the voiceprint feature VP05, it can be determined that the speaker corresponding to the voiceprint feature VP06 in table 1 is User 06.
If there is another conference Conf03 and there are voiceprint features corresponding to User01 and User01 in Conf03, the recording and broadcasting server compares the voiceprint features of User01 in Conf01 and Conf03, and if the comparison result shows that the difference between the voiceprint features of User01 in the two conferences does not meet the threshold requirement, the recording and broadcasting server can register the voiceprint features and speaker information in Conf03 and update the long-term voiceprint feature record. The specific form thereof may be as shown in rows 8 to 10 of table 2. It can be understood that the channel corresponding to the conference changes, such as the conference room changes, or the devices involved in the conference changes. As shown in table 2, since the channel identifiers of Conf03 and Conf02 are changed in the same conference room (conference room Site01), it can be considered that the video conference terminal of Conf03 and the video conference terminal of Conf02 are changed; it is also assumed that the multipoint control unit of Conf03 and the multipoint control unit of Conf02 are changed.
It is understood that the conference record processing apparatus may perform the long-term conference analysis (i.e., the analysis shown in table 2) after the short-term conference analysis (i.e., the analysis shown in table 1), or may perform the short-term conference analysis after the long-term conference analysis, as long as the audio data can be finally distinguished, and the specific operation manner is not limited herein.
211. And the recording and broadcasting server sends the audio data and the classification result of the audio data to the ASR server.
After the recording and broadcasting server completes the matching of the audio data and the speaker, the classification result and the audio data are sent to the ASR server.
212. The ASR server outputs the audio data as text.
In this embodiment, the video conference terminal collects corresponding image information according to sound source positioning, performs preliminary portrait recognition on the image information to obtain an identity recognition result, and then the recording and broadcasting server combines the identity recognition result with voiceprint features to further recognize audio data after obtaining the identity recognition result, so that accurate classification of voice data can be realized without pre-registering the voiceprint features of users.
It can be understood that the functions of the recording and playing server can also be integrated in the multipoint control unit, and specifically referring to fig. 4, an embodiment of the method for processing audio data in the embodiment of the present application includes:
401 and 405 are the same as 201 and 205 in the above embodiment, and are not described again.
406. The video conference terminal sends the audio code stream and the identity recognition result to the multipoint control unit.
The manner of transmitting the audio stream may refer to 206. In contrast, the step also sends the identification result to the multipoint control unit.
407. The multipoint control unit decodes the audio code stream to obtain audio data, and performs voice segmentation on the audio data to obtain segmented audio data.
And after the multipoint control unit acquires the audio code stream, determining a conference place to which the video conference terminal belongs according to the conference identifier distributed to the video conference terminal. And decoding the audio code stream to obtain audio data, and then storing the audio data according to the meeting place identifier. Meanwhile, the multipoint control unit performs voice segmentation on the audio data according to the sound source position of the audio data and the human voice detection technology, and the specific segmentation mode may be the same as 308 in the above embodiment, which is not described herein again. The specific implementation of 408-410 refers to 209-211, except that 408-411 is implemented by the multipoint control unit, and 209-211 is implemented by the recording and playing server.
It can be understood that the function of the recording and playing server can also be implemented in a video conference terminal, and specifically, referring to fig. 5, an embodiment of the method for processing audio data in the embodiment of the present application includes:
501-502 are the same as 201-202 in the above embodiment, and will not be described again. 503. The video conference terminal carries out voice segmentation on the audio data through human voice detection and sound source positioning so as to obtain segmented audio data.
The mode of performing voice segmentation on the video conference terminal can refer to the above 208, and details are not described here.
504 and 505 are the same as 204 and 205 in the above embodiment, and will not be described again.
The implementation of 506-508 is similar to that of 209-211, except that step 506-508 is executed by the video conference terminal, and step 209-211 is executed by the recording and playing server.
509. The ASR server outputs the audio data as text.
In this embodiment, the video conference terminal collects corresponding image information according to sound source positioning, performs preliminary portrait recognition on the image information to obtain an identity recognition result, and then the video conference terminal combines the identity recognition result with voiceprint features to further recognize audio data, so that accurate classification of voice data can be realized without pre-registering the voiceprint features of users.
Specifically, referring to fig. 6, an embodiment of a method for processing audio data in the embodiment of the present application includes:
601. the conference recording processing device acquires audio data of a first conference place, sound source azimuth information corresponding to the audio data and an identity recognition result, wherein the identity recognition result is used for indicating the corresponding relation between the speaker identity information obtained by the human image recognition method and the speaker speaking time information.
The conference record processing apparatus may be the recording and broadcasting server in the embodiment of the method shown in fig. 2, the multipoint control unit in the embodiment of the method shown in fig. 4, or the video conference terminal in the embodiment of the method shown in fig. 5.
In an application scenario, when the conference recording processing apparatus is the recording and playing server in the embodiment of the method shown in fig. 2, the conference recording processing apparatus receives the audio data sent by the multipoint control unit and the sound source direction information corresponding to the audio data. The audio data and the sound source azimuth information corresponding to the audio data may be packaged to generate an audio code stream and additional domain information, where the additional domain information includes the sound source azimuth information corresponding to the audio data. In an exemplary scheme, the video conference terminal encodes the audio data into an audio code stream, and then adds additional domain information to the corresponding audio code stream, and indicates the sound source orientation information corresponding to the audio data by using the additional domain information. And then the video conference terminal sends the audio code stream to the multipoint control unit, and the multipoint control unit determines a conference place to which the video conference terminal belongs according to the conference identifier allocated to the video conference terminal after receiving the audio code stream, then adds the conference place identifier in the audio code stream, and sends the audio code stream to the recording and playing server. In one possible implementation manner, the multipoint control unit may filter audio data of each conference site, and then select one or more audio data of the conference sites to send to the recording and playing server. The multipoint control unit can compare the volume of the audio data of each meeting place and select the audio data with the volume larger than a preset threshold value to forward; or, the multipoint control unit may determine, through an algorithm, that the voice data whose voice duration exceeds a preset threshold is forwarded. The specific screening conditions are not limited herein. This can reduce the amount of processing and thus increase the processing speed. And the identity recognition result is obtained by the video conference terminal according to sound source positioning and portrait recognition and is directly sent to the recording and broadcasting server by the video conference terminal.
In another application scenario, when the conference recording processing apparatus is the multipoint control unit in the embodiment of the method shown in fig. 4, the conference recording processing apparatus receives the audio data sent by the video conference terminal and the sound source direction information corresponding to the audio data. The audio data and the sound source azimuth information corresponding to the audio data may be packaged to generate an audio code stream and additional domain information, where the additional domain information includes the sound source azimuth information corresponding to the audio data. In an exemplary scheme, the video conference terminal encodes the audio data into an audio code stream, and then adds additional domain information to the corresponding audio code stream, and indicates the sound source orientation information corresponding to the audio data by using the additional domain information. And then the video conference terminal sends the audio code stream to the multipoint control unit. And the identity recognition result is obtained by the video conference terminal according to sound source positioning and portrait recognition and is sent to the multipoint control unit by the video conference terminal.
In another application scenario, when the conference recording processing apparatus is the video conference terminal in the embodiment of the method shown in fig. 5, the conference recording processing apparatus directly collects audio data in a current conference through a microphone, and acquires sound source azimuth information corresponding to the audio data according to a sound source positioning technology. And the identification result is obtained by the video conference terminal according to sound source positioning and portrait identification.
602. The conference recording processing device performs voice segmentation on the audio data to acquire first segmented audio data of the audio data.
The conference recording processing device segments the audio data according to the sound source azimuth information and the human voice detection method to obtain a plurality of segmented audio data of the audio data.
603. And the conference record processing device determines the speaker corresponding to the first segmented audio data according to the voiceprint characteristics of the first segmented audio data and the identification result.
The conference record processing apparatus may execute the method shown in step 210 in fig. 2, step 409 in fig. 4, or step 507 in fig. 5 to obtain the speaker corresponding to the audio data, which is not described herein again in detail.
In this embodiment, the conference recording processing apparatus obtains an identity recognition result indicating a correspondence between speaker identity information and speaker time information, and then the conference recording processing apparatus combines the identity recognition result with voiceprint features to further recognize audio data, so that accurate classification of voice data can be realized without pre-registering voiceprint features of users.
The above describes a processing method of audio data in the embodiment of the present application, and a conference recording processing apparatus and a video conference terminal in the embodiment of the present application are described below.
Specifically, referring to fig. 7, a conference record processing apparatus 700 according to an embodiment of the present application includes: an acquisition module 701 and a processing module 702, wherein the acquisition module 701 and the processing module 702 are connected by a bus. The conference record processing apparatus 700 may be a recording and playing server in the method embodiment shown in fig. 2, a multipoint control unit in the method embodiment shown in fig. 4, or a video conference terminal in the method embodiment shown in fig. 5, and may also be configured as one or more chips in the above-mentioned device. The conference recording handling apparatus 700 may be used to perform some or all of the functions of the devices described above.
For example, the obtaining module 701 obtains audio data of a first meeting place, sound source azimuth information corresponding to the audio data, and an identity recognition result, where the identity recognition result is used to indicate a correspondence between speaker identity information obtained by a portrait recognition method and speaker speaking time information; the processing module 702 performs voice segmentation on the audio data to obtain first segmented audio data of the audio data; and determining a speaker corresponding to the first segmented audio data according to the voiceprint characteristics of the first segmented audio data and the identification result.
Optionally, the audio data is contained in an audio code stream, and the audio code stream further includes additional domain information, where the additional domain information includes sound source azimuth information corresponding to the audio data.
Optionally, the processing module 702 is specifically configured to determine, if the identification result indicates that the first segmented audio data corresponds to unique speaker identity information, a speaker corresponding to the first segmented audio data according to the speaker identity information.
Optionally, the processing module 702 is specifically configured to compare a voiceprint feature of the first segment of audio data with a voiceprint feature of second segment of audio data if the identity recognition result indicates that the first segment of audio data corresponds to at least two pieces of speaker identity information, where the second segment of audio data is obtained by performing voice segmentation on the audio data by the conference recording processing apparatus, and the second segment of audio data corresponds to only one piece of speaker identity information; and if the voiceprint characteristics of the first segmented audio data are consistent with the voiceprint characteristics of the second segmented audio data, determining the speaker corresponding to the first segmented audio data according to the speaker identity information corresponding to the second segmented audio data.
Optionally, the processing module 702 is specifically configured to, if the identification result indicates that the first segmented audio data corresponds to at least two pieces of speaker identity information, determine, according to the speaker identity information and the voiceprint feature corresponding to the first segmented audio data, and the speaker identity information and the voiceprint feature corresponding to the second segmented audio data, a speaker corresponding to the first segmented audio data, where the second segmented audio data is obtained by performing voice segmentation on the audio data by the conference recording processing apparatus, and the second segmented audio data corresponds to at least two pieces of speaker identity information.
Optionally, the processing module 702 is specifically configured to determine, according to the voiceprint feature corresponding to the first segment of audio data, the identification result, and the long-term voiceprint feature record of the first meeting place, the speaker corresponding to the first segment of audio data, where the long-term voiceprint feature record includes the historical voiceprint feature record of the first meeting place, and the historical voiceprint feature record of the first meeting place is used to indicate a correspondence relationship between the voiceprint feature, the speaker, and the channel identifier.
Optionally, the processing module 702 is specifically configured to compare a voiceprint feature of a first speaker in the current conference of the first conference room with a voiceprint feature of the first speaker in the long-term voiceprint feature record to obtain a comparison result, where the first speaker is a determined speaker in the current conference of the first conference room; if the comparison result indicates that the voiceprint feature of the first speaker in the current conference of the first meeting place is consistent with the voiceprint feature of the first speaker in the long-time voiceprint feature record, comparing the voiceprint feature corresponding to the first segmented audio data, the identity recognition result and the long-time voiceprint feature record of the first meeting place to determine the speaker corresponding to the first segmented audio data.
Optionally, the processing module 702 is further configured to compare the voiceprint feature of the first speaker in the current conference of the first conference room with the voiceprint feature of the first speaker in the long-term voiceprint feature record to obtain a comparison result, where the first speaker is the determined speaker in the current conference of the first conference room; if the comparison result indicates that the voiceprint feature of the first speaker in the current conference of the first meeting place is inconsistent with the voiceprint feature of the first speaker in the long-time voiceprint feature record, registering the voiceprint feature, the channel identifier and the speaker corresponding to the voiceprint feature in the current conference of the first meeting place, and updating the long-time voiceprint feature record.
Optionally, the conference record processing apparatus 700 further includes a storage module, which is coupled to the processing module, so that the processing module can execute the computer execution instructions stored in the storage module to implement the functions of the conference record processing apparatus in the above-described method embodiment. In one example, the memory module optionally included in the conference recording processing apparatus 700 may be a memory unit within a chip, such as a register, a cache, and the like, and the memory module may also be a memory unit located outside the chip, such as a ROM or other types of static memory devices that can store static information and instructions, a RAM, and the like.
It should be understood that the flow executed between the modules of the conference record processing apparatus in the embodiment corresponding to fig. 7 is similar to the flow executed by the conference record processing apparatus in the corresponding method embodiment in fig. 2 to fig. 6, and details thereof are not repeated here.
Fig. 8 shows a schematic possible structure diagram of a conference record processing apparatus 800 in the above embodiment, where the conference record processing apparatus 800 may be configured as the aforementioned recording server in the method embodiment shown in fig. 2, the aforementioned multipoint control unit in the method embodiment shown in fig. 4, or the aforementioned video conference terminal in the method embodiment shown in fig. 5. The conference record processing apparatus 800 may include: a processor 802, a computer-readable storage medium/memory 803, a transceiver 804, an input device 805, and an output device 806, and a bus 801. Wherein the processor, transceiver, computer readable storage medium, etc. are connected by a bus. The embodiments of the present application do not limit the specific connection medium between the above components.
In one example, the transceiver 804 obtains audio data of a first meeting place, and an identification result of sound source position information corresponding to the audio data, where the identification result is used to indicate a correspondence between speaker identity information obtained by a portrait recognition method and speaker speaking time information;
the processor 802 performs voice segmentation on the audio data to obtain first segmented audio data of the audio data; and determining a speaker corresponding to the first segmented audio data according to the voiceprint characteristics of the first segmented audio data and the identification result.
In one example, the processor 802 may include baseband circuitry, e.g., may modulate audio data and generate an audio bitstream. The transceiver 804 may include a radio frequency circuit, so as to perform modulation and amplification on the audio code stream, and then send the audio code stream to a corresponding device in the conference system.
In yet another example, the processor 802 may run an operating system that controls functions between various devices and appliances. The transceiver 804 may include a baseband circuit and a radio frequency circuit, for example, the audio code stream or the identification result may be processed by the baseband circuit and the radio frequency circuit and then sent to a corresponding device in the conference system.
The transceiver 804 and the processor 802 may implement corresponding steps in any of the embodiments of fig. 2 to fig. 6, which are not described herein in detail.
It is understood that fig. 8 only shows a simplified design of the conference record processing apparatus, and in practical applications, the conference record processing apparatus may include any number of transceivers, processors, memories, etc., and all conference record processing apparatuses that can implement the present application are within the scope of the present application.
The processor 802 involved in the apparatus 800 may be a general-purpose processor, such as a CPU, a Network Processor (NP), a microprocessor, etc., or may be an ASIC, or one or more integrated circuits for controlling the execution of the program according to the present invention. But also a Digital Signal Processor (DSP), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The controller/processor can also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs, and microprocessors, among others. Processors typically perform logical and arithmetic operations based on program instructions stored within memory.
The bus 801 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 8, but this is not intended to represent only one bus or type of bus.
The computer-readable storage medium/memory 803 referred to above may also hold an operating system and other application programs. In particular, the program may include program code including computer operating instructions. More specifically, the memory may be ROM, other types of static storage devices that may store static information and instructions, RAM, other types of dynamic storage devices that may store information and instructions, disk storage, and so forth. The memory 803 may be a combination of the above memory types. And the computer-readable storage medium/memory described above may be in the processor, may be external to the processor, or distributed across multiple entities including the processor or processing circuitry. The computer-readable storage medium/memory described above may be embodied in a computer program product. By way of example, a computer program product may include a computer-readable medium in packaging material.
Alternatively, embodiments of the present application also provide a general-purpose processing system, such as that commonly referred to as a chip, including one or more microprocessors that provide processor functionality; and an external memory providing at least a portion of the storage medium, all connected together with other supporting circuitry through an external bus architecture. The memory stores instructions that, when executed by the processor, cause the processor to perform some or all of the steps of the data transmission method of the first communication device in the embodiment of fig. 2-6, and/or other processes for the techniques described herein.
The steps of a method or algorithm described in connection with the disclosure herein may be embodied in hardware or in software instructions executed by a processor. The software instructions may consist of corresponding software modules that may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. In addition, the ASIC may be located in the conference recording processing apparatus. Of course, the processor and the storage medium may reside as discrete components in the conference recording processing apparatus.
Specifically, referring to fig. 9, in the embodiment of the present application, the video conference terminal 900 includes: a processing module 901 and a sending module 902, wherein the processing module 901 and the sending module 902 are connected by a bus. The video conference terminal 900 may be the video conference terminal in the above method embodiment, and may also be configured as one or more chips in the video conference terminal. The video conference terminal 900 may be configured to perform some or all of the functionality of the video conference terminal described above.
For example, the processing module 901 performs sound source localization on the audio data of the first meeting place to obtain sound source azimuth information corresponding to the audio data; obtaining an identity recognition result according to the sound source direction and the portrait recognition method, wherein the identity recognition result is used for indicating the corresponding relation between the identity information of the speaker and the speaking time information; the sending module 902 sends the identification result, the audio data, and the sound source method information corresponding to the audio data to the conference record processing apparatus.
Optionally, the processing module 901 is specifically configured to obtain portrait information corresponding to the sound source position; carrying out image recognition on the portrait information to obtain face information and/or body attribute information; determining the identity information of the speaker according to the face information and/or the body attribute information; and establishing a corresponding relation between the speaker time information and the speaker identity information to obtain the identity recognition result.
Optionally, the video conference terminal 900 further includes a storage module, which is coupled to the processing module, so that the processing module can execute computer execution instructions stored in the storage module to implement the functions of the video conference terminal in the above-described method embodiments. In one example, the memory module optionally included in the video conference terminal 900 may be a memory unit on a chip, such as a register, a cache, and the like, and the memory module may also be a memory unit located outside the chip, such as a ROM or other types of static memory devices that can store static information and instructions, a RAM, and the like.
It should be understood that the flow executed between the modules of the video conference terminal in the embodiment corresponding to fig. 9 is similar to the flow executed by the video conference terminal in the corresponding method embodiment in fig. 2 to fig. 6, and details thereof are not repeated here.
Fig. 10 shows a possible structure diagram of a video conference terminal 1000 in the above embodiment, and the video conference terminal 1000 may be configured as the video conference terminal. The video conference terminal 1000 may include: a processor 1002, a computer-readable storage medium/memory 1003, a transceiver 1004, an input device 1005, and an output device 1006, and a bus 1001. Wherein the processor, transceiver, computer readable storage medium, etc. are connected by a bus. The embodiments of the present application do not limit the specific connection medium between the above components.
In an example, the processor 1002 performs sound source localization on audio data of a first meeting place to obtain sound source azimuth information corresponding to the audio data; obtaining an identity recognition result according to the sound source direction and the portrait recognition method, wherein the identity recognition result is used for indicating the corresponding relation between the identity information of the speaker and the speaking time information;
the transceiver 1004 transmits the identification result and the audio data to the conference record processing apparatus.
In one example, the processor 1002 may include baseband circuitry, e.g., may modulate audio data and generate an audio bitstream. The transceiver 1004 may include a radio frequency circuit, so as to perform modulation and amplification on the audio code stream, and then transmit the audio code stream to a corresponding device in the conference system.
In yet another example, the processor 1002 may run an operating system that controls functions between various devices and appliances. The transceiver 1004 may include a baseband circuit and a radio frequency circuit, for example, the audio code stream or the identification result may be processed by the baseband circuit and the radio frequency circuit and then sent to a corresponding device in the conference system.
The transceiver 1004 and the processor 1002 may implement corresponding steps in any one of the embodiments of fig. 3 to fig. 7, which are not described herein in detail.
It is understood that fig. 10 only shows a simplified design of the video conference terminal, and in practical applications, the video conference terminal may include any number of transceivers, processors, memories, etc., and all video conference terminals that can implement the present application are within the scope of the present application.
The processor 1002 involved in the apparatus 1000 may be a general-purpose processor, such as a CPU, a Network Processor (NP), a microprocessor, etc., or may be an ASIC, or one or more integrated circuits for controlling the execution of the program according to the present invention. But also a Digital Signal Processor (DSP), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The controller/processor can also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs, and microprocessors, among others. Processors typically perform logical and arithmetic operations based on program instructions stored within memory.
The bus 1001 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 10, but this is not intended to represent only one bus or type of bus.
The computer-readable storage medium/memory 1003 referred to above may also hold an operating system and other application programs. In particular, the program may include program code including computer operating instructions. More specifically, the memory may be ROM, other types of static storage devices that may store static information and instructions, RAM, other types of dynamic storage devices that may store information and instructions, disk storage, and so forth. The memory 1003 may be a combination of the above memory types. And the computer-readable storage medium/memory described above may be in the processor, may be external to the processor, or distributed across multiple entities including the processor or processing circuitry. The computer-readable storage medium/memory described above may be embodied in a computer program product. By way of example, a computer program product may include a computer-readable medium in packaging material.
Alternatively, embodiments of the present application also provide a general-purpose processing system, such as that commonly referred to as a chip, including one or more microprocessors that provide processor functionality; and an external memory providing at least a portion of the storage medium, all connected together with other supporting circuitry through an external bus architecture. The memory stores instructions that, when executed by the processor, cause the processor to perform some or all of the steps of the data transmission method of the first communication device in the embodiment of fig. 2-6, and/or other processes for the techniques described herein.
The steps of a method or algorithm described in connection with the disclosure herein may be embodied in hardware or in software instructions executed by a processor. The software instructions may consist of corresponding software modules that may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. Additionally, the ASIC may reside in a video conferencing terminal. Of course, the processor and the storage medium may reside as discrete components in a video conferencing terminal.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the unit is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (25)

1. A method of processing audio data, comprising:
the conference recording processing device acquires audio data of a first conference site in a current conference, sound source azimuth information corresponding to the audio data and an identity recognition result, wherein the identity recognition result is used for indicating the corresponding relation between speaker identity information obtained by a portrait recognition method and speaker speaking time information;
the conference recording processing device carries out voice segmentation on the audio data to obtain first segmented audio data of the audio data;
and the conference record processing device determines the speaker corresponding to the first sectional audio data according to the voiceprint characteristics of the first sectional audio data and the identification result.
2. The method of claim 1, wherein the audio data is contained in an audio bitstream, and wherein the audio bitstream further comprises additional domain information, and wherein the additional domain information comprises sound source orientation information corresponding to the audio data.
3. The method of claim 1 or 2, wherein the conference recording processing device determining the speaker corresponding to the first segment of audio data according to the voiceprint feature of the first segment of audio data and the identification result comprises:
and if the identification result indicates that the first segmented audio data corresponds to the unique speaker identity information, the conference record processing device determines the speaker corresponding to the first segmented audio data according to the speaker identity information.
4. The method of claim 1 or 2, wherein the conference recording processing device determining the speaker corresponding to the first segment of audio data according to the voiceprint feature of the first segment of audio data and the identification result comprises:
if the identity recognition result indicates that the first segmented audio data corresponds to at least two speaker identity information, the conference record processing device compares the voiceprint characteristics of the first segmented audio data with the voiceprint characteristics of second segmented audio data, the second segmented audio data is obtained by performing voice segmentation on the audio data through the conference record processing device, and the second segmented audio data corresponds to only speaker identity information;
and if the voiceprint characteristics of the first segmented audio data are consistent with the voiceprint characteristics of the second segmented audio data, the conference record processing device determines the speaker corresponding to the first segmented audio data according to the speaker identity information corresponding to the second segmented audio data.
5. The method of claim 1 or 2, wherein the conference recording processing device determining the speaker corresponding to the first segment of audio data according to the voiceprint feature of the first segment of audio data and the identification result comprises:
if the identity recognition result indicates that the first segmented audio data corresponds to at least two pieces of speaker identity information, the conference record processing device determines a speaker corresponding to the first segmented audio data according to the speaker identity information and voiceprint characteristics corresponding to the first segmented audio data and the speaker identity information and voiceprint characteristics corresponding to the second segmented audio data, the second segmented audio data is obtained by performing voice segmentation on the audio data through the conference record processing device, and the second segmented audio data corresponds to at least two pieces of speaker identity information.
6. The method according to any one of claims 1 to 4, wherein if the conference recording processing apparatus does not determine the unique speaker corresponding to the first segmented audio data according to the voiceprint feature of the first segmented audio data and the identification result, the method further comprises:
the conference record processing device determines the speaker corresponding to the first section of audio data according to the voiceprint feature corresponding to the first section of audio data, the identity recognition result and the long-time voiceprint feature record of the first conference place, the long-time voiceprint feature record comprises the historical voiceprint feature record of the first conference place, and the historical voiceprint feature record of the first conference place is used for indicating the corresponding relation among the voiceprint feature, the speaker and the channel identification.
7. The method of claim 6, wherein the conference recording processing device determining the speaker corresponding to the first segmented audio data according to the voiceprint feature corresponding to the first segmented audio data, the identification result and the long-term voiceprint feature record of the first conference site comprises:
the conference record processing device compares the voiceprint characteristics of a first speaker in the current conference of the first conference place with the voiceprint characteristics of the first speaker in the long-time voiceprint characteristic record, wherein the first speaker is the determined speaker in the current conference of the first conference place;
and if the voiceprint feature of the first speaker in the current conference of the first conference place is consistent with the voiceprint feature of the first speaker in the long-time voiceprint feature record, the conference record processing device determines the voiceprint feature corresponding to the first sectional audio data, the identification result and the long-time voiceprint feature record of the first conference place to be the speaker corresponding to the first sectional audio data.
8. The method of claim 6, further comprising:
the conference record processing device compares the voiceprint characteristics of a first speaker in the current conference of the first conference place with the voiceprint characteristics of the first speaker in the long-time voiceprint characteristic record, wherein the first speaker is the determined speaker in the current conference of the first conference place;
and if the comparison result indicates that the voiceprint feature of the first speaker in the current conference of the first meeting place is inconsistent with the voiceprint feature of the first speaker in the long-term voiceprint feature record, registering the voiceprint feature, the channel identifier and the speaker corresponding to the voiceprint feature in the current conference of the first meeting place by the conference record processing device, and updating the long-term voiceprint feature record.
9. A method of processing audio data, comprising:
the video conference terminal carries out sound source positioning on the audio data of the first meeting place so as to obtain sound source azimuth information corresponding to the audio data;
the video conference terminal acquires an identity recognition result according to the sound source direction and the portrait recognition method, wherein the identity recognition result is used for indicating the corresponding relation between the identity information of the speaker and the speaker time information;
and the video conference terminal sends the identification result, the audio data and the sound source azimuth information corresponding to the audio data to a conference record processing device.
10. The method of claim 9, wherein the obtaining, by the video conference terminal, the identification result according to the sound source orientation and the face recognition method comprises:
the video conference terminal acquires portrait information corresponding to the sound source position;
the video conference terminal carries out image recognition on the portrait information to obtain face information and/or body attribute information;
the video conference terminal determines the identity information of the speaker according to the face information and/or the body attribute information;
and the video conference terminal establishes a corresponding relation between the speaker time information and the speaker identity information to obtain the identity recognition result.
11. A conference recording processing apparatus, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring audio data of a first meeting place, sound source azimuth information and an identity recognition result corresponding to the audio data, and the identity recognition result is used for indicating the corresponding relation between speaker identity information obtained by a human image recognition method and speaker speaking time information;
the processing module is used for carrying out voice segmentation on the audio data so as to obtain first segmented audio data of the audio data; and determining a speaker corresponding to the first segmented audio data according to the voiceprint characteristics of the first segmented audio data and the identification result.
12. The apparatus of claim 11, wherein the audio data is included in an audio bitstream, and wherein the audio bitstream further comprises additional domain information, and wherein the additional domain information comprises sound source orientation information corresponding to the audio data.
13. The apparatus according to claim 11 or 12, wherein the processing module is specifically configured to determine the speaker corresponding to the first segmented audio data according to the speaker identification information if the identification result indicates that the first segmented audio data corresponds to unique speaker identification information.
14. The apparatus according to claim 11 or 12, wherein the processing module is specifically configured to compare a voiceprint feature of the first segmented audio data with a voiceprint feature of a second segmented audio data if the identification result indicates that the first segmented audio data corresponds to at least two speaker identity information, where the second segmented audio data is obtained by performing voice segmentation on the audio data by the conference recording processing apparatus, and the second segmented audio data corresponds to unique speaker identity information;
and if the voiceprint characteristics of the first segmented audio data are consistent with the voiceprint characteristics of the second segmented audio data, determining the speaker corresponding to the first segmented audio data according to the speaker identity information corresponding to the second segmented audio data.
15. The apparatus according to claim 11 or 12, wherein the processing module is specifically configured to, if the identification result indicates that the first segmented audio data corresponds to at least two pieces of speaker identity information, determine a speaker corresponding to the first segmented audio data according to the speaker identity information and the voiceprint feature corresponding to the first segmented audio data, and the speaker identity information and the voiceprint feature corresponding to the second segmented audio data, where the second segmented audio data is obtained by performing voice segmentation on the audio data by the conference recording processing apparatus, and the second segmented audio data corresponds to at least two pieces of speaker identity information.
16. The apparatus according to any one of claims 11 to 15, wherein the processing module is further configured to determine a speaker corresponding to the first segmented audio data according to the voiceprint feature corresponding to the first segmented audio data, the identification result, and a long-term voiceprint feature record of the first venue, where the long-term voiceprint feature record includes a historical voiceprint feature record of the first venue, and the historical voiceprint feature record of the first venue is used to indicate a correspondence relationship between the voiceprint feature, the speaker, and a channel identifier.
17. The apparatus according to claim 16, wherein the processing module is specifically configured to compare a voiceprint feature of a first speaker in the current conference of the first conference site with a voiceprint feature of the first speaker in the long-term voiceprint feature record to obtain a comparison result, where the first speaker is a determined speaker in the current conference of the first conference site;
and if the comparison result indicates that the voiceprint feature of the first speaker in the current conference of the first conference place is consistent with the voiceprint feature of the first speaker in the long-time voiceprint feature record, comparing the voiceprint feature corresponding to the first sectional audio data, the identification result and the long-time voiceprint feature record of the first conference place to determine the speaker corresponding to the first sectional audio data.
18. The apparatus of claim 16, wherein the processing module is further configured to compare a voiceprint feature of a first speaker in the current conference of the first conference site with a voiceprint feature of the first speaker in the long-term voiceprint feature record to obtain a comparison result, and the first speaker is a determined speaker in the current conference of the first conference site;
and if the comparison result indicates that the voiceprint feature of the first speaker in the current conference of the first meeting place is inconsistent with the voiceprint feature of the first speaker in the long-time voiceprint feature record, registering the voiceprint feature, the channel identification and the speaker corresponding to the voiceprint feature in the current conference of the first meeting place, and updating the long-time voiceprint feature record.
19. A video conference terminal, comprising:
the processing module is used for carrying out sound source positioning on the audio data of the first meeting place so as to acquire sound source azimuth information corresponding to the audio data; acquiring an identity recognition result according to the sound source direction and the portrait recognition method, wherein the identity recognition result is used for indicating the corresponding relation between the identity information of the speaker and the speaker time information;
and the sending module is used for sending the identification result, the audio data and the sound source azimuth information corresponding to the audio data to a conference record processing device.
20. The video conference terminal according to claim 19, wherein the processing module is specifically configured to obtain portrait information corresponding to the sound source orientation; carrying out image recognition on the portrait information to obtain face information and/or body attribute information; determining the identity information of the speaker according to the face information and/or the body attribute information; and establishing a corresponding relation between the speaker time information and the speaker identity information to obtain the identity recognition result.
21. A conference recording processing apparatus comprising at least one processor and a memory, the processor being configured to be coupled to the memory, the processor invoking instructions stored in the memory to control the terminal to perform the method of any one of claims 1 to 8.
22. A video conferencing terminal comprising at least one processor and a memory, the processor being configured to be coupled to the memory, the processor invoking instructions stored in the memory to control the terminal to perform the method of any of claims 9 to 10.
23. A conference recording processing system, comprising a conference recording processing apparatus according to any one of claims 11 to 18 and a video conference terminal according to any one of claims 19 to 20 and a multipoint control unit and an automatic speech recognition, ASR, server.
24. A computer storage medium storing computer instructions for performing the method of any of claims 1 to 10.
25. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 10.
CN202011027427.2A 2020-09-25 2020-09-25 Audio data processing method, equipment and system Pending CN114333853A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011027427.2A CN114333853A (en) 2020-09-25 2020-09-25 Audio data processing method, equipment and system
PCT/CN2021/098297 WO2022062471A1 (en) 2020-09-25 2021-06-04 Audio data processing method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011027427.2A CN114333853A (en) 2020-09-25 2020-09-25 Audio data processing method, equipment and system

Publications (1)

Publication Number Publication Date
CN114333853A true CN114333853A (en) 2022-04-12

Family

ID=80844861

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011027427.2A Pending CN114333853A (en) 2020-09-25 2020-09-25 Audio data processing method, equipment and system

Country Status (2)

Country Link
CN (1) CN114333853A (en)
WO (1) WO2022062471A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115019809A (en) * 2022-05-17 2022-09-06 中国南方电网有限责任公司超高压输电公司广州局 Method, apparatus, device, medium, and program product for preventing false entry into an interval
WO2023212879A1 (en) * 2022-05-05 2023-11-09 北京小米移动软件有限公司 Object audio data generation method and apparatus, electronic device, and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117528335B (en) * 2023-12-05 2024-05-28 惠州市鸿轩和科技有限公司 Audio equipment applying directional microphone and noise reduction method

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9053750B2 (en) * 2011-06-17 2015-06-09 At&T Intellectual Property I, L.P. Speaker association with a visual representation of spoken content
CN102968991B (en) * 2012-11-29 2015-01-21 华为技术有限公司 Method, device and system for sorting voice conference minutes
CN106782545B (en) * 2016-12-16 2019-07-16 广州视源电子科技股份有限公司 A kind of system and method that audio, video data is converted to writing record
CN106657865B (en) * 2016-12-16 2020-08-25 联想(北京)有限公司 Conference summary generation method and device and video conference system
CN110022454B (en) * 2018-01-10 2021-02-23 华为技术有限公司 Method for identifying identity in video conference and related equipment
US11276407B2 (en) * 2018-04-17 2022-03-15 Gong.Io Ltd. Metadata-based diarization of teleconferences
US11152006B2 (en) * 2018-05-07 2021-10-19 Microsoft Technology Licensing, Llc Voice identification enrollment
EP3627505B1 (en) * 2018-09-21 2023-11-15 Televic Conference NV Real-time speaker identification with diarization
CN109560941A (en) * 2018-12-12 2019-04-02 深圳市沃特沃德股份有限公司 Minutes method, apparatus, intelligent terminal and storage medium
CN110232925A (en) * 2019-06-28 2019-09-13 百度在线网络技术(北京)有限公司 Generate the method, apparatus and conference terminal of minutes
CN111402892A (en) * 2020-03-23 2020-07-10 郑州智利信信息技术有限公司 Conference recording template generation method based on voice recognition

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023212879A1 (en) * 2022-05-05 2023-11-09 北京小米移动软件有限公司 Object audio data generation method and apparatus, electronic device, and storage medium
CN115019809A (en) * 2022-05-17 2022-09-06 中国南方电网有限责任公司超高压输电公司广州局 Method, apparatus, device, medium, and program product for preventing false entry into an interval
CN115019809B (en) * 2022-05-17 2024-04-02 中国南方电网有限责任公司超高压输电公司广州局 Method, apparatus, device, medium and program product for monitoring false entry prevention interval

Also Published As

Publication number Publication date
WO2022062471A1 (en) 2022-03-31

Similar Documents

Publication Publication Date Title
CN114333853A (en) Audio data processing method, equipment and system
US10917577B2 (en) Method and device for controlling camera shooting, smart device, and storage medium
US9595259B2 (en) Sound source-separating device and sound source-separating method
US11343446B2 (en) Systems and methods for implementing personal camera that adapts to its surroundings, both co-located and remote
WO2019140161A1 (en) Systems and methods for decomposing a video stream into face streams
US9165182B2 (en) Method and apparatus for using face detection information to improve speaker segmentation
CN110691204B (en) Audio and video processing method and device, electronic equipment and storage medium
US11405584B1 (en) Smart audio muting in a videoconferencing system
US20130107028A1 (en) Microphone Device, Microphone System and Method for Controlling a Microphone Device
CN109560941A (en) Minutes method, apparatus, intelligent terminal and storage medium
US20210124912A1 (en) Face recognition method and apparatus
WO2021120190A1 (en) Data processing method and apparatus, electronic device, and storage medium
US6959095B2 (en) Method and apparatus for providing multiple output channels in a microphone
CN111883168A (en) Voice processing method and device
TW200804852A (en) Method for tracking vocal target
CN114762039A (en) Conference data processing method and related equipment
CN112908336A (en) Role separation method for voice processing device and voice processing device thereof
CN113301291B (en) Anti-interference method, system, equipment and storage medium in network video conference
US20220215852A1 (en) Sound pickup device and sound pickup method
CN112543302B (en) Intelligent noise reduction method and equipment in multi-person teleconference
CN113259734B (en) Intelligent broadcasting guide method, device, terminal and storage medium for interactive scene
US11783837B2 (en) Transcription generation technique selection
CN111182256A (en) Information processing method and server
TWI798867B (en) Video processing method and associated system on chip
TW202405796A (en) Video processing method for performing partial highlighting with aid of auxiliary information detection, and associated system on chip

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination