WO2022001801A1 - 角色分离方法、会议纪要的记录方法、角色展示方法、装置、电子设备及计算机存储介质 - Google Patents

角色分离方法、会议纪要的记录方法、角色展示方法、装置、电子设备及计算机存储介质 Download PDF

Info

Publication number
WO2022001801A1
WO2022001801A1 PCT/CN2021/101956 CN2021101956W WO2022001801A1 WO 2022001801 A1 WO2022001801 A1 WO 2022001801A1 CN 2021101956 W CN2021101956 W CN 2021101956W WO 2022001801 A1 WO2022001801 A1 WO 2022001801A1
Authority
WO
WIPO (PCT)
Prior art keywords
character
separated
sound source
role
voice
Prior art date
Application number
PCT/CN2021/101956
Other languages
English (en)
French (fr)
Inventor
郑斯奇
王宪亮
索宏彬
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2022001801A1 publication Critical patent/WO2022001801A1/zh
Priority to US18/090,296 priority Critical patent/US20230162757A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • G06F21/32User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/20Arrangements for obtaining desired frequency or directional characteristics
    • H04R1/32Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
    • H04R1/40Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
    • H04R1/406Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers microphones
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2201/00Details of transducers, loudspeakers or microphones covered by H04R1/00 but not provided for in any of its subgroups
    • H04R2201/40Details of arrangements for obtaining desired directional characteristic by combining a number of identical transducers covered by H04R1/40 but not provided for in any of its subgroups

Definitions

  • Embodiments of the present invention relate to the field of speech processing, and in particular, to a method for separating roles, a method for recording meeting minutes, a method for displaying roles, an apparatus, an electronic device, and a computer storage medium.
  • Speaker role separation is an important step in conference content analysis, and the real-time nature of its separation directly affects user experience.
  • speaker role separation is mostly based on voiceprint recognition. Because voiceprint recognition requires the accumulation of voice data for a certain period of time, a high recognition accuracy can be guaranteed. Therefore, most role separation systems based on voiceprint recognition on the market are based on offline voice data to complete the separation of roles, and it is difficult to realize the separation of roles in real time. It can be seen that how to separate roles in real time to improve user experience has become a technical problem that needs to be solved urgently.
  • embodiments of the present invention provide a role separation solution to at least partially solve the above technical problems.
  • a role separation method includes: acquiring sound source angle data corresponding to a voice data frame of a character to be separated collected by a voice collection device; based on the sound source angle data, identifying the character to be separated to obtain the The first identification result of the character to be separated; the character is separated based on the first identification result of the character to be separated.
  • a role separation method includes: sending a role separation request carrying a voice data frame of a character to be separated to the cloud, so that the cloud obtains sound source angle data corresponding to the voice data frame based on the role separation request, and based on the role separation request.
  • the sound source angle data identify the character to be separated, and then separate the character based on the identification result of the character to be separated; receive the character's information sent by the cloud based on the character separation request Separation results.
  • a role separation method includes: receiving a role separation request that carries a voice data frame of a character to be separated from a voice collection device; acquiring sound source angle data corresponding to the voice data frame based on the role separation request; Source angle data, identify the character to be separated to obtain the identification result of the character to be separated; separate the character based on the identification result of the character to be separated, and collect data from the voice
  • the device sends a role separation result for the role separation request.
  • a method for recording meeting minutes includes: acquiring sound source angle data corresponding to a voice data frame of a conference role collected by a voice collection device located in a conference room; and identifying the conference role based on the sound source angle data to obtain the The identification result of the conference role; the meeting minutes of the conference role are recorded based on the identification result of the conference role.
  • a character display method includes: acquiring sound source angle data corresponding to a voice data frame of a character collected by a voice collection device; based on the sound source angle data, identifying the character to obtain an identification result of the character; Based on the identity recognition result of the character, the identity data of the character is displayed on the interactive interface of the voice collection device.
  • a role separation device includes: a first acquisition module for acquiring sound source angle data corresponding to a voice data frame of a character to be separated collected by a voice acquisition device; a first identity recognition module for, based on the sound source angle data, Performing identification on the character to be separated to obtain a first identification result of the character to be separated; a separation module for separating the character based on the first identification result of the character to be separated.
  • a role separation device includes: a first sending module configured to send a role separation request carrying a voice data frame of a character to be separated to the cloud, so that the cloud obtains the voice data frame corresponding to the voice data frame based on the role separation request.
  • source angle data and based on the sound source angle data, identify the character to be separated, and then separate the character based on the identification result of the character to be separated; the first receiving module is used to receive all the characters.
  • the cloud sends the role separation result based on the role separation request.
  • a role separation device includes: a second receiving module, configured to receive a role separation request sent by a voice collection device that carries a voice data frame of a character to be separated; a third obtaining module, configured to obtain the voice based on the role separation request The sound source angle data corresponding to the data frame; the second identification module is used for identifying the character to be separated based on the sound source angle data, so as to obtain the identification result of the character to be separated; The second sending module is configured to separate the characters based on the identification result of the characters to be separated, and send the character separation result for the character separation request to the voice collection device.
  • a recording apparatus for meeting minutes includes: a fourth acquisition module for acquiring sound source angle data corresponding to a voice data frame of a conference role collected by a voice acquisition device located in a conference room; a third identity recognition module for acquiring sound source angle data based on the sound source angle data, identify the conference role to obtain the identification result of the conference role; a recording module is configured to record the meeting minutes of the conference role based on the identification result of the conference role.
  • a character display device includes: a fifth acquisition module for acquiring sound source angle data corresponding to a voice data frame of a character collected by a voice acquisition device; and a fourth identity recognition module for, based on the sound source angle data, The character is identified to obtain the identification result of the character; the first display module is configured to display the identification data of the character on the interactive interface of the voice collection device based on the identification result of the character.
  • an electronic device including: a processor, a memory, a communication interface, and a communication bus, and the processor, the memory, and the communication interface are completed through the communication bus mutual communication; the memory is used to store at least one executable instruction, and the executable instruction causes the processor to perform operations corresponding to the role separation method described in the first aspect, the second aspect or the third aspect, Or perform operations corresponding to the method for recording meeting minutes as described in the fourth aspect, or perform operations corresponding to the method for character presentation as described in the fifth aspect.
  • a computer storage medium on which a computer program is stored, and when the program is executed by a processor, implements the roles described in the first aspect, the second aspect or the third aspect
  • the sound source angle data corresponding to the voice data frame of the character to be separated collected by the voice collection device is obtained, and based on the sound source angle data, the character to be separated is identified to obtain the The first identity recognition result of the role to be separated; then the role is separated based on the first identity recognition result of the role to be separated, compared with other existing methods, based on the voice data frame of the role to be separated collected by the voice collection device.
  • Corresponding sound source angle data identify the character to be separated, and then separate the character based on the identification result of the character to be separated, so that the character can be separated in real time, thereby making the user experience smoother.
  • FIG. 1A is a flow chart of the steps of the method for role separation in Embodiment 1 of the present application;
  • FIG. 1B is a schematic diagram of sound propagation under a near-field model provided according to Embodiment 1 of the present application;
  • FIG. 1C is a schematic diagram of a scenario of a speaker separation method provided according to Embodiment 1 of the present application.
  • FIG. 2A is a flow chart of the steps of the method for role separation in Embodiment 2 of the present application.
  • FIG. 2B is a schematic diagram of a scenario of a role separation method provided according to Embodiment 2 of the present application.
  • 3A is a flow chart of the steps of the method for role separation in Embodiment 3 of the present application.
  • 3B is a schematic diagram of a scenario of a method for role separation provided according to Embodiment 3 of the present application.
  • FIG. 4A is a flowchart of the steps of the method for role separation in Embodiment 4 of the present application.
  • FIG. 4B is a schematic diagram of a scenario of a method for role separation provided according to Embodiment 4 of the present application.
  • Embodiment 5 is a flowchart of steps of a method for recording meeting minutes in Embodiment 5 of the present application;
  • FIG. 6 is a flow chart of the steps of the character display method in Embodiment 6 of the present application.
  • FIG. 7 is a schematic structural diagram of a role separation device in Embodiment 7 of the present application.
  • Embodiment 8 is a schematic structural diagram of a role separation device in Embodiment 8 of the present application.
  • Embodiment 9 is a schematic structural diagram of a role separation device in Embodiment 9 of the present application.
  • FIG. 10 is a schematic structural diagram of a role separation device in Embodiment 10 of the present application.
  • FIG. 11 is a schematic structural diagram of a recording device for meeting minutes in Embodiment 11 of the present application.
  • FIG. 12 is a schematic structural diagram of a character display device in Embodiment 12 of the present application.
  • FIG. 13 is a schematic structural diagram of an electronic device in Embodiment 13 of the present application.
  • FIG. 1A a flowchart of steps of the method for role separation in Embodiment 1 of the present application is shown.
  • the role separation method provided by this embodiment includes the following steps:
  • step S101 the sound source angle data corresponding to the voice data frame of the character to be separated collected by the voice collecting device is acquired.
  • the voice collection device may include a pickup.
  • the to-be-separated role may be a to-be-separated conference speaker, a to-be-separated caller, and the like.
  • the voice data frame can be understood as a voice segment with a duration of 20 milliseconds to 30 milliseconds.
  • the sound source angle data can be understood as the angle formed by the character to be separated and the voice acquisition device when speaking. It can be understood that, the above description is only exemplary, and the embodiments of the present application do not make any limitation on this.
  • the speech collection device includes a microphone array.
  • the speech collection device When acquiring the sound source angle data corresponding to the speech data frame of the character to be separated collected by the speech collection device, acquire the covariance matrix of the speech data frame received by at least some microphones in the microphone array; The variance matrix is decomposed into eigenvalues to obtain a plurality of eigenvalues; a first number of the largest eigenvalues are selected from the plurality of eigenvalues, and a speech signal subspace is formed based on the eigenvectors corresponding to the selected eigenvalues, wherein, The first number is equivalent to the estimated number of sound sources; the sound source angle data is determined based on the speech signal subspace.
  • a microphone array may be provided on a device supporting a voice interaction function (eg, a microphone), and the microphone array is used to receive nearby sound input.
  • a microphone array is an array formed by a group of omnidirectional microphones located at different positions in space according to a certain shape and regular arrangement. It is a device for spatial sampling of spatially propagated sound input. According to the topology of the microphone array, it can be divided into linear array, planar array, volume array and so on. According to the distance between the sound source and the microphone array, the array can be divided into a near-field model and a far-field model.
  • the near-field model regards the sound wave as a spherical wave, which considers the amplitude difference between the received signals of the microphone elements; the far-field model regards the sound wave as a plane wave, which ignores the amplitude difference between the received signals of each array element, and approximately considers the difference between the received signals.
  • Sound source localization may be performed according to signals received by at least some microphones in the microphone array to determine the position information of the character.
  • the determined position information may be the two-dimensional position coordinates of the character, or the azimuth and distance of the character relative to the at least part of the microphones.
  • the azimuth is the azimuth of the character in the coordinate system where the at least part of the microphones are located, that is, the sound source angle data, and the distance is the distance between the character and the center position of the at least part of the microphones.
  • a MUSIC algorithm Multiple Signal classification, multiple signal classification algorithm
  • the basic idea of the MUSIC algorithm is to decompose the covariance matrix of the output data of any array by eigenvalues, so as to obtain the signal subspace corresponding to the signal component and the noise subspace orthogonal to the signal component, and then use the positive value of the two subspaces.
  • the orthogonality of these two subspaces can be used to form a spatial scanning spectrum, and a global search for spectral peaks can be performed, thereby realizing the parameter estimation of the signal.
  • the microphone array can be a linear array, and the sound field model can be regarded as a near-field model.
  • the time difference between the sound source signal reaching each array microphone is ⁇ .
  • the character set to be separated from the respective microphones of the microphone array respectively to R 1, R 2, ..., R N-1, R N the propagation velocity of sound in air is C, sonic
  • the time difference between the arrival of the ith microphone and the arrival of the 1st microphone is ⁇ i , where,
  • the sound source localization process under the near-field model is described as follows.
  • a covariance matrix of signals received by at least some of the microphones in the microphone array can be obtained.
  • X(f) can be regarded as a vector, and each element in the vector represents the data of the signal received by a microphone at different frequency points f after Fourier transform.
  • X(f) can be expressed as
  • X 1 (f), X 2 (f), X M (f) represent the data at different frequency points f after Fourier transform (such as short-time Fourier transform) of signals received by different microphones, M is the number of microphones.
  • a time variable t is actually implied in the expression of X(f), and the complete representation should be X(f, t), which represents the data contained in a time period t.
  • E represents mathematical expectation, to find mathematical expectation or mean, actually for time t, it can be understood as E[X(f, t)X(f, t) H ], or
  • N2-N1 represents the time period corresponding to X(f, t)
  • N1 represents the start time
  • N2 represents the end time.
  • the covariance matrix is decomposed into eigenvalues, and multiple eigenvalues can be obtained.
  • the first number of the largest eigenvalues can be selected from the plurality of eigenvalues, and the eigenvectors corresponding to the selected eigenvalues can constitute a signal subspace.
  • the eigenvectors corresponding to the remaining eigenvalues can form a noise subspace, where the first quantity is equal to the estimated quantity of sound sources. For example, when it is considered that there are 3 sound source signals, the feature corresponding to the largest three eigenvalues can be taken.
  • the vectors form the signal subspace.
  • the estimated number of sound sources can be calculated through experience or other estimation methods, which will not be repeated here.
  • R(f) U s (f) ⁇ s U s (f) H +U N (f) ⁇ N U N (f) H
  • U s (f) is the signal subspace composed of the eigenvectors corresponding to the large eigenvalues
  • U N (f) is the noise subspace composed of the eigenvectors corresponding to the small features
  • S and N represent different divisions of the signal U
  • S represents the signal
  • N represents noise
  • the divided U s represents the signal subspace
  • U N represents the noise subspace.
  • stands for a diagonal matrix and represents a matrix consisting of eigenvalues.
  • the sound source location can be determined. For example, the maximum response of the signal in the two-dimensional space can be determined based on the signal subspace, and based on the direction of arrival (DOA) corresponding to the maximum response, the position of the sound source, that is, the position information of the character can be determined.
  • DOA direction of arrival
  • the calculation formula of the response of the target signal in the two-dimensional space is:
  • f is a value range, and a(R, ⁇ , f) can be obtained from the relative time difference ⁇ .
  • a(R, ⁇ , f) represents the steering vector of the microphone array.
  • R is the distance between the sound source and the center of the microphone array, and ⁇ is the azimuth of the sound source in the array coordinate system.
  • the method further includes: performing a processing on the voice data frame of the character to be separated.
  • Voice endpoint detection to obtain voice data frames with voice endpoints; based on the energy spectrum of the voice data frames of the characters to be separated, filter and smooth the voice data frames with voice endpoints to obtain filtered and smoothed voices data frame; based on the filtered and smoothed voice data frame, update the sound source angle data to obtain updated sound source angle data.
  • voice endpoint detection and filtering and smoothing on the voice data frames of the characters to be separated more stable sound source angle data can be obtained.
  • voice endpoint detection (Voice Activity Detection, VAD), also known as voice activity detection, is to determine the starting point and end point of the voice from a signal containing voice, and then extract the corresponding non-silent voice. signal, thereby eliminating the interference of silent segments and non-voice signals, so that the processing quality is guaranteed.
  • efficient endpoint detection minimizes processing time.
  • the voice endpoint detection may be performed based on spatial entropy, and may also be performed based on a neural network model.
  • the process of detecting voice endpoints based on spatial entropy is as follows: the sound signal received by the microphone array may contain the voice of the character to be separated and the surrounding environmental noise. Therefore, the voice endpoint detection can be performed according to the degree of confusion in the signal space of the sound signal received by at least some of the microphones in the microphone array.
  • spatial entropy can be used to represent the degree of confusion in the signal space. When the spatial entropy is small, it can be considered that there is speech activity, and when the spatial entropy is large, it can be considered that there is no speech activity.
  • a covariance matrix of signals received by at least some microphones in the microphone array may be obtained first, and eigenvalue decomposition is performed on the covariance matrix to obtain multiple eigenvalues.
  • the signal subspace composed of large eigenvalues can be regarded as a speech subspace
  • the signal subspace composed of small eigenvalues can be regarded as a noise subspace, so it can be determined by analyzing multiple eigenvalues. voice activity.
  • each eigenvalue can be regarded as a signal subspace (ie, signal source), and the entropy (ie, spatial entropy) of these multiple eigenvalues can be calculated.
  • the size of the calculated spatial entropy it can be judged whether there is speech activity. For example, multiple eigenvalues can be normalized, and the spatial entropy of the multiple values obtained after the normalization process can be calculated, the spatial entropy is compared with a predetermined threshold, and based on the comparison result between the spatial entropy and the predetermined threshold, Determine if there is voice activity. For example, when the spatial entropy is less than a predetermined threshold, it can be determined that there is a voice activity, and when the spatial entropy is greater than or equal to the predetermined threshold, it can be determined that there is no voice activity.
  • the value of the predetermined threshold can be set according to the actual situation, for example, it can be related to the selected positioning frequency band. For example, when the positioning frequency band is 500-5000 (HZ), the predetermined threshold can be 1, and when the spatial entropy is less than 1, it can be determined that there is a Voice activity, otherwise it can be judged as noise, and there is no voice activity. Among them, the space entropy is ES,
  • the voice activity detection model can be used to predict the voice data frames obtained based on at least part of the microphones in the microphone array to determine whether there is voice activity.
  • the voice activity detection model is used to predict the voice activity state of the input voice data frame, and the voice activity detection model may be a neural network-based model, and the prediction model may be obtained by means of supervised machine learning. It can be understood that, the above description is only exemplary, and the embodiments of the present application do not make any limitation on this.
  • the voice data frame with the voice endpoints when filtering and smoothing the voice data frame with the voice endpoints based on the energy spectrum of the voice data frame of the character to be separated to obtain the filtered and smoothed voice data frame, pass The median filter, based on the spectral flatness of the energy spectrum of the voice data frame of the character to be separated, performs filtering and smoothing on the voice data frame with voice endpoints to obtain the filtered and smoothed voice data frame.
  • the voice data frame with the voice endpoint is filtered and smoothed, which can effectively improve the performance of the voice data frame with the voice endpoint. Filter smoothing effect.
  • median filtering is a nonlinear digital filter technique often used to remove noise from images or other signals.
  • the design idea of the median filter is to examine the samples in the input signal and determine whether it represents the signal, and use an observation window composed of an odd number of samples to achieve this function. The values in the observation window are sorted, and the median value in the middle of the observation window is output. Then, discard the oldest value, obtain a new sample, and repeat the above calculation process.
  • the spectral flatness of the energy spectrum of the voice data frame of the character to be separated can be understood as the flatness of the energy spectrum of the voice data frame of the character to be separated, which is a characteristic parameter of the energy spectrum, which can be determined by processing the voice data frame of the character to be separated.
  • the energy spectrum of the voice data frame of the separated character is calculated to obtain the flatness of the energy spectrum of the voice data frame of the character to be separated.
  • the median filter is passed, based on the The spectral flatness of the energy spectrum of the voice data frame of the character to be separated, the energy spectrum of the voice data frame with the voice endpoint is filtered and smoothed to obtain a filtered and smoothed energy spectrum; based on the filtered and smoothed energy spectrum, and determine the filtered and smoothed speech data frame.
  • the sound source angle data corresponding to the filtered and smoothed voice data frame when updating the sound source angle data based on the filtered and smoothed voice data frame, acquire the sound source angle data corresponding to the filtered and smoothed voice data frame; using The sound source angle data corresponding to the filtered and smoothed voice data frame updates the sound source angle data corresponding to the voice data frame of the character to be separated.
  • the specific implementation of acquiring the sound source angle data corresponding to the filtered and smoothed voice data frame is the same as the above-mentioned specific implementation of acquiring the sound source angle data corresponding to the voice data frame of the character to be separated collected by the voice acquisition device. The implementation manner is similar and will not be repeated here. It can be understood that, the above description is only exemplary, and the embodiments of the present application do not make any limitation on this.
  • step S102 the character to be separated is identified based on the sound source angle data, so as to obtain a first identification result of the character to be separated.
  • the sound source angle Perform sequential clustering of the data to obtain the sequential clustering result of the sound source angle data; determine that the character identification corresponding to the sequential clustering result of the sound source angle data is the first identification result of the character to be separated .
  • the character identity identifier may be the character's name, nickname, or identity code or the like.
  • sequential clustering is typically an unsupervised method for classifying data in a certain number of homogeneous data sets (or clusters).
  • the number of clusters is not predetermined, but is gradually increased sequentially (one after the other) according to given criteria until an appropriate stopping condition is met.
  • the advantages of sequential clustering algorithms are twofold. First, redundant computation of an unnecessarily large number of clusters is avoided. Second, clusters are usually extracted in ordered order, from the most important (the one with the largest capacity) to the least important (the one with the least capacity).
  • the order clustering center of the sound source angle can be understood as the central angle of each order clustering of the sound source angle.
  • the center angle can be 30 degrees, 60 degrees, 90 degrees.
  • the sound source angle data and the sound source angle order are clustered together.
  • the distance between the cluster centers is compared with a preset distance threshold, and if the distance between the sound source angle data and the sound source angle sequential clustering center is less than the preset distance threshold, the sequential clustering of the sound source angle data is determined.
  • the class result is the order cluster where the sound source angle order cluster center is located. If the distance between the sound source angle data and the sound source angle order cluster center is equal to or greater than a preset distance threshold, the The sequence clustering result of the sound source angle data is not the sequence cluster where the sound source angle sequence cluster center is located.
  • step S103 the characters to be separated are separated based on the first identification result of the characters to be separated.
  • the first identification result of the character to be separated can be used to distinguish the character to be separated, thereby realizing all Describe the separation of roles to be separated. It can be understood that, the above description is only exemplary, and the embodiments of the present application do not make any limitation on this.
  • the voice data frame of the character to be separated is collected by a voice collecting device.
  • voice endpoint detection is performed on the voice data frame of the character to be separated to obtain a voice data frame with a voice endpoint, and based on the voice data of the character to be separated
  • the energy spectrum of the frame, filter and smooth the voice data frame with the voice endpoint to obtain the filtered and smoothed voice data frame; then perform sound source localization on the filtered and smoothed voice data frame to determine the sound source angle
  • the sound source angle data corresponding to the voice data frame of the character to be separated collected by the voice collection device is obtained, and based on the sound source angle data, the character to be separated is identified to obtain the The first identity recognition result of the role to be separated; then the role is separated based on the first identity recognition result of the role to be separated, compared with other existing methods, based on the voice data frame of the role to be separated collected by the voice collection device.
  • Corresponding sound source angle data identify the character to be separated, and then separate the character based on the identification result of the character to be separated, so that the character can be separated in real time, thereby making the user experience smoother.
  • the role separation method provided in this embodiment can be executed by any appropriate device with data processing capabilities, including but not limited to: cameras, terminals, mobile terminals, PCs, servers, in-vehicle devices, entertainment devices, advertising devices, and personal digital assistants (PDA), tablet computer, notebook computer, handheld game console, smart glasses, smart watch, wearable device, virtual display device or display enhancement device, etc.
  • PDA personal digital assistants
  • FIG. 2A a flowchart of steps of the method for role separation according to Embodiment 2 of the present application is shown.
  • the role separation method provided by this embodiment includes the following steps:
  • step S201 the sound source angle data corresponding to the voice data frame of the character to be separated collected by the voice collecting device is acquired.
  • step S201 Since the specific implementation of this step S201 is similar to the specific implementation of the above-mentioned step S101, it will not be repeated here.
  • step S202 the character to be separated is identified based on the sound source angle data, so as to obtain a first identification result of the character to be separated.
  • step S202 Since the specific implementation of this step S202 is similar to the specific implementation of the above-mentioned step S102, it will not be repeated here.
  • step S203 voiceprint recognition is performed on the voice data frames of the character to be separated within a preset time period, so as to obtain a second identity recognition result of the character to be separated.
  • voiceprint refers to the sound wave spectrum that carries speech information in human speech, has unique biological characteristics, and has the function of identity recognition.
  • Voiceprint Identification also known as Character Identification (Speaker Identification)
  • the process of voiceprint recognition is usually to store the voiceprint information of one or some users in advance (the user who stores the voiceprint information is a registered user), and the voice features extracted from the character voice signal and the pre-stored voiceprint Make a comparison to get a similarity score, and then compare the score with the threshold.
  • the score is greater than the threshold, the character is considered to be the registered user corresponding to the voiceprint; if the score is less than or equal to the threshold, the role is considered Not the registered user corresponding to the voiceprint.
  • the preset time period can be set by those skilled in the art according to actual needs, which is not limited in this embodiment of the present application. It can be understood that, the above description is only exemplary, and the embodiments of the present application do not make any limitation on this.
  • the voice data frames of the characters to be separated within a preset time period may undergo different levels of preprocessing before voiceprint recognition.
  • This preprocessing can facilitate more efficient voiceprint recognition.
  • preprocessing may include: sampling; quantizing; removing non-speech audio data and silenced audio data; framing, windowing audio data including speech for subsequent processing, and the like.
  • the voice features of the voice data frames of the characters to be separated within a preset time period can be extracted, and based on the voice features of the voice data frames, the voice data frames and the user's voiceprint are matched.
  • the speech feature can be one of the filter bank FBank (Filter Bank), Mel Frequency Cepstral Coefficients MFCC (Mel Frequency Cepstral Coefficents), Perceptual Linear Prediction Coefficient PLP, Deep Feature Deep Feature, and Energy Normalized Spectral Coefficient PNCC. or a combination of various.
  • the extracted speech features may also be normalized. Then, based on the voice features of the voice data frame, the voice data frame is matched with the user's voiceprint to obtain a similarity score between the voice data frame and the user's voiceprint, and the similarity score between the voice data frame and the user's voiceprint is determined according to the similarity score. matching users.
  • the user's voiceprint is described by a voiceprint model, such as a hidden Markov model (HMM model), a Gaussian mixture model (GMM model), and the like.
  • the user's voiceprint model is characterized by voice features, and is obtained by training using audio data including the user's voice (hereinafter referred to as the user's audio data).
  • a matching operation function can be used to calculate the similarity between the voice data frame and the user's voiceprint. For example, the posterior probability of matching the voice features of the voice data frame with the user's voiceprint model can be calculated as the similarity score, and the likelihood between the voice features of the voice data frame and the user's voiceprint model can also be calculated as the similarity score.
  • the user's voiceprint model can be trained with a small amount of user's audio data based on a general background model unrelated to the user (similar to Voice features are features).
  • a Universal Background Model (UBM) can be obtained through the expectation maximization algorithm EM training to represent the user-independent feature distribution.
  • UBM Universal Background Model
  • a small amount of user audio data is used to train a GMM model through adaptive algorithms (such as maximum posterior probability MAP, maximum likelihood linear regression MLLR, etc.) (the GMM model obtained in this way is called GMM-UBM model.
  • the GMM-UBM model is the user's voiceprint model.
  • the voice data frame can be matched with the user's voiceprint model and the general background model, respectively, to obtain a similarity score between the voice data frame and the user's voiceprint. For example, calculate the likelihood between the speech features of the speech data frame and the above UBM model and GMM-UBM model respectively, then divide the two likelihoods and take the logarithm, and use the obtained value as the speech data frame and the GMM-UBM model. Similarity ratings between users' voiceprints.
  • the user's voiceprint is described by a voiceprint vector, such as i-vector, d-vector, x-vector, j-vector, and so on.
  • the voiceprint vector of the speech data frame may be extracted based on at least the speech features of the speech data frame.
  • the voiceprint model of the character to be separated can be trained first by using the voice features of the voice data frame. Similar to the above, the voiceprint model of the character to be separated can be obtained by training the voice feature of the voice data frame based on the above-mentioned general background model that is pre-trained and irrelevant to the user.
  • the mean value supervector of the speech data frame can be extracted according to the voiceprint model.
  • the mean value of each GMM component of the GMM-UBM model of the character to be separated can be spliced to obtain the mean value supervector of the GMM-UBM model of the character to be separated, that is, the mean value supervector of the speech data frame.
  • the joint factor analysis (JFA) method or the simplified joint factor analysis method can be used to extract the low-dimensional voiceprint vector from the mean value supervector of the speech data frame.
  • i-vector As an example, after training the user-independent universal background model (UBM model), the mean supervector of the universal background model can be extracted, and the global variance space (Total Variability Space, T) matrix can be estimated. Then, the i-vector of the speech data frame is calculated based on the mean value supervector of the speech data frame, the T matrix, and the mean value supervector of the general background model. Specifically, i-vector can be calculated according to the following formula:
  • M s,h is the mean supervector obtained from the voice h of character s
  • mu is the mean supervector of the universal background model
  • T is the global difference space matrix
  • ⁇ s,h is the global difference factor, i.e. i -vector.
  • a trained deep neural network can also be used to obtain the voiceprint vector of the speech data frame.
  • DNN can include input layer, hidden layer and output layer.
  • the FBank features of the speech data frame can be input to the DNN input layer first, and the output of the last hidden layer of the DNN is the d-vector.
  • a similarity score between the voice data frame and the user's voiceprint can be calculated based on the voiceprint vector of the voice data frame and the user's voiceprint vector.
  • algorithms such as Support Vector Machine (SVM), LDA (Linear Discriminant Analysis, Linear Discriminant Analysis), PLDA (Probabilistic Linear Discriminant Analysis, Probabilistic Linear Discriminant Analysis), Likelihood and Cosine Distance (Cosine Distance) can be used to calculate speech. Similarity score between the data frame and the user's voiceprint.
  • the voice is composed of the voices of I characters, and each character has J different voices, and the jth voice of the ith character is defined as Y ij .
  • the generative model that defines Y ij is:
  • is the mean value of the voiceprint vector
  • F and G are the spatial feature matrices, which respectively represent the inter-class feature space and the intra-class feature space of characters.
  • Each column of F is equivalent to the feature vector of the inter-class feature space
  • each column of G is equivalent to the feature vector of the intra-class feature space.
  • Vector h i and w ij can be viewed as the speech feature in each respective space representation
  • ⁇ ij is the noise covariance. If the possibility of h i wherein two of the same voice greater the likelihood that the higher the similarity score, then they come from the same a greater role.
  • the model parameters of PLDA include four, namely ⁇ , F, G and ⁇ ij , which are iteratively trained by the EM algorithm.
  • ⁇ , F, G and ⁇ ij which are iteratively trained by the EM algorithm.
  • a simplified version of the PLDA model can be used, ignoring the training of the intra-class feature space matrix G, and only training the inter-class feature space matrix F, that is:
  • Voiceprint vector may be based on speech data frames, obtained by referring to the equation h i wherein speech data frames.
  • h i obtained with reference to the above formula wherein the user's voice.
  • I characterized in h i may be calculated two logarithmic likelihood ratios or cosine distance as similarity score between the voiceprint of the user frame of speech data.
  • the voiceprint is not limited to the above-mentioned voiceprint vectors (i-vector, d-vector, x-vector, etc.) and the above-mentioned voiceprint models (HMM model, GMM model, etc.), and the corresponding similarity scoring algorithm is also It can be arbitrarily selected according to the selected voiceprint, which is not limited in the present invention.
  • the voice data frame matches the voiceprint of the user, that is, it is determined that the voice data frame matches the user corresponding to the voiceprint. Otherwise, it is determined that the voice data frame does not match the voiceprint of the user.
  • step S204 if the first identification result is different from the second identification result, use the second identification result to correct the first identification result to obtain the to-be-separated role the final identification result.
  • the first identification result is the same as the second identification result, it is not necessary to use the second identification result to correct the first identification result, and the first identification
  • the identification result is determined as the final identification result of the character to be separated.
  • step S205 the characters to be separated are separated based on the final identification result of the characters to be separated.
  • the final identification result of the role to be separated can be used to distinguish the roles, thereby realizing the role to be separated separation. It can be understood that, the above description is only exemplary, and the embodiments of the present application do not make any limitation on this.
  • the method further includes: obtaining face image data of the character to be separated collected by an image acquisition device; Perform face recognition on the face image data to obtain the third identity recognition result of the character to be separated; if the third identity recognition result is not the same as the second identity recognition result, use the third identity
  • the identification result corrects the second identification result to obtain the final identification result of the character to be separated. Therefore, when the result of voiceprint recognition on the voice data frame of the character to be separated within the preset time period is different from the result of face recognition on the face image data, the human face image data is used to perform the voiceprint recognition.
  • the result of face recognition corrects the result of voiceprint recognition of the voice data frame of the character to be separated within a preset time period, so that the identification result of the character can be accurately obtained, and then the character can be accurately separated according to the identification result of the character.
  • the image capturing device may be a camera.
  • the voice collection device may acquire the face image data of the character to be separated collected by the camera from the camera.
  • the face recognition model can be used to perform face recognition on the face image data to obtain the third identity recognition result of the character to be separated.
  • the face recognition model may be a neural network model for face recognition. If the third identification result is the same as the second identification result, it is not necessary to use the third identification result to correct the second identification result, and the second identification result is determined as the The final identification result of the said role to be separated.
  • the characters can be distinguished by using the final identification result of the characters to be separated, thereby realizing the separation of the characters to be separated. It can be understood that, the above description is only exemplary, and the embodiments of the present application do not make any limitation on this.
  • the voice data frame of the character to be separated is collected by a voice collecting device.
  • voice endpoint detection is performed on the voice data frame of the character to be separated to obtain a voice data frame with a voice endpoint, and based on the voice data of the character to be separated
  • the energy spectrum of the frame, filter and smooth the voice data frame with the voice endpoint to obtain the filtered and smoothed voice data frame; then perform sound source localization on the filtered and smoothed voice data frame to determine the sound source angle
  • the first identification result of the character to be separated, and then voiceprint recognition is performed on the voice data frame of the character to be separated within a preset time period to obtain the second identification result of the character to be separated, if The first identification result of the character to be separated, and then voiceprint recognition is performed on the voice data frame of the character to be separated within
  • voiceprint recognition is performed on the voice data frames of the characters to be separated within the preset time period, so as to obtain the second identification result of the character to be separated.
  • the identification results are not the same, then use the second identification result to correct the first identification result to obtain the final identification result of the role to be separated, and separate the role to be separated based on the final identification result of the role to be separated, and
  • the result of the identification based on the sound source angle data is different from the result of the voiceprint identification of the voice data frame of the character within the preset time period, use the preset value for the character.
  • the result of voiceprint recognition performed on the voice data frames in the time period is corrected based on the sound source angle data of the result of identity recognition, which can accurately obtain the identity recognition result of the role, and then can accurately separate the role according to the identity recognition result of the role.
  • the role separation method provided in this embodiment can be executed by any appropriate device with data processing capabilities, including but not limited to: cameras, terminals, mobile terminals, PCs, servers, in-vehicle devices, entertainment devices, advertising devices, and personal digital assistants (PDA), tablet computer, notebook computer, handheld game console, smart glasses, smart watch, wearable device, virtual display device or display enhancement device, etc.
  • PDA personal digital assistants
  • FIG. 3A a flowchart of steps of the method for role separation according to Embodiment 3 of the present application is shown.
  • the role separation method provided by this embodiment includes the following steps:
  • step S301 a role separation request carrying the voice data frame of the character to be separated is sent to the cloud, so that the cloud obtains the sound source angle data corresponding to the voice data frame based on the role separation request, and based on the role separation request According to the sound source angle data, the character to be separated is identified, and then the character is separated based on the identification result of the character to be separated.
  • the voice collection device sends a role separation request carrying a voice data frame of a character to be separated to the cloud
  • the cloud acquires the sound source angle data corresponding to the voice data frame based on the role separation request , and based on the sound source angle data, identify the character to be separated, and then separate the character based on the identification result of the character to be separated.
  • the cloud obtains the sound source angle data corresponding to the voice data frame based on the role separation request.
  • the specific implementation of the corresponding sound source angle data is similar, and details are not repeated here.
  • the specific implementation of identifying the character to be separated based on the sound source angle data by the cloud is the same as the method of identifying the character to be separated based on the sound source angle data in the first embodiment.
  • the specific implementation manner is similar and will not be repeated here.
  • the specific implementation of the cloud separating the roles based on the identification results of the roles to be separated is similar to the specific implementation of separating the roles based on the identification results of the roles to be separated in the above-mentioned first embodiment. This will not be repeated here. It can be understood that, the above description is only exemplary, and the embodiments of the present application do not make any limitation on this.
  • step S302 the separation result of the roles sent by the cloud based on the role separation request is received.
  • the voice collection device receives the role separation result sent by the cloud based on the role separation request. It can be understood that, the above description is only exemplary, and the embodiments of the present application do not make any limitation on this.
  • the voice data frame of the character to be separated is collected by a voice collecting device. After the voice data frame of the character to be separated is collected by the voice collection device, the voice data frame of the character to be separated is sent to the cloud for character separation.
  • the cloud performs voice endpoint detection on the voice data frames of the characters to be separated to obtain voice data frames with voice endpoints, and based on the energy spectrum of the voice data frames of the characters to be separated
  • the voice data frame is filtered and smoothed to obtain the filtered and smoothed voice data frame; then the sound source localization is performed on the filtered and smoothed voice data frame to determine the sound source angle data; and the sound source angle data are sequenced Clustering to obtain the sequential clustering result of the sound source angle data; then determine that the character identification corresponding to the sequential clustering result of the sound source angle data is the identification result of the character to be separated, and finally based on the
  • the character to be separated is separated according to the identification result of the character to be separated, and the separation result of the character is sent to the voice collection device.
  • the voice collection device sends a role separation request carrying a voice data frame of a character to be separated to the cloud, and the cloud obtains the sound source angle data corresponding to the voice data frame based on the role separation request, and Based on the sound source angle data, the characters to be separated are identified, and then the characters are separated based on the identification results of the characters to be separated.
  • the voice acquisition device receives the separation results of the characters sent by the cloud based on the role separation request, which is similar to other existing methods.
  • the role separation method provided in this embodiment can be executed by any appropriate device with data processing capabilities, including but not limited to: cameras, terminals, mobile terminals, PCs, servers, in-vehicle devices, entertainment devices, advertising devices, and personal digital assistants (PDA), tablet computer, notebook computer, handheld game console, smart glasses, smart watch, wearable device, virtual display device or display enhancement device, etc.
  • PDA personal digital assistants
  • FIG. 4A a flow chart of steps of the method for role separation according to Embodiment 4 of the present application is shown.
  • the role separation method provided by this embodiment includes the following steps:
  • step S401 a role separation request is received, which is sent by the voice collection device and carries the voice data frame of the role to be separated.
  • the cloud receives a role separation request sent by the voice collection device that carries the voice data frame of the role to be separated. It can be understood that, the above description is only exemplary, and the embodiments of the present application do not make any limitation on this.
  • step S402 based on the role separation request, acquire sound source angle data corresponding to the voice data frame.
  • the cloud acquires the sound source angle data corresponding to the voice data frame based on the role separation request.
  • the specific implementation of the sound source angle data corresponding to the voice data frame is similar, and will not be repeated here. It can be understood that, the above description is only exemplary, and the embodiments of the present application do not make any limitation on this.
  • step S403 the character to be separated is identified based on the sound source angle data, so as to obtain an identification result of the character to be separated.
  • the cloud identifies the character to be separated based on the sound source angle data, so as to obtain the identification result of the character to be separated.
  • the specific implementation of identifying the character to be separated based on the sound source angle data in order to obtain the first identification result of the character to be separated is similar to that described above, and will not be repeated here. It can be understood that, the above description is only exemplary, and the embodiments of the present application do not make any limitation on this.
  • step S404 the characters are separated based on the identification result of the characters to be separated, and the character separation result for the character separation request is sent to the voice collection device.
  • the cloud separates the roles based on the identification result of the roles to be separated, and sends the role separation result for the role separation request to the voice collection device. It can be understood that, the above description is only exemplary, and the embodiments of the present application do not make any limitation on this.
  • the voice data frame of the character to be separated is collected by a voice collecting device. After the voice data frame of the character to be separated is collected by the voice collection device, the voice data frame of the character to be separated is sent to the cloud for character separation.
  • the cloud performs voice endpoint detection on the voice data frames of the characters to be separated to obtain voice data frames with voice endpoints, and based on the energy spectrum of the voice data frames of the characters to be separated
  • the voice data frame is filtered and smoothed to obtain the filtered and smoothed voice data frame; then the sound source localization is performed on the filtered and smoothed voice data frame to determine the sound source angle data; and the sound source angle data are sequenced Clustering to obtain the sequential clustering result of the sound source angle data; then determine that the character identity corresponding to the sequential clustering result of the sound source angle data is the first identification result of the character to be separated, and then Perform voiceprint recognition on the voice data frames of the character to be separated within a preset time period to obtain a second identification result of the character to be separated, if the first identification result is the same as the second If the identification results are not the same, then use the second identification result to correct the first identification result to obtain the final identification result of the role to be separated, and finally based on the final identification of the role to
  • the cloud receives the role separation request sent by the voice collection device that carries the voice data frame of the role to be separated, and obtains the sound source angle data corresponding to the voice data frame based on the role separation request, Then based on the sound source angle data, identify the characters to be separated to obtain the identification results of the characters to be separated, then separate the characters based on the identification results of the characters to be separated, and send a message for the separation request to the voice acquisition device.
  • the role to be separated is identified, and then based on the role of the role to be separated.
  • the identification results separate roles, which can separate roles in real time, thereby making the user experience smoother.
  • the role separation method provided in this embodiment can be executed by any appropriate device with data processing capabilities, including but not limited to: cameras, terminals, mobile terminals, PCs, servers, in-vehicle devices, entertainment devices, advertising devices, and personal digital assistants (PDA), tablet computer, notebook computer, handheld game console, smart glasses, smart watch, wearable device, virtual display device or display enhancement device, etc.
  • PDA personal digital assistants
  • FIG. 5 a flowchart of steps of a method for recording meeting minutes according to Embodiment 5 of the present application is shown.
  • the method for recording meeting minutes includes the following steps:
  • step S501 the sound source angle data corresponding to the voice data frame of the conference role collected by the voice collecting device located in the conference room is acquired.
  • the voice collection device located in the conference room may be a microphone located in the conference room.
  • the conference role can be understood as a person participating in the conference.
  • the specific implementation of acquiring the sound source angle data corresponding to the voice data frame of the conference character collected by the voice collecting device located in the conference room is the same as that of obtaining the voice data of the to-be-separated character collected by the voice collecting device in the above-mentioned first embodiment.
  • the specific implementation of the sound source angle data corresponding to the frame is similar, and details are not repeated here. It can be understood that, the above description is only exemplary, and the embodiments of the present application do not make any limitation on this.
  • step S502 the conference role is identified based on the sound source angle data to obtain an identification result of the conference role.
  • the specific implementation of identifying the conference role based on the sound source angle data to obtain the identification result of the conference role is the same as the above-mentioned embodiment 1 based on the sound
  • the specific implementation manner of identifying the character to be separated to obtain the first identification result of the character to be separated is similar to that of the source angle data, which is not repeated here. It can be understood that, the above description is only exemplary, and the embodiments of the present application do not make any limitation on this.
  • step S503 the meeting minutes of the meeting role are recorded based on the identification result of the meeting role.
  • the conference role after the identification result of the conference role is obtained, the conference role can be distinguished by using the identification result of the conference role, and then the meeting minutes of the conference role can be recorded in real time. It can be understood that, the above description is only exemplary, and the embodiments of the present application do not make any limitation on this.
  • the sound source angle data corresponding to the speech data frame of the meeting role collected by the speech collection device located in the conference room is obtained, and the meeting role is identified based on the sound source angle data. to obtain the identification result of the conference role, and then record the meeting minutes of the conference role based on the identification result of the conference role.
  • the corresponding sound source angle data identify the meeting role, and then record the meeting minutes of the meeting role based on the identification result of the meeting role, which can record the meeting minutes of the meeting role in real time, thereby effectively improving the recording of the meeting minutes of the meeting role. efficient.
  • the method for recording meeting minutes provided in this embodiment can be executed by any appropriate device with data processing capabilities, including but not limited to: cameras, terminals, mobile terminals, PCs, servers, vehicle-mounted devices, entertainment devices, advertising devices, personal Digital Assistant (PDA), Tablet PC, Notebook PC, Handheld Game Console, Smart Glasses, Smart Watch, Wearable Device, Virtual Display Device or Display Enhancement Device, etc.
  • PDA personal Digital Assistant
  • Tablet PC Tablet PC
  • notebook PC Handheld Game Console
  • Smart Glasses Smart Watch
  • Wearable Device Virtual Display Device or Display Enhancement Device, etc.
  • FIG. 6 it shows a flowchart of steps of a method for displaying a character according to Embodiment 6 of the present application.
  • the character display method provided by this embodiment includes the following steps:
  • step S601 the sound source angle data corresponding to the voice data frame of the character collected by the voice collecting device is acquired.
  • the specific implementation of acquiring the sound source angle data corresponding to the voice data frame of the character collected by the voice collection device is the same as the method of obtaining the voice data of the character to be separated collected by the voice collection device in the above-mentioned first embodiment.
  • the specific implementation of the sound source angle data corresponding to the frame is similar, and details are not repeated here. It can be understood that, the above description is only exemplary, and the embodiments of the present application do not make any limitation on this.
  • step S602 the character is identified based on the sound source angle data to obtain an identification result of the character.
  • the specific implementation of identifying the character based on the sound source angle data to obtain the identification result of the character is the same as the above-mentioned embodiment 1 based on the sound source angle.
  • the specific implementation of identifying the character to be separated to obtain the first identification result of the character to be separated is similar, and will not be repeated here. It can be understood that, the above description is only exemplary, and the embodiments of the present application do not make any limitation on this.
  • the method further includes: enabling all sound sources in the direction of the sound source indicated by the sound source angle data. Lamps for the above voice acquisition equipment. Thereby, by turning on the lamps of the voice collection device in the direction of the sound source indicated by the sound source angle data, the direction of the sound source can be effectively indicated. It can be understood that, the above description is only exemplary, and the embodiments of the present application do not make any limitation on this.
  • the lamps of the voice collecting device are arranged in various directions of the voice collecting device in an array manner, so that the direction of the sound source can be effectively indicated. It can be understood that, the above description is only exemplary, and the embodiments of the present application do not make any limitation on this.
  • step S603 based on the identity recognition result of the character, the identity data of the character is displayed on the interactive interface of the voice collection device.
  • the interactive interface of the voice collecting device may be a touch screen of the voice collecting device.
  • the identity data of the character may be face image data, identity identification data, etc. of the character. It can be understood that, the above description is only exemplary, and the embodiments of the present application do not make any limitation on this.
  • the method further includes: displaying a speech action image or a voice waveform image of the character on the interactive interface of the voice collection device.
  • a speech action image or a voice waveform image of the character on the interactive interface of the voice collection device.
  • an image sequence of speech actions or an image sequence of speech waveforms of the character may be dynamically displayed on the interactive interface of the voice collection device. It can be understood that, the above description is only exemplary, and the embodiments of the present application do not make any limitation on this.
  • the sound source angle data corresponding to the voice data frame of the character collected by the voice collection device is obtained, and based on the sound source angle data, the character is identified to obtain the character identification result , and then based on the character's identification result, the character's identity data is displayed on the interactive interface of the voice collection device.
  • the sound source angle data corresponding to the character's voice data frame collected by the voice collection device identify the character, and then display the character's identity data on the interactive interface of the voice collection device based on the character's identity recognition result, which can display the character's identity data in real time, thus making the user experience smoother.
  • the character display method provided in this embodiment can be executed by any appropriate device with data processing capabilities, including but not limited to: cameras, terminals, mobile terminals, PCs, servers, in-vehicle devices, entertainment devices, advertising devices, and personal digital assistants (PDA), tablet computer, notebook computer, handheld game console, smart glasses, smart watch, wearable device, virtual display device or display enhancement device, etc.
  • PDA personal digital assistants
  • FIG. 7 a schematic structural diagram of a role separation device in Embodiment 7 of the present application is shown.
  • the character separation device includes: a first acquisition module 701 for acquiring sound source angle data corresponding to a voice data frame of a character to be separated collected by a voice collection device; a first identity recognition module 702 for based on The sound source angle data is used to identify the character to be separated to obtain the first identification result of the character to be separated; the separation module 703 is used to identify the character to be separated based on the first identity of the character to be separated
  • the recognition result separates the roles.
  • the role separation apparatus provided in this embodiment is used to implement the corresponding role separation methods in the foregoing multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.
  • FIG. 8 a schematic structural diagram of a role separation device in Embodiment 8 of the present application is shown.
  • the character separation device includes: a first acquisition module 801 for acquiring sound source angle data corresponding to a voice data frame of a character to be separated collected by a voice collection device; a first identity recognition module 805 for based on The sound source angle data is used to identify the character to be separated to obtain a first identification result of the character to be separated; the separation module 808 is used for identifying the character to be separated based on the first identity of the character to be separated
  • the recognition result separates the roles.
  • the apparatus further includes: a detection module 802, configured to perform voice endpoint detection on the voice data frames of the characters to be separated, so as to obtain voice data frames with voice endpoints Filtering and smoothing module 803, for the energy spectrum of the voice data frame of the character to be separated based on, the described voice data frame with voice endpoint is filtered and smoothed, to obtain the voice data frame after filtering and smoothing; Update module 804 , for updating the sound source angle data based on the filtered and smoothed speech data frame to obtain updated sound source angle data.
  • a detection module 802 configured to perform voice endpoint detection on the voice data frames of the characters to be separated, so as to obtain voice data frames with voice endpoints
  • Filtering and smoothing module 803 for the energy spectrum of the voice data frame of the character to be separated based on, the described voice data frame with voice endpoint is filtered and smoothed, to obtain the voice data frame after filtering and smoothing
  • Update module 804 for updating the sound source angle data based on the filtered and smoothe
  • the filtering and smoothing module 803 is specifically configured to: through a median filter, based on the spectral flatness of the energy spectrum of the voice data frame of the character to be separated, perform the filtering on the voice data frame with voice endpoints. Perform filtering and smoothing to obtain the filtered and smoothed speech data frame.
  • the first identity recognition module 805 includes: a clustering sub-module 8051, configured to perform sequential clustering on the sound source angle data to obtain a sequential clustering result of the sound source angle data; determine Sub-module 8052, configured to determine that the character identification corresponding to the sequential clustering result of the sound source angle data is the first identification result of the character to be separated.
  • a clustering sub-module 8051 configured to perform sequential clustering on the sound source angle data to obtain a sequential clustering result of the sound source angle data
  • determine Sub-module 8052 configured to determine that the character identification corresponding to the sequential clustering result of the sound source angle data is the first identification result of the character to be separated.
  • the clustering sub-module 8051 is specifically configured to: determine the distance between the sound source angle data and the sound source angle order clustering center; cluster the sound source angle order based on the sound source angle data and the sound source angle order The distance from the center determines the sequence clustering result of the sound source angle data.
  • the device further includes: a voiceprint recognition module 806, configured to perform voiceprint recognition on the voice data frames of the character to be separated within a preset time period, to obtain the second identification result of the character to be separated; the first correction module 807 is configured to use the second identification if the first identification result is different from the second identification result As a result, the first identification result is corrected to obtain the final identification result of the character to be separated.
  • a voiceprint recognition module 806 configured to perform voiceprint recognition on the voice data frames of the character to be separated within a preset time period, to obtain the second identification result of the character to be separated
  • the first correction module 807 is configured to use the second identification if the first identification result is different from the second identification result As a result, the first identification result is corrected to obtain the final identification result of the character to be separated.
  • the voice collection device includes a microphone array
  • the first obtaining module 801 is specifically configured to: obtain a covariance matrix of the voice data frame received by at least some microphones in the microphone array;
  • the covariance matrix is decomposed into eigenvalues to obtain a plurality of eigenvalues; the first number of the largest eigenvalues are selected from the plurality of eigenvalues, and a speech signal subspace is formed based on the eigenvectors corresponding to the selected eigenvalues, wherein , the first quantity is equivalent to the estimated quantity of sound sources; the sound source angle data is determined based on the subspace of the speech signal.
  • the device further includes: a second acquisition module 809 for acquiring the face image data of the character to be separated collected by the image acquisition device; a face recognition module 810 , for performing face recognition on the face image data to obtain the third identification result of the character to be separated; the second correction module 811 is used for if the third identification result is the same as the third identification result. If the two identification results are different, the third identification result is used to correct the second identification result to obtain the final identification result of the character to be separated.
  • the role separation apparatus provided in this embodiment is used to implement the corresponding role separation methods in the foregoing multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.
  • FIG. 9 a schematic structural diagram of a role separation device in Embodiment 9 of the present application is shown.
  • the role separation device includes: a first sending module 901, configured to send a role separation request carrying a voice data frame of a role to be separated to the cloud, so that the cloud obtains the voice based on the role separation request
  • the first receiving A module 902 is configured to receive the separation result of the roles sent by the cloud based on the role separation request.
  • the role separation apparatus provided in this embodiment is used to implement the corresponding role separation methods in the foregoing multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.
  • FIG. 10 a schematic structural diagram of a role separation device in Embodiment 10 of the present application is shown.
  • the role separation device includes: a second receiving module 1001, configured to receive a role separation request sent by a voice collection device that carries a voice data frame of a role to be separated; a third acquisition module 1002, configured to based on the role
  • the separation request is to obtain the sound source angle data corresponding to the voice data frame;
  • the second identification module 1003 is used to identify the character to be separated based on the sound source angle data, so as to obtain the to-be-separated character.
  • the identification result of the separated role; the second sending module 1004 is configured to separate the role based on the identification result of the role to be separated, and send the role separation result for the role separation request to the voice collection device .
  • the role separation apparatus provided in this embodiment is used to implement the corresponding role separation methods in the foregoing multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.
  • FIG. 11 a schematic structural diagram of a recording apparatus for meeting minutes in Embodiment 11 of the present application is shown.
  • the apparatus for recording meeting minutes includes: a fourth acquisition module 1101, configured to acquire sound source angle data corresponding to the voice data frames of the conference roles collected by the voice acquisition device located in the conference room; a third identity recognition module 1102 , used to identify the conference role based on the sound source angle data to obtain the identification result of the conference role; the recording module 1103 is used to record the conference based on the identification result of the conference role Minutes of the role's meeting.
  • the apparatus for recording meeting minutes provided in this embodiment is used to implement the corresponding recording methods for meeting minutes in the foregoing multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.
  • FIG. 12 a schematic structural diagram of a character display device in Embodiment 12 of the present application is shown.
  • the character display device includes: a fifth acquisition module 1201 for acquiring sound source angle data corresponding to a voice data frame of a character collected by a voice collection device; a fourth identity recognition module 1203 for Source angle data, identify the character to obtain the character's identification result; the first display module 1204 is used to display on the interactive interface of the voice collection device based on the character's identification result Identity data for the character.
  • the apparatus further includes: a turning-on module 1202, configured to turn on the lamps of the voice collection device in the direction of the sound source indicated by the sound source angle data.
  • the apparatus further includes: a second display module 1205, configured to display a speech action image or a voice waveform image of the character on the interactive interface of the voice collection device.
  • a second display module 1205 configured to display a speech action image or a voice waveform image of the character on the interactive interface of the voice collection device.
  • the character display apparatus provided in this embodiment is used to implement the corresponding character display methods in the foregoing multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.
  • FIG. 13 a schematic structural diagram of an electronic device according to Embodiment 13 of the present invention is shown.
  • the specific embodiment of the present invention does not limit the specific implementation of the electronic device.
  • the electronic device may include: a processor (processor) 1302 , a communication interface (Communications Interface) 1304 , a memory (memory) 1306 , and a communication bus 1308 .
  • processor processor
  • Communication interface Communication Interface
  • memory memory
  • the processor 1302 , the communication interface 1304 , and the memory 1306 communicate with each other through the communication bus 1308 .
  • the communication interface 1304 is used to communicate with other electronic devices or servers.
  • the processor 1302 is configured to execute the program 1310, and specifically may execute the relevant steps in the above-mentioned embodiments of the role separation method.
  • the program 1310 may include program code including computer operation instructions.
  • the processor 1302 may be a central processing unit (CPU), or an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention.
  • processors included in the smart device may be the same type of processors, such as one or more CPUs; or may be different types of processors, such as one or more CPUs and one or more ASICs.
  • the memory 1306 is used to store the program 1310 .
  • Memory 1306 may include high-speed RAM memory, and may also include non-volatile memory, such as at least one disk memory.
  • the program 1310 can specifically be used to cause the processor 1302 to perform the following operations: acquire the sound source angle data corresponding to the voice data frame of the character to be separated collected by the voice collection device; The character is identified to obtain a first identification result of the character to be separated; the character is separated based on the first identification result of the character to be separated.
  • the program 1310 is further configured to cause the processor 1302, after acquiring the sound source angle data corresponding to the voice data frame of the character to be separated collected by the voice collection device, to perform the processing on the character to be separated.
  • the voice data frame is detected by voice endpoint to obtain the voice data frame with the voice endpoint; Based on the energy spectrum of the voice data frame of the character to be separated, the voice data frame with the voice endpoint is filtered and smoothed to obtain Filter the smoothed voice data frame; update the sound source angle data based on the filtered and smoothed voice data frame to obtain updated sound source angle data.
  • the program 1310 is further configured to cause the processor 1302 to filter and smooth the voice data frame with voice endpoints based on the energy spectrum of the voice data frame of the character to be separated, to When obtaining the voice data frame after filtering and smoothing, through the median filter, based on the spectral flatness of the energy spectrum of the voice data frame of the character to be separated, the voice data frame with the voice endpoint is filtered and smoothed, to Obtain the filtered and smoothed speech data frame.
  • the program 1310 is further configured to cause the processor 1302 to identify the character to be separated based on the sound source angle data, so as to obtain the first character of the character to be separated.
  • identify the result perform sequential clustering on the sound source angle data to obtain the sequential clustering result of the sound source angle data; determine that the role identity corresponding to the sequential clustering result of the sound source angle data is the Describe the first identification result of the role to be separated.
  • the program 1310 is further configured to cause the processor 1302 to determine the The distance between the sound source angle data and the sound source angle order clustering center; based on the distance between the sound source angle data and the sound source angle order clustering center, the order clustering result of the sound source angle data is determined.
  • the program 1310 is further configured to cause the processor 1302, after obtaining the first identity recognition result of the character to be separated, to analyze the voice of the character to be separated within a preset time period Perform voiceprint recognition on the data frame to obtain the second identification result of the character to be separated; if the first identification result is different from the second identification result, use the second identification result Correct the first identification result to obtain the final identification result of the character to be separated.
  • the voice collection device includes a microphone array
  • the program 1310 is further configured to cause the processor 1302 to acquire sound source angle data corresponding to the voice data frame of the character to be separated collected by the voice collection device
  • obtain the covariance matrix of the speech data frame received by at least some microphones in the microphone array perform eigenvalue decomposition on the covariance matrix to obtain a plurality of eigenvalues; from the plurality of eigenvalues Select the first number of the largest eigenvalues, and form a speech signal subspace based on the eigenvectors corresponding to the selected eigenvalues, wherein the first number is equivalent to the estimated number of sound sources; based on the speech signal subspace, determine the Describe the sound source angle data.
  • the program 1310 is further configured to cause the processor 1302 to acquire the face image of the character to be separated collected by the image acquisition device after obtaining the final identification result of the character to be separated data; perform face image recognition on the face image data to obtain the third identity recognition result of the character to be separated; if the third identity recognition result is different from the second identity recognition result, then Correcting the second identification result using the third identification result to obtain the final identification result of the character to be separated.
  • the sound source angle data corresponding to the voice data frame of the character to be separated collected by the voice collection device is obtained, and based on the sound source angle data, the character to be separated is identified to obtain the character to be separated.
  • the first identity recognition result of the role then the role to be separated is separated based on the first identity recognition result of the role to be separated, compared with other existing methods, based on the voice data frame of the role to be separated collected by the voice collection device
  • Corresponding sound source angle data identify the character to be separated, and then separate the character based on the identification result of the character to be separated, so that the character can be separated in real time, thereby making the user experience smoother.
  • the program 1310 can specifically be used to cause the processor 1302 to perform the following operations: send a role separation request carrying the voice data frame of the role to be separated to the cloud, so that the cloud obtains the corresponding voice data frame based on the role separation request.
  • sound source angle data and based on the sound source angle data, identify the character to be separated, and then separate the character based on the identification result of the character to be separated; receive the cloud based on the The separation result of the role sent by the role separation request.
  • the voice collection device sends a role separation request carrying the voice data frame of the character to be separated to the cloud, and the cloud obtains the sound source angle data corresponding to the voice data frame based on the role separation request, and based on the sound source Angle data, identify the roles to be separated, and then separate the roles based on the identification results of the roles to be separated.
  • the voice acquisition device receives the separation results of the roles sent by the cloud based on the role separation request.
  • the program 1310 can specifically be used to cause the processor 1302 to perform the following operations: receive a role separation request that carries a voice data frame of a role to be separated sent by the voice acquisition device; and obtain the corresponding voice data frame based on the role separation request.
  • sound source angle data based on the sound source angle data, identify the character to be separated to obtain the identification result of the character to be separated; separate the character based on the identification result of the character to be separated describe the role, and send the role separation result for the role separation request to the voice collection device.
  • the cloud receives the role separation request sent by the voice acquisition device that carries the voice data frame of the role to be separated, and based on the role separation request, obtains the sound source angle data corresponding to the voice data frame, and then based on the sound source angle data. From the source angle data, identify the role to be separated to obtain the identification result of the role to be separated, then separate the role based on the identification result of the role to be separated, and send the role separation result for the role separation request to the voice acquisition device , compared with other existing methods, based on the sound source angle data corresponding to the voice data frame of the character to be separated carried by the role separation request, the character to be separated is identified, and then based on the identification result of the character to be separated. Separating roles can separate roles in real time, thereby making the user experience smoother.
  • the program 1310 can specifically be used to cause the processor 1302 to perform the following operations: acquire sound source angle data corresponding to the voice data frame of the conference role collected by the voice collection device located in the conference room; The role is identified to obtain the identification result of the conference role; and the meeting minutes of the conference role are recorded based on the identification result of the conference role.
  • the sound source angle data corresponding to the voice data frame of the conference role collected by the voice acquisition device located in the conference room is acquired, and based on the sound source angle data, the conference role is identified to obtain the conference role compared with other existing methods, based on the sound source corresponding to the voice data frame of the conference role collected by the voice acquisition device located in the conference room View data, identify the meeting role, and then record the meeting minutes of the meeting role based on the identification result of the meeting role, which can record the meeting minutes of the meeting role in real time, thereby effectively improving the recording efficiency of the meeting minutes of the meeting role.
  • the program 1310 can specifically be used to cause the processor 1302 to perform the following operations: acquire the sound source angle data corresponding to the voice data frame of the character collected by the voice collection device; Obtain the identification result of the character; and display the identification data of the character on the interactive interface of the voice collection device based on the identification result of the character.
  • the program 1310 is further configured to cause the processor 1302 to, after acquiring the sound source angle data corresponding to the voice data frame of the character collected by the voice collecting device, Turn on the lamps of the voice collection device in the source direction.
  • the program 1310 is further configured to cause the processor 1302 to display the speech action image or the voice waveform image of the character on the interactive interface of the voice collection device.
  • the sound source angle data corresponding to the voice data frame of the character collected by the voice collection device is obtained, and based on the sound source angle data, the character is identified to obtain the character identification result, and then based on the sound source angle data
  • the identification result of the character, the character's identity data is displayed on the interactive interface of the voice collection device.
  • the Carry out identity recognition based on the sound source angle data corresponding to the character's voice data frame collected by the voice collection device, the Carry out identity recognition, and then display the identity data of the role on the interactive interface of the voice collection device based on the identity recognition result of the role, which can display the identity data of the role in real time, thereby making the user experience smoother.
  • each component/step described in the embodiments of the present invention may be split into more components/steps, or two or more components/steps or some operations of components/steps may be combined into New components/steps to achieve the purpose of embodiments of the present invention.
  • the above-described methods according to embodiments of the present invention may be implemented in hardware, firmware, or as software or computer codes that may be stored in a recording medium (such as CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk), or by Network downloaded computer code originally stored in a remote recording medium or non-transitory machine-readable medium and will be stored in a local recording medium so that the methods described herein can be stored on a computer using a general purpose computer, special purpose processor or programmable or such software processing on a recording medium of dedicated hardware such as ASIC or FPGA.
  • a recording medium such as CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk
  • Network downloaded computer code originally stored in a remote recording medium or non-transitory machine-readable medium and will be stored in a local recording medium so that the methods described herein can be stored on a computer using a general purpose computer, special purpose processor or programmable or such software processing on a recording medium of dedicated hardware such as ASIC or FPGA
  • a computer, processor, microprocessor controller or programmable hardware includes storage components (eg, RAM, ROM, flash memory, etc.) that can store or receive software or computer code, when the software or computer code is executed by a computer, When accessed and executed by a processor or hardware, it implements the role separation method, the recording method of meeting minutes, or the role presentation method described herein. Furthermore, when a general-purpose computer accesses code for implementing the role separation method, the recording method of meeting minutes, or the role presentation method shown herein, the execution of the code converts the general-purpose computer to perform the role separation shown here. Method, recording method of meeting minutes, or dedicated computer for character presentation method.
  • storage components eg, RAM, ROM, flash memory, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Computer Security & Cryptography (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Otolaryngology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • Data Mining & Analysis (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Economics (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

一种角色分离方法、会议纪要的记录方法、角色展示方法、装置、电子设备及计算机存储介质,涉及语音处理领域。其中,所述方法包括:获取语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据(S101);基于所述声源角度数据,对所述待分离的角色进行身份识别,以获得所述待分离的角色的第一身份识别结果(S102);基于所述待分离的角色的第一身份识别结果分离所述角色(S103)。通过该方法,能够实时地分离角色,进而使得用户体验更流畅。

Description

角色分离方法、会议纪要的记录方法、角色展示方法、装置、电子设备及计算机存储介质
本申请要求2020年06月28日递交的申请号为202010596049.3、发明名称为“角色分离方法、会议纪要的记录方法、角色展示方法、装置、电子设备及计算机存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明实施例涉及语音处理领域,尤其涉及一种角色分离方法、会议纪要的记录方法、角色展示方法、装置、电子设备及计算机存储介质。
背景技术
随着信息技术的不断发展,人们对于高精度信息化分析的需求越来越高。基于电子设备实现的通话或者会议是人们生活中不可或缺的一部分,对应于此,通话内容或者会议内容的记录、分析已成为相关技术领域人员研究的热点,例如,在公众报警电话、各类热线、公司会议等领域,可以对通话内容或者会议内容进行记录、分析,用以实现后期信息总结、检索等工作。
说话人角色分离作为会议内容分析的一个重要步骤,其分离的实时性直接影响用户体验。目前,大多基于声纹识别实现说话人角色分离。由于声纹识别需要积累一定时长的语音数据,才能保障较高的识别准确率。因此,市面上大多数基于声纹识别的角色分离系统,都是基于离线的语音数据完成角色的分离,很难实时地实现角色的分离。由此可见,如何实时地分离角色,从而提高用户体验成为当前亟待解决的技术问题。
发明内容
有鉴于此,本发明实施例提供一种角色分离方案,以至少部分解决上述技术问题。
根据本发明实施例的第一方面,提供了一种角色分离方法。所述方法包括:获取语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据;基于所述声源角度数据,对所述待分离的角色进行身份识别,以获得所述待分离的角色的第一身份识别结果;基于所述待分离的角色的第一身份识别结果分离所述角色。
根据本发明实施例的第二方面,提供了一种角色分离方法。所述方法包括:向云端发送携带有待分离的角色的语音数据帧的角色分离请求,使得所述云端基于所述角色分离请求,获取所述语音数据帧所对应的声源角度数据,并基于所述声源角度数据,对所述待分离的角色进行身份识别,再基于所述待分离的角色的身份识别结果分离所述角色;接收所述云端基于所述角色分离请求发送的所述角色的分离结果。
根据本发明实施例的第三方面,提供了一种角色分离方法。所述方法包括:接收语 音采集设备发送的携带有待分离的角色的语音数据帧的角色分离请求;基于所述角色分离请求,获取所述语音数据帧所对应的声源角度数据;基于所述声源角度数据,对所述待分离的角色进行身份识别,以获得所述待分离的角色的身份识别结果;基于所述待分离的角色的身份识别结果分离所述角色,并向所述语音采集设备发送针对所述角色分离请求的角色分离结果。
根据本发明实施例的第四方面,提供了一种会议纪要的记录方法。所述方法包括:获取位于会议室的语音采集设备采集的会议角色的语音数据帧所对应的声源角度数据;基于所述声源角度数据,对所述会议角色进行身份识别,以获得所述会议角色的身份识别结果;基于所述会议角色的身份识别结果记录所述会议角色的会议纪要。
根据本发明实施例的第五方面,提供了一种角色展示方法。所述方法包括:获取语音采集设备采集的角色的语音数据帧所对应的声源角度数据;基于所述声源角度数据,对所述角色进行身份识别,以获得所述角色的身份识别结果;基于所述角色的身份识别结果,在所述语音采集设备的交互界面上展示所述角色的身份数据。
根据本发明实施例的第六方面,提供了一种角色分离装置。所述装置包括:第一获取模块,用于获取语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据;第一身份识别模块,用于基于所述声源角度数据,对所述待分离的角色进行身份识别,以获得所述待分离的角色的第一身份识别结果;分离模块,用于基于所述待分离的角色的第一身份识别结果分离所述角色。
根据本发明实施例的第七方面,提供了一种角色分离装置。所述装置包括:第一发送模块,用于向云端发送携带有待分离的角色的语音数据帧的角色分离请求,使得所述云端基于所述角色分离请求,获取所述语音数据帧所对应的声源角度数据,并基于所述声源角度数据,对所述待分离的角色进行身份识别,再基于所述待分离的角色的身份识别结果分离所述角色;第一接收模块,用于接收所述云端基于所述角色分离请求发送的所述角色的分离结果。
根据本发明实施例的第八方面,提供了一种角色分离装置。所述装置包括:第二接收模块,用于接收语音采集设备发送的携带有待分离的角色的语音数据帧的角色分离请求;第三获取模块,用于基于所述角色分离请求,获取所述语音数据帧所对应的声源角度数据;第二身份识别模块,用于基于所述声源角度数据,对所述待分离的角色进行身份识别,以获得所述待分离的角色的身份识别结果;第二发送模块,用于基于所述待分离的角色的身份识别结果分离所述角色,并向所述语音采集设备发送针对所述角色分离请求的角色分离结果。
根据本发明实施例的第九方面,提供了一种会议纪要的记录装置。所述装置包括:第四获取模块,用于获取位于会议室的语音采集设备采集的会议角色的语音数据帧所对应的声源角度数据;第三身份识别模块,用于基于所述声源角度数据,对所述会议角色 进行身份识别,以获得所述会议角色的身份识别结果;记录模块,用于基于所述会议角色的身份识别结果记录所述会议角色的会议纪要。
根据本发明实施例的第十方面,提供了一种角色展示装置。所述装置包括:第五获取模块,用于获取语音采集设备采集的角色的语音数据帧所对应的声源角度数据;第四身份识别模块,用于基于所述声源角度数据,对所述角色进行身份识别,以获得所述角色的身份识别结果;第一展示模块,用于基于所述角色的身份识别结果,在所述语音采集设备的交互界面上展示所述角色的身份数据。
根据本发明实施例的第十一方面,提供了一种电子设备,包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行如第一方面、第二方面或者第三方面所述的角色分离方法对应的操作,或者执行如第四方面所述的会议纪要的记录方法对应的操作,或者执行如第五方面所述的角色展示方法对应的操作。
根据本发明实施例的第十二方面,提供了一种计算机存储介质,其上存储有计算机程序,该程序被处理器执行时实现如第一方面、第二方面或者第三方面所述的角色分离方法,或者实现如第四方面所述的会议纪要的记录方法,或者实现如第五方面所述的角色展示方法。
根据本发明实施例提供的角色分离方案,获取语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据,并基于声源角度数据,对待分离的角色进行身份识别,以获得待分离的角色的第一身份识别结果;再基于待分离的角色的第一身份识别结果分离角色,与现有的其它方式相比,基于语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据,对待分离的角色进行身份识别,再基于待分离的角色的身份识别结果分离角色,能够实时地分离角色,进而使得用户体验更流畅。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明实施例中记载的一些实施例,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。
图1A为本申请实施例一中角色分离方法的步骤流程图;
图1B为根据本申请实施例一提供的近场模型下的声音传播示意图;
图1C为根据本申请实施例一提供的说话人分离方法的场景示意图;
图2A为本申请实施例二中角色分离方法的步骤流程图;
图2B为根据本申请实施例二提供的角色分离方法的场景示意图;
图3A为本申请实施例三中角色分离方法的步骤流程图;
图3B为根据本申请实施例三提供的角色分离方法的场景示意图;
图4A为本申请实施例四中角色分离方法的步骤流程图;
图4B为根据本申请实施例四提供的角色分离方法的场景示意图;
图5为本申请实施例五中会议纪要的记录方法的步骤流程图;
图6为本申请实施例六中角色展示方法的步骤流程图;
图7为本申请实施例七中角色分离装置的结构示意图;
图8为本申请实施例八中角色分离装置的结构示意图;
图9为本申请实施例九中角色分离装置的结构示意图;
图10为本申请实施例十中角色分离装置的结构示意图;
图11为本申请实施例十一中会议纪要的记录装置的结构示意图;
图12为本申请实施例十二中角色展示装置的结构示意图;
图13为本申请实施例十三中电子设备的结构示意图。
具体实施方式
为了使本领域的人员更好地理解本发明实施例中的技术方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本发明实施例一部分实施例,而不是全部的实施例。基于本发明实施例中的实施例,本领域普通技术人员所获得的所有其他实施例,都应当属于本发明实施例保护的范围。
下面结合本发明实施例附图进一步说明本发明实施例具体实现。
参照图1A,示出了本申请实施例一中角色分离方法的步骤流程图。
具体地,本实施例提供的角色分离方法包括以下步骤:
在步骤S101中,获取语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据。
在本申请实施例中,所述语音采集设备可包括拾音器。所述待分离的角色可为待分离的会议讲话人、待分离的通话人等。所述语音数据帧可理解为时长为20毫秒至30毫秒的语音片段。所述声源角度数据可理解为所述待分离的角色在说话时与所述语音采集设备形成的角度。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在一些可选实施例中,所述语音采集设备包括麦克风阵列。在获取语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据时,获取所述麦克风阵列中至少部分麦克风接收到的所述语音数据帧的协方差矩阵;对所述协方差矩阵进行特征值分解,以得到多个特征值;从所述多个特征值中选取第一数量个最大的特征值,并基于选取的 特征值对应的特征向量构成语音信号子空间,其中,所述第一数量与声源估计数量相当;基于所述语音信号子空间,确定所述声源角度数据。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在一个具体的例子中,可以在支持语音交互功能的设备(如拾音器)上设置麦克风阵列,麦克风阵列用于接收附近的声音输入。麦克风阵列是一组位于空间不同位置的全向麦克风按一定的形状规则布置形成的阵列,是对空间传播声音输入进行空间采样的一种装置,采集到的信号包含了其空间位置信息。根据麦克风阵列的拓扑结构,可分为线性阵列、平面阵列、体阵列等。根据声源和麦克风阵列之间距离的远近,则可将阵列分为近场模型和远场模型。近场模型将声波看成球面波,它考虑麦克风阵元接收信号间的幅度差;远场模型则将声波看成平面波,它忽略各阵元接收信号间的幅度差,近似认为各接收信号之间是简单的时延关系。可以根据麦克风阵列中至少部分麦克风接收到的信号,进行声源定位,以确定角色的位置信息。所确定的位置信息可以是角色的二维位置坐标,也可以是角色相对于所述至少部分麦克风的方位角和距离。其中,方位角为角色在所述至少部分麦克风所在的坐标系中的方位角,也即是声源角度数据,距离为角色与所述至少部分麦克风的中心位置之间的距离。作为一个示例,根据麦克风阵列中的部分麦克风或全部麦克风接收到的信号,可以利用MUSIC算法(Multiple Signal classification,多信号分类算法)进行声源定位。MUSIC算法的基本思想为将任意阵列输出数据的协方差矩阵进行特征值分解,从而得到与信号分量相对应的信号子空间和信号分量相正交的噪声子空间,然后利用这两个子空间的正交性来估计信号的参数(入射方向、极化信息和信号强度)。例如,可以利用这两个子空间的正交性构成空间扫描谱,进行全域搜索谱峰,从而实现信号的参数估计。
以麦克风阵列应用于拾音器为例,麦克风阵列可以是线性阵列,声场模型可以视为近场模型。近场情况下,声源信号到达各阵列麦克风的时间差为τ,和远场相比,不仅随着角度变化,也会随着距离变化。如图1B所示,设待分离的角色到麦克风阵列中各个麦克风的距离分别为R 1,R 2,...,R N-1,R N,声波在空气中传播速度为C,则声波到达第i个麦克风相对于到达第1个麦克风的时间差为τ i,其中,
Figure PCTCN2021101956-appb-000001
近场模型下的声源定位过程描述如下。
首先可以获取麦克风阵列中至少部分麦克风接收到的信号的协方差矩阵。例如,协方差矩阵可以表示为R(f),R(f)=E[X(f)X(f) H],其中,X(f)是麦克风阵列中至少部分麦克风接收到的信号经过傅里叶变换(如短时傅里叶变换)后在不同频点f的数据,为频域数据。X(f)可以视为一个向量,向量中每一个元素代表一个麦克风接收到的信号经过傅里叶变换后在不同频点f的数据。例如,X(f)可以表示为
X(f)={X 1(f),X 2(f)…X M(f)}
其中,X 1(f)、X 2(f)、X M(f)表示不同麦克风接收到的信号经过傅里叶变换(如短时傅里叶变换)后在不同频点f的数据,M为麦克风的个数。X(f)的表达式中实际上隐含了一个时间变量t,完整表示应该为X(f,t),表示一个时间段t所含的数据。E表示数学期望,求数学期望或者均值,实际是对时间t而言,可以理解为E[X(f,t)X(f,t) H],或者
Figure PCTCN2021101956-appb-000002
其中,N2-N1表示X(f,t)对应的时间段,N1表示起始时间,N2表示结束时间。
然后对协方差矩阵进行特征值分解,可以得到多个特征值。从这多个特征值中可以选取第一数量个最大的特征值,所选取的特征值对应的特征向量就可以构成信号子空间。其中,剩余的特征值对应的特征向量可以构成噪声子空间,其中,第一数量与声源估计数量相当,如在认为有3个声源信号时,可以取最大的三个特征值对应的特征向量构成信号子空间。声源估计数量可以通过经验或其他估计方式计算得到,此处不再赘述。例如,对R(f)进行特征值分解后为,R(f)=U s(f)Σ sU s(f) H+U N(f)Σ NU N(f) H,其中,U s(f)是大特征值对应的特征向量构成的信号子空间,U N(f)是小特征对应的特征向量构成的噪声子空间,S、N表示对信号U的不同划分,S代表信号,N代表噪声,所划分的U s代表信号子空间,U N代表噪声子空间。Σ代表对角矩阵,表示由特征值组成的矩阵。实际上,对R(f)进行特征值分解,有R(f)=U(f)ΣU(f) H,其中,Σ是一个仅有主对角元素的矩阵,Σ中的主对角元素即是分解得到的特征值,将U和Σ按Σ的主对角元素(特征值)大小进行分类,可以分为较大的一类S(即由大特征值对应的特征向量构成的信号子空间),和较小的一类N(由剩余的小特征值对应的特征向量构成的噪声子空间),就有R(f)=U s(f)Σ sU s(f) H+U N(f)Σ NU N(f) H
基于信号子空间,可以确定声源位置。例如,可以基于信号子空间,确定信号在二维空间的最大响应,基于最大响应所对应的波达方向(DOA),可以确定声源位置,即角色的位置信息。
作为示例,目标信号在二维空间的响应计算公式为
Figure PCTCN2021101956-appb-000003
f是一个取值范围,a(R,θ,f)可由相对时间差τ求得。其中,a(R,θ,f)表示麦克风阵列的导向矢量。R是声源和麦克风阵列中心的距离,θ是声源在阵列坐标系中的方位角。设声源在(R,θ)位置时,相对时间差τ定义为:声源到达各个麦克风所需时间相对于到达第一个麦克风所需时间的差τ=(τ 12,…,τ M),τ 1=0,然后可以求得对应位置(R,θ)在频率f的导向矢量a(R,θ,f)=(a 1,a 2,….,a M),其中
Figure PCTCN2021101956-appb-000004
待分离的角色二维坐标为(R target,θ target)=argmax(R,θ)S R,θ。也就是说,响应S R,θ最大时的(R,θ)即为待分离的角色的位置。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在一些可选实施例中,在获取语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据之后,所述方法还包括:对所述待分离的角色的语音数据帧进行语音端点检测,以获得具有语音端点的语音数据帧;基于所述待分离的角色的语音数据帧的能量频谱,对所述具有语音端点的语音数据帧进行过滤平滑,以获得过滤平滑后的语音数据帧;基于所述过滤平滑后的语音数据帧,对所述声源角度数据进行更新,以获得更新后的声源角度数据。籍此,通过对待分离的角色的语音数据帧进行语音端点检测和过滤平滑,能够获得更加稳定的声源角度数据。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在一个具体的例子中,语音端点检测(Voice Activity Detection,VAD),也可以称为语音活动检测,是从包含语音的一段信号中确定出语音的起点及终点,进而提取出相应的非静音语音信号,从而排除静音段和非语音信号的干扰,使得处理质量得到保证。此外,有效的端点检测还能使处理时间减到最少。在对所述待分离的角色的语音数据帧进行语音端点检测时,可以基于空间熵进行语音端点检测,还可以基于神经网络模型进行语音端点检测。其中,基于空间熵进行语音端点检测的过程如下:麦克风阵列接收到的声音信号中可能存在待分离的角色的声音以及周围的环境噪声。因此,可以根据麦克风阵列中至少部分麦克风接收到的声音信号的信号空间的混乱程度,来进行语音端点检测。在本实施例中,可以用空间熵来表征信号空间的混乱程度。可以在空间熵较小的情况下,认为存在语音活动,在空间熵较大的情况下,认为不存在语音活动。作为示例,可以首先获取麦克风阵列中至少部分麦克风接收到的信号的协方差矩阵,对协方差矩阵进行特征值分解,以得到多个特征值。如上文所述,大特征值构成的信号子空间可以视为是语音子空间,小特征值构成的信号子空间可以视为噪声子空间,因此可以通过对多个特征值进行分析,确定是否存在语音活动。如可以将每个特征值视为一个信号子空间(也即信号源),计算这多个特征值的熵(即空间熵),根据计算得到的空间熵的大小,可以判断是否存在语音活动。例如,可以对多个特征值进行归一化处理,并计算经过归一化处理后得到的多个值的空间熵,将空间熵与预定阈值进行比较,基于空间熵与预定阈值的比较结果,判断是否存在语音活动。如可以在空间熵小于预定阈值的情况下,判定存在语音活动,大于或等于预定阈值的情况下,判定不存在语音活动。其中,预定阈值的取值可以根据实际情况设定,如可以与选取的定位频带有关,例如在定位频带取500-5000(HZ)时,预定阈值可以取1,空间熵小于1时可以判定存在语音活动,反之则可以判断为噪音,不存在语音活动。其中,空间熵为ES,
Figure PCTCN2021101956-appb-000005
p i为特征值经过归一化处理后得到的值,N为对协方差矩阵进行特征值分解后得到的特征值的个数,log的底数为大于1的数值,如可以是2、10、e,对此本公开不做限定。其中,基于神经网络模型进行语音端点检测的过程如下:可以使用语音活动检测模 型对基于麦克风阵列中至少部分麦克风获取的语音数据帧进行预测,以判断是否存在语音活动。其中,语音活动检测模型用于对输入的语音数据帧的语音活动状态进行预测,语音活动检测模型可以是基于神经网络的模型,可以通过有监督的机器学习的方式得到预测模型。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在一些可选实施例中,在基于所述待分离的角色的语音数据帧的能量频谱,对所述具有语音端点的语音数据帧进行过滤平滑,以获得过滤平滑后的语音数据帧时,通过中值滤波器,基于所述待分离的角色的语音数据帧的能量频谱的谱平度,对所述具有语音端点的语音数据帧进行过滤平滑,以获得所述过滤平滑后的语音数据帧。籍此,通过中值滤波器,基于待分离的角色的语音数据帧的能量频谱的谱平度,对具有语音端点的语音数据帧进行过滤平滑,能够有效提升针对具有语音端点的语音数据帧的过滤平滑效果。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在一个具体的例子中,中值滤波是一种非线性数字滤波器技术,经常用于去除图像或者其它信号中的噪声。中值滤波器的设计思想就是检查输入信号中的采样并判断它是否代表了信号,使用奇数个采样组成的观察窗实现这项功能。观察窗口中的数值进行排序,位于观察窗中间的中值作为输出。然后,丢弃最早的值,取得新的采样,重复上面的计算过程。所述待分离的角色的语音数据帧的能量频谱的谱平度可理解为所述待分离的角色的语音数据帧的能量频谱的平坦度,是能量频谱的特征参数,可通过对所述待分离的角色的语音数据帧的能量频谱进行计算,获得所述待分离的角色的语音数据帧的能量频谱的平坦度。在通过中值滤波器,基于所述待分离的角色的语音数据帧的能量频谱的谱平度,对所述具有语音端点的语音数据帧进行过滤平滑时,通过中值滤波器,基于所述待分离的角色的语音数据帧的能量频谱的谱平度,对所述具有语音端点的语音数据帧的能量频谱进行过滤平滑,以获得过滤平滑后的能量频谱;基于所述过滤平滑后的能量频谱,确定所述过滤平滑后的语音数据帧。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在一些可选实施例中,在基于所述过滤平滑后的语音数据帧,对所述声源角度数据进行更新时,获取所述过滤平滑后的语音数据帧所对应的声源角度数据;使用所述过滤平滑后的语音数据帧所对应的声源角度数据更新所述待分离的角色的语音数据帧所对应的声源角度数据。其中,所述获取所述过滤平滑后的语音数据帧所对应的声源角度数据的具体实施方式与上述获取语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据的具体实施方式类似,在此不再赘述。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在步骤S102中,基于所述声源角度数据,对所述待分离的角色进行身份识别,以获得所述待分离的角色的第一身份识别结果。
在一些可选实施例中,在基于所述声源角度数据,对所述待分离的角色进行身份识 别,以获得所述待分离的角色的第一身份识别结果时,对所述声源角度数据进行顺序聚类,以获得所述声源角度数据的顺序聚类结果;确定所述声源角度数据的顺序聚类结果对应的角色身份标识为所述待分离的角色的第一身份识别结果。其中,所述角色身份标识可为角色的姓名、昵称或者身份编码等。籍此,通过对声源角度数据进行顺序聚类,能够准确地获得待分离的角色的第一身份识别结果。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在一个具体的例子中,顺序聚类通常是用于对一定数量的同质数据集合(或集群)中的数据进行分类的非监督方法。在这种情况下,集群的数量没有预先确定,而是根据给定的标准按顺序逐渐增加(一个接一个),直到满足适当的停止条件。顺序聚类算法的优点是双重的。首先,避免了不必要的大量集群的冗余计算。其次,集群通常按有序顺序提取,从最重要的集群(具有最大容量的集群)到最不重要(最小容量)的集群。在对所述声源角度数据进行顺序聚类,以获得所述声源角度数据的顺序聚类结果时,确定所述声源角度数据与声源角度顺序聚类中心的距离;基于所述声源角度数据与所述声源角度顺序聚类中心的距离,确定所述声源角度数据的顺序聚类结果。其中,所述声源角度顺序聚类中心可理解为声源角度每一顺序聚类的中心角度,例如,假设有三个声源角度的顺序聚类,那么三个声源角度的顺序聚类的中心角度可为30度、60度、90度。在确定所述声源角度数据与声源角度顺序聚类中心的距离时,确定所述声源角度数据与所述声源角度顺序聚类中心的差的绝对值为所述声源角度数据与声源角度顺序聚类中心的距离。在基于所述声源角度数据与所述声源角度顺序聚类中心的距离,确定所述声源角度数据的顺序聚类结果时,将所述声源角度数据与所述声源角度顺序聚类中心的距离与预设的距离阈值进行比较,若所述声源角度数据与所述声源角度顺序聚类中心的距离小于预设的距离阈值,则确定所述声源角度数据的顺序聚类结果为所述声源角度顺序聚类中心所在的顺序聚类,若所述声源角度数据与所述声源角度顺序聚类中心的距离等于或大于预设的距离阈值,则确定所述声源角度数据的顺序聚类结果不为所述声源角度顺序聚类中心所在的顺序聚类。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在步骤S103中,基于所述待分离的角色的第一身份识别结果分离所述角色。
在本申请实施例中,在获得所述待分离的角色的第一身份识别结果之后,便可使用所述待分离的角色的第一身份识别结果区分所述待分离的角色,从而实现了所述待分离的角色的分离。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在一个具体的例子中,如图1C所示,待分离的角色的语音数据帧由语音采集设备采集。当语音采集设备采集到待分离的角色的语音数据帧之后,对待分离的角色的语音数据帧进行语音端点检测,以获得具有语音端点的语音数据帧,并基于所述待分离的角 色的语音数据帧的能量频谱,对所述具有语音端点的语音数据帧进行过滤平滑,以获得过滤平滑后的语音数据帧;再对所述过滤平滑后的语音数据帧进行声源定位,以确定声源角度数据;再对所述声源角度数据进行顺序聚类,以获得所述声源角度数据的顺序聚类结果;再确定所述声源角度数据的顺序聚类结果对应的角色身份标识为所述待分离的角色的身份识别结果,最后基于所述待分离的角色的身份识别结果分离所述角色。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
通过本申请实施例提供的角色分离方法,获取语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据,并基于声源角度数据,对待分离的角色进行身份识别,以获得待分离的角色的第一身份识别结果;再基于待分离的角色的第一身份识别结果分离角色,与现有的其它方式相比,基于语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据,对待分离的角色进行身份识别,再基于待分离的角色的身份识别结果分离角色,能够实时地分离角色,进而使得用户体验更流畅。
本实施例提供的角色分离方法可以由任意适当的具有数据处理能力的设备执行,包括但不限于:摄像头、终端、移动终端、PC机、服务器、车载设备、娱乐设备、广告设备、个人数码助理(PDA)、平板电脑、笔记本电脑、掌上游戏机、智能眼镜、智能手表、可穿戴设备、虚拟显示设备或显示增强设备等。
参照图2A,示出了本申请实施例二的角色分离方法的步骤流程图。
具体地,本实施例提供的角色分离方法包括以下步骤:
在步骤S201中,获取语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据。
由于该步骤S201的具体实施方式与上述步骤S101的具体实施方式类似,在此不再赘述。
在步骤S202中,基于所述声源角度数据,对所述待分离的角色进行身份识别,以获得所述待分离的角色的第一身份识别结果。
由于该步骤S202的具体实施方式与上述步骤S102的具体实施方式类似,在此不再赘述。
在步骤S203中,对所述待分离的角色在预设时间段内的语音数据帧进行声纹识别,以获得所述待分离的角色的第二身份识别结果。
在本申请实施例中,声纹(Voiceprint)指的是人类语音中携带言语信息的声波频谱,具备独特的生物学特征,具有身份识别的作用。声纹识别(Voiceprint Identification),又称角色识别(Speaker Identification),该技术是从角色发出的语音信号中提取语音特征,并据此对角色进行身份验证的生物识别技术。声纹识别的过程通常是,预先存储某个或某些用户的声纹信息(存储了声纹信息的用户为注册用户),将从角色语音信号中提取出来的语音特征与预先存储的声纹进行比对,得到一个相似度分值,然后将该分值与阈值进行 比较,若分值大于阈值,则认为角色就是该声纹所对应的注册用户;若分值小于等于阈值,则认为角色不是该声纹所对应的注册用户。所述预设时间段可由本领域技术人员根据实际需求进行设定,本申请实施例对此不做任何限定。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在一个具体的例子中,所述待分离的角色在预设时间段内的语音数据帧可以在声纹识别之前经受不同层次的预处理。这种预处理可以促进更加高效的声纹识别。在各种实施方式中,预处理可以包括:采样;量化;去除非语音的音频数据和静默的音频数据;对包括语音的音频数据进行分帧、加窗,以供后续处理等等。经过预处理之后,可以提取待分离的角色在预设时间段内的语音数据帧的语音特征,并基于语音数据帧的语音特征将语音数据帧与用户的声纹进行匹配。语音特征可以是滤波器组FBank(Filter Bank)、梅尔频率倒谱系数MFCC(Mel Frequency Cepstral Coefficents)、感知线性预测系数PLP、深度特征Deep Feature、以及能量规整谱系数PNCC等特征中的一种或者多种的组合。在一种实施例中,还可以对提取得到的语音特征进行归一化处理。而后,基于语音数据帧的语音特征,将语音数据帧与用户的声纹进行匹配,以得到语音数据帧与用户的声纹之间的相似评分,并根据该相似评分来确定与语音数据帧相匹配的用户。具体地,在一些实施方式中,用户的声纹以声纹模型来描述,例如隐马尔可夫模型(HMM模型)、高斯混合模型(GMM模型)等等。用户的声纹模型以语音特征为特征,利用包括用户语音的音频数据(后文简称为用户的音频数据)训练得到。可以采用匹配运算函数来计算语音数据帧的与用户的声纹之间的相似度。例如可以计算语音数据帧的语音特征与用户的声纹模型相匹配的后验概率来作为相似评分,也可以计算语音数据帧的语音特征与用户的声纹模型之间的似然度来作为相似评分。但由于训练好用户的声纹模型需要大量该用户的音频数据,因此在一些实施方式中,用户的声纹模型可以基于与用户无关的通用背景模型,利用少量用户的音频数据训练得到(同样以语音特征为特征)。例如,可以先使用与用户无关的、多个角色的音频数据,通过期望最大化算法EM训练得到通用背景模型(Universal Background Model,UBM),以表征用户无关的特征分布。再基于该UBM模型,利用少量的用户的音频数据通过自适应算法(如最大后验概率MAP,最大似然线性回归MLLR等)训练得到GMM模型(这样得到的GMM模型称之为GMM-UBM模型),以表征用户的特征分布。该GMM-UBM模型即为用户的声纹模型。此时,可以基于语音数据帧的语音特征,分别将语音数据帧与用户的声纹模型和通用背景模型进行匹配,以得到语音数据帧与用户的声纹之间的相似评分。例如,分别计算语音数据帧的语音特征与上述UBM模型和GMM-UBM模型之间的似然度,然后将这两个似然度相除后取对数,将得到的值作为语音数据帧与用户的声纹之间的相似评分。
在另一些实施方式中,用户的声纹以声纹向量来描述,例如i-vector、d-vector、x-vector和j-vector等等。可以至少基于语音数据帧的语音特征,提取语音数据帧的声纹向量。根 据一种实施例,可以先利用语音数据帧的语音特征训练待分离的角色的声纹模型。如前文类似地,可以基于预先训练好的与用户无关的上述通用背景模型,利用语音数据帧的语音特征训练得到待分离的角色的声纹模型。在得到待分离的角色的声纹模型之后,可以根据该声纹模型提取语音数据帧的均值超矢量。例如,可以将待分离的角色的GMM-UBM模型的各个GMM分量的均值进行拼接,得到待分离的角色的GMM-UBM模型的均值超矢量,即语音数据帧的均值超矢量。之后,可以采用联合因子分析法(JFA)或者简化的联合因子分析法,从语音数据帧的均值超矢量中提取得到低维的声纹向量。以i-vector为例,在训练得到与用户无关的上述通用背景模型(UBM模型)之后,可以提取该通用背景模型的均值超矢量,并估计全局差异空间(Total Variability Space,T)矩阵。而后基于语音数据帧的均值超矢量、T矩阵、通用背景模型的均值超矢量来计算语音数据帧的i-vector。具体地,i-vector可以根据以下公式计算得到:
M s,h=m u+Tω s,h
其中,M s,h是从角色s的语音h中得到的均值超矢量,m u是通用背景模型的均值超矢量,T是全局差异空间矩阵,ω s,h是全局差异因子,也就是i-vector。
根据另一种实施例,还可以利用训练好的深度神经网络(Deep Neural Network,DNN)来得到语音数据帧的声纹向量。以d-vector为例,DNN可以包括输入层、隐层和输出层。可以先将语音数据帧的FBank特征输入到DNN输入层,DNN最后一个隐层的输出即为d-vector。
在得到语音数据帧的声纹向量之后,可以基于语音数据帧的声纹向量和用户的声纹向量,来计算语音数据帧与用户的声纹之间的相似评分。其中,可以采用支持向量机(SVM)、LDA(Linear Discriminant Analysis,线性判别分析)、PLDA(Probabilistic Linear Discriminant Analysis,概率线性判别分析)、似然度和余弦距离(Cosine Distance)等算法来计算语音数据帧与用户的声纹之间的相似评分。
以PLDA算法为例,假设语音由I个角色的语音组成,其中每个角色有J段不一样的语音,并且定义第i个角色的第j段语音为Y ij。那么,定义Y ij的生成模型为:
Y ij=μ+Fh i+Gw ijij
其中,μ是声纹向量的均值,F、G是空间特征矩阵,各自代表角色类间特征空间和类内特征空间。F的每一列,相当于类间特征空间的特征向量,G的每一列,相当于类内特征空间的特征向量。向量h i和w ij可以看作是该语音分别在各自空间的特征表示,ε ij则是噪声协方差。如果两条语音的h i特征相同的似然度越大,即相似评分越高,那么它们来自同一个角色的可能性就越大。
PLDA的模型参数包括4个,即μ、F、G和ε ij,是采用EM算法迭代训练而成。通常地,可以采用简化版的PLDA模型,忽略类内特征空间矩阵G的训练,只训练类间特征空间矩阵F,即:
Y ij=μ+Fh iij
可以基于语音数据帧的声纹向量,参照上述公式得到语音数据帧的h i特征。同样地,基于用户的声纹向量,参照上述公式得到用户语音的h i特征。而后,可以计算两个h i特征的对数似然比或余弦距离来作为语音数据帧与用户的声纹之间的相似评分。
应当注意的是,声纹并不限于上述声纹向量(i-vector、d-vector和x-vector等等)和上述声纹模型(HMM模型和GMM模型等等),相应的相似评分算法也可依据所选定的声纹来任意选取,本发明对此不做限制。
在各种实施方式中,如果得到的相似评分超过相似阈值,则确定语音数据帧与该用户的声纹相匹配,也就是确定语音数据帧与该声纹对应的用户相匹配。否则确定语音数据帧不与该用户的声纹相匹配。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在步骤S204中,若所述第一身份识别结果与所述第二身份识别结果不相同,则使用所述第二身份识别结果更正所述第一身份识别结果,以获得所述待分离的角色的最终身份识别结果。
在本申请实施例中,若所述第一身份识别结果与所述第二身份识别结果相同,则无需使用所述第二身份识别结果更正所述第一身份识别结果,并将所述第一身份识别结果确定为所述待分离的角色的最终身份识别结果。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在步骤S205中,基于所述待分离的角色的最终身份识别结果分离所述角色。
在本申请实施例中,在获得所述待分离的角色的最终身份识别结果之后,便可使用所述待分离的角色的最终身份识别结果区分所述角色,从而实现了所述待分离的角色的分离。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在一些可选实施例中,在获得所述待分离的角色的最终身份识别结果之后,所述方法还包括:获取图像采集装置采集的所述待分离的角色的人脸图像数据;对所述人脸图像数据进行人脸识别,以获得所述待分离的角色的第三身份识别结果;若所述第三身份识别结果与所述第二身份识别结果不相同,则使用所述第三身份识别结果更正所述第二身份识别结果,以获得所述待分离的角色的最终身份识别结果。籍此,在对待分离的角色在预设时间段内的语音数据帧进行声纹识别的结果与对人脸图像数据进行人脸识别的结果不相同的情况下,使用对人脸图像数据进行人脸识别的结果更正对待分离的角色在预设时间段内的语音数据帧进行声纹识别的结果,能够准确地获得角色的身份识别结果,进而能够根据角色的身份识别结果准确地分离角色。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在一个具体的例子中,所述图像采集装置可为摄像头。语音采集设备可从摄像头中获取摄像头采集的待分离的角色的人脸图像数据。在对所述人脸图像数据进行人脸识别 时,可通过人脸识别模型,对所述人脸图像数据进行人脸识别,以获得所述待分离的角色的第三身份识别结果。其中,所述人脸识别模型可为用于人脸识别的神经网络模型。若所述第三身份识别结果与所述第二身份识别结果相同,则无需使用所述第三身份识别结果更正所述第二身份识别结果,并将所述第二身份识别结果确定为所述待分离的说角色的最终身份识别结果。在获得所述待分离的角色的最终身份识别结果之后,便可使用所述待分离的角色的最终身份识别结果区分所述角色,从而实现了所述待分离的角色的分离。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在一个具体的例子中,如图2B所示,待分离的角色的语音数据帧由语音采集设备采集。当语音采集设备采集到待分离的角色的语音数据帧之后,对待分离的角色的语音数据帧进行语音端点检测,以获得具有语音端点的语音数据帧,并基于所述待分离的角色的语音数据帧的能量频谱,对所述具有语音端点的语音数据帧进行过滤平滑,以获得过滤平滑后的语音数据帧;再对所述过滤平滑后的语音数据帧进行声源定位,以确定声源角度数据;再对所述声源角度数据进行顺序聚类,以获得所述声源角度数据的顺序聚类结果;再确定所述声源角度数据的顺序聚类结果对应的角色身份标识为所述待分离的角色的第一身份识别结果,再对所述待分离的角色在预设时间段内的语音数据帧进行声纹识别,以获得所述待分离的角色的第二身份识别结果,若所述第一身份识别结果与所述第二身份识别结果不相同,则使用所述第二身份识别结果更正所述第一身份识别结果,以获得所述待分离的角色的最终身份识别结果,最后基于所述待分离的角色的最终身份识别结果分离所述角色。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在上述实施例一的基础上,对待分离的角色在预设时间段内的语音数据帧进行声纹识别,以获得待分离的角色的第二身份识别结果,若第一身份识别结果与第二身份识别结果不相同,则使用第二身份识别结果更正第一身份识别结果,以获得待分离的角色的最终身份识别结果,并基于待分离的角色的最终身份识别结果分离待分离的角色,与现有的其它方式相比,在基于声源角度数据进行身份识别的结果与对角色在预设时间段内的语音数据帧进行声纹识别的结果不相同的情况下,使用对角色在预设时间段内的语音数据帧进行声纹识别的结果更正基于声源角度数据进行身份识别的结果,能够准确地获得角色的身份识别结果,进而能够根据角色的身份识别结果准确地分离角色。
本实施例提供的角色分离方法可以由任意适当的具有数据处理能力的设备执行,包括但不限于:摄像头、终端、移动终端、PC机、服务器、车载设备、娱乐设备、广告设备、个人数码助理(PDA)、平板电脑、笔记本电脑、掌上游戏机、智能眼镜、智能手表、可穿戴设备、虚拟显示设备或显示增强设备等。
参照图3A,示出了本申请实施例三的角色分离方法的步骤流程图。
具体地,本实施例提供的角色分离方法包括以下步骤:
在步骤S301中,向云端发送携带有待分离的角色的语音数据帧的角色分离请求,使得所述云端基于所述角色分离请求,获取所述语音数据帧所对应的声源角度数据,并基于所述声源角度数据,对所述待分离的角色进行身份识别,再基于所述待分离的角色的身份识别结果分离所述角色。
在本申请实施例中,语音采集设备向云端发送携带有待分离的角色的语音数据帧的角色分离请求,所述云端基于所述角色分离请求,获取所述语音数据帧所对应的声源角度数据,并基于所述声源角度数据,对所述待分离的角色进行身份识别,再基于所述待分离的角色的身份识别结果分离所述角色。其中,所述云端基于所述角色分离请求,获取所述语音数据帧所对应的声源角度数据的具体实施方式与上述实施例一中获取语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据的具体实施方式类似,在此不再赘述。所述云端基于所述声源角度数据,对所述待分离的角色进行身份识别的具体实施方式与上述实施例一中基于所述声源角度数据,对所述待分离的角色进行身份识别的具体实施方式类似,在此不再赘述。所述云端基于所述待分离的角色的身份识别结果分离所述角色的具体实施方式与上述实施例一中基于所述待分离的角色的身份识别结果分离所述角色的具体实施方式类似,在此不再赘述。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在步骤S302中,接收所述云端基于所述角色分离请求发送的所述角色的分离结果。
在本申请实施例中,语音采集设备接收所述云端基于所述角色分离请求发送的所述角色的分离结果。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在一个具体的例子中,如图3B所示,待分离的角色的语音数据帧由语音采集设备采集。当语音采集设备采集到待分离的角色的语音数据帧之后,将待分离的角色的语音数据帧发送至云端进行角色分离。具体地,云端对待分离的角色的语音数据帧进行语音端点检测,以获得具有语音端点的语音数据帧,并基于所述待分离的角色的语音数据帧的能量频谱,对所述具有语音端点的语音数据帧进行过滤平滑,以获得过滤平滑后的语音数据帧;再对所述过滤平滑后的语音数据帧进行声源定位,以确定声源角度数据;再对所述声源角度数据进行顺序聚类,以获得所述声源角度数据的顺序聚类结果;再确定所述声源角度数据的顺序聚类结果对应的角色身份标识为所述待分离的角色的身份识别结果,最后基于所述待分离的角色的身份识别结果分离所述角色,并将所述角色的分离结果发送至语音采集设备。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
通过本申请实施例提供的角色分离方法,语音采集设备向云端发送携带有待分离的角色的语音数据帧的角色分离请求,云端基于角色分离请求,获取语音数据帧所对应的声源角度数据,并基于声源角度数据,对待分离的角色进行身份识别,再基于待分离的 角色的身份识别结果分离角色,语音采集设备接收云端基于角色分离请求发送的角色的分离结果,与现有的其它方式相比,基于角色分离请求携带的待分离的角色的语音数据帧所对应的声源角度数据,对待分离的角色进行身份识别,再基于待分离的角色的身份识别结果分离角色,能够实时地分离角色,进而使得用户体验更流畅。
本实施例提供的角色分离方法可以由任意适当的具有数据处理能力的设备执行,包括但不限于:摄像头、终端、移动终端、PC机、服务器、车载设备、娱乐设备、广告设备、个人数码助理(PDA)、平板电脑、笔记本电脑、掌上游戏机、智能眼镜、智能手表、可穿戴设备、虚拟显示设备或显示增强设备等。
参照图4A,示出了本申请实施例四的角色分离方法的步骤流程图。
具体地,本实施例提供的角色分离方法包括以下步骤:
在步骤S401中,接收语音采集设备发送的携带有待分离的角色的语音数据帧的角色分离请求。
在本申请实施例中,云端接收语音采集设备发送的携带有待分离的角色的语音数据帧的角色分离请求。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在步骤S402中,基于所述角色分离请求,获取所述语音数据帧所对应的声源角度数据。
在本申请实施例中,所述云端基于所述角色分离请求,获取所述语音数据帧所对应的声源角度数据的具体实施方式与上述实施例一中获取语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据的具体实施方式类似,在此不再赘述。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在步骤S403中,基于所述声源角度数据,对所述待分离的角色进行身份识别,以获得所述待分离的角色的身份识别结果。
在本申请实施例中,所述云端基于所述声源角度数据,对所述待分离的角色进行身份识别,以获得所述待分离的角色的身份识别结果的具体实施方式与上述实施例一中基于所述声源角度数据,对所述待分离的角色进行身份识别,以获得所述待分离的角色的第一身份识别结果的具体实施方式类似,在此不再赘述。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在步骤S404中,基于所述待分离的角色的身份识别结果分离所述角色,并向所述语音采集设备发送针对所述角色分离请求的角色分离结果。
在本申请实施例中,所述云端基于所述待分离的角色的身份识别结果分离所述角色,并向所述语音采集设备发送针对所述角色分离请求的角色分离结果。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在一个具体的例子中,如图4B所示,待分离的角色的语音数据帧由语音采集设备 采集。当语音采集设备采集到待分离的角色的语音数据帧之后,将待分离的角色的语音数据帧发送至云端进行角色分离。具体地,云端对待分离的角色的语音数据帧进行语音端点检测,以获得具有语音端点的语音数据帧,并基于所述待分离的角色的语音数据帧的能量频谱,对所述具有语音端点的语音数据帧进行过滤平滑,以获得过滤平滑后的语音数据帧;再对所述过滤平滑后的语音数据帧进行声源定位,以确定声源角度数据;再对所述声源角度数据进行顺序聚类,以获得所述声源角度数据的顺序聚类结果;再确定所述声源角度数据的顺序聚类结果对应的角色身份标识为所述待分离的角色的第一身份识别结果,再对所述待分离的角色在预设时间段内的语音数据帧进行声纹识别,以获得所述待分离的角色的第二身份识别结果,若所述第一身份识别结果与所述第二身份识别结果不相同,则使用所述第二身份识别结果更正所述第一身份识别结果,以获得所述待分离的角色的最终身份识别结果,最后基于所述待分离的角色的最终身份识别结果分离所述角色,并将所述待分离的角色的分离结果发送至语音采集设备。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
通过本申请实施例提供的角色分离方法,云端接收语音采集设备发送的携带有待分离的角色的语音数据帧的角色分离请求,并基于角色分离请求,获取语音数据帧所对应的声源角度数据,再基于声源角度数据,对待分离的角色进行身份识别,以获得待分离的角色的身份识别结果,再基于待分离的角色的身份识别结果分离角色,并向语音采集设备发送针对角色分离请求的角色分离结果,与现有的其它方式相比,基于角色分离请求携带的待分离的角色的语音数据帧所对应的声源角度数据,对待分离的角色进行身份识别,再基于待分离的角色的身份识别结果分离角色,能够实时地分离角色,进而使得用户体验更流畅。
本实施例提供的角色分离方法可以由任意适当的具有数据处理能力的设备执行,包括但不限于:摄像头、终端、移动终端、PC机、服务器、车载设备、娱乐设备、广告设备、个人数码助理(PDA)、平板电脑、笔记本电脑、掌上游戏机、智能眼镜、智能手表、可穿戴设备、虚拟显示设备或显示增强设备等。
参照图5,示出了本申请实施例五的会议纪要的记录方法的步骤流程图。
具体地,本实施例提供的会议纪要的记录方法包括以下步骤:
在步骤S501中,获取位于会议室的语音采集设备采集的会议角色的语音数据帧所对应的声源角度数据。
在本申请实施例中,所述位于会议室的语音采集设备可为位于会议室的拾音器。所述会议角色可理解为参加会议的人员。其中,所述获取位于会议室的语音采集设备采集的会议角色的语音数据帧所对应的声源角度数据的具体实施方式与上述实施例一中获取语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据的具体实施方式类似,在此不再赘述。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做 任何限定。
在步骤S502中,基于所述声源角度数据,对所述会议角色进行身份识别,以获得所述会议角色的身份识别结果。
在本申请实施例中,所述基于所述声源角度数据,对所述会议角色进行身份识别,以获得所述会议角色的身份识别结果的具体实施方式与上述实施例一中基于所述声源角度数据,对所述待分离的角色进行身份识别,以获得所述待分离的角色的第一身份识别结果的具体实施方式类似,在此不再赘述。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在步骤S503中,基于所述会议角色的身份识别结果记录所述会议角色的会议纪要。
在本申请实施例中,在获得所述会议角色的身份识别结果之后,便可使用所述会议角色的身份识别结果区分所述会议角色,进而可以实时记录会议角色的会议纪要。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在一个具体的例子中,在基于所述会议角色的身份识别结果记录所述会议角色的会议纪要时,基于所述会议角色的身份识别结果,对所述会议角色的会议纪要语音数据进行语音识别,以获得所述会议角色的会议纪要文本数据,并记录所述会议角色的会议纪要文本数据。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
通过本申请实施例提供的会议纪要的记录方法,获取位于会议室的语音采集设备采集的会议角色的语音数据帧所对应的声源角度数据,并基于声源角度数据,对会议角色进行身份识别,以获得会议角色的身份识别结果,再基于会议角色的身份识别结果记录会议角色的会议纪要,与现有的其它方式相比,基于位于会议室的语音采集设备采集的会议角色的语音数据帧所对应的声源角度数据,对会议角色进行身份识别,再基于会议角色的身份识别结果记录会议角色的会议纪要,能够实时地记录会议角色的会议纪要,从而有效提高会议角色的会议纪要的记录效率。
本实施例提供的会议纪要的记录方法可以由任意适当的具有数据处理能力的设备执行,包括但不限于:摄像头、终端、移动终端、PC机、服务器、车载设备、娱乐设备、广告设备、个人数码助理(PDA)、平板电脑、笔记本电脑、掌上游戏机、智能眼镜、智能手表、可穿戴设备、虚拟显示设备或显示增强设备等。
参照图6,示出了本申请实施例六的角色展示方法的步骤流程图。
具体地,本实施例提供的角色展示方法包括以下步骤:
在步骤S601中,获取语音采集设备采集的角色的语音数据帧所对应的声源角度数据。
在本申请实施例中,所述获取语音采集设备采集的角色的语音数据帧所对应的声源角度数据的具体实施方式与上述实施例一中获取语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据的具体实施方式类似,在此不再赘述。可以理解的是, 以上描述仅为示例性的,本申请实施例对此不做任何限定。
在步骤S602中,基于所述声源角度数据,对所述角色进行身份识别,以获得所述角色的身份识别结果。
在本申请实施例中,所述基于所述声源角度数据,对所述角色进行身份识别,以获得所述角色的身份识别结果的具体实施方式与上述实施例一中基于所述声源角度数据,对所述待分离的角色进行身份识别,以获得所述待分离的角色的第一身份识别结果的具体实施方式类似,在此不再赘述。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在一些可选实施例中,在获取语音采集设备采集的角色的语音数据帧所对应的声源角度数据之后,所述方法还包括:在所述声源角度数据指示的声源方向上开启所述语音采集设备的灯具。籍此,通过在声源角度数据指示的声源方向上开启语音采集设备的灯具,能够有效指示声源方向。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在一个具体的例子中,所述语音采集设备的灯具以阵列的方式排布在所述语音采集设备的各个方位,从而能够有效指示声源方向。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在步骤S603中,基于所述角色的身份识别结果,在所述语音采集设备的交互界面上展示所述角色的身份数据。
在本申请实施例中,所述语音采集设备的交互界面可为语音采集设备的触控屏。所述角色的身份数据可为所述角色的人脸图像数据、身份标识数据等。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在一些可选实施例中,所述方法还包括:在所述语音采集设备的交互界面上展示所述角色的说话动作图像或者语音波形图像。籍此,能够更加生动地展示角色说话时的形象。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
在一个具体的例子中,可以在所述语音采集设备的交互界面上动态展示所述角色的说话动作图像序列或者语音波形图像序列。可以理解的是,以上描述仅为示例性的,本申请实施例对此不做任何限定。
通过本申请实施例提供的角色展示方法,获取语音采集设备采集的角色的语音数据帧所对应的声源角度数据,并基于声源角度数据,对角色进行身份识别,以获得角色的身份识别结果,再基于角色的身份识别结果,在语音采集设备的交互界面上展示角色的身份数据,与现有的其它方式相比,基于语音采集设备采集的角色的语音数据帧所对应的声源角度数据,对角色进行身份识别,再基于角色的身份识别结果,在语音采集设备的交互界面上展示角色的身份数据,能够实时地展示角色的身份数据,从而使得用户体验更加流畅。
本实施例提供的角色展示方法可以由任意适当的具有数据处理能力的设备执行,包括但不限于:摄像头、终端、移动终端、PC机、服务器、车载设备、娱乐设备、广告设备、个人数码助理(PDA)、平板电脑、笔记本电脑、掌上游戏机、智能眼镜、智能手表、可穿戴设备、虚拟显示设备或显示增强设备等。
参照图7,示出了本申请实施例七中角色分离装置的结构示意图。
本实施例提供的角色分离装置包括:第一获取模块701,用于获取语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据;第一身份识别模块702,用于基于所述声源角度数据,对所述待分离的角色进行身份识别,以获得所述待分离的角色的第一身份识别结果;分离模块703,用于基于所述待分离的角色的第一身份识别结果分离所述角色。
本实施例提供的角色分离装置用于实现前述多个方法实施例中相应的角色分离方法,并具有相应的方法实施例的有益效果,在此不再赘述。
参照图8,示出了本申请实施例八中角色分离装置的结构示意图。
本实施例提供的角色分离装置包括:第一获取模块801,用于获取语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据;第一身份识别模块805,用于基于所述声源角度数据,对所述待分离的角色进行身份识别,以获得所述待分离的角色的第一身份识别结果;分离模块808,用于基于所述待分离的角色的第一身份识别结果分离所述角色。
可选地,所述第一获取模块801之后,所述装置还包括:检测模块802,用于对所述待分离的角色的语音数据帧进行语音端点检测,以获得具有语音端点的语音数据帧;过滤平滑模块803,用于基于所述待分离的角色的语音数据帧的能量频谱,对所述具有语音端点的语音数据帧进行过滤平滑,以获得过滤平滑后的语音数据帧;更新模块804,用于基于所述过滤平滑后的语音数据帧,对所述声源角度数据进行更新,以获得更新后的声源角度数据。
可选地,所述过滤平滑模块803,具体用于:通过中值滤波器,基于所述待分离的角色的语音数据帧的能量频谱的谱平度,对所述具有语音端点的语音数据帧进行过滤平滑,以获得所述过滤平滑后的语音数据帧。
可选地,所述第一身份识别模块805,包括:聚类子模块8051,用于对所述声源角度数据进行顺序聚类,以获得所述声源角度数据的顺序聚类结果;确定子模块8052,用于确定所述声源角度数据的顺序聚类结果对应的角色身份标识为所述待分离的角色的第一身份识别结果。
可选地,所述聚类子模块8051,具体用于:确定所述声源角度数据与声源角度顺序聚类中心的距离;基于所述声源角度数据与所述声源角度顺序聚类中心的距离,确定所述声源角度数据的顺序聚类结果。
可选地,所述第一身份识别模块805之后,所述装置还包括:声纹识别模块806,用于对所述待分离的角色在预设时间段内的语音数据帧进行声纹识别,以获得所述待分离的角色的第二身份识别结果;第一更正模块807,用于若所述第一身份识别结果与所述第二身份识别结果不相同,则使用所述第二身份识别结果更正所述第一身份识别结果,以获得所述待分离的角色的最终身份识别结果。
可选地,所述语音采集设备包括麦克风阵列,所述第一获取模块801,具体用于:获取所述麦克风阵列中至少部分麦克风接收到的所述语音数据帧的协方差矩阵;对所述协方差矩阵进行特征值分解,以得到多个特征值;从所述多个特征值中选取第一数量个最大的特征值,并基于选取的特征值对应的特征向量构成语音信号子空间,其中,所述第一数量与声源估计数量相当;基于所述语音信号子空间,确定所述声源角度数据。
可选地,所述第一更正模块807之后,所述装置还包括:第二获取模块809,用于获取图像采集装置采集的所述待分离的角色的人脸图像数据;人脸识别模块810,用于对所述人脸图像数据进行人脸识别,以获得所述待分离的角色的第三身份识别结果;第二更正模块811,用于若所述第三身份识别结果与所述第二身份识别结果不相同,则使用所述第三身份识别结果更正所述第二身份识别结果,以获得所述待分离的角色的最终身份识别结果。
本实施例提供的角色分离装置用于实现前述多个方法实施例中相应的角色分离方法,并具有相应的方法实施例的有益效果,在此不再赘述。
参照图9,示出了本申请实施例九中角色分离装置的结构示意图。
本实施例提供的角色分离装置包括:第一发送模块901,用于向云端发送携带有待分离的角色的语音数据帧的角色分离请求,使得所述云端基于所述角色分离请求,获取所述语音数据帧所对应的声源角度数据,并基于所述声源角度数据,对所述待分离的角色进行身份识别,再基于所述待分离的角色的身份识别结果分离所述角色;第一接收模块902,用于接收所述云端基于所述角色分离请求发送的所述角色的分离结果。
本实施例提供的角色分离装置用于实现前述多个方法实施例中相应的角色分离方法,并具有相应的方法实施例的有益效果,在此不再赘述。
参照图10,示出了本申请实施例十中角色分离装置的结构示意图。
本实施例提供的角色分离装置包括:第二接收模块1001,用于接收语音采集设备发送的携带有待分离的角色的语音数据帧的角色分离请求;第三获取模块1002,用于基于所述角色分离请求,获取所述语音数据帧所对应的声源角度数据;第二身份识别模块1003,用于基于所述声源角度数据,对所述待分离的角色进行身份识别,以获得所述待分离的角色的身份识别结果;第二发送模块1004,用于基于所述待分离的角色的身份识别结果分离所述角色,并向所述语音采集设备发送针对所述角色分离请求的角色分离结果。
本实施例提供的角色分离装置用于实现前述多个方法实施例中相应的角色分离方法,并具有相应的方法实施例的有益效果,在此不再赘述。
参照图11,示出了本申请实施例十一中会议纪要的记录装置的结构示意图。
本实施例提供的会议纪要的记录装置包括:第四获取模块1101,用于获取位于会议室的语音采集设备采集的会议角色的语音数据帧所对应的声源角度数据;第三身份识别模块1102,用于基于所述声源角度数据,对所述会议角色进行身份识别,以获得所述会议角色的身份识别结果;记录模块1103,用于基于所述会议角色的身份识别结果记录所述会议角色的会议纪要。
本实施例提供的会议纪要的记录装置用于实现前述多个方法实施例中相应的会议纪要的记录方法,并具有相应的方法实施例的有益效果,在此不再赘述。
参照图12,示出了本申请实施例十二中角色展示装置的结构示意图。
本实施例提供的角色展示装置包括:第五获取模块1201,用于获取语音采集设备采集的角色的语音数据帧所对应的声源角度数据;第四身份识别模块1203,用于基于所述声源角度数据,对所述角色进行身份识别,以获得所述角色的身份识别结果;第一展示模块1204,用于基于所述角色的身份识别结果,在所述语音采集设备的交互界面上展示所述角色的身份数据。
可选地,所述第五获取模块1201之后,所述装置还包括:开启模块1202,用于在所述声源角度数据指示的声源方向上开启所述语音采集设备的灯具。
可选地,所述装置还包括:第二展示模块1205,用于在所述语音采集设备的交互界面上展示所述角色的说话动作图像或者语音波形图像。
本实施例提供的角色展示装置用于实现前述多个方法实施例中相应的角色展示方法,并具有相应的方法实施例的有益效果,在此不再赘述。
参照图13,示出了根据本发明实施例十三的一种电子设备的结构示意图,本发明具体实施例并不对电子设备的具体实现做限定。
如图13所示,该电子设备可以包括:处理器(processor)1302、通信接口(Communications Interface)1304、存储器(memory)1306、以及通信总线1308。
其中:
处理器1302、通信接口1304、以及存储器1306通过通信总线1308完成相互间的通信。
通信接口1304,用于与其它电子设备或服务器进行通信。
处理器1302,用于执行程序1310,具体可以执行上述角色分离方法实施例中的相关步骤。
具体地,程序1310可以包括程序代码,该程序代码包括计算机操作指令。
处理器1302可能是中央处理器CPU,或者是特定集成电路ASIC(Application Specific  Integrated Circuit),或者是被配置成实施本发明实施例的一个或多个集成电路。智能设备包括的一个或多个处理器,可以是同一类型的处理器,如一个或多个CPU;也可以是不同类型的处理器,如一个或多个CPU以及一个或多个ASIC。
存储器1306,用于存放程序1310。存储器1306可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。
程序1310具体可以用于使得处理器1302执行以下操作:获取语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据;基于所述声源角度数据,对所述待分离的角色进行身份识别,以获得所述待分离的角色的第一身份识别结果;基于所述待分离的角色的第一身份识别结果分离所述角色。
在一种可选的实施方式中,程序1310还用于使得处理器1302在获取语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据之后,对所述待分离的角色的语音数据帧进行语音端点检测,以获得具有语音端点的语音数据帧;基于所述待分离的角色的语音数据帧的能量频谱,对所述具有语音端点的语音数据帧进行过滤平滑,以获得过滤平滑后的语音数据帧;基于所述过滤平滑后的语音数据帧,对所述声源角度数据进行更新,以获得更新后的声源角度数据。
在一种可选的实施方式中,程序1310还用于使得处理器1302在基于所述待分离的角色的语音数据帧的能量频谱,对所述具有语音端点的语音数据帧进行过滤平滑,以获得过滤平滑后的语音数据帧时,通过中值滤波器,基于所述待分离的角色的语音数据帧的能量频谱的谱平度,对所述具有语音端点的语音数据帧进行过滤平滑,以获得所述过滤平滑后的语音数据帧。
在一种可选的实施方式中,程序1310还用于使得处理器1302在基于所述声源角度数据,对所述待分离的角色进行身份识别,以获得所述待分离的角色的第一身份识别结果时,对所述声源角度数据进行顺序聚类,以获得所述声源角度数据的顺序聚类结果;确定所述声源角度数据的顺序聚类结果对应的角色身份标识为所述待分离的角色的第一身份识别结果。
在一种可选的实施方式中,程序1310还用于使得处理器1302在对所述声源角度数据进行顺序聚类,以获得所述声源角度数据的顺序聚类结果时,确定所述声源角度数据与声源角度顺序聚类中心的距离;基于所述声源角度数据与所述声源角度顺序聚类中心的距离,确定所述声源角度数据的顺序聚类结果。
在一种可选的实施方式中,程序1310还用于使得处理器1302在获得所述待分离的角色的第一身份识别结果之后,对所述待分离的角色在预设时间段内的语音数据帧进行声纹识别,以获得所述待分离的角色的第二身份识别结果;若所述第一身份识别结果与所述第二身份识别结果不相同,则使用所述第二身份识别结果更正所述第一身份识别结果,以获得所述待分离的角色的最终身份识别结果。
在一种可选的实施方式中,所述语音采集设备包括麦克风阵列,程序1310还用于使得处理器1302在获取语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据时,获取所述麦克风阵列中至少部分麦克风接收到的所述语音数据帧的协方差矩阵;对所述协方差矩阵进行特征值分解,以得到多个特征值;从所述多个特征值中选取第一数量个最大的特征值,并基于选取的特征值对应的特征向量构成语音信号子空间,其中,所述第一数量与声源估计数量相当;基于所述语音信号子空间,确定所述声源角度数据。
在一种可选的实施方式中,程序1310还用于使得处理器1302在获得所述待分离的角色的最终身份识别结果之后,获取图像采集装置采集的所述待分离的角色的人脸图像数据;对所述人脸图像数据进行人脸图像识别,以获得所述待分离的角色的第三身份识别结果;若所述第三身份识别结果与所述第二身份识别结果不相同,则使用所述第三身份识别结果更正所述第二身份识别结果,以获得所述待分离的角色的最终身份识别结果。
程序1310中各步骤的具体实现可以参见上述角色分离方法实施例中的相应步骤和单元中对应的描述,在此不赘述。所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的设备和模块的具体工作过程,可以参考前述方法实施例中的对应过程描述,在此不再赘述。
通过本实施例的电子设备,获取语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据,并基于声源角度数据,对待分离的角色进行身份识别,以获得待分离的角色的第一身份识别结果;再基于待分离的角色的第一身份识别结果分离待分离的角色,与现有的其它方式相比,基于语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据,对待分离的角色进行身份识别,再基于待分离的角色的身份识别结果分离角色,能够实时地分离角色,进而使得用户体验更流畅。
程序1310具体可以用于使得处理器1302执行以下操作:向云端发送携带有待分离的角色的语音数据帧的角色分离请求,使得所述云端基于所述角色分离请求,获取所述语音数据帧所对应的声源角度数据,并基于所述声源角度数据,对所述待分离的角色进行身份识别,再基于所述待分离的角色的身份识别结果分离所述角色;接收所述云端基于所述角色分离请求发送的所述角色的分离结果。
程序1310中各步骤的具体实现可以参见上述角色分离方法实施例中的相应步骤和单元中对应的描述,在此不赘述。所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的设备和模块的具体工作过程,可以参考前述方法实施例中的对应过程描述,在此不再赘述。
通过本实施例的电子设备,语音采集设备向云端发送携带有待分离的角色的语音数据帧的角色分离请求,云端基于角色分离请求,获取语音数据帧所对应的声源角度数据,并基于声源角度数据,对待分离的角色进行身份识别,再基于待分离的角色的身份识别 结果分离角色,语音采集设备接收云端基于角色分离请求发送的角色的分离结果,与现有的其它方式相比,基于角色分离请求携带的待分离的角色的语音数据帧所对应的声源角度数据,对待分离的角色进行身份识别,再基于待分离的角色的身份识别结果分离角色,能够实时地分离角色,进而使得用户体验更流畅。
程序1310具体可以用于使得处理器1302执行以下操作:接收语音采集设备发送的携带有待分离的角色的语音数据帧的角色分离请求;基于所述角色分离请求,获取所述语音数据帧所对应的声源角度数据;基于所述声源角度数据,对所述待分离的角色进行身份识别,以获得所述待分离的角色的身份识别结果;基于所述待分离的角色的身份识别结果分离所述角色,并向所述语音采集设备发送针对所述角色分离请求的角色分离结果。
程序1310中各步骤的具体实现可以参见上述角色分离方法实施例中的相应步骤和单元中对应的描述,在此不赘述。所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的设备和模块的具体工作过程,可以参考前述方法实施例中的对应过程描述,在此不再赘述。
通过本实施例的电子设备,云端接收语音采集设备发送的携带有待分离的角色的语音数据帧的角色分离请求,并基于角色分离请求,获取语音数据帧所对应的声源角度数据,再基于声源角度数据,对待分离的角色进行身份识别,以获得待分离的角色的身份识别结果,再基于待分离的角色的身份识别结果分离角色,并向语音采集设备发送针对角色分离请求的角色分离结果,与现有的其它方式相比,基于角色分离请求携带的待分离的角色的语音数据帧所对应的声源角度数据,对待分离的角色进行身份识别,再基于待分离的角色的身份识别结果分离角色,能够实时地分离角色,进而使得用户体验更流畅。
程序1310具体可以用于使得处理器1302执行以下操作:获取位于会议室的语音采集设备采集的会议角色的语音数据帧所对应的声源角度数据;基于所述声源角度数据,对所述会议角色进行身份识别,以获得所述会议角色的身份识别结果;基于所述会议角色的身份识别结果记录所述会议角色的会议纪要。
程序1310中各步骤的具体实现可以参见上述会议纪要的记录方法实施例中的相应步骤和单元中对应的描述,在此不赘述。所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的设备和模块的具体工作过程,可以参考前述方法实施例中的对应过程描述,在此不再赘述。
通过本实施例的电子设备,获取位于会议室的语音采集设备采集的会议角色的语音数据帧所对应的声源角度数据,并基于声源角度数据,对会议角色进行身份识别,以获得会议角色的身份识别结果,再基于会议角色的身份识别结果记录会议角色的会议纪要,与现有的其它方式相比,基于位于会议室的语音采集设备采集的会议角色的语音数据帧 所对应的声源角度数据,对会议角色进行身份识别,再基于会议角色的身份识别结果记录会议角色的会议纪要,能够实时地记录会议角色的会议纪要,从而有效提高会议角色的会议纪要的记录效率。
程序1310具体可以用于使得处理器1302执行以下操作:获取语音采集设备采集的角色的语音数据帧所对应的声源角度数据;基于所述声源角度数据,对所述角色进行身份识别,以获得所述角色的身份识别结果;基于所述角色的身份识别结果,在所述语音采集设备的交互界面上展示所述角色的身份数据。
在一种可选的实施方式中,程序1310还用于使得处理器1302在获取语音采集设备采集的角色的语音数据帧所对应的声源角度数据之后,在所述声源角度数据指示的声源方向上开启所述语音采集设备的灯具。
在一种可选的实施方式中,程序1310还用于使得处理器1302在所述语音采集设备的交互界面上展示所述角色的说话动作图像或者语音波形图像。
程序1310中各步骤的具体实现可以参见上述角色展示方法实施例中的相应步骤和单元中对应的描述,在此不赘述。所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的设备和模块的具体工作过程,可以参考前述方法实施例中的对应过程描述,在此不再赘述。
通过本实施例的电子设备,获取语音采集设备采集的角色的语音数据帧所对应的声源角度数据,并基于声源角度数据,对角色进行身份识别,以获得角色的身份识别结果,再基于角色的身份识别结果,在语音采集设备的交互界面上展示角色的身份数据,与现有的其它方式相比,基于语音采集设备采集的角色的语音数据帧所对应的声源角度数据,对角色进行身份识别,再基于角色的身份识别结果,在语音采集设备的交互界面上展示角色的身份数据,能够实时地展示角色的身份数据,从而使得用户体验更加流畅。
需要指出,根据实施的需要,可将本发明实施例中描述的各个部件/步骤拆分为更多部件/步骤,也可将两个或多个部件/步骤或者部件/步骤的部分操作组合成新的部件/步骤,以实现本发明实施例的目的。
上述根据本发明实施例的方法可在硬件、固件中实现,或者被实现为可存储在记录介质(诸如CD ROM、RAM、软盘、硬盘或磁光盘)中的软件或计算机代码,或者被实现通过网络下载的原始存储在远程记录介质或非暂时机器可读介质中并将被存储在本地记录介质中的计算机代码,从而在此描述的方法可被存储在使用通用计算机、专用处理器或者可编程或专用硬件(诸如ASIC或FPGA)的记录介质上的这样的软件处理。可以理解,计算机、处理器、微处理器控制器或可编程硬件包括可存储或接收软件或计算机代码的存储组件(例如,RAM、ROM、闪存等),当所述软件或计算机代码被计算机、处理器或硬件访问且执行时,实现在此描述的角色分离方法、会议纪要的记录方法,或者角色展示方法。此外,当通用计算机访问用于实现在此示出的角色分离方法、会议纪 要的记录方法,或者角色展示方法的代码时,代码的执行将通用计算机转换为用于执行在此示出的角色分离方法、会议纪要的记录方法,或者角色展示方法的专用计算机。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及方法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明实施例的范围。
以上实施方式仅用于说明本发明实施例,而并非对本发明实施例的限制,有关技术领域的普通技术人员,在不脱离本发明实施例的精神和范围的情况下,还可以做出各种变化和变型,因此所有等同的技术方案也属于本发明实施例的范畴,本发明实施例的专利保护范围应由权利要求限定。

Claims (21)

  1. 一种角色分离方法,所述方法包括:
    获取语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据;
    基于所述声源角度数据,对所述待分离的角色进行身份识别,以获得所述待分离的角色的第一身份识别结果;
    基于所述待分离的角色的第一身份识别结果分离所述角色。
  2. 根据权利要求1所述的方法,其中,所述获取语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据之后,所述方法还包括:
    对所述待分离的角色的语音数据帧进行语音端点检测,以获得具有语音端点的语音数据帧;
    基于所述待分离的角色的语音数据帧的能量频谱,对所述具有语音端点的语音数据帧进行过滤平滑,以获得过滤平滑后的语音数据帧;
    基于所述过滤平滑后的语音数据帧,对所述声源角度数据进行更新,以获得更新后的声源角度数据。
  3. 根据权利要求2所述的方法,其中,所述基于所述待分离的角色的语音数据帧的能量频谱,对所述具有语音端点的语音数据帧进行过滤平滑,以获得过滤平滑后的语音数据帧,包括:
    通过中值滤波器,基于所述待分离的角色的语音数据帧的能量频谱的谱平度,对所述具有语音端点的语音数据帧进行过滤平滑,以获得所述过滤平滑后的语音数据帧。
  4. 根据权利要求1所述的方法,其中,所述基于所述声源角度数据,对所述待分离的角色进行身份识别,以获得所述待分离的角色的第一身份识别结果,包括:
    对所述声源角度数据进行顺序聚类,以获得所述声源角度数据的顺序聚类结果;
    确定所述声源角度数据的顺序聚类结果对应的角色身份标识为所述待分离的角色的第一身份识别结果。
  5. 根据权利要求4所述的方法,其中,所述对所述声源角度数据进行顺序聚类,以获得所述声源角度数据的顺序聚类结果,包括:
    确定所述声源角度数据与声源角度顺序聚类中心的距离;
    基于所述声源角度数据与所述声源角度顺序聚类中心的距离,确定所述声源角度数据的顺序聚类结果。
  6. 根据权利要求1所述的方法,其中,所述获得所述待分离的角色的第一身份识别结果之后,所述方法还包括:
    对所述待分离的角色在预设时间段内的语音数据帧进行声纹识别,以获得所述待分离的角色的第二身份识别结果;
    若所述第一身份识别结果与所述第二身份识别结果不相同,则使用所述第二身份识别结果更正所述第一身份识别结果,以获得所述待分离的角色的最终身份识别结果。
  7. 根据权利要求1所述的方法,其中,所述语音采集设备包括麦克风阵列,所述获取语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据,包括:
    获取所述麦克风阵列中至少部分麦克风接收到的所述语音数据帧的协方差矩阵;
    对所述协方差矩阵进行特征值分解,以得到多个特征值;
    从所述多个特征值中选取第一数量个最大的特征值,并基于选取的特征值对应的特征向量构成语音信号子空间,其中,所述第一数量与声源估计数量相当;
    基于所述语音信号子空间,确定所述声源角度数据。
  8. 根据权利要求6所述的方法,其中,所述获得所述待分离的角色的最终身份识别结果之后,所述方法还包括:
    获取图像采集装置采集的所述待分离的角色的人脸图像数据;
    对所述人脸图像数据进行人脸识别,以获得所述待分离的角色的第三身份识别结果;
    若所述第三身份识别结果与所述第二身份识别结果不相同,则使用所述第三身份识别结果更正所述第二身份识别结果,以获得所述待分离的角色的最终身份识别结果。
  9. 一种角色分离方法,所述方法包括:
    向云端发送携带有待分离的角色的语音数据帧的角色分离请求,使得所述云端基于所述角色分离请求,获取所述语音数据帧所对应的声源角度数据,并基于所述声源角度数据,对所述待分离的角色进行身份识别,再基于所述待分离的角色的身份识别结果分离所述角色;
    接收所述云端基于所述角色分离请求发送的所述角色的分离结果。
  10. 一种角色分离方法,所述方法包括:
    接收语音采集设备发送的携带有待分离的角色的语音数据帧的角色分离请求;
    基于所述角色分离请求,获取所述语音数据帧所对应的声源角度数据;
    基于所述声源角度数据,对所述待分离的角色进行身份识别,以获得所述待分离的角色的身份识别结果;
    基于所述待分离的角色的身份识别结果分离所述角色,并向所述语音采集设备发送针对所述角色分离请求的角色分离结果。
  11. 一种会议纪要的记录方法,所述方法包括:
    获取位于会议室的语音采集设备采集的会议角色的语音数据帧所对应的声源角度数据;
    基于所述声源角度数据,对所述会议角色进行身份识别,以获得所述会议角色的身份识别结果;
    基于所述会议角色的身份识别结果记录所述会议角色的会议纪要。
  12. 一种角色展示方法,所述方法包括:
    获取语音采集设备采集的角色的语音数据帧所对应的声源角度数据;
    基于所述声源角度数据,对所述角色进行身份识别,以获得所述角色的身份识别结果;
    基于所述角色的身份识别结果,在所述语音采集设备的交互界面上展示所述角色的身份数据。
  13. 根据权利要求12所述的方法,其中,所述获取语音采集设备采集的角色的语音数据帧所对应的声源角度数据之后,所述方法还包括:
    在所述声源角度数据指示的声源方向上开启所述语音采集设备的灯具。
  14. 根据权利要求12所述的方法,其中,所述方法还包括:
    在所述语音采集设备的交互界面上展示所述角色的说话动作图像或者语音波形图像。
  15. 一种角色分离装置,所述装置包括:
    第一获取模块,用于获取语音采集设备采集的待分离的角色的语音数据帧所对应的声源角度数据;
    第一身份识别模块,用于基于所述声源角度数据,对所述待分离的角色进行身份识别,以获得所述待分离的角色的第一身份识别结果;
    分离模块,用于基于所述待分离的角色的第一身份识别结果分离所述待分离的角色。
  16. 一种角色分离装置,所述装置包括:
    第一发送模块,用于向云端发送携带有待分离的角色的语音数据帧的角色分离请求,使得所述云端基于所述角色分离请求,获取所述语音数据帧所对应的声源角度数据,并基于所述声源角度数据,对所述待分离的角色进行身份识别,再基于所述待分离的角色的身份识别结果分离所述角色;
    第一接收模块,用于接收所述云端基于所述角色分离请求发送的所述角色的分离结果。
  17. 一种角色分离装置,所述装置包括:
    第二接收模块,用于接收语音采集设备发送的携带有待分离的角色的语音数据帧的角色分离请求;
    第三获取模块,用于基于所述角色分离请求,获取所述语音数据帧所对应的声源角度数据;
    第二身份识别模块,用于基于所述声源角度数据,对所述待分离的角色进行身份识别,以获得所述待分离的角色的身份识别结果;
    第二发送模块,用于基于所述待分离的角色的身份识别结果分离所述角色,并向所述语音采集设备发送针对所述角色分离请求的角色分离结果。
  18. 一种会议纪要的记录装置,所述装置包括:
    第四获取模块,用于获取位于会议室的语音采集设备采集的会议角色的语音数据帧 所对应的声源角度数据;
    第三身份识别模块,用于基于所述声源角度数据,对所述会议角色进行身份识别,以获得所述会议角色的身份识别结果;
    记录模块,用于基于所述会议角色的身份识别结果记录所述会议角色的会议纪要。
  19. 一种角色展示装置,所述装置包括:
    第五获取模块,用于获取语音采集设备采集的角色的语音数据帧所对应的声源角度数据;
    第四身份识别模块,用于基于所述声源角度数据,对所述角色进行身份识别,以获得所述角色的身份识别结果;
    第一展示模块,用于基于所述角色的身份识别结果,在所述语音采集设备的交互界面上展示所述角色的身份数据。
  20. 一种电子设备,包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;
    所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行如权利要求1-8中任意一项权利要求所述的角色分离方法对应的操作,或者执行如权利要求9所述的角色分离方法对应的操作,或者执行如权利要求10所述的角色分离方法对应的操作,或者执行如权利要求11所述的会议纪要的记录方法对应的操作,或者执行如权利要求12-14中任意一项权利要求所述的角色展示方法对应的操作。
  21. 一种计算机存储介质,其上存储有计算机程序,该程序被处理器执行时实现如权利要求1-8中任意一项权利要求所述的角色分离方法,或者实现如权利要求9所述的角色分离方法,或者实现如权利要求10所述的角色分离方法,或者实现如权利要求11所述的会议纪要的记录方法,或者实现如权利要求12-14中任意一项权利要求所述的角色展示方法。
PCT/CN2021/101956 2020-06-28 2021-06-24 角色分离方法、会议纪要的记录方法、角色展示方法、装置、电子设备及计算机存储介质 WO2022001801A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/090,296 US20230162757A1 (en) 2020-06-28 2022-12-28 Role separation method, meeting summary recording method, role display method and apparatus, electronic device, and computer storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010596049.3 2020-06-28
CN202010596049.3A CN113849793A (zh) 2020-06-28 2020-06-28 角色分离方法、会议纪要的记录方法、角色展示方法、装置、电子设备及计算机存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/090,296 Continuation US20230162757A1 (en) 2020-06-28 2022-12-28 Role separation method, meeting summary recording method, role display method and apparatus, electronic device, and computer storage medium

Publications (1)

Publication Number Publication Date
WO2022001801A1 true WO2022001801A1 (zh) 2022-01-06

Family

ID=78972419

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/101956 WO2022001801A1 (zh) 2020-06-28 2021-06-24 角色分离方法、会议纪要的记录方法、角色展示方法、装置、电子设备及计算机存储介质

Country Status (3)

Country Link
US (1) US20230162757A1 (zh)
CN (1) CN113849793A (zh)
WO (1) WO2022001801A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114550728B (zh) * 2022-02-15 2024-03-01 北京有竹居网络技术有限公司 用于标记说话人的方法、装置和电子设备
CN114822511A (zh) * 2022-06-29 2022-07-29 阿里巴巴达摩院(杭州)科技有限公司 语音检测方法、电子设备及计算机存储介质
CN115356682B (zh) * 2022-08-21 2024-09-13 嘉晨云控新能源(上海)有限公司 一种基于精确定位的声源位置感知装置及方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150082404A1 (en) * 2013-08-31 2015-03-19 Steven Goldstein Methods and systems for voice authentication service leveraging networking
CN110062200A (zh) * 2018-01-19 2019-07-26 浙江宇视科技有限公司 视频监控方法、装置、网络摄像机及存储介质
CN111260313A (zh) * 2020-01-09 2020-06-09 苏州科达科技股份有限公司 发言者的识别方法、会议纪要生成方法、装置及电子设备

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111199741A (zh) * 2018-11-20 2020-05-26 阿里巴巴集团控股有限公司 声纹识别方法、声纹验证方法、装置、计算设备及介质
CN110298252A (zh) * 2019-05-30 2019-10-01 平安科技(深圳)有限公司 会议纪要生成方法、装置、计算机设备及存储介质
CN111048095A (zh) * 2019-12-24 2020-04-21 苏州思必驰信息科技有限公司 一种语音转写方法、设备及计算机可读存储介质

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150082404A1 (en) * 2013-08-31 2015-03-19 Steven Goldstein Methods and systems for voice authentication service leveraging networking
CN110062200A (zh) * 2018-01-19 2019-07-26 浙江宇视科技有限公司 视频监控方法、装置、网络摄像机及存储介质
CN111260313A (zh) * 2020-01-09 2020-06-09 苏州科达科技股份有限公司 发言者的识别方法、会议纪要生成方法、装置及电子设备

Also Published As

Publication number Publication date
CN113849793A (zh) 2021-12-28
US20230162757A1 (en) 2023-05-25

Similar Documents

Publication Publication Date Title
WO2022001801A1 (zh) 角色分离方法、会议纪要的记录方法、角色展示方法、装置、电子设备及计算机存储介质
US11031002B2 (en) Recognizing speech in the presence of additional audio
US10847171B2 (en) Method for microphone selection and multi-talker segmentation with ambient automated speech recognition (ASR)
CN110289003B (zh) 一种声纹识别的方法、模型训练的方法以及服务器
CN110310623B (zh) 样本生成方法、模型训练方法、装置、介质及电子设备
Sizov et al. Unifying probabilistic linear discriminant analysis variants in biometric authentication
US9626970B2 (en) Speaker identification using spatial information
JP2021500616A (ja) オブジェクト識別の方法及びその、コンピュータ装置並びにコンピュータ装置可読記憶媒体
WO2018149077A1 (zh) 声纹识别方法、装置、存储介质和后台服务器
WO2020155584A1 (zh) 声纹特征的融合方法及装置,语音识别方法,系统及存储介质
WO2016150257A1 (en) Speech summarization program
WO2020024708A1 (zh) 一种支付处理方法和装置
CN109346088A (zh) 身份识别方法、装置、介质及电子设备
CN110634472B (zh) 一种语音识别方法、服务器及计算机可读存储介质
WO2020098523A1 (zh) 一种语音识别方法、装置及计算设备
Chakroun et al. Robust features for text-independent speaker recognition with short utterances
Chin et al. Speaker identification using discriminative features and sparse representation
CN114120984A (zh) 语音交互方法、电子设备和存储介质
CN111785302B (zh) 说话人分离方法、装置及电子设备
Paleček et al. Audio-visual speech recognition in noisy audio environments
Isyanto et al. Voice biometrics for Indonesian language users using algorithm of deep learning CNN residual and hybrid of DWT-MFCC extraction features
Chakroun et al. Efficient text-independent speaker recognition with short utterances in both clean and uncontrolled environments
Zhang et al. Depthwise separable convolutions for short utterance speaker identification
Nirjon et al. sMFCC: exploiting sparseness in speech for fast acoustic feature extraction on mobile devices--a feasibility study
JP5091202B2 (ja) サンプルを用いずあらゆる言語を識別可能な識別方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21831555

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21831555

Country of ref document: EP

Kind code of ref document: A1