WO2022166220A1 - 一种语音分析方法及其语音记录装置 - Google Patents

一种语音分析方法及其语音记录装置 Download PDF

Info

Publication number
WO2022166220A1
WO2022166220A1 PCT/CN2021/120416 CN2021120416W WO2022166220A1 WO 2022166220 A1 WO2022166220 A1 WO 2022166220A1 CN 2021120416 W CN2021120416 W CN 2021120416W WO 2022166220 A1 WO2022166220 A1 WO 2022166220A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound source
voice
verification model
verification
model
Prior art date
Application number
PCT/CN2021/120416
Other languages
English (en)
French (fr)
Inventor
陈文明
陈新磊
张洁
张世明
Original Assignee
深圳壹秘科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹秘科技有限公司 filed Critical 深圳壹秘科技有限公司
Publication of WO2022166220A1 publication Critical patent/WO2022166220A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • the present invention relates to the technical field of audio, and in particular, to the technical field of voice discrimination and verification.
  • a good intelligent conference recording system should be able to record and recognize the voices of all the speakers who have appeared in the conference, that is, record and recognize who spoke and what was said. It includes speech separation technology, speech recognition technology, speaker recognition technology and many other cutting-edge technologies. However, because these technologies still face many academic and engineering problems, although the existing conference recording system also has some intelligent algorithms and models, it often needs to attach some restrictions, resulting in the so-called "intelligence" in front of the user experience. It doesn't seem so “smart” anymore. Moreover, these traditional intelligent algorithms often have the idea of "once and for all", training a model with strong generalization performance through massive data, so that this "universal" model can be adapted to all speakers in practical applications. .
  • the present application provides a voice analysis method and a voice recording device that can improve the recognition and differentiation of sound sources.
  • a voice analysis method comprising: acquiring first voice data, wherein the first voice data includes first voice information and a marked sound source corresponding to the first voice information; the verification model corresponding to the marked sound source, using a pre-stored basic verification model to adapt the first voice information, and saving the adapted model parameter set as the verification model corresponding to the marked sound source; If a verification model corresponding to the marked sound source is stored, use the verification model to judge whether the first voice information corresponds to the marked sound source, and optimize the verification model; when the verification model is determined When the verification accuracy rate exceeds a preset threshold, the verification model is used to determine the sound source corresponding to the second voice information included in the second voice data; wherein, the verification accuracy rate refers to the use of the verification model to determine the sound source of the second voice information. The accuracy of whether a voice message corresponds to the marked sound source.
  • a voice recording device which includes: an acquiring unit for acquiring first voice data; wherein the first voice data includes first voice information and a marker sound corresponding to the first voice information source; a learning unit, used for not storing the verification model corresponding to the marked sound source, using a pre-stored basic verification model to adapt the first voice information, and using the adapted model parameter set as the The verification model corresponding to the marked sound source is stored; if the verification model corresponding to the marked sound source is stored, the verification model is used to judge whether the first voice information corresponds to the marked sound source, and the The verification model is optimized; the use unit is used to determine that the second voice information contained in the second voice data corresponds to the marked sound source by using the verification model when it is determined that the verification accuracy rate of the verification model exceeds a preset threshold; Wherein, the verification accuracy refers to the accuracy of using the verification model to determine whether the first voice information corresponds to the marked sound source.
  • the beneficial effect of the present application is that, in the process of use, a verification model will be generated for different sound sources, and the verification model will continuously learn by itself during the use process to achieve further optimization. After setting the value, you can directly use the verification model to verify the sound source. Using a specific verification model for different sound sources for sound source verification, the accuracy rate will be higher, and the use is almost not restricted by other conditions, and the use is more flexible.
  • FIG. 1 is a flowchart of a speech analysis method according to Embodiment 1 of the present application.
  • FIG. 2 is a schematic diagram of a multi-channel time-domain speech separation model in Embodiment 1 of the present application.
  • FIG. 3 is a schematic diagram of a comparison between a traditional generalization model and a meta-trained verification model in Embodiment 1 of the present application.
  • FIG. 4 is a schematic diagram of an adaptation process of voice information to a basic verification model in Embodiment 1 of the present application.
  • FIG. 5 is a schematic diagram of a learning phase used in Embodiment 1 of the present application.
  • FIG. 6( a ) is a schematic flowchart of the voice analysis method provided in Embodiment 1 of the present application when the verification accuracy rate is lower than a preset threshold.
  • FIG. 6( b ) is a schematic flowchart of the verification accuracy of the voice analysis method provided in Embodiment 1 of the present application when the verification accuracy is greater than or equal to a preset threshold.
  • FIG. 7 is a schematic block diagram of a voice recording apparatus according to Embodiment 2 of the present application.
  • FIG. 8 is a schematic structural diagram of a voice recording apparatus according to Embodiment 3 of the present application.
  • the embodiments of the present application can be applied to various speech recording apparatuses with speech analysis functions.
  • voice recorder audio conference terminal
  • intelligent conference recording device or intelligent electronic device with recording function, etc.
  • the technical solutions of the present application will be described below through specific embodiments.
  • FIG. 1 illustrates a speech analysis method provided in Embodiment 1 of the present application.
  • the analysis includes but is not limited to sound source verification and sound source differentiation.
  • the sound source may refer to a speaker, different sound sources refer to different speakers, and the same sound source refers to the same speaker.
  • Sound source distinction refers to the result of judging sound sources, that is, distinguishing audio information from different sound sources. The sound source distinction does not need to obtain the complete speech emitted by the sound source, but only needs to obtain a part of it, such as a sentence, or even a word or fragment in a sentence.
  • the sound source distinction in this application is to give the judgment result of the sound source at the previous moment when the sound source emits sound (eg, the speaker speaks) in the form of low delay.
  • Sound source verification refers to judging whether the voice information actually corresponds to the marked sound source, or in other words, judging whether the voice information belongs to the marked sound source.
  • the speech analysis method 100 includes:
  • the first voice data includes first voice information and a marked sound source corresponding to the first voice information; wherein the marked sound source is the first voice information determined in a first manner
  • the first method is a method of sound source analysis (including sound source discrimination and sound source verification) other than the second method disclosed in this application;
  • the marked sound source may be the angle information passing through the sound source Determined, or directly input by the user through the software interface or on the terminal device; the angle information can be obtained by using the DOA (Direction of Arrival, direction of arrival) technology of the microphone array on the voice recording device.
  • DOA Direction of Arrival, direction of arrival
  • the angle information, or the angle information of each sound source obtained by the directional microphone on the voice recording device; or, the angle discrimination method is used to obtain the marked sound source, that is, the multi-channel voice data is first separated through the neural network.
  • the separated speech data includes the separated speech information and the corresponding angle information, and then the sound source (ie the speaker) is distinguished by the angle division algorithm and the voiceprint feature, and the area indicated by the area within a certain range of each angle is used. sound source is marked.
  • the verification accuracy rate of the verification model when it is determined that the verification accuracy rate of the verification model exceeds a preset threshold, use the verification model to determine the sound source corresponding to the second voice information included in the second voice data; wherein, the verification accuracy rate refers to using the verification model The accuracy of determining whether the first voice information corresponds to the marked sound source.
  • both the first voice information and the second voice information belong to the voice information of the sound source.
  • the second voice data also includes a marked sound source, and thus, in the process of using the method, the second voice information in the second voice data packet and the marked sound source can still be used to mark the sound source.
  • the corresponding verification model is adapted to further optimize the verification model and improve its verification accuracy.
  • the acquiring first voice data includes:
  • S111 collect multi-channel voice data; optionally, collect multi-channel voice data through a microphone array; use a microphone array to obtain multi-channel voice data, and also capture angle information of the sound source, and preliminarily define the multi-channel voice according to the spatial information In the data, the sound sources (that is, speakers) corresponding to different voice information;
  • S112 Perform voice separation on the multi-channel voice data to obtain the separated first voice data. After that, the method can use the first voice data as initial data to perform "self-learning".
  • the multi-channel voice data may contain overlapping voices, so the multi-channel voice data needs to be sent to a voice separation module for voice separation.
  • the speech separation module is composed of a neural network, and its structure includes an encoder and a decoder composed of several convolutional neural networks (CNN, Convolutional Neural Networks).
  • performing voice separation on the multi-channel voice data includes: using a time-domain signal separation manner to separate the multi-channel voice data.
  • FIG. 2 is a schematic diagram of separating multi-channel voice data by using a time-domain signal separation method.
  • the multi-channel voice data includes angle information of each sound source. Taking two speakers as an example, speaker 1 and speaker 2, that is, different sound sources, speak at different spatial positions, and the time series signals are collected by the microphone array to obtain multi-channel mixed speech data.
  • the speech data is sent to the encoder group (Encoders) composed of convolutional neural networks, and transformed into multi-dimensional representations (Representations); then, the multi-channel mixed speech data is then output to each sound source through the decoder group (Decoders).
  • the estimated speech signal and the corresponding estimated angle information is sent to the encoder group (Encoders) composed of convolutional neural networks, and transformed into multi-dimensional representations (Representations).
  • the clean speech signals of each sound source and their respective accurate angle information in the multi-channel mixed speech data are introduced as labels, and the loss function is calculated and optimized to complete the training.
  • the speech signal separation method disclosed in this application does not need to go through a feature extraction process such as Fourier transform, but directly uses the time domain speech signal for training. The frequency domain, and then the traditional deep learning speech separation method of learning the spectral characteristics of the speech information can reduce the delay.
  • the traditional deep learning speech separation method usually performs time-frequency analysis on the speech signal, that is, the time-domain waveform of the speech signal is converted to the frequency domain by Fourier transform, and then the spectral features are learned.
  • the mixed speech Mixture Speech
  • Feature Extraction feature extraction
  • the neural network estimates the masking associated with the two speakers and At this time, the spectral features of the mixed signal and the two estimated masks are used to calculate the element-wise product, respectively, to obtain the speech signals estimated by the two speakers.
  • the method After acquiring the first voice data in S110, the method will determine whether a verification model corresponding to the marked sound source is stored, if not, execute S120, and if so, execute S130. That is, the verification model set is traversed. If there is no verification model corresponding to the sound source in the verification model set, the system will determine it as a new sound source and execute S120; if the verification model set has a verification model corresponding to the sound source, Then execute S130.
  • S130 is a self-learning process
  • the verification model of the sound source is optimized in the adaptation process again and again, until the verification accuracy rate exceeds the preset threshold, it can be considered that the verification accuracy rate of the verification model of the sound source is high enough
  • S140 may be executed to directly use the verification model corresponding to the sound source to determine the sound source corresponding to the voice information, without continuing to use the above first method to determine the sound source corresponding to the voice information.
  • the verification model set containing the verification model corresponding to the marked sound source can be stored in the local storage unit of the voice recording device, and the advantage of storing locally is to improve the security and confidentiality of the data; the verification model set containing the marked sound source corresponds to
  • the verification model collection of the verification model can also be stored with the cloud device, the advantage of which is that more verification models for different sound sources can be obtained.
  • the basic verification model is a generalization verification model obtained by training a meta-training model in advance.
  • the so-called generalization verification model refers to a speaker verification model with strong generalization ability trained using massive data.
  • the pre-training model adopted by the generalization verification model in this application is a meta-training strategy (Meta-train).
  • the generalization verification model trained by the meta-training model does not learn how to directly adapt to sound sources that do not appear in the training set, but learn how to quickly adapt to sound sources that do not appear in the training set.
  • Figure 3 a schematic diagram of the comparison between the traditional generalization model and the meta-training generalization validation model obtained by training with the meta-training strategy.
  • Figure 3(a) is the traditional generalization verification model
  • Figure 3(b) is the meta-trained generalization verification model.
  • the training set is sound sources AB, AC and BC, and the sound sources D, E and F that do not appear in the training set.
  • sound source AB represents a mixed sound source of sound source A and sound source B
  • sound source AC represents a mixed sound source of sound source A and sound source C
  • sound source BC represents a mixed sound source of sound source AB and sound source C.
  • the process generally consists of three steps: train (Train) - adapt (Enroll) - test (Test). If the training set is a mixture of sound sources AB, AC and BC, then when the training is completed (Train), the parameter set ⁇ of the traditional generalization verification model will reach the middle position of AB, AC, and BC.
  • the so-called position of the parameter set ⁇ refers to treating the parameter set as a high-dimensional vector space, in which each parameter is a dimension, and the value of each parameter can affect the position of the parameter set ⁇ in this space.
  • the value of each parameter in the parameter set ⁇ determines the parameter set located in the AB, AC, BC in the high-dimensional vector space the middle position.
  • D, E, and F such as speech fragments
  • the optimized generalization model can be used to test the speech information sent by D, E, and F ( Test).
  • the meta-training also follows the steps of Train-Enroll-Test
  • the generalization verification model of meta-training requires adaptation speech (Enroll) compared with the traditional generalization verification model. Speech) is much less, so the speaker verification system for any one speaker or any combination of speakers in ⁇ A,B,C,D,E,F ⁇ can be trained faster.
  • the training set is also mixed sound sources AB, AC and BC.
  • the convergence position of the parameter set ⁇ of the meta-trained generalization verification model is not in the middle of the training set as shown in Figure 3(b). Therefore, in the face of When the new sound sources D, E and F send out speech information, only a very small amount of adaptive speech is required to allow the model to quickly match the new sound sources.
  • the meta-trained generalization verification model is better at "Learn to learn", no matter how many speakers that are not present in the training set, the meta-trained generalization verification model can quickly speak to these speakers People adapt, and the ability to generalize and transfer is stronger.
  • the "self-learning" process of the speech analysis method disclosed in the first embodiment of the present application and the speech recording device running the speech analysis method can be realized.
  • the obtained first voice data has obtained a small amount of voice information marked with a sound source, and the voice information can be used as the adapted voice for sound source verification.
  • the verification model corresponding to the marked sound source is not stored, use a pre-stored basic verification model to adapt the first voice information, and use the adapted model parameter set as the marked sound source.
  • the verification model corresponding to the source is saved, including:
  • the voice SpeechA belonging to spkA is sent to the basic verification model of meta-training for adaptation, and then the parameter set of the adapted verification model is saved in the local storage space of the voice recording device, and stored according to the tag name (or The user is marked and stored by the entered identity information), i.e., A model.
  • the speech SpeechB belonging to spkB is sent to the original meta-training basic verification model for adaptation, and the parameter set of the adapted verification model is also saved in the local storage space of the speech recording device, and marked as B model.
  • the generalized verification model of meta-training mentioned in Embodiment 1 of this application is the base verification model (Base model), which is a separate parameter set and is specifically backed up.
  • the voice information of different sound sources is adapted on the basic verification model.
  • the parameter set of the specific sound source is saved as the verification model for the sound source.
  • the voice information of the sound source is all The sound source is verified through the verification model corresponding to the sound source, not on the basic verification model, or on the verification model for other sound sources.
  • A's voice information spk A
  • Base model base model
  • A's verification model A model
  • spk B--Base model--B model not spk A--Base model--A model
  • spk B--A model--B model not spk A--Base model--A model
  • the A model will continue to be adapted with the voice information of the sound source A to achieve the effect of further optimization.
  • the verification model is optimized in S130, including:
  • the first voice information is adapted using the verification model corresponding to the marked sound source, and the optimized model parameter set is saved as the verification model corresponding to the marked sound source.
  • the verification model corresponding to the sound source is established through S120, after that, during the continuous use of the voice recording device running the voice analysis method by the sound source (such as speaker A), through S130
  • the verification model corresponding to the sound source is continuously trained to improve the verification accuracy of the verification model for the sound source. That is, this is a self-optimizing process.
  • the spk1model can be continuously trained through S130 and the accuracy of the model for the verification of Speaker1 can be improved.
  • the sound source corresponding to the voice information can be verified directly according to the verification model corresponding to the sound source, and there is no need to use the first method (such as marking angle information or user input) to mark and confirm the voice information. the corresponding sound source.
  • the verification accuracy rate refers to the accuracy rate at which the verification model is used to determine whether the first voice information corresponds to the marked sound source, which can be expressed as:
  • the verification accuracy rate refers to the percentage of the total number of verification times that the verification model of a specific sound source determines whether the voice information belongs to the specific sound source.
  • the verification model judges the sound source of the voice information, that is, for the voice information of a specific sound source, the sound source determined by the verification model is the characteristic sound source, and this verification is correct. For example, if what is to be verified is the voice information of sound source A, but the verification model model corresponding to sound source A judges that the voice information does not belong to sound source A, then the judgment is wrong; it is judged that the voice information belongs to sound source A, The judgment is correct. If the voice information to be verified is not the voice information of the sound source A, but the verification model model corresponding to the sound source A judges that the voice information belongs to the sound source A, the judgment is wrong; it is judged that the voice information does not belong to the sound source A, then the judgment is made. correct.
  • the verification accuracy rate refers to the accuracy rate at which the verification model is used to determine whether the first voice information corresponds to the marked sound source, and can also be expressed as:
  • the validation accuracy is determined according to the False Accept Rate (FAR) and the False Reject Rate (FRR).
  • the verification accuracy rate is the correct rate when the false acceptance rate (False Accept Rate, FAR) and the false rejection rate (False Reject Rate, FRR) are equal; it can be expressed as (1-equal error rate), equal error rate (EER, Equal Error Rate) is the false acceptance rate (the error rate when the false acceptance rate is equal to the false rejection rate.
  • the verification error rate In general, if the verification accuracy of the verification model is to be the highest, the verification error rate must be the lowest, and the verification error rate includes the false acceptance rate and the false rejection rate.
  • the false acceptance rate refers to the acceptance of an incorrect result, that is,
  • the voice information to be verified is not originally the voice of a specific sound source (such as a marked sound source), but the verification model mistakenly regards it as the voice information of a specific sound source; the false rejection rate means that a correct result is rejected, that is, the test is to be tested.
  • the speech information is originally the speech information of a specific sound source (eg, a marked sound source) but the validation model mistakenly regards it as the speech information that is not a specific sound source.
  • the false acceptance rate and the false rejection rate are two related (non-linear correlation) curves on the image.
  • the intersection point is the equal error rate EER.
  • the verification correct rate, ie (1-ERR) is the highest.
  • a trigger mechanism for example, when the verification accuracy rate is higher than the preset threshold (assumed to be 0.95), start to enable the verification module corresponding to the sound source to replace the original sound source marking method, such as by angle Determine the source of the sound, or the way the user enters the identity information by himself.
  • the reliability has a minimum guarantee.
  • the self-learning process of the first embodiment of the method provided by the present application will not be terminated.
  • voice information with marked sound sources will be continuously generated.
  • the voice information marked by these sound sources can still be used to adapt the verification model corresponding to the sound source to make its accuracy rate higher and higher. This process can be called learning while in use.
  • the real-time multi-channel conference audio data stream is separated by the multi-channel time-domain voice separation model, and the separated voice information and A set of angles, in which each record contains voice information and its corresponding angle.
  • determine the speaker (ie the sound source) corresponding to the voice information obtain the set of voice information of the speaker, and then collect the set of voice information of the speaker.
  • the speech information in the meta-training model is respectively sent to the meta-training model for training and adaptation, and the verification model for different sound sources is obtained, the speaker 1 model... the speaker n model.
  • the verification model can be directly used to determine the sound of the voice information.
  • source that is, match the received speech information with the speaker model in the speaker verification model set, and use the matched speaker model to verify the speech information.
  • the voice information will be used to further adapt the verification model to further improve the verification accuracy of the verification model.
  • the verification module can be used to replace the initial sound source determination method to distinguish and verify the sound source of the voice information.
  • the solution provided by the present application has better real-time performance, accuracy and flexibility.
  • the embodiment of the present application has such a "self-learning" function, it enables the voice recording device running the method to continuously learn, improve and become more intelligent in the process of use, and can distinguish the voices of different speakers. Accuracy is getting higher and higher, with few additional constraints. The more users use it, the higher the sensitivity to user differentiation, which can greatly improve user stickiness and user experience.
  • a more unique verification model can be trained for different speakers and different conference scenarios, and the sound source can be verified and distinguished, and the reliability is higher;
  • the process of person adaptation and speaker differentiation is performed locally, and the speaker model is also stored in the local storage space.
  • the technical solution of the present invention not only It can reduce transmission delay, simplify operations, and better ensure user privacy information, which is suitable for scenarios with higher security and confidentiality requirements.
  • FIG. 7 shows a voice recording apparatus 200 according to Embodiment 2 of the present application.
  • the voice recording device 200 includes, but is not limited to, any one of a voice recorder, an audio conference terminal, or an intelligent electronic device with a recording function, etc. It may also not include a voice pickup function, but only include an analysis function of voice distinction or verification.
  • the computer or other intelligent electronic device that implements this function is not limited in the second embodiment.
  • the voice recording device 200 includes:
  • an obtaining unit 210 configured to obtain first voice data; wherein, the first voice data includes first voice information and a marked sound source corresponding to the first voice information;
  • the learning unit 220 is used for adapting the first voice information by using a pre-stored basic verification model if the verification model corresponding to the marked sound source is not stored, and using the adapted model parameter set as the corresponding verification model for the marked sound source.
  • the verification model corresponding to the source is stored; if the verification model corresponding to the marked sound source is stored, the verification model is used to judge whether the first voice information corresponds to the marked sound source, and the verification model is optimized;
  • the use unit 230 is configured to use the verification model to determine that the second voice information contained in the second voice data corresponds to the marked sound source when it is determined that the verification accuracy rate of the verification model exceeds a preset threshold; wherein, the verification accuracy rate is Refers to the accuracy of using the verification model to determine whether the first voice information corresponds to the marked sound source.
  • the acquiring unit 210 is specifically configured to collect multi-channel voice data; perform voice separation on the multi-channel voice data to obtain the separated first voice data.
  • the acquiring unit 210 is specifically configured to collect multi-channel voice data; separate the multi-channel voice data by using a time-domain signal separation method to obtain the separated first voice data.
  • the basic verification model is a generalization verification model obtained by training a meta-training model in advance.
  • the learning unit 220 is specifically configured to use the verification model to determine whether the first voice information corresponds to the marked sound source, and use the marked sound source to correspond
  • the first voice information is adapted by the verification model, and the optimized model parameter set is saved as the verification model corresponding to the marked sound source.
  • the preset threshold is determined according to the false acceptance rate and the false rejection rate.
  • the verification accuracy rate can be expressed as (1 - equal error rate), and its threshold can be set to 0.95, that is, the equal error rate should be less than 0.05.
  • FIG. 8 is a schematic structural diagram of a voice recording apparatus 300 according to Embodiment 3 of the present application.
  • the video processing apparatus 300 includes: a processor 310 and a memory 320 .
  • the communication connection between the processor 310 and the memory 320 is realized through a bus system.
  • the processor 310 invokes the program in the memory 320 to execute any one of the speech analysis methods provided in the first embodiment above.
  • the processor 310 may be an independent component, or may be a collective term for multiple processing components. For example, it may be a CPU, an ASIC, or one or more integrated circuits configured to implement the above method, such as at least one microprocessor DSP, or at least one programmable gate FPGA, etc.
  • the memory 320 is a computer-readable storage medium on which programs executable on the processor 310 are stored.
  • the voice processing device 300 further includes: a sound pickup device 330 for acquiring voice information.
  • the processor 310, the memory 320, and the sound pickup device 330 are connected to each other through a bus system for communication.
  • the processor 310 invokes the program in the memory 320, executes any one of the speech analysis methods provided in the first embodiment, and processes the multi-channel speech information acquired by the sound pickup device 330.
  • the functions described in the specific embodiments of the present application may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • software When implemented in software, it may be implemented by a processor executing software instructions.
  • the software instructions may consist of corresponding software modules.
  • the software modules may be stored in a computer-readable storage medium, which may be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that includes an integration of one or more available media.
  • the available media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, Digital Video Disc (DVD)), or semiconductor media (eg, Solid State Disk (SSD)) )Wait.
  • the computer-readable storage medium includes but is not limited to random access memory (Random Access Memory, RAM), flash memory, read only memory (Read Only Memory, ROM), Erasable Programmable Read Only Memory (Erasable Programmable ROM, EPROM) ), Electrically Erasable Programmable Read-Only Memory (Electrically EPROM, EEPROM), registers, hard disks, removable hard disks, compact disks (CD-ROMs), or any other form of storage medium known in the art.
  • An exemplary computer-readable storage medium is coupled to the processor such that the processor can read information from, and write information to, the computer-readable storage medium.
  • the computer-readable storage medium can also be an integral part of the processor.
  • the processor and computer-readable storage medium may reside in an ASIC. Additionally, the ASIC may reside in access network equipment, target network equipment or core network equipment.
  • the processor and the computer-readable storage medium may also exist as discrete components in the access network device, the target network device or the core network device. When implemented in software, it can also be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer program instructions may be stored on or transmitted from one computer readable storage medium to another computer readable storage medium as described above, for example, the computer instructions may be downloaded from a website, computer, server or The data center transmits to another website site, computer, server or data center by wire (such as coaxial cable, optical fiber, Digital Subscriber Line, DSL) or wireless (such as infrared, wireless, microwave, etc.).
  • wire such as coaxial cable, optical fiber, Digital Subscriber Line, DSL
  • wireless such as infrared, wireless, microwave, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种语音分析方法及其语音记录设备。该方法包括:获取第一语音数据,其中,所述第一语音数据包括第一语音信息以及所述第一语音信息对应的标记声源;若未存储与所述标记声源对应的验证模型,采用预先存储的基础验证模型对所述第一语音信息进行适配,并将适配后的模型参数集作为与所述标记声源对应的验证模型进行保存;若存储有与所述标记声源对应的验证模型,采用所述验证模型判断所述第一语音信息是否与所述标记声源对应,并对所述验证模型进行优化;当确定所述验证模型的验证准确率超过预设阈值时,采用所述验证模型确定第二语音数据中包含的第二语音信息对应的声源。该方法中的验证模型可不断获得优化,使用起来更为灵活、准确率更高。

Description

一种语音分析方法及其语音记录装置 技术领域
本发明涉及音频技术领域,尤其涉及一种语音区分和验证的技术领域。
背景技术
随着深度学习领域研究的不断深入,越来越多的硬件设备向智能化发展。同时,得益于嵌入式芯片在计算和存储能力方面取得质的飞跃,曾经只能在GPU(Graphics Processing Unit,图形处理器)上运行的深度学习模型有了小型化的硬件基础,而便携式移动设备的智能化和小型化也成为了当前技术发展的潮流。智能会议记录系统就是其中之一。
一个良好的智能会议记录系统应当可以实现对会议中所有出现过的说话人的语音进行记录和识别,即记录和识别谁说了话、说了哪些话。它包含语音分离技术、语音识别技术、说话人识别技术等众多前沿技术。但因目前这些技术还面临许多学术和工程难题,导致现有的会议记录系统虽然也带有一些智能化的算法和模型,往往需附加了一些限制条件,导致的所谓“智能”在用户体验面前就显得不那么“智能”了。并且,这些传统的智能算法往往都抱着“一劳永逸”的思路,通过海量的数据训练一个具有较强泛化性能的模型,来让这个“万能”模型适配实际应用中出现的所有的说话人。
但是,实际情况却是,由于语音是千变万化的,即使是同一个说话人,在不同的心情、语气、声调下所产生的语音频谱都有很大的差异,而传统的基于声纹识别的算法很难捕获到这些变化,从而影响说话人识别和区分的准确率。
发明内容
本申请提供一种可提升声源识别和区分的语音分析的方法及其语音记录装置。
本申请提供以下技术方案:
一方面,提供一种语音分析方法,其包括:获取第一语音数据,其中,所述第一语音数据包括第一语音信息以及所述第一语音信息对应的标记声源;若未存储与所述标记声源对应的验证模型,采用预先存储的基础验证模型对所述第一语音信息进行适配, 并将适配后的模型参数集作为与所述标记声源对应的验证模型进行保存;若存储有与所述标记声源对应的验证模型,采用所述验证模型判断所述第一语音信息是否与所述标记声源对应,并对所述验证模型进行优化;当确定所述验证模型的验证准确率超过预设阈值时,采用所述验证模型确定第二语音数据中包含的第二语音信息对应的声源;其中,所述验证准确率是指采用所述验证模型判断所述第一语音信息是否与所述标记声源对应的准确率。
另一方面,提供一种语音记录设备,其包括:获取单元,用于获取第一语音数据;其中,所述第一语音数据包括第一语音信息,以及所述第一语音信息对应的标记声源;学习单元,用于未存储与所述标记声源对应的验证模型,采用预先存储的基础验证模型对所述第一语音信息进行适配,并将适配后的模型参数集作为与所述标记声源对应的验证模型进行保存;若存储有与所述标记声源对应的验证模型,采用所述验证模型判断所述第一语音信息是否与所述标记声源对应,并对所述验证模型进行优化;使用单元,用于当确定所述验证模型的验证准确率超过预设阈值时,采用所述验证模型确定第二语音数据中包含的第二语音信息对应所述标记声源;其中,所述验证准确率是指采用所述验证模型判断所述第一语音信息是否与所述标记声源对应的准确率。
本申请的有益效果在于,由于在使用的过程中会针对不同声源单独生成一个验证模型,并且该验证模型会在使用过程中,不断地自我学习达到进一步优化的效果,当验证准确率达到预设值之后,即可直接采用验证模型验证声源。用针对不同声源的特定的验证模型进行声源验证,准确率会更高,并且使用时几乎不受其他条件的限制,使用更为灵活。
附图说明
图1为本申请实施方式一提供的一种语音分析方法的流程图。
图2为本申请实施方式一中多通道时域语音分离模型的示意图。
图3为本申请实施方式一中传统的泛化模型和元训练的验证模型的比对示意图。
图4为本申请实施方式一中语音信息对基础验证模型的适配过程示意图。
图5为本申请实施方式一中使用学习阶段的示意图。
图6(a)为本申请实施方式一提供的语音分析方法验证准确率低于预设阈值时的流程示意图。
图6(b)为本申请实施方式一提供的语音分析方法验证准确率大于或等于预设阈值时的流程示意图。
图7为本申请实施方式二提供的一种语音记录装置的方框示意图。
图8本申请实施方式三提供的一种语音记录装置的结构示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施方式,对本申请进行进一步详细说明。应当理解,此处所描述的实施方式仅用以解释本申请,并不用于限定本申请。但是,本申请可以以多种不同的形式来实现,并不限于本文所描述的实施方式。相反地,提供这些实施方式的目的是使对本实用新型的公开内容的理解更加透彻全面。
除非另有定义,本文所实用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施方式的目的,不是旨在限制本申请。
应理解,本文中术语“系统”或“网络”在本文中常被可互换使用。本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
本申请实施例可以应用于各种带有语音分析功能的各种语音记录装置。例如:录音笔、音频会议终端、智能会议记录装置或者有录音功能的智能电子设备等。以下将通过具体的实施方式对本申请的技术方案进行阐述。
实施方式一
请参看图1,为本申请实施方式一提供的一种语音分析方法。其中,分析包括但不限于声源验证和声源区分。声源可以是指说话人,不同的声源即指不同的说话人,相同的声源即指相同的说话人。声源区分是指对声源进行判断的结果,即,区分不同的声源发出的音频信息。声源区分可以不需要获取到声源发出的完整语音,只需获取其中一部分即可,如一句话,甚至是一句话中的一个单词或片段。可选的,本申请中的声源区分是以低延迟的形式在声源发出声音(如说话人说话)的同时给出上一时刻对该声源的判断结果。声源验证是指判断语音信息是否确实与标记声源对应,或者说,判断语音信息是否属于该标记声源。
该语音分析方法100包括:
S110,获取第一语音数据,其中,该第一语音数据包括第一语音信息以及该第一语音信息对应的标记声源;其中,该标记声源是采用第一方式确定的该第一语音信息对应的声源,该第一方式是除本申请所揭示的第二方式以外的声源分析(包括声源区分和声源验证)方式;如:该标记声源可以是通过声源的角度信息确定的,或者是,用户通过软件界面或者在终端设备上直接输入的;该角度信息可以是采用语音记录装置上的麦克风阵列的DOA(Direction of Arrival,波达方向)技术获取的各声源的角度信息,或者,是采用语音记录装置上的指向型麦克风获取的各声源的角度信息;又或者是,采用角度区分法获得标记声源,即,先通过神经网络把多通道语音数据进行分离,分离后的语音数据包括分离的语音信息以及对应的角度信息,然后再通过角度划分算法和声纹特征来区分声源(即说话人),并对每个角度一定范围内的区域所指示的声源进行标记。
S120,若未存储与该标记声源对应的验证模型,采用预先存储的基础验证模型对该第一语音信息进行适配,并将适配后的模型参数集作为与该标记声源对应的验证模型进行保存;
S130,若存储有与该标记声源对应的验证模型,采用该验证模型判断该第一语音信息是否与该标记声源对应,并对该验证模型进行优化;
S140,当确定该验证模型的验证准确率超过预设阈值时,采用该验证模型确定第二语音数据中包含的第二语音信息对应的声源;其中,该验证准确率是指采用该验证模型判断该第一语音信息是否与该标记声源对应的准确率。
其中,该第一语音信息与该第二语音信息均属于该声源的语音信息。
可选的,该第二语音数据中也包含标记声源,由此,该方法在使用过程中,仍然可以继续用第二语音数据包中的第二语音信息及标记声源,对该声源对应的验证模型进行适配,进一步优化该验证模型,提升其验证准确率。
可选的,S110,该获取第一语音数据,包括:
S111,采集多通道语音数据;可选的,通过麦克风阵列采集多通道语音数据;采用麦克风阵列获取多通道语音数据,还可捕获到声源的角度信息,根据这些空间信息初步界定该多通道语音数据中,不同的语音信息对应的声源(即说话人);
S112,对该多通道语音数据进行语音分离,获得分离后的该第一语音数据。此后,该方法即可将该第一语音数据作为初始数据,进行“自我学习”。
其中,该多通道语音数据中可能包含重叠语音,因此需要将该多通道语音数据送入一个语音分离模块进行语音分离。可选的,语音分离模块由一个神经网络构成,其结构包括若干个卷积神经网络(CNN,Convolutional Neural Networks)组成的编码器和解码器。
可选的,S112,对该多通道语音数据进行语音分离,包括:对该多通道语音数据采用时域信号分离方式进行分离。
请参见图2,采用时域信号分离方式对多通道语音数据进行分离的示意图。其中,多通道语音数据包含各个声源的角度信息。以两个说话人为例,说话人1和说话人2,即不同的声源,在不同的空间位置说话,时序信号被麦克风阵列采集到,获得多通道的混合语音数据,将该多通道的混合语音数据送入由卷积神经网络组成的编码器组(Encoders),并且变换成多维度的表示(Representations);然后,该多通道的混合语音数据再经解码器组(Decoders)输出各个声源估计的语音信号以及对应的估计的角度信息。此时,再引入多通道的混合语音数据中各个声源干净的语音信号及其各自准确的角度信息作为标签,计算损失函数并优化该损失函数,完成训练。本申请所揭示的语音信号分离方式中无需经过傅里叶变换等特征提取过程,而是直接使用时域的语音信号进行训练,相对于需将语音信息的时域波形做傅里叶变换转换到频域,然后在对该语音信息的频谱特征进行学习的传统的深度学习语音分离方式,可降低时延。
而传统的深度学习语音分离方式通常是对语音信号进行时频分析,即将语音信号的时域波形做傅里叶变换转换到频域,然后对频谱特征进行学习。以单声道监督式两说话人语音分离为例,混合语音(Mixture Speech)经特征提取(Feature Extraction)模块获得了单声道混合语音信号的频谱,送入神经网络(Neural Network)。在训练过程中,神经网络会估计两个说话人相关的掩蔽
Figure PCTCN2021120416-appb-000001
Figure PCTCN2021120416-appb-000002
此时利用混合信号的频谱特征和两个估计掩蔽分别计算元素积(Element-wise product)得到两个说话人估计的语音信号
Figure PCTCN2021120416-appb-000003
Figure PCTCN2021120416-appb-000004
这时引入混合语音中两个说话人原始的干净语音信号,和估计语音信号
Figure PCTCN2021120416-appb-000005
计算损失函数(Loss Function),通常是均方误差(Mean Square Error,MSE),然后优化该损失函数使模型收敛,训练完毕。这种传统的时频分析方式虽然只需要单通道语音数据即可完成训练,但是由于在特征提取以及波形重建时必须要经过傅里叶变换和逆傅里叶变换,这会带来一定的计算时延,对于实时性要求较高的设备和场景来说未必能满足其低延迟的要求。除此之外,这种传统的算法都希望通过一个泛化验证模型就可以解决所有的语音分离问题,在实际应用中这显然是不现实的。
在S110获取第一语音数据之后,该方法会判断是否存储有与该标记声源对应的验证模型,无则执行S120,有则执行S130。即,遍历验证模型集合,若验证模型集合中没有该声源对应的验证模型,那么系统会将其判定为新的声源,执行S120;若验证模型集合中有该声源对应的验证模型,则执行S130。S130为自我学习的过程,该声源的验证模型在一次又一次的适配过程中得到优化,直至验证准确率超过预设阈值,则可认为该声源的验证模型的验证准确率已经足够高,则可以执行S140,直接采用该声源对应的验证模型确定语音信息对应的声源,而无需继续采用上述第一方式确定语音信息对应的声源。
可选的,包含该标记声源对应的验证模型的验证模型集合可存储于语音记录设备的本地存储单元中,存储于本地的优点在于提升数据的安全性和保密性;包含该标记声源对应的验证模型的验证模型集合也可以存储与云端设备,其优点在于可以获得更多的针对不同声源的验证模型。
可选的,该基础验证模型为预先通过元训练模型训练获得的泛化验证模型。所谓泛化验证模型是指使用海量数据训练的具有较强的泛化能力的说话人验证模型。本申请中的该泛化验证模型采用的预训练模型为元训练策略(Meta-train)。采用元训练模型训练的泛化验证模型,学习的不是如何直接适配训练集中未出现的声源的方法,而是学习如何快速地去适配训练集中未出现的声源的方法。
请参见图3,传统的泛化模型与采用元训练策略训练获得的元训练泛化验证模型的比较示意图。其中,图3(a)为传统的泛化验证模型,图3(b)为元训练的泛化验证模型。
假设训练集为声源AB、AC和BC,训练集中未出现的声源D、E和F。其中,声源AB表示声源A与声源B的混合声源,声源AC表示声源A与声源C的混合声源,声源BC表示声源AB与声源C的混合声源。
如图3(a)所示,针对说话人验证任务,若采用传统的泛化验证模型的方式,其过程一般为训练(Train)-适配(Enroll)-测试(Test)三个步骤。若训练集为混合声源AB、AC和BC,那么在完成训练时(Train),传统的泛化验证模型的参数集θ会到达AB、AC、BC的中间位置。所谓参数集θ的位置,是指将参数集看做一个高维向量空间,其中每个参数是一个维度,每个参数的取值,都可影响参数集θ在该空间中的位置。在传统的泛化模型中,根据训练集AB、AC和BC训练完之后,参数集θ中的各个参数的取值决定了其在该高维向量空间中位于该AB、AC、BC的参数集的中间位置。当遇到训练集 中未出现的声源D、E和F时,需要提供一定数量的D、E、F的语音信息(如:语音片段),对原本的参数集θ进行适配(Enroll),使得该传统的泛化验证模型更靠近D、E、F这些新的说话人,当适配过程完成以后,就可以用优化后的泛化模型对D、E、F发出的语音信息进行测试(Test)。
如图3(b)所示,元训练虽然同样也遵循这种Train-Enroll-Test的步骤,但元训练的泛化验证模型和传统泛化验证模型相比,其需要的适配语音(Enroll Speech)少很多,因而可以更快地训练出{A,B,C,D,E,F}中任意一个说话人或者任意说话人组合的说话人验证系统。具体而言,训练集同样为混合声源AB、AC和BC,元训练的泛化验证模型的参数集θ收敛位置如图3(b)所示并不在训练集的中间,因此,在面对新的声源D、E和F发出的语音信息时,只需要极少量的适配语音就能让模型快速地匹配新的声源。也就是说,元训练的泛化验证模型更懂得“学会学习(Learn to learn)”,不管面对多少训练集中未出现的说话人,元训练的泛化验证模型都能很快地对这些说话人进行适配,泛化和迁移能力更强。
有了元训练的泛化验证模型作为验证声源的基础验证模型,那本申请实施例一揭示的语音分析方法以及运行该语音分析方法的语音记录装置的“自我学习”过程即可实现了。在S110中,获得的第一语音数据,已经得到了少量的带有声源标记的语音信息,这些语音信息即可作为声源验证的适配语音。
可选的,S120,若未存储与该标记声源对应的验证模型,采用预先存储的基础验证模型对该第一语音信息进行适配,并将适配后的模型参数集作为与该标记声源对应的验证模型进行保存,包括:
S121,将该第一语音信息送入元训练的基础验证模型进行适配;
S122,存储适配后的参数集作为该声源对应的验证模型。
举例说明,请参见图4,假设有两个说话人,即声源A和声源B,其各自有一条标记好的适配语音,分别为spkA(SpeakerA)和spkB(SpeakerB)。首先,将属于spkA的语音SpeechA送入元训练的基础验证模型进行适配,然后把适配后的验证模型的参数集保存在语音记录设备的本地存储空间中,并按照标记名进行存储(或者用户通过输入的身份信息进行标记并存储),即,A model。同样的,再把属于spkB的语音SpeechB送入原始的元训练的基础验证模型进行适配,同样把适配完的验证模型的参数集保存在语音记录设备的本地存储空间中,并标记为B model。本申请实施方式一中提及的元训练的泛化验证模型为基础验证模型(Base model),是一个单独的参数集,进行专门备份。不 同的声源的语音信息都是在该基础验证模型上进行适配,适配完毕后将特定声源的参数集保存下来,作为针对该声源的验证模型,此后该声源的语音信息均通过该声源对应的验证模型进行声源验证,而不是在基础验证模型上进行验证,或者,在针对其他声源的验证模型上进行验证。例如:A的语音信息(spk A)—基础模型(Base model)—A的验证模型(A model),spk B--Base model--B model,而不是spk A--Base model--A model,spk B--A model--B model。并且,进一步的,还会继续用声源A的语音信息对A model进行适配,达到进一步优化的效果。
可选的,S130中对该验证模型进行优化,包括:
采用该标记声源对应的验证模型对该第一语音信息进行适配,并将优化后的模型参数集作为该标记声源对应的验证模型进行保存。
在本申请实施例方式一种,通过S120建立与声源对应的验证模型后,此后,在该声源(如说话人A)不断使用运行该语音分析方法的语音记录装置的过程中,通过S130不断地对该声源对应的验证模型进行训练,提升该验证模型针对该声源进行验证的准确率。即,此为一个自我优化的过程。
举例说明,请参见图5,以声源为说话人1(Speaker1)为例,通过多通道时域语音分离模型得到Speaker1带有标记的时域语音片段序列S={s 1,s 2,...,s t},经过S120的步骤,得到了Speaker1对应的验证模型spk1mode,在后续Speaker1持续使用过程中,经过S130可对spk1model不断地进行训练并提升该模型对Speaker1验证的准确率。
当准确率达到预设条件时,则可直接根据该声源对应的验证模型来验证语音信息对应的声源,而无需再采用第一方式(如标记角度信息或用户输入)来标确认语音信息对应的声源。
可选的,S140中,该验证准确率是指采用该验证模型判断该第一语音信息是否与该标记声源对应的准确率,可以表达为:
该验证准确率是指特定声源的验证模型判断语音信息是否属于该特定声源的正确次数占总验证次的数百分比。
具体而言,是验证模型对语音信息的声源进行判断,即,对于特定声源的语音信息,验证模型判断出的声源即为该特征声源,此次验证即为正确的。举例说明,如果待验证的是声源A的语音信息,但是声源A对应的验证模型模型判断该语音信息不是属于声源A的,则判断错误;判断该语音信息是属于声源A的,则判断正确。如果待验 证的不是声源A的语音信息,但是声源A对应的验证模型模型判断该语音信息是属于声源A的,则判断错误;判断该语音信息是不属于声源A的,则判断正确。
可选的,S140中,该验证准确率是指采用该验证模型判断该第一语音信息是否与该标记声源对应的准确率,也可以表达为:
验证准确率根据假接受率(False Accept Rate,FAR)和假拒绝率(False Reject Rate,FRR)确定。
具体而言,该验证准确率是假接受率(False Accept Rate,FAR)和假拒绝率(False Reject Rate,FRR)相等时的正确率;可表达为(1-等错误率),等错误率(EER,Equal Error Rate)为假接受率(和假拒绝率相等时的错误率。
一般情况下,若要使得验证模型的验证准确率最高,则要使得验证错误率最低,而验证错误率包括了假接收率和假拒绝率,假接受率是指接受了一个错误的结果,即待验证语音信息原本不是特定声源(如标记的声源)的语音但是验证模型错误地把它当成是特定声源的语音信息;假拒绝率是指拒绝了一个正确的结果,即指待测语音信息原本是特定声源(如标记的声源)的语音信息但是验证模型错误地把它当成不是特定声源的语音信息。假接收率和假拒绝率在图像上是两条相关(非线性相关)的曲线,假接收率越大,假拒绝率越小,反之,假拒绝率越大,假接收率越小,两曲线的交点即为等错误率EER,当FAR=FRR=EER时,验证错误率最低。此时,验证正确率,即(1-ERR),也就是最高。根据这种规律,我们可设置一个触发机制,如当验证准确率高于预设阈值(假设为0.95)时,开始启用声源对应的验证模块来代替最初的声源标记方式,如通过角度来判断声源,或者,用户自行输入身份信息的方式。
可选的,在S140之后,虽然当声源对应验证模型可满足预设的最低要求,即验证准确率大于预设阈值,可靠性有了最低保证。但是,本申请提供的方法的实施例一的自我学习过程并不会终止,相反,只要用户依然在使用运行该方法的语音记录系统,就会不断的有标记声源的语音信息产生,该方法依然可以用这些有声源标记的语音信息继续去适配该声源对应验证模型,让其准确率越来越高,该过程可称之为使用时学习。
使用过程中,假设使用场景为会议系统,请参见请参见图6(a)中之举例,实时多通道会议音频数据流,经过多通道时域语音分离模型分离后,得到分离后的语音信息和角度的集合,其中每条记录中都包含语音信息及其对应的角度,通过角度信息,确定语音信息对应的说话人(即声源),获得说话人语音信息集合,再将说话人语音信息集合中的语音信息分别送入元训练模型进行训练和适配,获得针对不同声源的验证模型, 说话人1模型……说话人n模型。请继续参见图6(b),当验证模型的验证准确率大于预设值之后,与图6(a)中的验证过程不同的是,此时,可直接采用验证模型来确定语音信息的声源,即,将收到的语音信息与说话人验证模型集合中的说话人模型进行匹配,采用匹配上的说话人模型对该语音信息进行验证。同时,还会用该语音信息对该验证模型进行进一步的适配,以进一步提升该验证模型的验证准确率。
本申请的实施方式,由于在使用过程中,使用语音记录装置的人位置会发生变换,并且也会有不同的人来使用该语音记录装置,如果采用通过角度信息确定声源,或者,用户自行输入的方式确定声源,其不确定性会增大,可靠性降低。因此,当本申请实施方式一提供的方法,经历了S120、S130“自我学习”阶段和“使用时学习”阶段之后,针对不同的声源生成特定的声源验证模型,并且在使用过程中,针对该声源的验证准确率会不断的得到提升,当达到用户可以接受的程度的时,即可用验证模块来替换初始的声源确定方式,对语音信息的声源进行区分和验证。并且,在该方法中,随着特定声源的使用时间越来越长,该特定声源对应的验证模型的准确率也会越来越高,形成一个良性循环,实现运行该方法的语音记录设备“自我学习”的目的。因此,相较于传统的声源验证方式,本申请提供的方案具备更好的实时性、准确性和灵活性。
由于,本申请的实施方式具备这种“自我学习”的功能,能够让运行该方法的语音记录设备在使用的过程中不断学习、不断完善,越来越智能,对不同的说话人的语音区分准确率越来越高,并且几乎不需要附加任何限制条件。用户使用越多,对用户的区分灵敏度越高,可极大地提高用户粘性和用户体验。
并且,本申请的实施方式,,可以针对不同的说话人和不同的会议场景,训练出更具有独特性的验证模型,对声源进行验证和区分,可靠性更高;所有的语音分离、说话人适配以及说话人区分过程都在本地进行,说话人模型也保存在本地存储空间中,相比于传统智能会议记录系统还需联网将语音文件上传到云端的操作,本发明的技术方案不仅可以降低传输时延、简化操作,更能保证用户隐私信息,适用于安全、保密性要求更高的场景。
实施方式二
请参看图7,为本申请实施方式二提供的一种语音记录装置200。该语音记录装置200包括但不限于录音笔、音频会议终端、或者有录音功能的智能电子设备等中任意一种,也可以是不包含语音拾取功能,仅包含语音区分或验证的分析功能,可实现该功能的电脑或其他智能电子设备,对此在本实施方式二中不做限定。
该语音记录装置200包括:
获取单元210,用于获取第一语音数据;其中,该第一语音数据包括第一语音信息,以及该第一语音信息对应的标记声源;
学习单元220,用于若未存储与该标记声源对应的验证模型,采用预先存储的基础验证模型对该第一语音信息进行适配,并将适配后的模型参数集作为与该标记声源对应的验证模型进行保存;若存储有与该标记声源对应的验证模型,采用该验证模型判断该第一语音信息是否与该标记声源对应,并对该验证模型进行优化;
使用单元230,用于当确定该验证模型的验证准确率超过预设阈值时,采用该验证模型确定第二语音数据中包含的第二语音信息对应该标记声源;其中,该验证准确率是指采用该验证模型判断该第一语音信息是否与该标记声源对应的准确率。
该可选的,该该获取单元210,具体用于采集多通道语音数据;对该多通道语音数据进行语音分离,获得分离后的该第一语音数据。
可选的,该获取单元210,具体用于采集多通道语音数据;对该多通道语音数据采用时域信号分离方式进行分离,获得分离后的该第一语音数据。
可选的,该基础验证模型为预先通过元训练模型训练获得的泛化验证模型。
可选的,若存储有与该标记声源对应的验证模型,该学习单元220,具体用于采用该验证模型判断该第一语音信息是否与该标记声源对应,并采用该标记声源对应的验证模型对该第一语音信息进行适配,并将优化后的模型参数集作为该标记声源对应的验证模型进行保存。
可选的,该预设阈值根据假接受率和假拒绝率确定。具体而言,该验证准确利率可表达为(1-等错误率),其阈值可以设置为0.95,即等错率要小于0.05。
本实施方式二中有不详尽之处、或优化方案、或者具体的实例,请参见上述实施方式一中相同或对应的部分,在此不做重复赘述。
实施方式三
请参看图8,本申请实施方式三提供的一种语音记录装置300的结构示意图。该视频处理装置300包括:处理器310以及存储器320。处理器310、存储器320之间通过总线系统实现相互的通信连接。处理器310调用存储器320中的程序,执行上述实施方式一提供的任意一种语音分析方法。
该处理器310可以是一个独立的元器件,也可以是多个处理元件的统称。例如,可以是CPU,也可以是ASIC,或者被配置成实施以上方法的一个或多个集成电路,如 至少一个微处理器DSP,或至少一个可编程门这列FPGA等。存储器320为一计算机可读存储介质,其上存储可在处理器310上运行的程序。
可选的,该语音处理装置300还包括:声音拾取装置330用于获取语音信息。处理器310、存储器320、声音拾取装置330之间通过总线系统实现相互的通信连接。处理器310调用存储器320中的程序,执行上述实施方式一提供的任意一种语音分析方法,处理该声音拾取装置330获取的多通道语音信息。
本实施方式三中有不详尽之处,请参见上述实施方式一中相同或对应的部分,在此不做重复赘述。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本申请具体实施方式所描述的功能可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以是由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成。软件模块可以被存放于计算机可读存储介质中,所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,数字视频光盘(Digital Video Disc,DVD))、或者半导体介质(例如,固态硬盘(Solid State Disk,SSD))等。所述计算机可读存储介质包括但不限于随机存取存储器(Random Access Memory,RAM)、闪存、只读存储器(Read Only Memory,ROM)、可擦除可编程只读存储器(Erasable Programmable ROM,EPROM)、电可擦可编程只读存储器(Electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、只读光盘(CD-ROM)或者本领域熟知的任何其它形式的存储介质。一种示例性的计算机可读存储介质耦合至处理器,从而使处理器能够从该计算机可读存储介质读取信息,且可向该计算机可读存储介质写入信息。当然,计算机可读存储介质也可以是处理器的组成部分。处理器和计算机可读存储介质可以位于ASIC中。另外,该ASIC可以位于接入网设备、目标网络设备或核心网设备中。当然,处理器和计算机可读存储介质也可以作为分立组件存在于接入网设备、目标网络设备或核心网设备中。当使用软件实现时,也可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机或芯片上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请具体实施方式所述的流程或功能,该芯片可包含有处理器。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机程序指令可以存储在上述计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计 算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(Digital Subscriber Line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。
上述实施方式说明但并不限制本发明,本领域的技术人员能在权利要求的范围内设计出多个可代替实例。所属领域的技术人员应该意识到,本申请并不局限于上面已经描述并在附图中示出的精确结构,对在没有违反如所附权利要求书所定义的本发明的范围之内,可对具体实现方案做出适当的调整、修改、、等同替换、改进等。因此,凡依据本发明的构思和原则,所做的任意修改和变化,均在所附权利要求书所定义的本发明的范围之内。

Claims (14)

  1. 一种语音分析方法,其特征在于,所述方法包括:
    获取第一语音数据,其中,所述第一语音数据包括第一语音信息以及所述第一语音信息对应的标记声源;
    若未存储与所述标记声源对应的验证模型,采用预先存储的基础验证模型对所述第一语音信息进行适配,并将适配后的模型参数集作为与所述标记声源对应的验证模型保存;
    若存储有与所述标记声源对应的验证模型,采用所述验证模型判断所述第一语音信息是否与所述标记声源对应,并对所述验证模型进行优化;
    当确定所述验证模型的验证准确率超过预设阈值时,采用所述验证模型确定第二语音数据中包含的第二语音信息对应的声源;其中,所述验证准确率是指采用所述验证模型判断所述第一语音信息是否与所述标记声源对应的准确率。
  2. 如权利要求1所述的方法,其特征在于,所述获取第一语音数据,包括:
    采集多通道语音数据;
    对所述多通道语音数据进行语音分离,获得分离后的所述第一语音数据。
  3. 如权利要求2所述的方法,其特征在于,所述对所述多通道语音数据进行语音分离,包括:
    对所述多通道语音数据采用时域信号分离方式进行分离。
  4. 如权利要求1所述的方法,其特征在于,所述基础验证模型为预先通过元训练模型训练获得的泛化验证模型。
  5. 如权利要求1所述的方法,其特征在于,所述对所述验证模型进行优化,包括:
    采用所述标记声源对应的验证模型对所述第一语音信息进行适配,并将优化后的的模型参数集作为所述标记声源对应的验证模型进行保存。
  6. 如权利要求1所述的方法,其特征在于,所述预设阈值根据假接受率和假拒绝率确定。
  7. 一种语音记录设备,其特征在于,所述语音记录设备包括:
    获取单元,用于获取第一语音数据;其中,所述第一语音数据包括第一语音信息,以及所述第一语音信息对应的标记声源;
    学习单元,用于若未存储与所述标记声源对应的验证模型,采用预先存储的基础验证模型对所述第一语音信息进行适配,并将适配后的模型参数集作为与所述标记声源 对应的验证模型进行保存;若存储有与所述标记声源对应的验证模型,采用所述验证模型判断所述第一语音信息是否与所述标记声源对应,并对所述验证模型进行优化;
    使用单元,用于当确定所述验证模型的验证准确率超过预设阈值时,采用所述验证模型确定第二语音数据中包含的第二语音信息对应所述标记声源;其中,所述验证准确率是指采用所述验证模型判断所述第一语音信息是否与所述标记声源对应的准确率。
  8. 如权利要求7所述的语音记录设备,其特征在于,所述获取单元,具体用于采集多通道语音数据;对所述多通道语音数据进行语音分离,获得分离后的所述第一语音数据。
  9. 如权利要求8所述的语音记录设备,其特征在于,所述获取单元,具体用于采集多通道语音数据;对所述多通道语音数据采用时域信号分离方式进行分离,获得分离后的所述第一语音数据。
  10. 如权利要求7所述的语音记录设备,其特征在于,所述基础验证模型为预先通过元训练模型训练获得的泛化验证模型。
  11. 如权利要求7所述的语音记录设备,其特征在于,若存储有与所述标记声源对应的验证模型,所述学习单元,具体用于采用所述验证模型判断所述第一语音信息是否与所述标记声源对应,并采用所述标记声源对应的验证模型对所述第一语音信息进行适配,并将优化后的模型参数集作为所述标记声源对应的验证模型进行保存。
  12. 如权利要求7所述的语音记录设备,其特征在于,所述预设阈值根据假接受率和假拒绝率确定。
  13. 一种语音记录设备,其特征在于,所述语音记录设备包括:处理器以及存储器;所述处理器调用所述存储器中的程序,执行上述权利要求1至6中任意一项所述的语音分析方法。
  14. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有语音分析方法的程序,所述语音分析方法的程序被处理器执行时实现上述权利要求1至6中任意一项所述的语音分析方法。
PCT/CN2021/120416 2021-02-03 2021-09-24 一种语音分析方法及其语音记录装置 WO2022166220A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110149946.4A CN112992174A (zh) 2021-02-03 2021-02-03 一种语音分析方法及其语音记录装置
CN202110149946.4 2021-02-03

Publications (1)

Publication Number Publication Date
WO2022166220A1 true WO2022166220A1 (zh) 2022-08-11

Family

ID=76346460

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/120416 WO2022166220A1 (zh) 2021-02-03 2021-09-24 一种语音分析方法及其语音记录装置

Country Status (2)

Country Link
CN (1) CN112992174A (zh)
WO (1) WO2022166220A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112992174A (zh) * 2021-02-03 2021-06-18 深圳壹秘科技有限公司 一种语音分析方法及其语音记录装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100268537A1 (en) * 2009-04-17 2010-10-21 Saudi Arabian Oil Company Speaker verification system
CN108288470A (zh) * 2017-01-10 2018-07-17 富士通株式会社 基于声纹的身份验证方法和装置
CN108922538A (zh) * 2018-05-29 2018-11-30 平安科技(深圳)有限公司 会议信息记录方法、装置、计算机设备及存储介质
CN111341326A (zh) * 2020-02-18 2020-06-26 RealMe重庆移动通信有限公司 语音处理方法及相关产品
CN112992174A (zh) * 2021-02-03 2021-06-18 深圳壹秘科技有限公司 一种语音分析方法及其语音记录装置

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102968990B (zh) * 2012-11-15 2015-04-15 朱东来 说话人识别方法和系统
CN103226951B (zh) * 2013-04-19 2015-05-06 清华大学 基于模型顺序自适应技术的说话人确认系统创建方法
CN104143326B (zh) * 2013-12-03 2016-11-02 腾讯科技(深圳)有限公司 一种语音命令识别方法和装置
CN105489221B (zh) * 2015-12-02 2019-06-14 北京云知声信息技术有限公司 一种语音识别方法及装置
CN107545889B (zh) * 2016-06-23 2020-10-23 华为终端有限公司 适用于模式识别的模型的优化方法、装置及终端设备
CN108305633B (zh) * 2018-01-16 2019-03-29 平安科技(深圳)有限公司 语音验证方法、装置、计算机设备和计算机可读存储介质
CN110544488B (zh) * 2018-08-09 2022-01-28 腾讯科技(深圳)有限公司 一种多人语音的分离方法和装置
CN111179940A (zh) * 2018-11-12 2020-05-19 阿里巴巴集团控股有限公司 一种语音识别方法、装置及计算设备
US11580325B2 (en) * 2019-01-25 2023-02-14 Yahoo Assets Llc Systems and methods for hyper parameter optimization for improved machine learning ensembles
CN110689523A (zh) * 2019-09-02 2020-01-14 西安电子科技大学 基于元学习个性化图像信息评价方法、信息数据处理终端
CN110991661A (zh) * 2019-12-20 2020-04-10 北京百度网讯科技有限公司 用于生成模型的方法和装置
CN111353610A (zh) * 2020-02-28 2020-06-30 创新奇智(青岛)科技有限公司 一种模型参数确定方法、装置、存储介质及电子设备
CN111326168B (zh) * 2020-03-25 2023-08-22 合肥讯飞数码科技有限公司 语音分离方法、装置、电子设备和存储介质
CN111931991A (zh) * 2020-07-14 2020-11-13 上海眼控科技股份有限公司 气象临近预报方法、装置、计算机设备和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100268537A1 (en) * 2009-04-17 2010-10-21 Saudi Arabian Oil Company Speaker verification system
CN108288470A (zh) * 2017-01-10 2018-07-17 富士通株式会社 基于声纹的身份验证方法和装置
CN108922538A (zh) * 2018-05-29 2018-11-30 平安科技(深圳)有限公司 会议信息记录方法、装置、计算机设备及存储介质
CN111341326A (zh) * 2020-02-18 2020-06-26 RealMe重庆移动通信有限公司 语音处理方法及相关产品
CN112992174A (zh) * 2021-02-03 2021-06-18 深圳壹秘科技有限公司 一种语音分析方法及其语音记录装置

Also Published As

Publication number Publication date
CN112992174A (zh) 2021-06-18

Similar Documents

Publication Publication Date Title
US11727918B2 (en) Multi-user authentication on a device
WO2019080639A1 (zh) 一种对象识别方法、计算机设备及计算机可读存储介质
CN112074901A (zh) 语音识别登入
EP1704668B1 (en) System and method for providing claimant authentication
US9589560B1 (en) Estimating false rejection rate in a detection system
CN110178178A (zh) 具有环境自动语音识别(asr)的麦克风选择和多个讲话者分割
US20220130395A1 (en) Voice-Controlled Management of User Profiles
WO2021051608A1 (zh) 一种基于深度学习的声纹识别方法、装置及设备
Korshunov et al. Impact of score fusion on voice biometrics and presentation attack detection in cross-database evaluations
US20230401338A1 (en) Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium
KR102655791B1 (ko) 화자 인증 방법, 화자 인증을 위한 학습 방법 및 그 장치들
EP3501024B1 (en) Systems, apparatuses, and methods for speaker verification using artificial neural networks
US20240013784A1 (en) Speaker recognition adaptation
WO2022166220A1 (zh) 一种语音分析方法及其语音记录装置
CN116508097A (zh) 说话者识别准确度
CN113889091A (zh) 语音识别方法、装置、计算机可读存储介质及电子设备
WO2021027555A1 (zh) 一种人脸检索方法及装置
WO2018001125A1 (zh) 一种音频识别方法和装置
JP7453733B2 (ja) マルチデバイスによる話者ダイアライゼーション性能の向上のための方法およびシステム
CN115713939B (zh) 语音识别方法、装置及电子设备
CN106373576B (zh) 一种基于vq和svm算法的说话人确认方法及其系统
CN115547345A (zh) 声纹识别模型训练及相关识别方法、电子设备和存储介质
US11676608B2 (en) Speaker verification using co-location information
WO2022166219A1 (zh) 一种语音区分方法及其语音记录装置
CN112382296A (zh) 一种声纹遥控无线音频设备的方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21924217

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21924217

Country of ref document: EP

Kind code of ref document: A1