WO2020211354A1 - Speaker identity recognition method and device based on speech content, and storage medium - Google Patents

Speaker identity recognition method and device based on speech content, and storage medium Download PDF

Info

Publication number
WO2020211354A1
WO2020211354A1 PCT/CN2019/117903 CN2019117903W WO2020211354A1 WO 2020211354 A1 WO2020211354 A1 WO 2020211354A1 CN 2019117903 W CN2019117903 W CN 2019117903W WO 2020211354 A1 WO2020211354 A1 WO 2020211354A1
Authority
WO
WIPO (PCT)
Prior art keywords
text information
identity
speaker
target
speech
Prior art date
Application number
PCT/CN2019/117903
Other languages
French (fr)
Chinese (zh)
Inventor
王健宗
孙奥兰
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020211354A1 publication Critical patent/WO2020211354A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches

Definitions

  • This application relates to the technical field of speech signal processing, and in particular to a method, device and computer-readable storage medium for speaker identification based on the content of speech.
  • the voiceprint is not as good as fingerprints and faces, the individual differences are obvious, but because each person's vocal tract, oral cavity and nasal cavity (the organs used for pronunciation) also have individual differences. Because it is reflected in the sound, it is also different. For example, when we are answering the phone, we can accurately tell who is answering the phone by saying "Hello”. Our human ears, as receivers of the body, are born with the ability to distinguish sounds, so we too Through technical means, voiceprints can also be used as important information for "personal identity authentication” like human faces and fingerprints.
  • Voiceprint recognition Voiceprint Recognition, VPR
  • speaker recognition includes two types, namely speaker identification and speaker verification (Speaker Verification).
  • the former is used to determine which one of several people said a certain speech, which is a "multiple choice” question; while the latter is used to confirm whether a certain speech is spoken by a designated person, which is a "one-to-one discrimination" "problem.
  • Speaker recognition is the process of accepting or rejecting the speaker's identity given the speaker's voice information. It is widely used in banking systems, financial commerce and voice security control.
  • speaker recognition technology has gradually developed and been popularized, especially in security verification and telephone banking.
  • This technology is required to be applied in the single-channel-single-speaker scenario, that is, to input the voice information of a single customer to obtain a better verification effect.
  • speaker recognition can help customers solve urgent needs and obtain personalized services, and can also help achieve precision marketing.
  • the applicant realizes that the existing products in the industry are mostly based on the speaker's voiceprint recognition, but this method works better when the two sides of the conversation are of different genders, and when the gender is the same, the effect is relatively poor.
  • the audio of the conversation between the customer and the customer service is recorded on a single channel of the telephone recording. It is not possible to directly verify the identity of the customer on the telephone recording information through the speaker verification technology, resulting in low telephone customer service efficiency. Waste a lot of manpower and material resources.
  • This application provides a speaker identification method, device, and computer-readable storage medium based on the content of speech.
  • the main purpose of the method is to convert the recorded conversation audio into text information using automatic speech recognition technology, and then use a deep learning classification method to perform Identification of the customer or customer service, and finally, splicing the customer audio clips and verifying the identity of the spliced audio clips.
  • the speaker identification and the speaker can be based on the speech content. Verification, improve the accuracy of the identity verification process, realize its application in telephone customer service, and save manpower and material resources.
  • this application provides a speaker identification method based on speaking content, which is applied to an electronic device, and the method includes:
  • the identity of the target to be confirmed is confirmed according to the target voice signal.
  • the present application also provides an electronic device, which includes a memory, a processor, and a camera device, the memory includes a speaker identification program based on the content of the speech, and the speaker based on the content of the speech
  • the identity recognition program is executed by the processor in the steps of the aforementioned speaker identity recognition method based on speaking content.
  • this application also provides a speaker identification system based on the content of the speech, including:
  • a voice signal collection unit configured to collect an initial voice signal, wherein the initial voice signal includes the speech content of at least two targets to be confirmed;
  • a text information conversion unit configured to convert the initial voice signal into text information corresponding to the speaking content through a voice recognition technology
  • the text information fragment obtaining unit is configured to identify the speaker's identity according to the text information, and obtain the text information fragment corresponding to each target to be confirmed, and the speaker is one of the targets to be confirmed;
  • the target voice signal acquiring unit is configured to acquire and splice the voice signal segment corresponding to the target to be confirmed according to the text information segment to acquire the target voice signal;
  • the identity confirmation unit is used to confirm the identity of the target to be confirmed according to the target voice signal.
  • the present application also provides a computer-readable storage medium, which includes a speaker identification program based on speaking content, and the speaker identification program based on speaking content is processed by a processor When executed, the steps of the speaker identification method based on the speaking content as described above are realized.
  • the speaker identification method, device and computer readable storage medium based on the content of speech proposed in this application convert the recorded dialogue audio into text information using automatic speech recognition technology, and then use the deep learning classification method to identify the target or non-target Recognition, and finally, splicing the target audio clips and verifying the identity of the spliced audio clips.
  • the speaker identification and verification based on the content of the speech can improve the identity verification The accuracy of the process.
  • FIG. 1 is a flowchart of a specific embodiment of a speaker identification method based on speaking content in this application;
  • Figure 2 is a schematic diagram of the application for identifying targets based on the converted text information
  • Figure 3 is a flowchart of identifying the target according to the converted text information in 2;
  • Fig. 4 is a schematic diagram of speaker identity confirmation based on DNN according to an embodiment of the present application.
  • Fig. 5 is a schematic diagram of speaker identity confirmation based on GMM according to an embodiment of the present application.
  • Fig. 6 is a logical structure of a speaker identification system based on speaking content according to an embodiment of the present application
  • FIG. 7 is a schematic diagram of an application environment of a specific embodiment of a speaker identification method based on speaking content according to the present application.
  • Fig. 1 is a flowchart of a specific embodiment of speaker identification based on speaking content in this application.
  • the electronic device that executes the method can be implemented by software and/or hardware.
  • the speaker identification method based on speaking content includes the following steps:
  • Step S110 Collect an initial voice signal, where the initial voice signal includes the speech content of at least two targets to be confirmed.
  • the initial voice signal is a dialogue voice signal of at least two speakers.
  • the initial voice signal collection mentioned here is mainly for the voice signal of the speaker during the telephone communication. If the telephone communication is a situation where there are only two people making a voice call, the target to be confirmed is two, when a multi-person call can be realized
  • the speaker identification program based on the speaking content provided in this application can also be applied to the situation of multi-person calls. At this time, the initial voice signal will contain the speaking content of multiple targets to be confirmed.
  • the specific implementation schemes are similar. No longer.
  • the trigger point for the collection of voice signal data is also different.
  • the trigger point for the collection of voice signal data can be set at Buttons on the mobile terminal, or start buttons, etc.
  • the initial voice signal is the collected voice signal data, and the voice signal data can be used as the initial voice signal required in subsequent identity recognition.
  • step S120 the collected initial voice signal is converted into text information corresponding to the content of the speech through ASR (Automatic Speech Recognition, voice recognition technology).
  • ASR Automatic Speech Recognition, voice recognition technology
  • the steps of converting the collected initial voice signal into the corresponding text information through the ASR voice recognition technology include: first passing the Subspace Gaussian Mixture Model (SGMM) ) And voice activity detection VAD (Voice Activity Detection, VAD), which divides the initial voice signal into multiple short voice fragments.
  • the short voice fragments can facilitate ASR to convert text information to it.
  • the segmentation parameters here can be performed according to ASR Set; then, each voice segment is converted into text information through ASR.
  • the SGMM-VAD algorithm can be composed of two Gaussian Mixed Models (GMM), which are used to describe the speech/non-speech log-normal distribution, and detect speech fragments from speech mixed with a high proportion of noise signals.
  • GMM Gaussian Mixed Models
  • the voice activity detection (Voice Activity Detection, VAD) is also called voice endpoint detection or voice boundary detection.
  • the purpose is to identify and eliminate the long silent period from the voice signal stream, so as to save the voice channel resources without reducing the service quality. It is an important part of the IP phone application. Silence suppression can save valuable bandwidth resources and can help reduce the end-to-end delay felt by users.
  • the steps of converting each voice segment through ASR may include:
  • the ASR model contains a total of 83-dimensional features, of which 80-dimensional is the front-end feature of log FBANK, the frame length is 25ms, and the other 3 is the pitch feature (including the probability of the POV principal element feature).
  • the LC-BHLSTM (Latency-controlled Bidirectional Highway Long Short-Term Memory) model was created.
  • the LC-BHLSTM model has 5 layers, 1024 storage units, and each layer has 512 outputs. The projection of the node.
  • the ASR model Second, input the segmented speech segments into the ASR model, and express each speech segment as a multi-dimensional feature output through the ASR model, which can specifically be an 83-dimensional feature output. Then, input the output signal of the ASR model into the LC-BHLSTM model.
  • the output target value of the LC-BHLSTM model is a 10k-dimensional context-dependent triphone state (also known as sentence sound), and finally complete the conversion of speech fragments to dialogue text information .
  • LSTM Long Short-Term Memory
  • LSTM Long Short-Term Memory
  • S130 Identify the target to be confirmed or the identity of the speaker according to the above text information, and obtain the text information fragment corresponding to each target to be confirmed, where the speaker is one of the targets to be confirmed.
  • the step of identifying the speaker's identity according to the text information may include:
  • step of identifying the speaker's identity according to the text information may further include:
  • the training set is constructed based on the corpus, and the "customer"/"customer service” (ie “target”/”non-target") tags are manually marked during the training phase to form the training set, and then the deep learning classification model is trained to form the dialogue text information Input the deep learning classification model, and assign the tags of "customer” and "customer service” to the text segment. Finally, find the corresponding customer voice information from the recognized customer text data, and splice them into customer voice.
  • the "customer"/"customer service” ie "target”/”non-target
  • the quality of the customer's voice is very important. Therefore, it is necessary to completely extract the customer voice from the voice of the customer-customer service dialogue and input it into the subsequent deep learning classification model for speaker verification.
  • telephone customer service platform data has the following characteristics:
  • the recorded voice only has two speakers, the customer service and the customer, and the one waiting to be verified is the customer's voice. Therefore, this application uses a two-classification method to identify classified customer service/customers.
  • the voices of the two speakers may be similar, but the content of the speech is different.
  • Telephone customer service is mostly established content, introducing products in related fields, so it will contain more professional terms, while customers answering or calling are mainly for consulting related issues, the language is relatively plain and life-like, and contains fewer professional terms. Therefore, these professional term keywords can be used as the features of the classification model to train the two classification model. This method is called "keyword matching".
  • the recognized customer text data of each segment is spliced into customer voice for later speaker verification.
  • FIG. 2 shows the principle of identifying a target according to the converted text information according to an embodiment of the application
  • FIG. 3 shows the process of identifying the target according to the converted text information in 2.
  • the steps to identify the target according to the converted text information are as follows:
  • S310 The system builds a search engine from the training set, extracts Chinese word segmentation of text information through the search engine, and builds a reserved index on these texts.
  • S320 Put the dialogue text information into the deep learning classification model for training, and obtain the K pieces of text most relevant to the dialogue text information.
  • K-NN K-Nearest Neighbor
  • K-NN K-Nearest Neighbor
  • k-NearestNeighbor K-Nearest Neighbor
  • the core idea of the K-NN algorithm is that if most of the k nearest samples in the feature space of a sample belong to a certain category, the sample also belongs to this category and has the characteristics of the samples in this category. This method only determines the category of the sample to be classified based on the category of the nearest one or several samples in determining the classification decision.
  • the K-NN algorithm is only related to a very small number of adjacent samples when making category decisions. Since the K-NN algorithm mainly relies on the surrounding limited nearby samples, rather than the method of discriminating the class domain to determine the category, so for the crossover or overlap of the class domain, the K-NN algorithm More suitable than other methods.
  • step S140 After obtaining the text information fragments corresponding to each target to be confirmed, step S140 is entered: the voice signal segments corresponding to the target to be confirmed are obtained according to the text information fragments and spliced to obtain the target voice signal.
  • the voice signal segment here can also be understood as a voice segment.
  • the initial voice signal is divided into multiple voice segments.
  • the The text information confirms the corresponding voice signal segment, which is the voice signal of the speaker who needs to be identified.
  • step S150 confirm the identity of the target to be confirmed according to the acquired target voice signal.
  • the step of confirming the target identity according to the acquired target voice signal may include two ways:
  • the first way is to use the i-vector system based on the deep neural network model DNN to confirm the identity of the target speaker or the identity of the target to be confirmed.
  • the second method is to use the i-vector system based on the Gaussian mixture model GMM to realize the confirmation of the identity of the target speaker or the identity of the target to be confirmed.
  • Figure 4 shows the principle of the DNN-based i-vector system to confirm the identity of the target speaker in the first way, where DNN is a deep neural network algorithm, UBM is a universal background model, and DFNN is a dynamic fuzzy nerve Network (Dynamic Fuzzy Neural Network), LSTM is Long Short-Term Memory, and TDNN is Time Delay Neural Network.
  • DNN is a deep neural network algorithm
  • UBM is a universal background model
  • DFNN is a dynamic fuzzy nerve Network (Dynamic Fuzzy Neural Network)
  • LSTM Long Short-Term Memory
  • TDNN Time Delay Neural Network
  • Step 1 Feature extraction, collect enough statistical information, extract i-vectors and a scoring standard. This process is used to convert speech waveforms into feature vectors (common parameters are: MFCC (Mel-frequency cepstral coefficients, Mel frequency cepstral coefficients), LPCC (Linear Prediction Cepstrum Coefficient, linear prediction cepstral parameters) and PLP (Perceptual Linear Prediction, perceptual linear prediction)), filter noise from a given speech signal, and retain useful speaker information.
  • MFCC Mel-frequency cepstral coefficients, Mel frequency cepstral coefficients
  • LPCC Linear Prediction Cepstrum Coefficient, linear prediction cepstral parameters
  • PLP Perceptual Linear Prediction, perceptual linear prediction
  • Step 2 Collect enough statistical information based on VAD technology to calculate 0-order, 1st-order, and 2nd-order Baum-Welch statistical information from a series of feature vectors. These statistics are high-dimensional information generated from large-scale DNNs, also known as UBM.
  • Step 3 The extraction of the i-vector is to convert the above-mentioned high-dimensional statistical information into a single low-dimensional feature vector, which contains only discriminative feature information that is different from other speakers.
  • Step 4 After the i-vector is extracted, use the scoring criteria (common criteria: cosine distance similarity, LDA (Linear Discriminant Analysis) and PLDA (Probabilistic Linear Discriminant Analysis)) to determine Whether to accept or reject the customer identity information.
  • scoring criteria common criteria: cosine distance similarity, LDA (Linear Discriminant Analysis) and PLDA (Probabilistic Linear Discriminant Analysis)
  • the principle of the GMM-based i-vector system to confirm the identity of the target speaker is similar to the feature extraction process of mode 1, and will not be repeated here.
  • Fig. 5 shows the principle of the GMM-based i-vector system to confirm the identity of the target speaker, where GMM is a Gaussian mixture model, and the meaning of MFCC and PLP can be explained with reference to Fig. 4.
  • the second method is similar to the feature extraction process of the first method, and will not be repeated here.
  • the speaker identification method based on speaking content converts the dialogue audio into text information using automatic speech recognition technology, and then uses a deep learning classification method to perform target or non-target identification of the text information. Finally, Splicing the target audio clip and verifying the identity of the spliced audio clip can identify and verify the speaker based on the spoken content according to the application scenario where there is a difference between the speaking content of the customer and the customer service in the telesales, effectively improving the identity verification process The accuracy rate.
  • FIG. 6 shows the logical structure of a speaker identification system based on speaking content according to an embodiment of the present application.
  • the speaker identification system 600 based on speaking content provided by the present application includes a voice signal acquisition unit 610, a text information conversion unit 620, a text information fragment acquisition unit 630, a target voice signal acquisition unit 640, and identity confirmation Unit 650.
  • the implementation functions of the voice signal acquisition unit 610, the text information conversion unit 620, the text information fragment acquisition unit 630, the target voice signal acquisition unit 640, and the identity confirmation unit 650 are the same as those of the speaker identity recognition method based on speaking content in the above embodiment The corresponding steps correspond one to one.
  • the voice signal collection unit 610 is used to collect the initial voice signal;
  • the text information conversion unit 620 is used to convert the initial voice signal collected by the voice signal collection unit 610 into text information corresponding to the content of the speech through voice recognition technology;
  • text information The fragment acquisition unit 630 is configured to identify the speaker's identity according to the text information converted by the text information conversion unit 620, and obtain a text information fragment corresponding to each target to be confirmed, where the speaker is one of the plurality of targets to be confirmed One;
  • the target voice signal acquisition unit 640 is configured to acquire and splice the voice signal segment corresponding to the target to be confirmed according to the text information fragment acquired by the text information fragment acquisition unit 630 to obtain the target voice signal;
  • the identity confirmation unit 650 is used to obtain the target voice signal according to the target The target voice signal obtained by the voice signal obtaining unit 640 confirms the identity of the target to be confirmed.
  • the text information conversion unit 620 further includes:
  • the voice segment segmentation unit 621 is configured to segment the initial voice signal into voice segments by using a subspace Gaussian mixture model and voice activity detection technology;
  • the voice segment conversion unit 622 is configured to perform text information conversion on each voice segment using voice recognition technology.
  • the speech segment conversion unit 622 includes a model construction unit and a model processing unit (not shown in the figure).
  • the model construction unit is used to construct the speech recognition model and the two-way high-speed long and short-term memory network model LC-BHLSTM with delay control;
  • the model processing unit is used to input each speech segment into the speech recognition model for processing, and the speech recognition model represents each speech segment It is a multi-dimensional feature output; and, inputting the output signal of the speech recognition model into the LC-BHLSTM model for processing to obtain text information corresponding to each speech segment.
  • the speech recognition model constructed by the model construction unit contains 83-dimensional features, 80 of which are the front-end features of log FBANK, the frame length is 25ms, and the other 3 are the pitch features; the delay-controlled bidirectional high-speed constructed by the model construction unit
  • the long and short-term memory network model has 5 layers, 1024 storage units, and each layer outputs a projection of 512 nodes.
  • the text information fragment obtaining unit 630 further includes:
  • the deep learning classification model acquisition unit 631 is configured to acquire a deep learning classification model formed based on training set training, where the training set is formed based on a corpus;
  • the label assigning unit 632 is configured to input text information into the deep learning classification model, and assign corresponding labels to the text information.
  • the identity confirmation unit 650 further includes:
  • the first identity confirmation unit 651 is configured to use an i-vector system based on a deep neural network model to confirm the identity of the target to be confirmed.
  • the first identity confirmation unit 651 includes a feature vector conversion unit, a high-dimensional information generating unit, a low-dimensional feature vector conversion unit, and an identity evaluation unit (not shown in the figure).
  • the feature vector conversion unit is used to extract i-vectors and a scoring standard based on statistical information to convert the speech waveform into a feature vector;
  • the high-dimensional information generation unit is used to calculate 0-order from a series of feature vectors, 1 Order, 2 order Baum-Welch statistical information to generate high-dimensional information;
  • low-dimensional feature vector conversion unit used to convert high-dimensional statistical information into a single low-dimensional feature vector, the low-dimensional feature vector only contains other speakers Different discriminative feature information;
  • the identity evaluation unit is used to determine whether to accept or reject the speaker’s identity information using preset scoring standards.
  • the identity confirmation unit 650 further includes a second identity confirmation unit (not shown in the figure), which is configured to adopt an i-vector system based on a Gaussian mixture model to achieve the target to be confirmed. Confirmation of identity.
  • the speaker identity recognition system based on speaking content uses automatic speech recognition technology for dialogue audio through a voice signal acquisition unit, a text information conversion unit, a text information fragment acquisition unit, a target voice signal acquisition unit, and an identity confirmation unit Convert it into text information, and then perform target or non-target identification of the text information, and finally splicing the target audio segment and verifying the identity of the spliced audio segment, which can be used in application scenarios where there are differences in speech content, based on the speech content Perform speaker identification and verification to effectively improve the accuracy of identity verification.
  • FIG. 7 is a schematic diagram of an application environment of a specific embodiment of the speaker identification method based on the speaking content of the application.
  • the electronic device 1 that implements the aforementioned speaker identification method based on speaking content may be a terminal device with arithmetic functions such as a server, a smart phone, a tablet computer, a portable computer, a desktop computer, and the like.
  • the electronic device 1 includes a processor 12, a memory 11, a network interface 14 and a communication bus 15.
  • the memory 11 includes at least one type of readable storage medium.
  • the at least one type of readable storage medium may be a non-volatile storage medium such as flash memory, hard disk, multimedia card, card-type memory 11, and the like.
  • the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1.
  • the readable storage medium may also be the external memory 11 of the electronic device 1, such as a plug-in hard disk or a smart memory card (Smart Media Card, SMC) equipped on the electronic device 1. , Secure Digital (SD) card, Flash Card, etc.
  • SD Secure Digital
  • the readable storage medium of the memory 11 is generally used to store the speaker identification program 10 based on the speaking content installed in the electronic device 1 and the like.
  • the memory 11 can also be used to temporarily store data that has been output or will be output.
  • the processor 12 may be a central processing unit (CPU), a microprocessor, or other data processing chip, which is used to run program codes or processed data stored in the memory 11, for example, based on the content of speech. Speaker identification program 10 etc.
  • CPU central processing unit
  • microprocessor microprocessor
  • Speaker identification program 10 etc.
  • the network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is generally used to establish a communication connection between the electronic device 1 and other electronic devices.
  • the communication bus 15 is used to realize the connection and communication between these components.
  • Fig. 7 only shows the electronic device 1 with components 11-15, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.
  • the electronic device 1 may also include a user interface.
  • the user interface may include an input unit such as a keyboard (Keyboard), a voice input device such as a microphone (microphone) and other devices with voice recognition functions, and a voice output device such as audio, earphones, etc.
  • the user interface may also include a standard wired interface and a wireless interface.
  • the electronic device 1 may also include a display, and the display may also be called a display screen or a display unit.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, and an organic light-emitting diode (Organic Light-Emitting Diode, OLED) touch device.
  • OLED Organic Light-Emitting Diode
  • the display is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.
  • the electronic device 1 further includes a touch sensor.
  • the area provided by the touch sensor for the user to perform a touch operation is called a touch area.
  • the touch sensor described here may be a resistive touch sensor, a capacitive touch sensor, or the like.
  • the touch sensor includes not only a contact type touch sensor, but also a proximity type touch sensor and the like.
  • the touch sensor may be a single sensor, or may be, for example, a plurality of sensors arranged in an array.
  • the area of the display of the electronic device 1 may be the same as or different from the area of the touch sensor.
  • the display and the touch sensor are stacked to form a touch display screen. The device detects the touch operation triggered by the user based on the touch screen.
  • the electronic device 1 may also include a radio frequency (RF) circuit, a sensor, an audio circuit, etc., which will not be repeated here.
  • RF radio frequency
  • the memory 11 of the computer storage medium may include an operating system and a speaker identification program 10 based on the content of speech; the processor 12 executes the speaker based on the content of the speech stored in the memory 11.
  • the identity recognition program 10 implements the steps of the aforementioned speaker identity recognition method based on speaking content.
  • the electronic device 1 proposed in the above embodiment can reduce the need for acoustic model modeling, and use a two-classification algorithm to improve the recognition effect of the model in scenes with different speaker genders.
  • the entire identity verification and recognition framework is proposed, which can solve the problem of customer verification in single-channel-multiple or dual-speaker scenarios, with high speaker recognition accuracy and fast speed.
  • the speaker identification program 10 based on speaking content can also be divided into one or more modules, and one or more modules are stored in the memory 11 and executed by the processor 12 to complete the application .
  • the module referred to in this application refers to a series of computer program instruction segments that can complete specific functions. Specifically in this application, this module is a module that can implement the steps of the aforementioned speaker identification method based on speaking content, or It is the module corresponding to each unit of the aforementioned speaker identification system based on speaking content.
  • the embodiment of the present application also proposes a computer-readable storage medium that includes a speaker identification program based on the content of speech, which is implemented when the program is executed by the processor.
  • a speaker identification program based on the content of speech, which is implemented when the program is executed by the processor.
  • the operation of each step of the aforementioned speaking content-based speaker identification method, or the function of each unit of the aforementioned speaking content-based speaker identification system when the said speaking content-based speaker identification program is executed by the processor To avoid repetition, I won’t repeat them here.

Abstract

A speaker identity recognition method and device based on speech content, and a storage medium. The method comprises: acquiring an initial speech signal, wherein the initial speech signal comprises speech content of at least two targets to be determined (S110); converting the initial speech signal into text information corresponding to the speech content by means of speech recognition technology (S120); recognizing the identity of a speaker according to the text information, and obtaining text information fragments corresponding to the targets to be determined, wherein the speaker is one of the targets to be determined (S130); obtaining, according to the text information fragments, speech signal segments corresponding to the targets to be determined and splicing same to obtain a target speech signal (S140); and determining, according to the target speech signal, the identity of the targets to be determined (S150). The identity of a speaker is recognized and verified on the basis of the speech content, and thus, the accuracy in a identity verification process can be improved, application of the invention in telephone customer services can be achieved, and labor and material resources are saved.

Description

基于说话内容的说话者身份识别方法、装置及存储介质Speaker identification method, device and storage medium based on speaking content
本申请要求申请号为201910305438.3,申请日为2019年4月16日,发明创造名称为“基于说话内容的说话者身份识别方法、装置及存储介质”的专利申请的优先权。This application requires the priority of the patent application whose application number is 201910305438.3, the filing date is April 16, 2019, and the invention and creation titled "Method, device and storage medium for speaker identification based on speaking content".
技术领域Technical field
本申请涉及语音信号处理技术领域,尤其涉及一种基于说话内容的说话者身份识别方法、装置及计算机可读存储介质。This application relates to the technical field of speech signal processing, and in particular to a method, device and computer-readable storage medium for speaker identification based on the content of speech.
背景技术Background technique
根据研究表明,声纹虽然不如指纹、人脸这样,个体差异明显,但是由于每个人的声道、口腔和鼻腔(发音要用到的器官)也具有个体差异性。因为反映到声音上,也是具有差异性的。就比如说,当我们在接电话的时候,通过一声"喂",我们就能准确的分辨出接电话的是谁,我们人耳作为身体的接收器生来就具有分辨声音的能力,那么我们也可以通过技术的手段,使声纹也可以向人脸、指纹那样作为“个人身份认证”的重要信息。According to research, although the voiceprint is not as good as fingerprints and faces, the individual differences are obvious, but because each person's vocal tract, oral cavity and nasal cavity (the organs used for pronunciation) also have individual differences. Because it is reflected in the sound, it is also different. For example, when we are answering the phone, we can accurately tell who is answering the phone by saying "Hello". Our human ears, as receivers of the body, are born with the ability to distinguish sounds, so we too Through technical means, voiceprints can also be used as important information for "personal identity authentication" like human faces and fingerprints.
声纹识别(Voiceprint Recognition,VPR),也称为说话人识别(Speaker Recognition),包括两类,即说话人辨认(Speaker Identification)和说话人确认(Speaker Verification)。前者用以判断某段语音是若干人中的哪一个所说的,是“多选一”问题;而后者用以确认某段语音是否是指定的某个人所说的,是“一对一判别”问题。说话人识别是给定说话者语音信息,以接受或拒绝说话者身份的过程,被广泛应用在银行系统,金融商业和语音安全控制中。Voiceprint recognition (Voiceprint Recognition, VPR), also known as speaker recognition (Speaker Recognition), includes two types, namely speaker identification and speaker verification (Speaker Verification). The former is used to determine which one of several people said a certain speech, which is a "multiple choice" question; while the latter is used to confirm whether a certain speech is spoken by a designated person, which is a "one-to-one discrimination" "problem. Speaker recognition is the process of accepting or rejecting the speaker's identity given the speaker's voice information. It is widely used in banking systems, financial commerce and voice security control.
为此,说话人识别技术逐渐发展并得到普及,尤其在安全验证、电话银行中得到广泛应用。该技术要求在单信道-单一说话者情景下应用,即输入单一客户的语音信息,能够获得较好的验证效果。在客户导向的企业中,说话人识别能够帮助客户解决紧急需要,并获得个性化服务,也可以帮助实现精准营销。但是,申请人意识到,现有业内产品多为基于说话者声纹的识别,但这种方法在对话双方性别不同时效果较好,性别相同时,效果相对差。For this reason, speaker recognition technology has gradually developed and been popularized, especially in security verification and telephone banking. This technology is required to be applied in the single-channel-single-speaker scenario, that is, to input the voice information of a single customer to obtain a better verification effect. In customer-oriented companies, speaker recognition can help customers solve urgent needs and obtain personalized services, and can also help achieve precision marketing. However, the applicant realizes that the existing products in the industry are mostly based on the speaker's voiceprint recognition, but this method works better when the two sides of the conversation are of different genders, and when the gender is the same, the effect is relatively poor.
例如,在电话客户服务平台上,在电话录音的单一信道上记录的是客户与客服的对话音频,不能够直接通过说话人验证技术对电话录音信息进行客户身份验证,导致电话客户服务效率低,浪费大量的人力物力。For example, on the telephone customer service platform, the audio of the conversation between the customer and the customer service is recorded on a single channel of the telephone recording. It is not possible to directly verify the identity of the customer on the telephone recording information through the speaker verification technology, resulting in low telephone customer service efficiency. Waste a lot of manpower and material resources.
因此,为了解决上述问题,亟需一种能够通过音频交互对说话人身份进行验证的技术。Therefore, in order to solve the above-mentioned problems, there is an urgent need for a technology that can verify the identity of the speaker through audio interaction.
发明内容Summary of the invention
本申请提供一种基于说话内容的说话者身份识别方法、装置及计算机可读存储介质,其主要目的在于通过将录制的对话音频用自动语音识别技术转换为文字信息,然后使用深度学习分类方法进行客户或客服的身份识别,最后,对客户音频片段进行拼接及对拼接后的音频片段进行身份验证,能够根据电话销售中客户与客服说话内容存在差异的应用场景,基于说话内容进行说话人识别及验证,提高身份验证过程中的准确率,实现其在电话客户服务中的应用,节省人力物力。This application provides a speaker identification method, device, and computer-readable storage medium based on the content of speech. The main purpose of the method is to convert the recorded conversation audio into text information using automatic speech recognition technology, and then use a deep learning classification method to perform Identification of the customer or customer service, and finally, splicing the customer audio clips and verifying the identity of the spliced audio clips. According to the application scenarios where there is a difference between the customer and the customer service in the telesales, the speaker identification and the speaker can be based on the speech content. Verification, improve the accuracy of the identity verification process, realize its application in telephone customer service, and save manpower and material resources.
为实现上述目的,本申请提供一种基于说话内容的说话者身份识别方法,应用于电子装置,所述方法包括:In order to achieve the above objective, this application provides a speaker identification method based on speaking content, which is applied to an electronic device, and the method includes:
采集初始语音信号,其中,所述初始语音信号包含待确认目标的说话内容;Collecting an initial voice signal, where the initial voice signal contains the speech content of the target to be confirmed;
通过语音识别技术将所述初始语音信号转换为与所述说话内容对应的文本信息;Converting the initial voice signal into text information corresponding to the speaking content through a voice recognition technology;
根据所述文本信息对说话者身份进行识别,获取与各个待确认目标对应的文本信息片段,所述说话者为所述待确认目标其中之一;Recognizing the identity of the speaker according to the text information, and obtaining text information fragments corresponding to each target to be confirmed, the speaker being one of the targets to be confirmed;
根据文本信息片段获取与所述待确认目标对应的语音信号段并进行拼接,获取目标语音信号;Acquire and splice the voice signal segment corresponding to the target to be confirmed according to the text information segment to obtain the target voice signal;
根据所述目标语音信号对所述待确认目标的身份进行确认。The identity of the target to be confirmed is confirmed according to the target voice signal.
为实现上述目的,本申请还提供一种电子装置,该电子装置包括:存储器、处理器及摄像装置,所述存储器中包括基于说话内容的说话者身份识别程序,所述基于说话内容的说话者身份识别程序被所述处理器执行前述基于说话内容的说话者身份识别方法的步骤。In order to achieve the above object, the present application also provides an electronic device, which includes a memory, a processor, and a camera device, the memory includes a speaker identification program based on the content of the speech, and the speaker based on the content of the speech The identity recognition program is executed by the processor in the steps of the aforementioned speaker identity recognition method based on speaking content.
为实现上述目的,本申请还提供一种基于说话内容的说话者身份识别系 统,包括:In order to achieve the above objectives, this application also provides a speaker identification system based on the content of the speech, including:
语音信号采集单元,用于采集初始语音信号,其中,所述初始语音信号包含至少两个待确认目标的说话内容;A voice signal collection unit, configured to collect an initial voice signal, wherein the initial voice signal includes the speech content of at least two targets to be confirmed;
文本信息转换单元,用于通过语音识别技术将所述初始语音信号转换为与所述说话内容对应的文本信息;A text information conversion unit, configured to convert the initial voice signal into text information corresponding to the speaking content through a voice recognition technology;
文本信息片段获取单元,用于根据所述文本信息对说话者身份进行识别,获取与各个待确认目标对应的文本信息片段,所述说话者为待确认目标其中之一;The text information fragment obtaining unit is configured to identify the speaker's identity according to the text information, and obtain the text information fragment corresponding to each target to be confirmed, and the speaker is one of the targets to be confirmed;
目标语音信号获取单元,用于根据文本信息片段获取与所述待确认目标对应的语音信号段并进行拼接,获取目标语音信号;The target voice signal acquiring unit is configured to acquire and splice the voice signal segment corresponding to the target to be confirmed according to the text information segment to acquire the target voice signal;
身份确认单元,用于根据所述目标语音信号对所述待确认目标的身份进行确认。The identity confirmation unit is used to confirm the identity of the target to be confirmed according to the target voice signal.
为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中包括基于说话内容的说话者身份识别程序,所述基于说话内容的说话者身份识别程序被处理器执行时,实现如上所述的基于说话内容的说话者身份识别方法的步骤。In order to achieve the above objective, the present application also provides a computer-readable storage medium, which includes a speaker identification program based on speaking content, and the speaker identification program based on speaking content is processed by a processor When executed, the steps of the speaker identification method based on the speaking content as described above are realized.
本申请提出的基于说话内容的说话者身份识别方法、装置及计算机可读存储介质,将录制的对话音频用自动语音识别技术转换为文字信息,然后使用深度学习分类方法进行目标或非目标的身份识别,最后,对目标音频片段进行拼接及对拼接后的音频片段进行身份验证,能够根据电话销售中客户与客服说话内容存在差异的应用场景,基于说话内容进行说话人识别及验证,提高身份验证过程中的准确率。The speaker identification method, device and computer readable storage medium based on the content of speech proposed in this application convert the recorded dialogue audio into text information using automatic speech recognition technology, and then use the deep learning classification method to identify the target or non-target Recognition, and finally, splicing the target audio clips and verifying the identity of the spliced audio clips. According to the application scenarios where there is a difference between the content of the customer and the customer service in telesales, the speaker identification and verification based on the content of the speech can improve the identity verification The accuracy of the process.
附图说明Description of the drawings
图1为本申请基于说话内容的说话者身份识别方法具体实施例的流程图;FIG. 1 is a flowchart of a specific embodiment of a speaker identification method based on speaking content in this application;
图2为本申请根据转换后的文本信息对目标进行身份识别的原理图;Figure 2 is a schematic diagram of the application for identifying targets based on the converted text information;
图3为2中根据转换后的文本信息对目标进行身份识别的流程图;Figure 3 is a flowchart of identifying the target according to the converted text information in 2;
图4为根据本申请实施例的基于DNN的说话人身份确认原理图;Fig. 4 is a schematic diagram of speaker identity confirmation based on DNN according to an embodiment of the present application;
图5为根据本申请实施例的基于GMM的说话人身份确认原理图;Fig. 5 is a schematic diagram of speaker identity confirmation based on GMM according to an embodiment of the present application;
图6为根据本申请实施例的基于说话内容的说话者身份识别系统的逻辑 结构;Fig. 6 is a logical structure of a speaker identification system based on speaking content according to an embodiment of the present application;
图7为本申请基于说话内容的说话者身份识别方法具体实施例的应用环境示意图。FIG. 7 is a schematic diagram of an application environment of a specific embodiment of a speaker identification method based on speaking content according to the present application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
具体实施方式detailed description
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described here are only used to explain the application, and are not used to limit the application.
实施例一Example one
本申请提供一种基于说话内容的说话者身份识别方法,应用于一种电子装置。图1为本申请基于说话内容的说话者身份识别具体实施例的流程图。执行该方法的电子装置可以由软件和/或硬件实现。This application provides a speaker identification method based on speaking content, which is applied to an electronic device. Fig. 1 is a flowchart of a specific embodiment of speaker identification based on speaking content in this application. The electronic device that executes the method can be implemented by software and/or hardware.
在图1所示的实施例中,基于说话内容的说话者身份识别方法包括如下步骤:In the embodiment shown in FIG. 1, the speaker identification method based on speaking content includes the following steps:
步骤S110,采集初始语音信号其中,其中的初始语音信号包含至少两个待确认目标的说话内容。Step S110: Collect an initial voice signal, where the initial voice signal includes the speech content of at least two targets to be confirmed.
其中,该初始语音信号为至少两个说话者的对话语音信号。此处提到的采集初始语音信号,主要是针对电话沟通过程中说话人的语音信号,如果电话沟通为只有两个人进行语音通话的情况,待确认目标为两个,当能实现多人通话时,本申请提供的基于说话内容的说话者身份识别程序也可以适用于多人通话的情形,此时初始语音信号就会包含多个待确认目标的说话内容,具体实施方案是相似的,此处不再赘述。Wherein, the initial voice signal is a dialogue voice signal of at least two speakers. The initial voice signal collection mentioned here is mainly for the voice signal of the speaker during the telephone communication. If the telephone communication is a situation where there are only two people making a voice call, the target to be confirmed is two, when a multi-person call can be realized The speaker identification program based on the speaking content provided in this application can also be applied to the situation of multi-person calls. At this time, the initial voice signal will contain the speaking content of multiple targets to be confirmed. The specific implementation schemes are similar. No longer.
另外,针对应用场景的不同,对语音信号数据的采集触发点也存在不同,例如,当基于说话内容的说话者身份识别程序安装在移动终端上时,触发语音信号数据采集的可以为设定在移动终端上的按键,或者启动按钮等。而初始语音信号就是采集到的语音信号数据,该语音信号数据即可作为后续身份识别中所需要的初始语音信号。In addition, for different application scenarios, the trigger point for the collection of voice signal data is also different. For example, when a speaker identification program based on the content of the speech is installed on a mobile terminal, the trigger point for the collection of voice signal data can be set at Buttons on the mobile terminal, or start buttons, etc. The initial voice signal is the collected voice signal data, and the voice signal data can be used as the initial voice signal required in subsequent identity recognition.
步骤S120,通过ASR(Automatic Speech Recognition,语音识别技术)将所采集的初始语音信号转换为与说话内容对应的文本信息。In step S120, the collected initial voice signal is converted into text information corresponding to the content of the speech through ASR (Automatic Speech Recognition, voice recognition technology).
作为示例,当说话者分别为客户和客服时,通过ASR语音识别技术将所采集的初始语音信号转换为对应的文本信息的步骤包括:先通过子空间高斯混合模型SGMM(Subspace Gaussian Mixture Model,SGMM)和语音活动检测VAD(Voice Activity Detection,VAD),将初始语音信号分割为多个短小的语音片段,短小的语音片段能够便于ASR对其进行文本信息转换,此处的分割参数可以根据ASR进行设定;然后,通过ASR对各语音片段分别进行文本信息转换。As an example, when the speakers are the customer and the customer service, the steps of converting the collected initial voice signal into the corresponding text information through the ASR voice recognition technology include: first passing the Subspace Gaussian Mixture Model (SGMM) ) And voice activity detection VAD (Voice Activity Detection, VAD), which divides the initial voice signal into multiple short voice fragments. The short voice fragments can facilitate ASR to convert text information to it. The segmentation parameters here can be performed according to ASR Set; then, each voice segment is converted into text information through ASR.
具体地,SGMM-VAD算法可由两个GMM(Gaussian Mixed Model,GMM)组成,分别用来描述语音/非语音对数正态分布,从混有高比例噪声信号的语音中检测语音片段。Specifically, the SGMM-VAD algorithm can be composed of two Gaussian Mixed Models (GMM), which are used to describe the speech/non-speech log-normal distribution, and detect speech fragments from speech mixed with a high proportion of noise signals.
而语音活动检测(Voice Activity Detection,VAD)又称语音端点检测或语音边界检测。目的是从声音信号流里识别和消除长时间的静音期,以达到在不降低业务质量的情况下节省话路资源的作用,它是IP电话应用的重要组成部分。静音抑制可以节省宝贵的带宽资源,可以有利于减少用户感觉到的端到端的时延。The voice activity detection (Voice Activity Detection, VAD) is also called voice endpoint detection or voice boundary detection. The purpose is to identify and eliminate the long silent period from the voice signal stream, so as to save the voice channel resources without reducing the service quality. It is an important part of the IP phone application. Silence suppression can save valuable bandwidth resources and can help reduce the end-to-end delay felt by users.
通过ASR对各语音片段进行转换处理的步骤可以包括:The steps of converting each voice segment through ASR may include:
第一:构建ASR模型,ASR模型包含共83维特征,其中80维为log FBANK的前端特征,帧长25ms,另外3维为音高特征(包含POV主元特征的概率)。同时,创建LC-BHLSTM(Latency-controlled Bidirectional Highway Long Short-Term Memory,延迟控制的双向高速长短期记忆网络)模型,该LC-BHLSTM模型共有5层,1024个存储单元,每层输出有512个节点的投影。First: Construct an ASR model. The ASR model contains a total of 83-dimensional features, of which 80-dimensional is the front-end feature of log FBANK, the frame length is 25ms, and the other 3 is the pitch feature (including the probability of the POV principal element feature). At the same time, the LC-BHLSTM (Latency-controlled Bidirectional Highway Long Short-Term Memory) model was created. The LC-BHLSTM model has 5 layers, 1024 storage units, and each layer has 512 outputs. The projection of the node.
第二,将上述分割后的各语音片段输入ASR模型中,通过ASR模型将各语音片段表示为多维特征输出,具体可以为83维特征输出。然后,将ASR模型的输出信号输入LC-BHLSTM模型中,LC-BHLSTM模型的输出目标值是10k维上下文相关的三音素状态(又名:句音),最终完成语音片段至对话文本信息的转换。Second, input the segmented speech segments into the ASR model, and express each speech segment as a multi-dimensional feature output through the ASR model, which can specifically be an 83-dimensional feature output. Then, input the output signal of the ASR model into the LC-BHLSTM model. The output target value of the LC-BHLSTM model is a 10k-dimensional context-dependent triphone state (also known as sentence sound), and finally complete the conversion of speech fragments to dialogue text information .
其中,LSTM(Long Short-Term Memory长短期记忆网络)是一种时间递归神经网络,适合于处理和预测时间序列中间隔和延迟相对较长的事件。Among them, LSTM (Long Short-Term Memory) is a time recurrent neural network, suitable for processing and predicting events with relatively long intervals and delays in time series.
S130:根据上述文本信息对待确认目标或者说话者身份进行识别,获取 与各个待确认目标对应的文本信息片段,其中的说话者为多个待确认目标其中之一。S130: Identify the target to be confirmed or the identity of the speaker according to the above text information, and obtain the text information fragment corresponding to each target to be confirmed, where the speaker is one of the targets to be confirmed.
其中,根据文本信息对说话者身份进行识别的步骤可以包括:Among them, the step of identifying the speaker's identity according to the text information may include:
第一:获取基于训练集训练形成的深度学习分类模型,其中的训练集基于语料库组建而成;First: Obtain a deep learning classification model based on training set training, where the training set is formed based on a corpus;
第二:将文本信息输入到深度学习分类模型中,以对文本信息分配对应的标签。Second: input text information into the deep learning classification model to assign corresponding labels to the text information.
进一步地,根据文本信息对说话者身份进行识别的步骤还可以包括:Further, the step of identifying the speaker's identity according to the text information may further include:
1.基于语料库组建训练集;其中,在训练阶段手动标记“目标”、“非目标”标签来组建训练集。1. Build a training set based on a corpus; among them, manually mark the "target" and "non-target" labels in the training phase to build the training set.
2.基于训练集训练形成深度学习分类模型;2. Form a deep learning classification model based on training set training;
3.将文本信息输入训练好的深度学习分类模型中,对文本信息分配“目标”或“非目标”的标签。3. Input the text information into the trained deep learning classification model, and assign the label of "target" or "non-target" to the text information.
具体地,基于语料库组建训练集,在训练阶段手动标记“客户”/“客服”(即“目标”/“非目标”)标签来组建训练集,进而训练形成深度学习分类模型,将对话文本信息输入所述深度学习分类模型,对文本片段分配“客户”和“客服”的标签。最后,将各段被识别的客户文字数据找到对应的客户语音信息,并拼接成客户语音。Specifically, the training set is constructed based on the corpus, and the "customer"/"customer service" (ie "target"/"non-target") tags are manually marked during the training phase to form the training set, and then the deep learning classification model is trained to form the dialogue text information Input the deep learning classification model, and assign the tags of "customer" and "customer service" to the text segment. Finally, find the corresponding customer voice information from the recognized customer text data, and splice them into customer voice.
在对说话者身份进行识别的过程中,客户语音的质量十分重要。因此需要在客户-客服对话语音中完整地抽取出客户语音,以输入到后续深度学习分类模型中进行说话人验证。In the process of identifying the speaker's identity, the quality of the customer's voice is very important. Therefore, it is necessary to completely extract the customer voice from the voice of the customer-customer service dialogue and input it into the subsequent deep learning classification model for speaker verification.
当前,电话客户服务平台数据具有如下特征:Currently, telephone customer service platform data has the following characteristics:
其一,录制语音仅有客服与客户两个说话者,而等待验证身份的为客户语音。因此,本申请采用二分类方法来识别分类客服/客户。First, the recorded voice only has two speakers, the customer service and the customer, and the one waiting to be verified is the customer's voice. Therefore, this application uses a two-classification method to identify classified customer service/customers.
其二,两位说话者声音可能相似,但是说话内容有所不同。电话客服服务,大多为既定内容,介绍相关领域的产品,因此会包含较多专业术语,而客户接听或来电主要是咨询相关问题,语言相对平实生活化,包含较少专业术语。因此这些专业术语关键词可作为分类模型的特征,以训练二分类模型,该方法被称为“关键词匹配”。最后,将各片段被识别的客户文字数据拼接成客户语音,以用于后期说话人验证。Second, the voices of the two speakers may be similar, but the content of the speech is different. Telephone customer service is mostly established content, introducing products in related fields, so it will contain more professional terms, while customers answering or calling are mainly for consulting related issues, the language is relatively plain and life-like, and contains fewer professional terms. Therefore, these professional term keywords can be used as the features of the classification model to train the two classification model. This method is called "keyword matching". Finally, the recognized customer text data of each segment is spliced into customer voice for later speaker verification.
上述处理过程的主要工作原理如框图2及流程图2所示。其中图2示出了根据本申请实施例的转换后的文本信息对目标进行身份识别的原理,图3示出了2中根据转换后的文本信息对目标进行身份识别的流程。如图3所示,根据转换后的文本信息对目标进行身份识别的步骤如下:The main working principle of the above process is shown in block 2 and flow chart 2. FIG. 2 shows the principle of identifying a target according to the converted text information according to an embodiment of the application, and FIG. 3 shows the process of identifying the target according to the converted text information in 2. As shown in Figure 3, the steps to identify the target according to the converted text information are as follows:
S310:系统从训练集中构建搜索引擎,通过搜索引擎提取文本信息的中文分词,并在这些文本上构建保留索引。S310: The system builds a search engine from the training set, extracts Chinese word segmentation of text information through the search engine, and builds a reserved index on these texts.
S320:将对话文本信息放入深度学习分类模型中训练,获取与该对话文本信息最相关的K条文本。S320: Put the dialogue text information into the deep learning classification model for training, and obtain the K pieces of text most relevant to the dialogue text information.
S330:根据K-NN算法投票表决对话文本信息的类别。S330: Voting the type of dialogue text information according to the K-NN algorithm.
其中,邻近算法或者说K最近邻(K-NN,k-NearestNeighbor)分类算法是数据挖掘分类技术中最简单的方法之一。所谓K最近邻,就是k个最近的邻居的意思,说的是每个样本都可以用它最接近的k个邻居来代表。Among them, the neighbor algorithm or K-Nearest Neighbor (K-NN, k-NearestNeighbor) classification algorithm is one of the simplest methods in data mining classification technology. The so-called K nearest neighbors means k nearest neighbors, which means that each sample can be represented by its nearest k neighbors.
而K-NN算法的核心思想是如果一个样本在特征空间中的k个最相邻的样本中的大多数属于某一个类别,则该样本也属于这个类别,并具有这个类别上样本的特性。该方法在确定分类决策上只依据最邻近的一个或者几个样本的类别来决定待分样本所属的类别。K-NN算法在类别决策时,只与极少量的相邻样本有关。由于K-NN算法主要靠周围有限的邻近的样本,而不是靠判别类域的方法来确定所属类别的,因此对于类域的交叉或重叠较多的待分样本集来说,K-NN算法较其他方法更为适合。The core idea of the K-NN algorithm is that if most of the k nearest samples in the feature space of a sample belong to a certain category, the sample also belongs to this category and has the characteristics of the samples in this category. This method only determines the category of the sample to be classified based on the category of the nearest one or several samples in determining the classification decision. The K-NN algorithm is only related to a very small number of adjacent samples when making category decisions. Since the K-NN algorithm mainly relies on the surrounding limited nearby samples, rather than the method of discriminating the class domain to determine the category, so for the crossover or overlap of the class domain, the K-NN algorithm More suitable than other methods.
在获取与各个待确认目标对应的文本信息片段之后,进入步骤S140:根据文本信息片段获取与待确认目标对应的语音信号段并进行拼接,获取目标语音信号。After obtaining the text information fragments corresponding to each target to be confirmed, step S140 is entered: the voice signal segments corresponding to the target to be confirmed are obtained according to the text information fragments and spliced to obtain the target voice signal.
此处的语音信号段也可以理解为语音片段,在与待确认目标对应的各段文本信息未获取之前,初始语音信号分割为多个语音片段,待获取各段文本信息之后,即可根据该文本信息确认对应的语音信号段,该语音信号段就是需要进行身份确认的说话者的语音信号。The voice signal segment here can also be understood as a voice segment. Before each piece of text information corresponding to the target to be confirmed is not obtained, the initial voice signal is divided into multiple voice segments. After each piece of text information is obtained, the The text information confirms the corresponding voice signal segment, which is the voice signal of the speaker who needs to be identified.
在获取目标语音信号之后,进入步骤S150:根据所获取的目标语音信号对待确认目标的身份进行确认。After acquiring the target voice signal, proceed to step S150: confirm the identity of the target to be confirmed according to the acquired target voice signal.
其中,根据所获取的目标语音信号对目标身份进行确认的步骤可以包括两种方式:Among them, the step of confirming the target identity according to the acquired target voice signal may include two ways:
方式一是采用基于深度神经网络模型DNN的i-向量系统实现对目标说话人身份或者待确认目标的身份的确认。方式二是采用基于高斯混合模型GMM的i-向量系统实现对目标说话人身份或者待确认目标的身份的确认。The first way is to use the i-vector system based on the deep neural network model DNN to confirm the identity of the target speaker or the identity of the target to be confirmed. The second method is to use the i-vector system based on the Gaussian mixture model GMM to realize the confirmation of the identity of the target speaker or the identity of the target to be confirmed.
图4示出了方式一的基于DNN的i-向量系统对目标说话进行人身份确认的原理,其中,DNN为深度神经网络算法,UBM为通用背景模型(Universal Background Model),DFNN为动态模糊神经网络(Dynamic Fuzzy Neural Network)),LSTM为长短期记忆网络(Long Short-Term Memory),TDNN为时延神经网络(Time delay neural network)。基于图4所示的原理,该方式一的基于DNN的i-向量系统对目标说话进行人身份确认的过程主要包括以下步骤:Figure 4 shows the principle of the DNN-based i-vector system to confirm the identity of the target speaker in the first way, where DNN is a deep neural network algorithm, UBM is a universal background model, and DFNN is a dynamic fuzzy nerve Network (Dynamic Fuzzy Neural Network), LSTM is Long Short-Term Memory, and TDNN is Time Delay Neural Network. Based on the principle shown in Fig. 4, the process of the DNN-based i-vector system of this way to confirm the identity of the target speaker mainly includes the following steps:
步骤一:特征提取,收集足够多的统计信息,抽取i-向量和一个评分标准。该过程是用来将语音波形转换为特征向量(常用参数有:MFCC(Mel-frequency cepstral coefficients,梅尔频率倒谱系数),LPCC(Linear Prediction Cepstrum Coefficient,线性预测倒谱参数)和PLP(Perceptual Linear Prediction,感知线性预测)),从给定的语音信号中过滤噪声,保留有用的说话人信息。Step 1: Feature extraction, collect enough statistical information, extract i-vectors and a scoring standard. This process is used to convert speech waveforms into feature vectors (common parameters are: MFCC (Mel-frequency cepstral coefficients, Mel frequency cepstral coefficients), LPCC (Linear Prediction Cepstrum Coefficient, linear prediction cepstral parameters) and PLP (Perceptual Linear Prediction, perceptual linear prediction)), filter noise from a given speech signal, and retain useful speaker information.
步骤二:基于VAD技术收集足够多的统计信息是从一系列特征向量中计算0阶,1阶,2阶Baum-Welch(鲍姆-韦尔奇)统计信息。这些统计信息是从大规模DNN中生成的高维信息,也称作UBM。Step 2: Collect enough statistical information based on VAD technology to calculate 0-order, 1st-order, and 2nd-order Baum-Welch statistical information from a series of feature vectors. These statistics are high-dimensional information generated from large-scale DNNs, also known as UBM.
步骤三:i-向量的提取是将上述高维统计信息转换为单一低维特征向量,该低维向量仅包含与其他说话者不同的有辨别力的特征信息。Step 3: The extraction of the i-vector is to convert the above-mentioned high-dimensional statistical information into a single low-dimensional feature vector, which contains only discriminative feature information that is different from other speakers.
步骤四:在i-向量被提取后,采用评分标准(常用标准:余弦cosine距离相似度,LDA(Linear Discriminant Analysis,线性判别分析)和PLDA(Probabilistic Linear Discriminant Analysis,概率线性判别分析))来决定是否接受或拒绝该客户身份信息。Step 4: After the i-vector is extracted, use the scoring criteria (common criteria: cosine distance similarity, LDA (Linear Discriminant Analysis) and PLDA (Probabilistic Linear Discriminant Analysis)) to determine Whether to accept or reject the customer identity information.
而基于GMM的i-向量系统对目标说话进行人身份确认的原理与方式一的特征提取过程相类似,此处不再一一赘述。The principle of the GMM-based i-vector system to confirm the identity of the target speaker is similar to the feature extraction process of mode 1, and will not be repeated here.
图5示出了基于GMM的i-向量系统对目标说话进行人身份确认的原理,其中,GMM为高斯混合模型,MFCC和PLP的意思可参照图4中的解释。Fig. 5 shows the principle of the GMM-based i-vector system to confirm the identity of the target speaker, where GMM is a Gaussian mixture model, and the meaning of MFCC and PLP can be explained with reference to Fig. 4.
该方式二与方式一的特征提取过程相类似,此处不再一一赘述。The second method is similar to the feature extraction process of the first method, and will not be repeated here.
本申请实施例提供的基于说话内容的说话者身份识别方法,将对话音频 用自动语音识别技术转换为文字信息,然后使用深度学习分类方法对该文字信息进行目标或非目标的身份识别,最后,对目标音频片段进行拼接及对拼接后的音频片段进行身份验证,能够根据电话销售中客户与客服的说话内容存在差异的应用场景,基于说话内容进行说话人识别及验证,有效提高身份验证过程中的准确率。The speaker identification method based on speaking content provided by the embodiment of the application converts the dialogue audio into text information using automatic speech recognition technology, and then uses a deep learning classification method to perform target or non-target identification of the text information. Finally, Splicing the target audio clip and verifying the identity of the spliced audio clip can identify and verify the speaker based on the spoken content according to the application scenario where there is a difference between the speaking content of the customer and the customer service in the telesales, effectively improving the identity verification process The accuracy rate.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.
实施例二Example two
与上述方法相对应,本申请还提供一种基于说话内容的说话者身份识别系统,图6示出了根据本申请实施例的基于说话内容的说话者身份识别系统逻辑结构。Corresponding to the above method, the present application also provides a speaker identification system based on speaking content. FIG. 6 shows the logical structure of a speaker identification system based on speaking content according to an embodiment of the present application.
如图6所示,本申请提供的基于说话内容的说话者身份识别系统600,包括语音信号采集单元610、文本信息转换单元620、文本信息片段获取单元630、目标语音信号获取单元640以及身份确认单元650。其中,语音信号采集单元610、文本信息转换单元620、文本信息片段获取单元630、目标语音信号获取单元640以及身份确认单元650的实现功能与上述实施例中基于说话内容的说话者身份识别方法中对应的步骤一一对应。As shown in FIG. 6, the speaker identification system 600 based on speaking content provided by the present application includes a voice signal acquisition unit 610, a text information conversion unit 620, a text information fragment acquisition unit 630, a target voice signal acquisition unit 640, and identity confirmation Unit 650. Among them, the implementation functions of the voice signal acquisition unit 610, the text information conversion unit 620, the text information fragment acquisition unit 630, the target voice signal acquisition unit 640, and the identity confirmation unit 650 are the same as those of the speaker identity recognition method based on speaking content in the above embodiment The corresponding steps correspond one to one.
具体的,语音信号采集单元610用于采集初始语音信号;文本信息转换单元620用于通过语音识别技术将语音信号采集单元610所采集的初始语音信号转换为与说话内容对应的文本信息;文本信息片段获取单元630用于根据文本信息转换单元620转换完成的文本信息对说话者身份进行识别,获取与各个待确认目标对应的文本信息片段,其中的说话者为所述多个待确认目标其中之一;目标语音信号获取单元640用于根据文本信息片段获取单元630所获取的文本信息片段获取与待确认目标对应的语音信号段并进行拼接,获取目标语音信号;身份确认单元650用于根据目标语音信号获取单元640所获取的目标语音信号对待确认目标的身份进行确认。Specifically, the voice signal collection unit 610 is used to collect the initial voice signal; the text information conversion unit 620 is used to convert the initial voice signal collected by the voice signal collection unit 610 into text information corresponding to the content of the speech through voice recognition technology; text information The fragment acquisition unit 630 is configured to identify the speaker's identity according to the text information converted by the text information conversion unit 620, and obtain a text information fragment corresponding to each target to be confirmed, where the speaker is one of the plurality of targets to be confirmed One; the target voice signal acquisition unit 640 is configured to acquire and splice the voice signal segment corresponding to the target to be confirmed according to the text information fragment acquired by the text information fragment acquisition unit 630 to obtain the target voice signal; the identity confirmation unit 650 is used to obtain the target voice signal according to the target The target voice signal obtained by the voice signal obtaining unit 640 confirms the identity of the target to be confirmed.
在本申请的一优选实施方式中,文本信息转换单元620进一步包括:In a preferred embodiment of the present application, the text information conversion unit 620 further includes:
语音片段分割单元621,用于通过子空间高斯混合模型和语音活动检测技术,将所述初始语音信号分割为语音片段;The voice segment segmentation unit 621 is configured to segment the initial voice signal into voice segments by using a subspace Gaussian mixture model and voice activity detection technology;
语音片段转换单元622,用于通过语音识别技术对各语音片段分别进行文本信息转换。The voice segment conversion unit 622 is configured to perform text information conversion on each voice segment using voice recognition technology.
进一步,语音片段转换单元622包括模型构建单元和模型处理单元(图中未示出)。其中,模型构建单元用于构建语音识别模型和延迟控制的双向高速长短期记忆网络模型LC-BHLSTM;模型处理单元用于将各语音片段输入语音识别模型进行处理,语音识别模型将各语音片段表示为多维特征输出;以及,将语音识别模型的输出信号输入所述LC-BHLSTM模型进行处理,得到各语音片段对应的文本信息。Further, the speech segment conversion unit 622 includes a model construction unit and a model processing unit (not shown in the figure). Among them, the model construction unit is used to construct the speech recognition model and the two-way high-speed long and short-term memory network model LC-BHLSTM with delay control; the model processing unit is used to input each speech segment into the speech recognition model for processing, and the speech recognition model represents each speech segment It is a multi-dimensional feature output; and, inputting the output signal of the speech recognition model into the LC-BHLSTM model for processing to obtain text information corresponding to each speech segment.
其中,模型构建单元所构建的语音识别模型包含83维特征,其中的80维为log FBANK的前端特征,帧长25ms,另外3维为音高特征;模型构建单元所构建的延迟控制的双向高速长短期记忆网络模型有5层,1024个存储单元,每层输出有512个节点的投影。Among them, the speech recognition model constructed by the model construction unit contains 83-dimensional features, 80 of which are the front-end features of log FBANK, the frame length is 25ms, and the other 3 are the pitch features; the delay-controlled bidirectional high-speed constructed by the model construction unit The long and short-term memory network model has 5 layers, 1024 storage units, and each layer outputs a projection of 512 nodes.
在本申请的一优选实施方式中,文本信息片段获取单元630进一步包括:In a preferred embodiment of the present application, the text information fragment obtaining unit 630 further includes:
深度学习分类模型获取单元631,用于获取基于训练集训练形成的深度学习分类模型,其中的训练集基于语料库组建而成;The deep learning classification model acquisition unit 631 is configured to acquire a deep learning classification model formed based on training set training, where the training set is formed based on a corpus;
标签分配单元632,用于将文本信息输入深度学习分类模型中,对文本信息分配对应的标签。The label assigning unit 632 is configured to input text information into the deep learning classification model, and assign corresponding labels to the text information.
在本申请的另一优选实施方式中,身份确认单元650进一步包括:In another preferred embodiment of the present application, the identity confirmation unit 650 further includes:
第一身份确认单元651,用于采用基于深度神经网络模型的i-向量系统实现对所述待确认目标的身份的确认。其中,第一身份确认单元651包括特征向量转换单元、高维信息生成单元、低维特征向量转换单元以及身份评判单元(图中未示出)。The first identity confirmation unit 651 is configured to use an i-vector system based on a deep neural network model to confirm the identity of the target to be confirmed. The first identity confirmation unit 651 includes a feature vector conversion unit, a high-dimensional information generating unit, a low-dimensional feature vector conversion unit, and an identity evaluation unit (not shown in the figure).
其中,特征向量转换单元,用于基于统计信息抽取i-向量和一个评分标准,用以将语音波形转换为特征向量;高维信息生成单元,用于从一系列特征向量中计算0阶,1阶,2阶Baum-Welch统计信息,以生成高维信息;低维特征向量转换单元,用于将高维统计信息转换为单一低维特征向量,其中的低维特征向量仅包含与其他说话者不同的有辨别力的特征信息;身份评判单元,用于采用预设的评分标准来决定是否接受或拒绝说话者的身份信息。Among them, the feature vector conversion unit is used to extract i-vectors and a scoring standard based on statistical information to convert the speech waveform into a feature vector; the high-dimensional information generation unit is used to calculate 0-order from a series of feature vectors, 1 Order, 2 order Baum-Welch statistical information to generate high-dimensional information; low-dimensional feature vector conversion unit, used to convert high-dimensional statistical information into a single low-dimensional feature vector, the low-dimensional feature vector only contains other speakers Different discriminative feature information; the identity evaluation unit is used to determine whether to accept or reject the speaker’s identity information using preset scoring standards.
在本申请的又一优选实施方式中,身份确认单元650进一步包括第二身份确认单元(图中未示出),用于采用基于高斯混合模型的i-向量系统实现对 所述待确认目标的身份的确认。In another preferred embodiment of the present application, the identity confirmation unit 650 further includes a second identity confirmation unit (not shown in the figure), which is configured to adopt an i-vector system based on a Gaussian mixture model to achieve the target to be confirmed. Confirmation of identity.
本实施例提供的基于说话内容的说话者身份识别系统,通过语音信号采集单元、文本信息转换单元、文本信息片段获取单元、目标语音信号获取单元以及身份确认单元,将对话音频用自动语音识别技术转换为文字信息,然后对该文字信息进行目标或非目标的身份识别,最后对目标音频片段进行拼接及对拼接后的音频片段进行身份验证,能够在说话内容存在差异的应用场景,基于说话内容进行说话人识别及验证,有效提高身份验证的准确率。The speaker identity recognition system based on speaking content provided by this embodiment uses automatic speech recognition technology for dialogue audio through a voice signal acquisition unit, a text information conversion unit, a text information fragment acquisition unit, a target voice signal acquisition unit, and an identity confirmation unit Convert it into text information, and then perform target or non-target identification of the text information, and finally splicing the target audio segment and verifying the identity of the spliced audio segment, which can be used in application scenarios where there are differences in speech content, based on the speech content Perform speaker identification and verification to effectively improve the accuracy of identity verification.
实施例三Example three
图7为本申请基于说话内容的说话者身份识别方法具体实施例的应用环境示意图。如图7所示,实现上述基于说话内容的说话者身份识别方法的电子装置1可以是服务器、智能手机、平板电脑、便携计算机、桌上型计算机等具有运算功能的终端设备。FIG. 7 is a schematic diagram of an application environment of a specific embodiment of the speaker identification method based on the speaking content of the application. As shown in FIG. 7, the electronic device 1 that implements the aforementioned speaker identification method based on speaking content may be a terminal device with arithmetic functions such as a server, a smart phone, a tablet computer, a portable computer, a desktop computer, and the like.
该电子装置1包括:处理器12、存储器11、网络接口14及通信总线15。The electronic device 1 includes a processor 12, a memory 11, a network interface 14 and a communication bus 15.
存储器11包括至少一种类型的可读存储介质。所述至少一种类型的可读存储介质可为如闪存、硬盘、多媒体卡、卡型存储器11等的非易失性存储介质。在一些实施例中,所述可读存储介质可以是所述电子装置1的内部存储单元,例如该电子装置1的硬盘。在另一些实施例中,所述可读存储介质也可以是所述电子装置1的外部存储器11,例如所述电子装置1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。The memory 11 includes at least one type of readable storage medium. The at least one type of readable storage medium may be a non-volatile storage medium such as flash memory, hard disk, multimedia card, card-type memory 11, and the like. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1. In other embodiments, the readable storage medium may also be the external memory 11 of the electronic device 1, such as a plug-in hard disk or a smart memory card (Smart Media Card, SMC) equipped on the electronic device 1. , Secure Digital (SD) card, Flash Card, etc.
在本实施例中,所述存储器11的可读存储介质通常用于存储安装于所述电子装置1的基于说话内容的说话者身份识别程序10等。所述存储器11还可以用于暂时地存储已经输出或者将要输出的数据。In this embodiment, the readable storage medium of the memory 11 is generally used to store the speaker identification program 10 based on the speaking content installed in the electronic device 1 and the like. The memory 11 can also be used to temporarily store data that has been output or will be output.
处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU),微处理器或其他数据处理芯片,用于运行存储器11中存储的程序代码或处理数据,例如基于说话内容的说话者身份识别程序10等。In some embodiments, the processor 12 may be a central processing unit (CPU), a microprocessor, or other data processing chip, which is used to run program codes or processed data stored in the memory 11, for example, based on the content of speech. Speaker identification program 10 etc.
网络接口14可选地可以包括标准的有线接口、无线接口(如WI-FI接口),通常用于在该电子装置1与其他电子设备之间建立通信连接。The network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is generally used to establish a communication connection between the electronic device 1 and other electronic devices.
通信总线15用于实现这些组件之间的连接通信。The communication bus 15 is used to realize the connection and communication between these components.
图7仅示出了具有组件11-15的电子装置1,但是应理解的是,并不要求 实施所有示出的组件,可以替代的实施更多或者更少的组件。Fig. 7 only shows the electronic device 1 with components 11-15, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.
可选地,该电子装置1还可以包括用户接口,用户接口可以包括输入单元比如键盘(Keyboard)、语音输入装置比如麦克风(microphone)等具有语音识别功能的设备、语音输出装置比如音响、耳机等,可选地用户接口还可以包括标准的有线接口、无线接口。Optionally, the electronic device 1 may also include a user interface. The user interface may include an input unit such as a keyboard (Keyboard), a voice input device such as a microphone (microphone) and other devices with voice recognition functions, and a voice output device such as audio, earphones, etc. Optionally, the user interface may also include a standard wired interface and a wireless interface.
可选地,该电子装置1还可以包括显示器,显示器也可以称为显示屏或显示单元。在一些实施例中可以是LED显示器、液晶显示器、触控式液晶显示器以及有机发光二极管(Organic Light-Emitting Diode,OLED)触摸器等。显示器用于显示在电子装置1中处理的信息以及用于显示可视化的用户界面。Optionally, the electronic device 1 may also include a display, and the display may also be called a display screen or a display unit. In some embodiments, it may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, and an organic light-emitting diode (Organic Light-Emitting Diode, OLED) touch device. The display is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.
可选地,该电子装置1还包括触摸传感器。所述触摸传感器所提供的供用户进行触摸操作的区域称为触控区域。此外,这里所述的触摸传感器可以为电阻式触摸传感器、电容式触摸传感器等。而且,所述触摸传感器不仅包括接触式的触摸传感器,也可包括接近式的触摸传感器等。此外,所述触摸传感器可以为单个传感器,也可以为例如阵列布置的多个传感器。Optionally, the electronic device 1 further includes a touch sensor. The area provided by the touch sensor for the user to perform a touch operation is called a touch area. In addition, the touch sensor described here may be a resistive touch sensor, a capacitive touch sensor, or the like. Moreover, the touch sensor includes not only a contact type touch sensor, but also a proximity type touch sensor and the like. In addition, the touch sensor may be a single sensor, or may be, for example, a plurality of sensors arranged in an array.
此外,该电子装置1的显示器的面积可以与所述触摸传感器的面积相同,也可以不同。可选地,将显示器与所述触摸传感器层叠设置,以形成触摸显示屏。该装置基于触摸显示屏侦测用户触发的触控操作。In addition, the area of the display of the electronic device 1 may be the same as or different from the area of the touch sensor. Optionally, the display and the touch sensor are stacked to form a touch display screen. The device detects the touch operation triggered by the user based on the touch screen.
可选地,该电子装置1还可以包括射频(Radio Frequency,RF)电路,传感器、音频电路等等,在此不再赘述。Optionally, the electronic device 1 may also include a radio frequency (RF) circuit, a sensor, an audio circuit, etc., which will not be repeated here.
在图7所示的装置实施例中,计算机存储介质的存储器11中可以包括操作系统、以及基于说话内容的说话者身份识别程序10;处理器12执行存储器11中存储的基于说话内容的说话者身份识别程序10时实现如前述基于说话内容的说话者身份识别方法的步骤。In the device embodiment shown in FIG. 7, the memory 11 of the computer storage medium may include an operating system and a speaker identification program 10 based on the content of speech; the processor 12 executes the speaker based on the content of the speech stored in the memory 11. The identity recognition program 10 implements the steps of the aforementioned speaker identity recognition method based on speaking content.
上述实施例提出的电子装置1,相较之前的声纹识别算法,能够减少声学模型建模的需要,用二分类算法提高模型在说话者性别不同的场景下的识别效果。此外,提出整个身份验证识别框架,能够解决单通道-多或者双说话者场景下的客户验证问题,说话人识别精度高、速度快。Compared with the previous voiceprint recognition algorithm, the electronic device 1 proposed in the above embodiment can reduce the need for acoustic model modeling, and use a two-classification algorithm to improve the recognition effect of the model in scenes with different speaker genders. In addition, the entire identity verification and recognition framework is proposed, which can solve the problem of customer verification in single-channel-multiple or dual-speaker scenarios, with high speaker recognition accuracy and fast speed.
在其他实施例中,基于说话内容的说话者身份识别程序10还可以被分割为一个或者多个模块,一个或者多个模块被存储于存储器11中,并由处理器12执行,以完成本申请。本申请所称的模块是指能够完成特定功能的一系列 计算机程序指令段,具体到本申请中,该模块即为能够实现前述基于说话内容的说话者身份识别方法的各步骤的模块,或者说是与前述基于说话内容的说话者身份识别系统的各单元所对应的模块。In other embodiments, the speaker identification program 10 based on speaking content can also be divided into one or more modules, and one or more modules are stored in the memory 11 and executed by the processor 12 to complete the application . The module referred to in this application refers to a series of computer program instruction segments that can complete specific functions. Specifically in this application, this module is a module that can implement the steps of the aforementioned speaker identification method based on speaking content, or It is the module corresponding to each unit of the aforementioned speaker identification system based on speaking content.
实施例四Example four
此外,本申请实施例还提出一种计算机可读存储介质,该计算机可读存储介质中包括基于说话内容的说话者身份识别程序,该基于说话内容的说话者身份识别程序被处理器执行时实现前述基于说话内容的说话者身份识别方法的各项步骤的操作,或者,该基于说话内容的说话者身份识别程序被处理器执行时实现前述基于说话内容的说话者身份识别系统的各单元的功能。为避免重复,在此不再赘述。In addition, the embodiment of the present application also proposes a computer-readable storage medium that includes a speaker identification program based on the content of speech, which is implemented when the program is executed by the processor The operation of each step of the aforementioned speaking content-based speaker identification method, or the function of each unit of the aforementioned speaking content-based speaker identification system when the said speaking content-based speaker identification program is executed by the processor . To avoid repetition, I won’t repeat them here.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that in this article, the terms "including", "including" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, It also includes other elements that are not explicitly listed, or elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article or method that includes the element.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。The serial numbers of the foregoing embodiments of the present application are for description only, and do not represent the superiority of the embodiments. Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disk, optical disk), including several instructions to make a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims (20)

  1. 一种基于说话内容的说话者身份识别方法,应用于电子装置,其特征在于,所述方法包括:A speaker identification method based on speaking content, applied to an electronic device, characterized in that the method includes:
    采集初始语音信号,其中,所述初始语音信号包含至少两个待确认目标的说话内容;Collecting an initial voice signal, wherein the initial voice signal includes the speech content of at least two targets to be confirmed;
    通过语音识别技术将所述初始语音信号转换为与所述说话内容对应的文本信息;Converting the initial voice signal into text information corresponding to the speaking content through a voice recognition technology;
    根据所述文本信息对说话者身份进行识别,获取与各个待确认目标对应的文本信息片段,所述说话者为所述待确认目标其中之一;Recognizing the identity of the speaker according to the text information, and obtaining text information fragments corresponding to each target to be confirmed, the speaker being one of the targets to be confirmed;
    根据文本信息片段获取与所述待确认目标对应的语音信号段并进行拼接,获取目标语音信号;Acquire and splice the voice signal segment corresponding to the target to be confirmed according to the text information segment to obtain the target voice signal;
    根据所述目标语音信号对所述待确认目标的身份进行确认。The identity of the target to be confirmed is confirmed according to the target voice signal.
  2. 根据权利要求1所述的基于说话内容的说话者身份识别方法,其特征在于,所述通过语音识别技术将所述初始语音信号转换为与所述说话内容对应的文本信息包括:The speaker identity recognition method based on speaking content according to claim 1, wherein said converting said initial voice signal into text information corresponding to said speaking content through voice recognition technology comprises:
    通过子空间高斯混合模型和语音活动检测技术,将所述初始语音信号分割为语音片段;Segmenting the initial voice signal into voice segments by using a subspace Gaussian mixture model and voice activity detection technology;
    通过语音识别技术对各语音片段分别进行文本信息转换。Through speech recognition technology, each speech segment is converted into text information.
  3. 根据权利要求2所述的基于说话内容的说话者身份识别方法,其特征在于,所述通过语音识别技术对各语音片段分别进行文本信息转换的步骤包括:The speaker identity recognition method based on speaking content according to claim 2, characterized in that the step of performing text information conversion on each voice segment by voice recognition technology respectively comprises:
    构建语音识别模型和延迟控制的双向高速长短期记忆网络模型LC-BHLSTM;Construct a speech recognition model and a two-way high-speed long and short-term memory network model LC-BHLSTM with delay control;
    将所述各语音片段输入所述语音识别模型进行处理,所述语音识别模型将所述各语音片段表示为多维特征输出;Input each voice segment into the voice recognition model for processing, and the voice recognition model represents each voice segment as a multi-dimensional feature output;
    将所述语音识别模型的输出信号输入所述LC-BHLSTM模型进行处理,得到所述各语音片段对应的文本信息。The output signal of the speech recognition model is input into the LC-BHLSTM model for processing, and the text information corresponding to each speech segment is obtained.
  4. 根据权利要求2所述的基于说话内容的说话者身份识别方法,其特征在于,The speaker identification method based on speaking content according to claim 2, characterized in that,
    所构建的语音识别模型包含83维特征,其中的80维为log FBANK的前端特征,帧长25ms,另外3维为音高特征;The constructed speech recognition model contains 83-dimensional features, 80 of which are log FBANK front-end features, the frame length is 25ms, and the other 3 dimensions are pitch features;
    所构建的延迟控制的双向高速长短期记忆网络模型有5层,1024个存储单元,每层输出有512个节点的投影。The constructed delay-controlled bidirectional high-speed long and short-term memory network model has 5 layers, 1024 storage units, and each layer outputs a projection of 512 nodes.
  5. 根据权利要求1所述的基于说话内容的说话者身份识别方法,其特征在于,所述根据所述文本信息对说话者身份进行识别的步骤包括:The speaker identity recognition method based on speaking content according to claim 1, wherein the step of recognizing the speaker identity according to the text information comprises:
    获取基于训练集训练形成的深度学习分类模型,其中,所述训练集基于语料库组建而成;Acquiring a deep learning classification model formed based on training set training, wherein the training set is formed based on a corpus;
    将所述文本信息输入所述深度学习分类模型中,对所述文本信息分配对应的标签。The text information is input into the deep learning classification model, and a corresponding label is assigned to the text information.
  6. 根据权利要求5所述的基于说话内容的说话者身份识别方法,其特征在于,The speaker identification method based on speaking content according to claim 5, characterized in that,
    所述训练集以标记“目标”、“非目标”标签组建;The training set is organized with tags labeled "target" and "non-target";
    将所述文本信息输入所述深度学习分类模型中,对所述文本信息分配“目标”或“非目标”的标签。The text information is input into the deep learning classification model, and the text information is assigned a "target" or "non-target" label.
  7. 根据权利要求1所述的基于说话内容的说话者身份识别方法,其特征在于,所述根据所述文本信息对说话者身份进行识别的步骤包括:The speaker identity recognition method based on speaking content according to claim 1, wherein the step of recognizing the speaker identity according to the text information comprises:
    从训练集中构建搜索引擎,通过所述搜索引擎提取所述文本信息的中文分词,并在所述文本信息上构建保留索引;Constructing a search engine from the training set, extracting Chinese word segmentation of the text information through the search engine, and constructing a reserved index on the text information;
    将所述文本信息放入深度学习分类模型中训练,获取与所述文本信息最相关的K条文本;Putting the text information into a deep learning classification model for training, and obtaining K pieces of text most relevant to the text information;
    根据K-NN算法投票表决所述文本信息的类别。Vote the category of the text information according to the K-NN algorithm.
  8. 根据权利要求1所述的基于说话内容的说话者身份识别方法,其特征在于,所述根据所述目标语音信号对所述待确认目标的身份进行确认的步骤包括:The speaker identity recognition method based on speaking content according to claim 1, wherein the step of confirming the identity of the target to be confirmed according to the target voice signal comprises:
    采用基于深度神经网络模型的i-向量系统实现对所述待确认目标的身份的确认;或者,The i-vector system based on the deep neural network model is adopted to confirm the identity of the target to be confirmed; or,
    采用基于高斯混合模型的i-向量系统实现对所述待确认目标的身份的确认。The i-vector system based on the Gaussian mixture model is adopted to realize the confirmation of the identity of the target to be confirmed.
  9. 根据权利要求8所述的基于说话内容的说话者身份识别方法,其特征 在于,所述采用基于深度神经网络模型的i-向量系统实现对所述待确认目标的身份的确认的步骤包括:The speaker identity recognition method based on speaking content according to claim 8, wherein the step of adopting an i-vector system based on a deep neural network model to confirm the identity of the target to be confirmed comprises:
    基于统计信息抽取i-向量和一个评分标准,用以将语音波形转换为特征向量;Extract i-vectors and a scoring standard based on statistical information to convert speech waveforms into feature vectors;
    从一系列所述特征向量中计算0阶,1阶,2阶Baum-Welch统计信息,以生成高维信息;Calculate 0-order, 1-order, and 2-order Baum-Welch statistical information from a series of the feature vectors to generate high-dimensional information;
    将所述高维统计信息转换为单一低维特征向量,所述低维特征向量仅包含与其他说话者不同的有辨别力的特征信息;Converting the high-dimensional statistical information into a single low-dimensional feature vector, and the low-dimensional feature vector only contains discriminative feature information that is different from other speakers;
    采用预设的评分标准来决定是否接受或拒绝说话者的身份信息。Use preset scoring criteria to decide whether to accept or reject the speaker’s identity information.
  10. 根据权利要求9所述的基于说话内容的说话者身份识别方法,其特征在于,所述预设的评分标准包括:余弦cosine距离相似度、线性判别分析、和、概率线性判别分析。The speaker identity recognition method based on speaking content according to claim 9, wherein the preset scoring criteria include: cosine cosine distance similarity, linear discriminant analysis, sum, probabilistic linear discriminant analysis.
  11. 一种电子装置,其特征在于,该电子装置包括:存储器、处理器及摄像装置,所述存储器中包括基于说话内容的说话者身份识别程序,所述基于说话内容的说话者身份识别程序被所述处理器执行时实现如权利要求1~10中任一项所述的基于说话内容的说话者身份识别方法的步骤。An electronic device, characterized in that the electronic device includes a memory, a processor, and a camera, the memory includes a speaker identification program based on the content of speech, and the speaker identification program based on the content of the speech is When the processor is executed, the steps of the speaker identification method based on the content of speech according to any one of claims 1 to 10 are realized.
  12. 一种基于说话内容的说话者身份识别系统,其特征在于,包括:A speaker identification system based on speaking content is characterized in that it includes:
    语音信号采集单元,用于采集初始语音信号,其中,所述初始语音信号包含至少两个待确认目标的说话内容;A voice signal collection unit, configured to collect an initial voice signal, wherein the initial voice signal includes the speech content of at least two targets to be confirmed;
    文本信息转换单元,用于通过语音识别技术将所述初始语音信号转换为与所述说话内容对应的文本信息;A text information conversion unit, configured to convert the initial voice signal into text information corresponding to the speaking content through a voice recognition technology;
    文本信息片段获取单元,用于根据所述文本信息对说话者身份进行识别,获取与各个待确认目标对应的文本信息片段,所述说话者为所述待确认目标其中之一;A text information fragment acquisition unit, configured to recognize the identity of the speaker according to the text information, and obtain a text information fragment corresponding to each target to be confirmed, and the speaker is one of the targets to be confirmed;
    目标语音信号获取单元,用于根据文本信息片段获取与所述待确认目标对应的语音信号段并进行拼接,获取目标语音信号;The target voice signal acquiring unit is configured to acquire and splice the voice signal segment corresponding to the target to be confirmed according to the text information segment to acquire the target voice signal;
    身份确认单元,用于根据所述目标语音信号对所述待确认目标的身份进行确认。The identity confirmation unit is used to confirm the identity of the target to be confirmed according to the target voice signal.
  13. 根据权利要求12所述的基于说话内容的说话者身份识别系统,其特征在于,所述文本信息转换单元包括:The speaker identification system based on speaking content according to claim 12, wherein the text information conversion unit comprises:
    语音片段分割单元,用于通过子空间高斯混合模型和语音活动检测技术,将所述初始语音信号分割为语音片段;A voice segment segmentation unit, configured to segment the initial voice signal into voice segments through a subspace Gaussian mixture model and voice activity detection technology;
    语音片段转换单元,用于通过语音识别技术对各语音片段分别进行文本信息转换。The voice segment conversion unit is used to perform text information conversion on each voice segment through voice recognition technology.
  14. 根据权利要求13所述的基于说话内容的说话者身份识别系统,其特征在于,所述语音片段转换单元包括:The speaker identification system based on speaking content according to claim 13, wherein the speech segment conversion unit comprises:
    模型构建单元,用于构建语音识别模型和延迟控制的双向高速长短期记忆网络模型LC-BHLSTM;The model building unit is used to build a speech recognition model and a two-way high-speed long and short-term memory network model LC-BHLSTM with delay control;
    模型处理单元,用于将所述各语音片段输入所述语音识别模型进行处理,所述语音识别模型将所述各语音片段表示为多维特征输出;以及,将所述语音识别模型的输出信号输入所述LC-BHLSTM模型进行处理,得到所述各语音片段对应的文本信息。The model processing unit is configured to input the speech fragments into the speech recognition model for processing, and the speech recognition model expresses the speech fragments as multi-dimensional feature output; and input the output signal of the speech recognition model The LC-BHLSTM model is processed to obtain text information corresponding to each speech segment.
  15. 根据权利要求14所述的基于说话内容的说话者身份识别系统,其特征在于,The speaker identification system based on speaking content according to claim 14, characterized in that,
    所述模型构建单元所构建的语音识别模型包含83维特征,其中的80维为log FBANK的前端特征,帧长25ms,另外3维为音高特征;The speech recognition model constructed by the model construction unit includes 83-dimensional features, 80 of which are front-end features of log FBANK, the frame length is 25ms, and the other 3 dimensions are pitch features;
    所述模型构建单元所构建的延迟控制的双向高速长短期记忆网络模型有5层,1024个存储单元,每层输出有512个节点的投影。The delay-controlled bidirectional high-speed long and short-term memory network model constructed by the model construction unit has 5 layers, 1024 storage units, and each layer outputs projections of 512 nodes.
  16. 根据权利要求12所述的基于说话内容的说话者身份识别系统,其特征在于,所述文本信息片段获取单元包括:The speaker identification system based on speaking content according to claim 12, wherein the text information fragment acquisition unit comprises:
    深度学习分类模型获取单元,用于获取基于训练集训练形成的深度学习分类模型,其中,所述训练集基于语料库组建而成;A deep learning classification model acquisition unit, configured to acquire a deep learning classification model formed based on training set training, wherein the training set is formed based on a corpus;
    标签分配单元,用于将所述文本信息输入所述深度学习分类模型中,对所述文本信息分配对应的标签。The label assignment unit is configured to input the text information into the deep learning classification model, and assign corresponding labels to the text information.
  17. 根据权利要求12所述的基于说话内容的说话者身份识别系统,其特征在于,所述身份确认单元包括:The speaker identification system based on speaking content according to claim 12, wherein the identity confirmation unit comprises:
    第一身份确认单元,用于采用基于深度神经网络模型的i-向量系统实现对所述待确认目标的身份的确认。The first identity confirmation unit is configured to adopt an i-vector system based on a deep neural network model to confirm the identity of the target to be confirmed.
  18. 根据权利要求17所述的基于说话内容的说话者身份识别系统,其特征在于,所述第一身份确认单元包括:The speaker identification system based on speaking content according to claim 17, wherein the first identity confirmation unit comprises:
    特征向量转换单元,用于基于统计信息抽取i-向量和一个评分标准,用以将语音波形转换为特征向量;The feature vector conversion unit is used to extract i-vectors and a scoring standard based on statistical information to convert speech waveforms into feature vectors;
    高维信息生成单元,用于从一系列所述特征向量中计算0阶,1阶,2阶Baum-Welch统计信息,以生成高维信息;A high-dimensional information generating unit, configured to calculate 0-order, 1st-order, and 2nd-order Baum-Welch statistical information from a series of the feature vectors to generate high-dimensional information;
    低维特征向量转换单元,用于将所述高维统计信息转换为单一低维特征向量,所述低维特征向量仅包含与其他说话者不同的有辨别力的特征信息;A low-dimensional feature vector conversion unit, configured to convert the high-dimensional statistical information into a single low-dimensional feature vector, and the low-dimensional feature vector only contains distinguishing feature information that is different from other speakers;
    身份评判单元,用于采用预设的评分标准来决定是否接受或拒绝说话者的身份信息。The identity evaluation unit is used to determine whether to accept or reject the speaker’s identity information using a preset scoring standard.
  19. 根据权利要求12所述的基于说话内容的说话者身份识别系统,其特征在于,所述身份确认单元包括:The speaker identification system based on speaking content according to claim 12, wherein the identity confirmation unit comprises:
    第二身份确认单元,用于采用基于高斯混合模型的i-向量系统实现对所述待确认目标的身份的确认。The second identity confirmation unit is configured to use an i-vector system based on a Gaussian mixture model to confirm the identity of the target to be confirmed.
  20. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中包括基于说话内容的说话者身份识别程序,所述基于说话内容的说话者身份识别程序被处理器执行时,实现如权利要求1至10中任一项所述的基于说话内容的说话者身份识别方法的步骤。A computer-readable storage medium, characterized in that the computer-readable storage medium includes a speaker identification program based on the content of speech, and when the program is executed by a processor, The steps of the speaker identification method based on speaking content according to any one of claims 1 to 10.
PCT/CN2019/117903 2019-04-16 2019-11-13 Speaker identity recognition method and device based on speech content, and storage medium WO2020211354A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910305438.3 2019-04-16
CN201910305438.3A CN110136727B (en) 2019-04-16 2019-04-16 Speaker identification method, device and storage medium based on speaking content

Publications (1)

Publication Number Publication Date
WO2020211354A1 true WO2020211354A1 (en) 2020-10-22

Family

ID=67570149

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117903 WO2020211354A1 (en) 2019-04-16 2019-11-13 Speaker identity recognition method and device based on speech content, and storage medium

Country Status (2)

Country Link
CN (1) CN110136727B (en)
WO (1) WO2020211354A1 (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110136727B (en) * 2019-04-16 2024-04-16 平安科技(深圳)有限公司 Speaker identification method, device and storage medium based on speaking content
CN110517667A (en) * 2019-09-03 2019-11-29 龙马智芯(珠海横琴)科技有限公司 A kind of method of speech processing, device, electronic equipment and storage medium
CN112837672B (en) * 2019-11-01 2023-05-09 北京字节跳动网络技术有限公司 Method and device for determining conversation attribution, electronic equipment and storage medium
CN110931023B (en) * 2019-11-29 2022-08-19 厦门快商通科技股份有限公司 Gender identification method, system, mobile terminal and storage medium
CN111144091B (en) * 2019-12-02 2024-04-05 支付宝(杭州)信息技术有限公司 Customer service member determination method and device and group member identification determination method
CN111089245A (en) * 2019-12-23 2020-05-01 宁波飞拓电器有限公司 Multipurpose energy-saving fire-fighting emergency lamp
CN111128223B (en) * 2019-12-30 2022-08-05 科大讯飞股份有限公司 Text information-based auxiliary speaker separation method and related device
CN111243595B (en) * 2019-12-31 2022-12-27 京东科技控股股份有限公司 Information processing method and device
CN111405122B (en) * 2020-03-18 2021-09-24 苏州科达科技股份有限公司 Audio call testing method, device and storage medium
CN111508505B (en) * 2020-04-28 2023-11-03 讯飞智元信息科技有限公司 Speaker recognition method, device, equipment and storage medium
CN111539221B (en) * 2020-05-13 2023-09-12 北京焦点新干线信息技术有限公司 Data processing method and system
CN112182197A (en) * 2020-11-09 2021-01-05 北京明略软件系统有限公司 Method, device and equipment for recommending dialect and computer readable medium
CN112397057A (en) * 2020-12-01 2021-02-23 平安科技(深圳)有限公司 Voice processing method, device, equipment and medium based on generation countermeasure network
CN113051426A (en) * 2021-03-18 2021-06-29 深圳市声扬科技有限公司 Audio information classification method and device, electronic equipment and storage medium
CN113051902A (en) * 2021-03-30 2021-06-29 上海思必驰信息科技有限公司 Voice data desensitization method, electronic device and computer-readable storage medium
CN113792140A (en) * 2021-08-12 2021-12-14 南京星云数字技术有限公司 Text processing method and device and computer readable storage medium
CN114299957A (en) * 2021-11-29 2022-04-08 北京百度网讯科技有限公司 Voiceprint separation method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100138040A1 (en) * 2007-01-18 2010-06-03 Korea Institute Of Science And Technology Apparatus for detecting user and method for detecting user by the same
CN102456345A (en) * 2010-10-19 2012-05-16 盛乐信息技术(上海)有限公司 Concatenated speech detection system and method
CN107464568A (en) * 2017-09-25 2017-12-12 四川长虹电器股份有限公司 Based on the unrelated method for distinguishing speek person of Three dimensional convolution neutral net text and system
CN107657947A (en) * 2017-09-20 2018-02-02 百度在线网络技术(北京)有限公司 Method of speech processing and its device based on artificial intelligence
CN110136727A (en) * 2019-04-16 2019-08-16 平安科技(深圳)有限公司 Speaker's personal identification method, device and storage medium based on speech content

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107680602A (en) * 2017-08-24 2018-02-09 平安科技(深圳)有限公司 Voice fraud recognition methods, device, terminal device and storage medium
CN108831485B (en) * 2018-06-11 2021-04-23 东北师范大学 Speaker identification method based on spectrogram statistical characteristics
CN108877809B (en) * 2018-06-29 2020-09-22 北京中科智加科技有限公司 Speaker voice recognition method and device
CN109273012B (en) * 2018-09-06 2023-01-31 河海大学 Identity authentication method based on speaker recognition and digital voice recognition

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100138040A1 (en) * 2007-01-18 2010-06-03 Korea Institute Of Science And Technology Apparatus for detecting user and method for detecting user by the same
CN102456345A (en) * 2010-10-19 2012-05-16 盛乐信息技术(上海)有限公司 Concatenated speech detection system and method
CN107657947A (en) * 2017-09-20 2018-02-02 百度在线网络技术(北京)有限公司 Method of speech processing and its device based on artificial intelligence
CN107464568A (en) * 2017-09-25 2017-12-12 四川长虹电器股份有限公司 Based on the unrelated method for distinguishing speek person of Three dimensional convolution neutral net text and system
CN110136727A (en) * 2019-04-16 2019-08-16 平安科技(深圳)有限公司 Speaker's personal identification method, device and storage medium based on speech content

Also Published As

Publication number Publication date
CN110136727A (en) 2019-08-16
CN110136727B (en) 2024-04-16

Similar Documents

Publication Publication Date Title
WO2020211354A1 (en) Speaker identity recognition method and device based on speech content, and storage medium
US11636860B2 (en) Word-level blind diarization of recorded calls with arbitrary number of speakers
US11367450B2 (en) System and method of diarization and labeling of audio data
US10878824B2 (en) Speech-to-text generation using video-speech matching from a primary speaker
CN111128223B (en) Text information-based auxiliary speaker separation method and related device
CN107274916B (en) Method and device for operating audio/video file based on voiceprint information
CN108074576A (en) Inquest the speaker role's separation method and system under scene
CN108735200A (en) A kind of speaker's automatic marking method
Pao et al. A study on the search of the most discriminative speech features in the speaker dependent speech emotion recognition
KR102113879B1 (en) The method and apparatus for recognizing speaker's voice by using reference database
Sawakare et al. Speech recognition techniques: a review
CN112992155B (en) Far-field voice speaker recognition method and device based on residual error neural network
CN113744742B (en) Role identification method, device and system under dialogue scene
CN112037772B (en) Response obligation detection method, system and device based on multiple modes
Madhusudhana Rao et al. Machine hearing system for teleconference authentication with effective speech analysis
CN117133273A (en) Voice classification method, device, electronic equipment and storage medium
CN114121023A (en) Speaker separation method, speaker separation device, electronic equipment and computer readable storage medium
CN115567642A (en) Monitoring method and device for crowdsourcing customer service, computer equipment and storage medium
Sze LYU 0202 Advanced Audio Information Retrieval System
Ambika et al. Vector Quantization in Language Independent Speaker Identification Using Mel-Frequency Cepstrum Co-efficient

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19925310

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19925310

Country of ref document: EP

Kind code of ref document: A1