WO2020211354A1

WO2020211354A1 - Speaker identity recognition method and device based on speech content, and storage medium

Info

Publication number: WO2020211354A1
Application number: PCT/CN2019/117903
Authority: WO
Inventors: 王健宗; 孙奥兰
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-04-16
Filing date: 2019-11-13
Publication date: 2020-10-22
Also published as: CN110136727A; CN110136727B

Abstract

A speaker identity recognition method and device based on speech content, and a storage medium. The method comprises: acquiring an initial speech signal, wherein the initial speech signal comprises speech content of at least two targets to be determined (S110); converting the initial speech signal into text information corresponding to the speech content by means of speech recognition technology (S120); recognizing the identity of a speaker according to the text information, and obtaining text information fragments corresponding to the targets to be determined, wherein the speaker is one of the targets to be determined (S130); obtaining, according to the text information fragments, speech signal segments corresponding to the targets to be determined and splicing same to obtain a target speech signal (S140); and determining, according to the target speech signal, the identity of the targets to be determined (S150). The identity of a speaker is recognized and verified on the basis of the speech content, and thus, the accuracy in a identity verification process can be improved, application of the invention in telephone customer services can be achieved, and labor and material resources are saved.

Description

Speaker identification method, device and storage medium based on speaking content

This application requires the priority of the patent application whose application number is 201910305438.3, the filing date is April 16, 2019, and the invention and creation titled "Method, device and storage medium for speaker identification based on speaking content".

Technical field

This application relates to the technical field of speech signal processing, and in particular to a method, device and computer-readable storage medium for speaker identification based on the content of speech.

Background technique

According to research, although the voiceprint is not as good as fingerprints and faces, the individual differences are obvious, but because each person's vocal tract, oral cavity and nasal cavity (the organs used for pronunciation) also have individual differences. Because it is reflected in the sound, it is also different. For example, when we are answering the phone, we can accurately tell who is answering the phone by saying "Hello". Our human ears, as receivers of the body, are born with the ability to distinguish sounds, so we too Through technical means, voiceprints can also be used as important information for "personal identity authentication" like human faces and fingerprints.

Voiceprint recognition (Voiceprint Recognition, VPR), also known as speaker recognition (Speaker Recognition), includes two types, namely speaker identification and speaker verification (Speaker Verification). The former is used to determine which one of several people said a certain speech, which is a "multiple choice" question; while the latter is used to confirm whether a certain speech is spoken by a designated person, which is a "one-to-one discrimination" "problem. Speaker recognition is the process of accepting or rejecting the speaker's identity given the speaker's voice information. It is widely used in banking systems, financial commerce and voice security control.

For this reason, speaker recognition technology has gradually developed and been popularized, especially in security verification and telephone banking. This technology is required to be applied in the single-channel-single-speaker scenario, that is, to input the voice information of a single customer to obtain a better verification effect. In customer-oriented companies, speaker recognition can help customers solve urgent needs and obtain personalized services, and can also help achieve precision marketing. However, the applicant realizes that the existing products in the industry are mostly based on the speaker's voiceprint recognition, but this method works better when the two sides of the conversation are of different genders, and when the gender is the same, the effect is relatively poor.

For example, on the telephone customer service platform, the audio of the conversation between the customer and the customer service is recorded on a single channel of the telephone recording. It is not possible to directly verify the identity of the customer on the telephone recording information through the speaker verification technology, resulting in low telephone customer service efficiency. Waste a lot of manpower and material resources.

Therefore, in order to solve the above-mentioned problems, there is an urgent need for a technology that can verify the identity of the speaker through audio interaction.

Summary of the invention

This application provides a speaker identification method, device, and computer-readable storage medium based on the content of speech. The main purpose of the method is to convert the recorded conversation audio into text information using automatic speech recognition technology, and then use a deep learning classification method to perform Identification of the customer or customer service, and finally, splicing the customer audio clips and verifying the identity of the spliced audio clips. According to the application scenarios where there is a difference between the customer and the customer service in the telesales, the speaker identification and the speaker can be based on the speech content. Verification, improve the accuracy of the identity verification process, realize its application in telephone customer service, and save manpower and material resources.

In order to achieve the above objective, this application provides a speaker identification method based on speaking content, which is applied to an electronic device, and the method includes:

Collecting an initial voice signal, where the initial voice signal contains the speech content of the target to be confirmed;

Converting the initial voice signal into text information corresponding to the speaking content through a voice recognition technology;

Recognizing the identity of the speaker according to the text information, and obtaining text information fragments corresponding to each target to be confirmed, the speaker being one of the targets to be confirmed;

Acquire and splice the voice signal segment corresponding to the target to be confirmed according to the text information segment to obtain the target voice signal;

The identity of the target to be confirmed is confirmed according to the target voice signal.

In order to achieve the above object, the present application also provides an electronic device, which includes a memory, a processor, and a camera device, the memory includes a speaker identification program based on the content of the speech, and the speaker based on the content of the speech The identity recognition program is executed by the processor in the steps of the aforementioned speaker identity recognition method based on speaking content.

In order to achieve the above objectives, this application also provides a speaker identification system based on the content of the speech, including:

A voice signal collection unit, configured to collect an initial voice signal, wherein the initial voice signal includes the speech content of at least two targets to be confirmed;

A text information conversion unit, configured to convert the initial voice signal into text information corresponding to the speaking content through a voice recognition technology;

The text information fragment obtaining unit is configured to identify the speaker's identity according to the text information, and obtain the text information fragment corresponding to each target to be confirmed, and the speaker is one of the targets to be confirmed;

The target voice signal acquiring unit is configured to acquire and splice the voice signal segment corresponding to the target to be confirmed according to the text information segment to acquire the target voice signal;

The identity confirmation unit is used to confirm the identity of the target to be confirmed according to the target voice signal.

In order to achieve the above objective, the present application also provides a computer-readable storage medium, which includes a speaker identification program based on speaking content, and the speaker identification program based on speaking content is processed by a processor When executed, the steps of the speaker identification method based on the speaking content as described above are realized.

The speaker identification method, device and computer readable storage medium based on the content of speech proposed in this application convert the recorded dialogue audio into text information using automatic speech recognition technology, and then use the deep learning classification method to identify the target or non-target Recognition, and finally, splicing the target audio clips and verifying the identity of the spliced audio clips. According to the application scenarios where there is a difference between the content of the customer and the customer service in telesales, the speaker identification and verification based on the content of the speech can improve the identity verification The accuracy of the process.

Description of the drawings

FIG. 1 is a flowchart of a specific embodiment of a speaker identification method based on speaking content in this application;

Figure 2 is a schematic diagram of the application for identifying targets based on the converted text information;

Figure 3 is a flowchart of identifying the target according to the converted text information in 2;

Fig. 4 is a schematic diagram of speaker identity confirmation based on DNN according to an embodiment of the present application;

Fig. 5 is a schematic diagram of speaker identity confirmation based on GMM according to an embodiment of the present application;

Fig. 6 is a logical structure of a speaker identification system based on speaking content according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an application environment of a specific embodiment of a speaker identification method based on speaking content according to the present application.

The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

detailed description

It should be understood that the specific embodiments described here are only used to explain the application, and are not used to limit the application.

Example one

This application provides a speaker identification method based on speaking content, which is applied to an electronic device. Fig. 1 is a flowchart of a specific embodiment of speaker identification based on speaking content in this application. The electronic device that executes the method can be implemented by software and/or hardware.

In the embodiment shown in FIG. 1, the speaker identification method based on speaking content includes the following steps:

Step S110: Collect an initial voice signal, where the initial voice signal includes the speech content of at least two targets to be confirmed.

Wherein, the initial voice signal is a dialogue voice signal of at least two speakers. The initial voice signal collection mentioned here is mainly for the voice signal of the speaker during the telephone communication. If the telephone communication is a situation where there are only two people making a voice call, the target to be confirmed is two, when a multi-person call can be realized The speaker identification program based on the speaking content provided in this application can also be applied to the situation of multi-person calls. At this time, the initial voice signal will contain the speaking content of multiple targets to be confirmed. The specific implementation schemes are similar. No longer.

In addition, for different application scenarios, the trigger point for the collection of voice signal data is also different. For example, when a speaker identification program based on the content of the speech is installed on a mobile terminal, the trigger point for the collection of voice signal data can be set at Buttons on the mobile terminal, or start buttons, etc. The initial voice signal is the collected voice signal data, and the voice signal data can be used as the initial voice signal required in subsequent identity recognition.

In step S120, the collected initial voice signal is converted into text information corresponding to the content of the speech through ASR (Automatic Speech Recognition, voice recognition technology).

As an example, when the speakers are the customer and the customer service, the steps of converting the collected initial voice signal into the corresponding text information through the ASR voice recognition technology include: first passing the Subspace Gaussian Mixture Model (SGMM) ) And voice activity detection VAD (Voice Activity Detection, VAD), which divides the initial voice signal into multiple short voice fragments. The short voice fragments can facilitate ASR to convert text information to it. The segmentation parameters here can be performed according to ASR Set; then, each voice segment is converted into text information through ASR.

Specifically, the SGMM-VAD algorithm can be composed of two Gaussian Mixed Models (GMM), which are used to describe the speech/non-speech log-normal distribution, and detect speech fragments from speech mixed with a high proportion of noise signals.

The voice activity detection (Voice Activity Detection, VAD) is also called voice endpoint detection or voice boundary detection. The purpose is to identify and eliminate the long silent period from the voice signal stream, so as to save the voice channel resources without reducing the service quality. It is an important part of the IP phone application. Silence suppression can save valuable bandwidth resources and can help reduce the end-to-end delay felt by users.

The steps of converting each voice segment through ASR may include:

First: Construct an ASR model. The ASR model contains a total of 83-dimensional features, of which 80-dimensional is the front-end feature of log FBANK, the frame length is 25ms, and the other 3 is the pitch feature (including the probability of the POV principal element feature). At the same time, the LC-BHLSTM (Latency-controlled Bidirectional Highway Long Short-Term Memory) model was created. The LC-BHLSTM model has 5 layers, 1024 storage units, and each layer has 512 outputs. The projection of the node.

Second, input the segmented speech segments into the ASR model, and express each speech segment as a multi-dimensional feature output through the ASR model, which can specifically be an 83-dimensional feature output. Then, input the output signal of the ASR model into the LC-BHLSTM model. The output target value of the LC-BHLSTM model is a 10k-dimensional context-dependent triphone state (also known as sentence sound), and finally complete the conversion of speech fragments to dialogue text information .

Among them, LSTM (Long Short-Term Memory) is a time recurrent neural network, suitable for processing and predicting events with relatively long intervals and delays in time series.

S130: Identify the target to be confirmed or the identity of the speaker according to the above text information, and obtain the text information fragment corresponding to each target to be confirmed, where the speaker is one of the targets to be confirmed.

Among them, the step of identifying the speaker's identity according to the text information may include:

First: Obtain a deep learning classification model based on training set training, where the training set is formed based on a corpus;

Second: input text information into the deep learning classification model to assign corresponding labels to the text information.

Further, the step of identifying the speaker's identity according to the text information may further include:

1. Build a training set based on a corpus; among them, manually mark the "target" and "non-target" labels in the training phase to build the training set.

2. Form a deep learning classification model based on training set training;

3. Input the text information into the trained deep learning classification model, and assign the label of "target" or "non-target" to the text information.

Specifically, the training set is constructed based on the corpus, and the "customer"/"customer service" (ie "target"/"non-target") tags are manually marked during the training phase to form the training set, and then the deep learning classification model is trained to form the dialogue text information Input the deep learning classification model, and assign the tags of "customer" and "customer service" to the text segment. Finally, find the corresponding customer voice information from the recognized customer text data, and splice them into customer voice.

In the process of identifying the speaker's identity, the quality of the customer's voice is very important. Therefore, it is necessary to completely extract the customer voice from the voice of the customer-customer service dialogue and input it into the subsequent deep learning classification model for speaker verification.

Currently, telephone customer service platform data has the following characteristics:

First, the recorded voice only has two speakers, the customer service and the customer, and the one waiting to be verified is the customer's voice. Therefore, this application uses a two-classification method to identify classified customer service/customers.

Second, the voices of the two speakers may be similar, but the content of the speech is different. Telephone customer service is mostly established content, introducing products in related fields, so it will contain more professional terms, while customers answering or calling are mainly for consulting related issues, the language is relatively plain and life-like, and contains fewer professional terms. Therefore, these professional term keywords can be used as the features of the classification model to train the two classification model. This method is called "keyword matching". Finally, the recognized customer text data of each segment is spliced into customer voice for later speaker verification.

The main working principle of the above process is shown in block 2 and flow chart 2. FIG. 2 shows the principle of identifying a target according to the converted text information according to an embodiment of the application, and FIG. 3 shows the process of identifying the target according to the converted text information in 2. As shown in Figure 3, the steps to identify the target according to the converted text information are as follows:

S310: The system builds a search engine from the training set, extracts Chinese word segmentation of text information through the search engine, and builds a reserved index on these texts.

S320: Put the dialogue text information into the deep learning classification model for training, and obtain the K pieces of text most relevant to the dialogue text information.

S330: Voting the type of dialogue text information according to the K-NN algorithm.

Among them, the neighbor algorithm or K-Nearest Neighbor (K-NN, k-NearestNeighbor) classification algorithm is one of the simplest methods in data mining classification technology. The so-called K nearest neighbors means k nearest neighbors, which means that each sample can be represented by its nearest k neighbors.

The core idea of the K-NN algorithm is that if most of the k nearest samples in the feature space of a sample belong to a certain category, the sample also belongs to this category and has the characteristics of the samples in this category. This method only determines the category of the sample to be classified based on the category of the nearest one or several samples in determining the classification decision. The K-NN algorithm is only related to a very small number of adjacent samples when making category decisions. Since the K-NN algorithm mainly relies on the surrounding limited nearby samples, rather than the method of discriminating the class domain to determine the category, so for the crossover or overlap of the class domain, the K-NN algorithm More suitable than other methods.

After obtaining the text information fragments corresponding to each target to be confirmed, step S140 is entered: the voice signal segments corresponding to the target to be confirmed are obtained according to the text information fragments and spliced to obtain the target voice signal.

The voice signal segment here can also be understood as a voice segment. Before each piece of text information corresponding to the target to be confirmed is not obtained, the initial voice signal is divided into multiple voice segments. After each piece of text information is obtained, the The text information confirms the corresponding voice signal segment, which is the voice signal of the speaker who needs to be identified.

After acquiring the target voice signal, proceed to step S150: confirm the identity of the target to be confirmed according to the acquired target voice signal.

Among them, the step of confirming the target identity according to the acquired target voice signal may include two ways:

The first way is to use the i-vector system based on the deep neural network model DNN to confirm the identity of the target speaker or the identity of the target to be confirmed. The second method is to use the i-vector system based on the Gaussian mixture model GMM to realize the confirmation of the identity of the target speaker or the identity of the target to be confirmed.

Figure 4 shows the principle of the DNN-based i-vector system to confirm the identity of the target speaker in the first way, where DNN is a deep neural network algorithm, UBM is a universal background model, and DFNN is a dynamic fuzzy nerve Network (Dynamic Fuzzy Neural Network), LSTM is Long Short-Term Memory, and TDNN is Time Delay Neural Network. Based on the principle shown in Fig. 4, the process of the DNN-based i-vector system of this way to confirm the identity of the target speaker mainly includes the following steps:

Step 1: Feature extraction, collect enough statistical information, extract i-vectors and a scoring standard. This process is used to convert speech waveforms into feature vectors (common parameters are: MFCC (Mel-frequency cepstral coefficients, Mel frequency cepstral coefficients), LPCC (Linear Prediction Cepstrum Coefficient, linear prediction cepstral parameters) and PLP (Perceptual Linear Prediction, perceptual linear prediction)), filter noise from a given speech signal, and retain useful speaker information.

Step 2: Collect enough statistical information based on VAD technology to calculate 0-order, 1st-order, and 2nd-order Baum-Welch statistical information from a series of feature vectors. These statistics are high-dimensional information generated from large-scale DNNs, also known as UBM.

Step 3: The extraction of the i-vector is to convert the above-mentioned high-dimensional statistical information into a single low-dimensional feature vector, which contains only discriminative feature information that is different from other speakers.

Step 4: After the i-vector is extracted, use the scoring criteria (common criteria: cosine distance similarity, LDA (Linear Discriminant Analysis) and PLDA (Probabilistic Linear Discriminant Analysis)) to determine Whether to accept or reject the customer identity information.

The principle of the GMM-based i-vector system to confirm the identity of the target speaker is similar to the feature extraction process of mode 1, and will not be repeated here.

Fig. 5 shows the principle of the GMM-based i-vector system to confirm the identity of the target speaker, where GMM is a Gaussian mixture model, and the meaning of MFCC and PLP can be explained with reference to Fig. 4.

The second method is similar to the feature extraction process of the first method, and will not be repeated here.

The speaker identification method based on speaking content provided by the embodiment of the application converts the dialogue audio into text information using automatic speech recognition technology, and then uses a deep learning classification method to perform target or non-target identification of the text information. Finally, Splicing the target audio clip and verifying the identity of the spliced audio clip can identify and verify the speaker based on the spoken content according to the application scenario where there is a difference between the speaking content of the customer and the customer service in the telesales, effectively improving the identity verification process The accuracy rate.

It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.

Example two

Corresponding to the above method, the present application also provides a speaker identification system based on speaking content. FIG. 6 shows the logical structure of a speaker identification system based on speaking content according to an embodiment of the present application.

As shown in FIG. 6, the speaker identification system 600 based on speaking content provided by the present application includes a voice signal acquisition unit 610, a text information conversion unit 620, a text information fragment acquisition unit 630, a target voice signal acquisition unit 640, and identity confirmation Unit 650. Among them, the implementation functions of the voice signal acquisition unit 610, the text information conversion unit 620, the text information fragment acquisition unit 630, the target voice signal acquisition unit 640, and the identity confirmation unit 650 are the same as those of the speaker identity recognition method based on speaking content in the above embodiment The corresponding steps correspond one to one.

Specifically, the voice signal collection unit 610 is used to collect the initial voice signal; the text information conversion unit 620 is used to convert the initial voice signal collected by the voice signal collection unit 610 into text information corresponding to the content of the speech through voice recognition technology; text information The fragment acquisition unit 630 is configured to identify the speaker's identity according to the text information converted by the text information conversion unit 620, and obtain a text information fragment corresponding to each target to be confirmed, where the speaker is one of the plurality of targets to be confirmed One; the target voice signal acquisition unit 640 is configured to acquire and splice the voice signal segment corresponding to the target to be confirmed according to the text information fragment acquired by the text information fragment acquisition unit 630 to obtain the target voice signal; the identity confirmation unit 650 is used to obtain the target voice signal according to the target The target voice signal obtained by the voice signal obtaining unit 640 confirms the identity of the target to be confirmed.

In a preferred embodiment of the present application, the text information conversion unit 620 further includes:

The voice segment segmentation unit 621 is configured to segment the initial voice signal into voice segments by using a subspace Gaussian mixture model and voice activity detection technology;

The voice segment conversion unit 622 is configured to perform text information conversion on each voice segment using voice recognition technology.

Further, the speech segment conversion unit 622 includes a model construction unit and a model processing unit (not shown in the figure). Among them, the model construction unit is used to construct the speech recognition model and the two-way high-speed long and short-term memory network model LC-BHLSTM with delay control; the model processing unit is used to input each speech segment into the speech recognition model for processing, and the speech recognition model represents each speech segment It is a multi-dimensional feature output; and, inputting the output signal of the speech recognition model into the LC-BHLSTM model for processing to obtain text information corresponding to each speech segment.

Among them, the speech recognition model constructed by the model construction unit contains 83-dimensional features, 80 of which are the front-end features of log FBANK, the frame length is 25ms, and the other 3 are the pitch features; the delay-controlled bidirectional high-speed constructed by the model construction unit The long and short-term memory network model has 5 layers, 1024 storage units, and each layer outputs a projection of 512 nodes.

In a preferred embodiment of the present application, the text information fragment obtaining unit 630 further includes:

The deep learning classification model acquisition unit 631 is configured to acquire a deep learning classification model formed based on training set training, where the training set is formed based on a corpus;

The label assigning unit 632 is configured to input text information into the deep learning classification model, and assign corresponding labels to the text information.

In another preferred embodiment of the present application, the identity confirmation unit 650 further includes:

The first identity confirmation unit 651 is configured to use an i-vector system based on a deep neural network model to confirm the identity of the target to be confirmed. The first identity confirmation unit 651 includes a feature vector conversion unit, a high-dimensional information generating unit, a low-dimensional feature vector conversion unit, and an identity evaluation unit (not shown in the figure).

Among them, the feature vector conversion unit is used to extract i-vectors and a scoring standard based on statistical information to convert the speech waveform into a feature vector; the high-dimensional information generation unit is used to calculate 0-order from a series of feature vectors, 1 Order, 2 order Baum-Welch statistical information to generate high-dimensional information; low-dimensional feature vector conversion unit, used to convert high-dimensional statistical information into a single low-dimensional feature vector, the low-dimensional feature vector only contains other speakers Different discriminative feature information; the identity evaluation unit is used to determine whether to accept or reject the speaker’s identity information using preset scoring standards.

In another preferred embodiment of the present application, the identity confirmation unit 650 further includes a second identity confirmation unit (not shown in the figure), which is configured to adopt an i-vector system based on a Gaussian mixture model to achieve the target to be confirmed. Confirmation of identity.

The speaker identity recognition system based on speaking content provided by this embodiment uses automatic speech recognition technology for dialogue audio through a voice signal acquisition unit, a text information conversion unit, a text information fragment acquisition unit, a target voice signal acquisition unit, and an identity confirmation unit Convert it into text information, and then perform target or non-target identification of the text information, and finally splicing the target audio segment and verifying the identity of the spliced audio segment, which can be used in application scenarios where there are differences in speech content, based on the speech content Perform speaker identification and verification to effectively improve the accuracy of identity verification.

Example three

FIG. 7 is a schematic diagram of an application environment of a specific embodiment of the speaker identification method based on the speaking content of the application. As shown in FIG. 7, the electronic device 1 that implements the aforementioned speaker identification method based on speaking content may be a terminal device with arithmetic functions such as a server, a smart phone, a tablet computer, a portable computer, a desktop computer, and the like.

The electronic device 1 includes a processor 12, a memory 11, a network interface 14 and a communication bus 15.

The memory 11 includes at least one type of readable storage medium. The at least one type of readable storage medium may be a non-volatile storage medium such as flash memory, hard disk, multimedia card, card-type memory 11, and the like. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1. In other embodiments, the readable storage medium may also be the external memory 11 of the electronic device 1, such as a plug-in hard disk or a smart memory card (Smart Media Card, SMC) equipped on the electronic device 1. , Secure Digital (SD) card, Flash Card, etc.

In this embodiment, the readable storage medium of the memory 11 is generally used to store the speaker identification program 10 based on the speaking content installed in the electronic device 1 and the like. The memory 11 can also be used to temporarily store data that has been output or will be output.

In some embodiments, the processor 12 may be a central processing unit (CPU), a microprocessor, or other data processing chip, which is used to run program codes or processed data stored in the memory 11, for example, based on the content of speech. Speaker identification program 10 etc.

The network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is generally used to establish a communication connection between the electronic device 1 and other electronic devices.

The communication bus 15 is used to realize the connection and communication between these components.

Fig. 7 only shows the electronic device 1 with components 11-15, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.

Optionally, the electronic device 1 may also include a user interface. The user interface may include an input unit such as a keyboard (Keyboard), a voice input device such as a microphone (microphone) and other devices with voice recognition functions, and a voice output device such as audio, earphones, etc. Optionally, the user interface may also include a standard wired interface and a wireless interface.

Optionally, the electronic device 1 may also include a display, and the display may also be called a display screen or a display unit. In some embodiments, it may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, and an organic light-emitting diode (Organic Light-Emitting Diode, OLED) touch device. The display is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.

Optionally, the electronic device 1 further includes a touch sensor. The area provided by the touch sensor for the user to perform a touch operation is called a touch area. In addition, the touch sensor described here may be a resistive touch sensor, a capacitive touch sensor, or the like. Moreover, the touch sensor includes not only a contact type touch sensor, but also a proximity type touch sensor and the like. In addition, the touch sensor may be a single sensor, or may be, for example, a plurality of sensors arranged in an array.

In addition, the area of the display of the electronic device 1 may be the same as or different from the area of the touch sensor. Optionally, the display and the touch sensor are stacked to form a touch display screen. The device detects the touch operation triggered by the user based on the touch screen.

Optionally, the electronic device 1 may also include a radio frequency (RF) circuit, a sensor, an audio circuit, etc., which will not be repeated here.

In the device embodiment shown in FIG. 7, the memory 11 of the computer storage medium may include an operating system and a speaker identification program 10 based on the content of speech; the processor 12 executes the speaker based on the content of the speech stored in the memory 11. The identity recognition program 10 implements the steps of the aforementioned speaker identity recognition method based on speaking content.

Compared with the previous voiceprint recognition algorithm, the electronic device 1 proposed in the above embodiment can reduce the need for acoustic model modeling, and use a two-classification algorithm to improve the recognition effect of the model in scenes with different speaker genders. In addition, the entire identity verification and recognition framework is proposed, which can solve the problem of customer verification in single-channel-multiple or dual-speaker scenarios, with high speaker recognition accuracy and fast speed.

In other embodiments, the speaker identification program 10 based on speaking content can also be divided into one or more modules, and one or more modules are stored in the memory 11 and executed by the processor 12 to complete the application . The module referred to in this application refers to a series of computer program instruction segments that can complete specific functions. Specifically in this application, this module is a module that can implement the steps of the aforementioned speaker identification method based on speaking content, or It is the module corresponding to each unit of the aforementioned speaker identification system based on speaking content.

Example four

In addition, the embodiment of the present application also proposes a computer-readable storage medium that includes a speaker identification program based on the content of speech, which is implemented when the program is executed by the processor The operation of each step of the aforementioned speaking content-based speaker identification method, or the function of each unit of the aforementioned speaking content-based speaker identification system when the said speaking content-based speaker identification program is executed by the processor . To avoid repetition, I won’t repeat them here.

It should be noted that in this article, the terms "including", "including" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, It also includes other elements that are not explicitly listed, or elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article or method that includes the element.

The serial numbers of the foregoing embodiments of the present application are for description only, and do not represent the superiority of the embodiments. Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disk, optical disk), including several instructions to make a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.

The above are only preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims

A speaker identification method based on speaking content, applied to an electronic device, characterized in that the method includes:

Collecting an initial voice signal, wherein the initial voice signal includes the speech content of at least two targets to be confirmed;

Converting the initial voice signal into text information corresponding to the speaking content through a voice recognition technology;

Recognizing the identity of the speaker according to the text information, and obtaining text information fragments corresponding to each target to be confirmed, the speaker being one of the targets to be confirmed;

Acquire and splice the voice signal segment corresponding to the target to be confirmed according to the text information segment to obtain the target voice signal;

The identity of the target to be confirmed is confirmed according to the target voice signal.
The speaker identity recognition method based on speaking content according to claim 1, wherein said converting said initial voice signal into text information corresponding to said speaking content through voice recognition technology comprises:

Segmenting the initial voice signal into voice segments by using a subspace Gaussian mixture model and voice activity detection technology;

Through speech recognition technology, each speech segment is converted into text information.
The speaker identity recognition method based on speaking content according to claim 2, characterized in that the step of performing text information conversion on each voice segment by voice recognition technology respectively comprises:

Construct a speech recognition model and a two-way high-speed long and short-term memory network model LC-BHLSTM with delay control;

Input each voice segment into the voice recognition model for processing, and the voice recognition model represents each voice segment as a multi-dimensional feature output;

The output signal of the speech recognition model is input into the LC-BHLSTM model for processing, and the text information corresponding to each speech segment is obtained.
The speaker identification method based on speaking content according to claim 2, characterized in that,

The constructed speech recognition model contains 83-dimensional features, 80 of which are log FBANK front-end features, the frame length is 25ms, and the other 3 dimensions are pitch features;

The constructed delay-controlled bidirectional high-speed long and short-term memory network model has 5 layers, 1024 storage units, and each layer outputs a projection of 512 nodes.
The speaker identity recognition method based on speaking content according to claim 1, wherein the step of recognizing the speaker identity according to the text information comprises:

Acquiring a deep learning classification model formed based on training set training, wherein the training set is formed based on a corpus;

The text information is input into the deep learning classification model, and a corresponding label is assigned to the text information.
The speaker identification method based on speaking content according to claim 5, characterized in that,

The training set is organized with tags labeled "target" and "non-target";

The text information is input into the deep learning classification model, and the text information is assigned a "target" or "non-target" label.
The speaker identity recognition method based on speaking content according to claim 1, wherein the step of recognizing the speaker identity according to the text information comprises:

Constructing a search engine from the training set, extracting Chinese word segmentation of the text information through the search engine, and constructing a reserved index on the text information;

Putting the text information into a deep learning classification model for training, and obtaining K pieces of text most relevant to the text information;

Vote the category of the text information according to the K-NN algorithm.
The speaker identity recognition method based on speaking content according to claim 1, wherein the step of confirming the identity of the target to be confirmed according to the target voice signal comprises:

The i-vector system based on the deep neural network model is adopted to confirm the identity of the target to be confirmed; or,

The i-vector system based on the Gaussian mixture model is adopted to realize the confirmation of the identity of the target to be confirmed.
The speaker identity recognition method based on speaking content according to claim 8, wherein the step of adopting an i-vector system based on a deep neural network model to confirm the identity of the target to be confirmed comprises:

Extract i-vectors and a scoring standard based on statistical information to convert speech waveforms into feature vectors;

Calculate 0-order, 1-order, and 2-order Baum-Welch statistical information from a series of the feature vectors to generate high-dimensional information;

Converting the high-dimensional statistical information into a single low-dimensional feature vector, and the low-dimensional feature vector only contains discriminative feature information that is different from other speakers;

Use preset scoring criteria to decide whether to accept or reject the speaker’s identity information.
The speaker identity recognition method based on speaking content according to claim 9, wherein the preset scoring criteria include: cosine cosine distance similarity, linear discriminant analysis, sum, probabilistic linear discriminant analysis.
An electronic device, characterized in that the electronic device includes a memory, a processor, and a camera, the memory includes a speaker identification program based on the content of speech, and the speaker identification program based on the content of the speech is When the processor is executed, the steps of the speaker identification method based on the content of speech according to any one of claims 1 to 10 are realized.
A speaker identification system based on speaking content is characterized in that it includes:

A voice signal collection unit, configured to collect an initial voice signal, wherein the initial voice signal includes the speech content of at least two targets to be confirmed;

A text information conversion unit, configured to convert the initial voice signal into text information corresponding to the speaking content through a voice recognition technology;

A text information fragment acquisition unit, configured to recognize the identity of the speaker according to the text information, and obtain a text information fragment corresponding to each target to be confirmed, and the speaker is one of the targets to be confirmed;

The target voice signal acquiring unit is configured to acquire and splice the voice signal segment corresponding to the target to be confirmed according to the text information segment to acquire the target voice signal;

The identity confirmation unit is used to confirm the identity of the target to be confirmed according to the target voice signal.
The speaker identification system based on speaking content according to claim 12, wherein the text information conversion unit comprises:

A voice segment segmentation unit, configured to segment the initial voice signal into voice segments through a subspace Gaussian mixture model and voice activity detection technology;

The voice segment conversion unit is used to perform text information conversion on each voice segment through voice recognition technology.
The speaker identification system based on speaking content according to claim 13, wherein the speech segment conversion unit comprises:

The model building unit is used to build a speech recognition model and a two-way high-speed long and short-term memory network model LC-BHLSTM with delay control;

The model processing unit is configured to input the speech fragments into the speech recognition model for processing, and the speech recognition model expresses the speech fragments as multi-dimensional feature output; and input the output signal of the speech recognition model The LC-BHLSTM model is processed to obtain text information corresponding to each speech segment.
The speaker identification system based on speaking content according to claim 14, characterized in that,

The speech recognition model constructed by the model construction unit includes 83-dimensional features, 80 of which are front-end features of log FBANK, the frame length is 25ms, and the other 3 dimensions are pitch features;

The delay-controlled bidirectional high-speed long and short-term memory network model constructed by the model construction unit has 5 layers, 1024 storage units, and each layer outputs projections of 512 nodes.
The speaker identification system based on speaking content according to claim 12, wherein the text information fragment acquisition unit comprises:

A deep learning classification model acquisition unit, configured to acquire a deep learning classification model formed based on training set training, wherein the training set is formed based on a corpus;

The label assignment unit is configured to input the text information into the deep learning classification model, and assign corresponding labels to the text information.
The speaker identification system based on speaking content according to claim 12, wherein the identity confirmation unit comprises:

The first identity confirmation unit is configured to adopt an i-vector system based on a deep neural network model to confirm the identity of the target to be confirmed.
The speaker identification system based on speaking content according to claim 17, wherein the first identity confirmation unit comprises:

The feature vector conversion unit is used to extract i-vectors and a scoring standard based on statistical information to convert speech waveforms into feature vectors;

A high-dimensional information generating unit, configured to calculate 0-order, 1st-order, and 2nd-order Baum-Welch statistical information from a series of the feature vectors to generate high-dimensional information;

A low-dimensional feature vector conversion unit, configured to convert the high-dimensional statistical information into a single low-dimensional feature vector, and the low-dimensional feature vector only contains distinguishing feature information that is different from other speakers;

The identity evaluation unit is used to determine whether to accept or reject the speaker’s identity information using a preset scoring standard.
The speaker identification system based on speaking content according to claim 12, wherein the identity confirmation unit comprises:

The second identity confirmation unit is configured to use an i-vector system based on a Gaussian mixture model to confirm the identity of the target to be confirmed.
A computer-readable storage medium, characterized in that the computer-readable storage medium includes a speaker identification program based on the content of speech, and when the program is executed by a processor, The steps of the speaker identification method based on speaking content according to any one of claims 1 to 10.