CN110880327B

CN110880327B - Audio signal processing method and device

Info

Publication number: CN110880327B
Application number: CN201911038804.XA
Authority: CN
Inventors: 王健宗; 吴冀平; 彭俊清
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-10-29
Filing date: 2019-10-29
Publication date: 2024-07-09
Anticipated expiration: 2039-10-29
Also published as: CN110880327A; WO2021082084A1

Abstract

The application discloses a method and a device for processing an audio signal, wherein the method for processing the audio signal comprises the following steps: acquiring a first audio signal to be processed; determining at least one phoneme contained in the first audio signal; and calculating the phoneme coverage rate of the at least one phoneme, and if the phoneme coverage rate meets the target condition, updating the first voiceprint recognition model into a second voiceprint recognition model. By adopting the technical scheme of the application, the final voiceprint recognition model can be ensured to be obtained through audio signal training which can reflect the pronunciation characteristics of the user more completely, and the accuracy of speaker recognition by the voiceprint recognition model is improved.

Description

Audio signal processing method and device

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a method and apparatus for processing an audio signal.

Background

In the voice print registration process of the user, training of a voice print recognition model is needed according to registration voice of the user, and the voice print recognition model obtained through training is used for subsequent speaker recognition to determine whether the speaker is the user corresponding to the voice print recognition model.

At present, in the existing text-independent voiceprint recognition system, when a user performs voiceprint registration, the user can speak at will, and the speaking duration is longer than a preset threshold. However, if the registered voice can fully reflect the pronunciation characteristics of the user, the system cannot be guaranteed. When the segment of registered voice cannot completely reflect the pronunciation characteristics of the user, and the voice print recognition model trained by the segment of registered voice is continuously adopted for speaker recognition, the individual difference capability of distinguishing the user can be greatly reduced, and the overall recognition performance of the system is affected.

Disclosure of Invention

The embodiment of the invention provides an audio signal processing method and device, which can ensure that a final voiceprint recognition model is obtained through audio signal training which can reflect the pronunciation characteristics of a user more completely, and improve the accuracy of speaker recognition by the voiceprint recognition model.

In a first aspect, an embodiment of the present invention provides an audio signal processing method, including:

Acquiring a first audio signal to be processed;

determining at least one phoneme contained in the first audio signal;

Calculating a phoneme coverage rate of the at least one phoneme, wherein the phoneme coverage rate is used for representing a ratio between the types of the phonemes and the total number of the phonemes in the at least one phoneme;

If the phoneme coverage rate meets the target condition, updating a first voiceprint recognition model into a second voiceprint recognition model, wherein the first voiceprint recognition model is obtained by training voiceprint feature information of the first audio signal, and the second voiceprint recognition model is obtained by training voiceprint feature information of the second audio signal.

In another possible implementation manner, before updating the first voiceprint recognition model to the second voiceprint recognition model, the method further includes:

outputting prompt information for prompting the user to input the second audio signal;

acquiring the second audio signal;

Extracting voiceprint feature information of the second audio signal;

and training a second voiceprint recognition model by using voiceprint characteristic information of the second audio signal.

In a possible implementation, the phoneme coverage includes an initial coverage and/or a final coverage, where the initial coverage is used to represent a ratio between a category of an initial and a total number of initials in the at least one phoneme, and the final coverage is used to represent a ratio between a category of a final and a total number of finals in the at least one phoneme.

In one possible implementation, the phoneme coverage includes an initial coverage and a final coverage;

If the phoneme coverage rate meets the target condition, updating the first voiceprint recognition model to a second voiceprint recognition model, including:

And if the coverage rate of the initials is smaller than a first threshold value and the coverage rate of the finals is smaller than a second threshold value, updating the first voiceprint recognition model into a second voiceprint recognition model.

In one possible implementation, the method further includes:

If the phoneme coverage rate does not meet the target condition, determining the first voiceprint recognition model as a voiceprint recognition model corresponding to a first user identifier;

acquiring a third audio signal, inputting the third audio signal into the first voiceprint recognition model for voiceprint recognition processing, and obtaining a processing result;

and determining whether the third audio signal is associated with the first user identification according to the processing result.

In yet another possible implementation manner, before the determining at least one phoneme contained in the first audio signal, the determining further includes:

Preprocessing the first audio signal, wherein the preprocessing comprises the steps of retaining the audio signal which accords with preset voice characteristics in the first audio signal and/or deleting a silent voice signal in the first audio signal;

The determining at least one phoneme contained in the first audio signal includes:

At least one phoneme contained in the preprocessed first audio signal is determined.

In yet another possible implementation manner, before the calculating the phoneme coverage of the at least one phoneme, the method further includes:

Determining a target text corresponding to the first audio signal according to the at least one phoneme;

and if the target text is not matched with the preset text, executing the step of calculating the phoneme coverage rate of the at least one phoneme.

In a second aspect, the present embodiment further provides an audio signal processing apparatus, including:

a first acquisition unit configured to acquire a first audio signal to be processed;

a first determining unit configured to determine at least one phoneme contained in the first audio signal;

A calculation unit configured to calculate a phoneme coverage of the at least one phoneme, the phoneme coverage being configured to represent a ratio between a kind of a phoneme and a total number of phonemes in the at least one phoneme;

And the updating unit is used for updating the first voiceprint recognition model into a second voiceprint recognition model if the phoneme coverage rate meets the target condition, wherein the first voiceprint recognition model is obtained by training voiceprint feature information of the first audio signal, and the second voiceprint recognition model is obtained by training voiceprint feature information of the second audio signal.

In a possible implementation manner, the device further comprises:

An output unit for outputting prompt information for prompting the user to input the second audio signal;

A second acquisition unit configured to acquire the second audio signal input;

An extracting unit, configured to extract voiceprint feature information of the second audio signal;

And the training unit is used for training a second voice print recognition model by adopting voice print characteristic information of the second audio signal.

The updating unit is specifically configured to update the first voiceprint recognition model to a second voiceprint recognition model if the coverage rate of the initials is smaller than a first threshold and the coverage rate of the finals is smaller than a second threshold.

In a possible implementation manner, the device further comprises:

The second determining unit is used for determining the first voiceprint recognition model as a voiceprint recognition model corresponding to the first user identifier if the phoneme coverage rate does not meet the target condition;

The voiceprint recognition processing unit is used for acquiring a third audio signal, inputting the third audio signal into the first voiceprint recognition model for voiceprint recognition processing, and obtaining a processing result;

And a third determining unit, configured to determine whether the third audio signal is associated with the first user identifier according to the processing result.

In a possible implementation manner, the device further comprises:

A preprocessing unit, configured to preprocess the first audio signal, where the preprocessing includes retaining an audio signal that accords with a preset voice feature in the first audio signal, and/or deleting a silent voice signal in the first audio signal;

The first determining unit is specifically configured to determine at least one phoneme contained in the preprocessed first audio signal.

In a possible implementation manner, the device further comprises:

A fourth determining unit, configured to determine, according to the at least one phoneme, a target text corresponding to the first audio signal;

the calculating unit is specifically configured to execute a step of calculating a phoneme coverage rate of the at least one phoneme if the target text does not match with a preset text.

In a possible implementation manner, the audio signal processing apparatus is characterized by comprising a processor, a memory and a communication interface, wherein the processor, the memory and the communication interface are mutually connected, the communication interface is used for receiving and transmitting data, the memory is used for storing program codes, and the processor is used for calling the program codes and executing the method of the first aspect.

In a possible implementation manner, a computer readable storage medium stores a computer program, where the computer program is executed by a processor to implement the method according to the first aspect.

In the embodiment of the application, after a first audio signal to be processed is acquired, at least one phoneme contained in the first audio signal is determined, the phoneme coverage rate of the at least one phoneme is calculated, and if the phoneme coverage rate meets a target condition, the first voiceprint recognition model is updated to be a second voiceprint recognition model. The application can ensure that the final voiceprint recognition model is obtained through audio signal training which can reflect the pronunciation characteristics of the user more completely, better help the model to distinguish the differences of different individuals, and improve the accuracy of the voiceprint recognition model for speaker recognition.

Drawings

In order to illustrate embodiments of the invention or solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1 is a flowchart of an audio signal processing method according to an embodiment of the present invention;

FIG. 2 is a diagram of a Chinese phoneme representation intent provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a Chinese pinyin initial representation provided by an embodiment of the present invention;

FIG. 4 is a diagram showing the intent of a Pinyin final provided by an embodiment of the present invention;

fig. 5 is a flowchart of another audio signal processing method according to an embodiment of the present invention;

fig. 6 is a flowchart of another audio signal processing method according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an audio signal processing apparatus according to an embodiment of the present invention;

Fig. 8 is a schematic structural diagram of another audio signal processing apparatus according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of another audio signal processing apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.

The following describes in detail the audio signal processing method according to the embodiment of the present invention with reference to fig. 1 to fig. 6.

Referring to fig. 1, a flowchart of an audio signal processing method is provided in an embodiment of the present invention. As shown in fig. 1, the audio signal processing method according to an embodiment of the present invention may include the following steps S101 to S104.

S101, acquiring a first audio signal to be processed;

s102, determining at least one phoneme contained in the first audio signal;

In one embodiment, the first audio signal may be an audio signal corresponding to a section of registration voice which is arbitrarily spoken by the first user when the voiceprint registration is performed, and the method according to the embodiment of the present application determines whether the section of registration voice is qualified, if not, the user is prompted to re-register, that is, a section of registration voice is re-spoken, or the first audio signal may be an audio signal corresponding to a plurality of sections of registration voice which is spoken by the first user, and determines which registration voices in the plurality of sections of registration voices are qualified by the method according to the embodiment of the present application.

The first audio signal is used for training a first voiceprint recognition model corresponding to the first user identifier. The first voiceprint recognition model is used for carrying out speaker recognition subsequently to determine whether the first user is the first user. In order to enable speaker recognition to be performed more accurately, the first audio signal for training the first voiceprint recognition model needs to be capable of reflecting the pronunciation characteristics of the first user.

The determining method may use an automatic speech recognition (Automatic Speech Recognition, ASR) technique, which is a technique that converts the person's audio signal into text, to get each phoneme in the first audio. Optionally, the first audio signal to be processed may also be preprocessed before the phonemes are acquired, i.e. after the first audio signal is acquired in step S101.

Optionally, the preprocessing may include: preserving the audio signals which accord with the preset voice characteristics in the first audio signals, for example, preserving the audio signals which can acquire phonemes in the first audio signals; and/or deleting the silence voice signal in the first audio signal, for example, removing the silence voice signal generated by sentence breaking or pausing among sentences when the user inputs the audio signal, or removing other non-voice signals such as whistling sound, music sound and the like of an automobile when the user acquires the first audio signal by arbitrarily speaking a section of registered voice.

Specifically, the first audio signal is framed to obtain N frames of audio signals, i.e., the first audio signal is divided into N small segments, one of which is called a frame. The frame length of each frame, i.e. the length of the small audio signal, may be 25ms, for example. Wherein the framing operation may be implemented using a moving window function.

The N frames of audio signals are subjected to acoustic feature extraction, and acoustic features include, but are not limited to, MFCC features, i.e., each frame of waveforms is changed into a multi-dimensional vector according to physiological characteristics of human ears, and the vector contains content information of the frame of audio signals. The first audio signal forms a matrix of 12 rows (assuming 12 dimensions for acoustic features), N columns, where N is the total number of frames.

The method comprises the steps of processing according to a multidimensional vector corresponding to each frame of audio signal, determining which state the frame of audio signal corresponds to has the largest probability, namely determining the state as the state corresponding to the frame of audio signal, wherein the states corresponding to a plurality of adjacent frames of audio signals are generally the same, namely, a plurality of frames of audio signals correspond to one state, and each three states are combined into one phoneme, and generally all initials and finals form all phonemes. Optionally, a plurality of phonemes form a word, so as to obtain the target text corresponding to the first audio signal. If the target text is not matched with the preset text, step S103 is executed, and if the target text is matched with the preset text, the speaker identification is directly performed by using the first voiceprint recognition model obtained by training the voiceprint feature information of the first audio signal, so as to complete the registration. The preset text can be a text preset by the system, and the audio signal containing the preset text can be a signal capable of reflecting the pronunciation characteristics of the user, so that the speaker identification can be performed by directly adopting the first voiceprint identification model.

As shown in fig. 2, including b, p, m, f, z, c, s, d, t, n, l, zh, ch, sh, among other factors.

S103, calculating the phoneme coverage rate of the at least one phoneme;

s104, if the phoneme coverage rate meets the target condition, updating the first voiceprint recognition model into a second voiceprint recognition model;

in one embodiment, the phone coverage of the first audio signal may indicate whether the first audio signal can represent the pronunciation characteristics of the first user, where the phone coverage is used to represent a ratio between the type of phones and the total number of phones in the at least one phone, so that the phone coverage of the first audio signal needs to be as large as possible to represent the pronunciation characteristics of the user.

And counting at least one phoneme in the first audio signal, obtaining the types of the phonemes contained in the at least one phoneme, obtaining the total number of the at least one phoneme, and then calculating the ratio between the types of the phonemes and the total number of the phonemes in the at least one phoneme as the phoneme coverage rate of the at least one phoneme.

For example, the at least one phoneme comprises: w, o, ai, zh, o, n, g, g, u, o, w, o. Then, the at least one phoneme comprises the following phonemes: w, zh, n, g, o, ai, u, i.e. 7 phoneme categories, the total number of the at least one phoneme is counted as 12, and the phoneme coverage is 7/12=58.33%.

If the phone coverage rate is smaller, it is indicated that the class ratio of the phones in the at least one phone is smaller, and the user may repeat the same word repeatedly, so that the first audio signal cannot represent the pronunciation characteristics of the user, and if the subsequent speaker recognition is performed by using the first voiceprint recognition model obtained through voiceprint information training of the first audio signal, the individual difference capability of the user is greatly reduced, and the voiceprint recognition effect is poor.

In order to avoid the occurrence of the above situation, when the phoneme coverage rate is detected to be smaller than the corresponding threshold value, voiceprint modification is needed to be performed, a second audio signal input by a user is acquired, a second voiceprint recognition model is trained through voiceprint information of the second audio signal, and the first voiceprint recognition model is updated to be the second voiceprint recognition model. The size of the threshold value can be obtained through a large amount of data training.

Optionally, the phoneme coverage includes an initial coverage and/or a final coverage, the initial coverage is used for representing a ratio between a category of the initial consonants and a total number of the initial consonants in the at least one phoneme, and the final coverage is used for representing a ratio between a category of the final in the at least one phoneme and a total number of the final.

In a first alternative embodiment, the phoneme coverage may include an initial consonant coverage, and the initial consonant coverage may reflect a coverage of the initial consonant in the at least one phoneme. Referring to fig. 3, the intent of the Chinese pinyin is shown as 23 initials in the table. In the Chinese phonetic alphabet scheme, although w and y are not called initials, according to the custom spelling of people, for example yan is spelled in the way of adopting initials to spell finals, namely y-an-yan, so that y and w are calculated as initials. At least one initial consonant is obtained from at least one phoneme, the variety of the at least one initial consonant can be counted, and the ratio between the variety of the at least one initial consonant and the total number of the at least one initial consonant is calculated, so that the coverage rate of the initial consonants is obtained. For example, obtaining at least one phoneme comprises: w, o, ai, zh, o, n, g, g, u, o, w, o. The initial consonant includes: w, zh, n, g, g, w. That is, the total number of occurrence of the initial consonant is 6, and the initial consonant category includes: w, zh, n, g, i.e. the category of the jellyfish comprises 4, the coverage of the jellyfish is 4/6=2/3, i.e. the coverage of the jellyfish is approximately 66.6%.

In a second alternative embodiment, the phoneme coverage may include a vowel coverage, where the vowel coverage may reflect coverage of vowels in the at least one phoneme. Referring to fig. 4, the intent of representing the vowels of the pinyin is shown as 35 vowels in the pinyin. Some of the vowels in the Pinyin vowel table of FIG. 4 are abbreviated as "iou" when syllables are composed, and the pinyin with "words is written as" you ". Therefore, we consider only the vowels appearing in the Pinyin vowel table of FIG. 4, i.e., abbreviated vowels, to be restored to a complete form in the statistics of vowels. According to the vowels shown in fig. 4, at least one vowel is obtained from at least one phoneme, the category of the at least one vowel is counted, and the ratio between the category of the at least one vowel and the total number of the at least one vowel is calculated to obtain the coverage rate of the vowels. For example, the at least one phoneme is obtained by: w, o, ai, zh, o, n, g, g, u, o, w, o are examples, and vowels include: o, ai, o, u, o, o, i.e. the total number of occurrences of finals is 6, the categories of finals include: o, ai, u. I.e. the final category comprises 3, the final coverage is 3/6=1/2, i.e. the final coverage is approximately 50%.

In an actual usage scenario, the first alternative embodiment and the second alternative embodiment may be used alone or in combination, and if the first alternative embodiment is used alone, the second audio signal input by the user may be obtained when the coverage rate of the initial consonant is smaller than the first threshold value, the second voiceprint recognition model is trained through voiceprint information of the second audio signal, and the first voiceprint recognition model is updated to the second voiceprint recognition model. If the coverage rate of the initials is smaller, it is indicated that the category of the initials in the at least one phoneme is smaller than the total number of the initials, the user may repeat the word of the same initial, and the first audio signal cannot show the pronunciation characteristics of the initial of the user. If the coverage rate of the initial consonants is greater than or equal to a first threshold value, the first audio signal is indicated to show the pronunciation characteristics of the initial consonants of the user, and voiceprint registration is completed.

If the second alternative implementation mode is used alone, the second audio signal input by the user can be obtained when the coverage rate of the final is smaller than the second threshold value, the second voice print recognition model is trained through voice print information of the second audio signal, and the first voice print recognition model is updated to be the second voice print recognition model. If the coverage rate of the vowels is smaller, it is indicated that the total number of the vowels in the at least one phoneme is smaller, the user may repeat the word of the same vowel repeatedly, and the first audio signal cannot show the pronunciation characteristics of the vowels of the user; if the coverage rate of the vowels is greater than or equal to the second threshold value, the first audio signal is indicated to reflect the pronunciation characteristics of the vowels of the user, and voiceprint registration is completed.

If the first alternative embodiment and the second alternative embodiment are combined, that is, the initial coverage rate and the final coverage rate are required to be simultaneously passed, whether the first voiceprint recognition model needs to be updated to the second voiceprint recognition model is determined, the set target condition may be that the initial coverage rate is smaller than a first threshold value and the final coverage rate is smaller than a second threshold value, and other conditions are regarded as not meeting the target condition, for example, the initial coverage rate is greater than or equal to the first threshold value, the final coverage rate is smaller than the second threshold value, or the initial coverage rate is less than or equal to the first threshold value, the final coverage rate is less than the second threshold value, or the initial coverage rate is greater than or equal to the first threshold value, and the final coverage rate is greater than or equal to the second threshold value.

Specifically, optionally, judging whether the coverage rate of the initials is smaller than a first threshold value, judging whether the coverage rate of the finals is smaller than a second threshold value, if the coverage rate of the initials is smaller than the first threshold value and the coverage rate of the finals is smaller than the second threshold value, indicating that the first audio signal cannot embody the pronunciation characteristics of the user, and cannot directly adopt a first voiceprint recognition model trained by voiceprint information of the first audio signal to recognize a speaker, and updating the first voiceprint recognition model into a second voiceprint recognition model if voiceprint change is needed.

If the coverage rate of the initials is greater than or equal to a first threshold value, the coverage rate of the finals is greater than or equal to a second threshold value, or if the coverage rate of the initials is greater than or equal to the first threshold value, the coverage rate of the finals is smaller than the second threshold value, or the coverage rate of the initials is smaller than the first threshold value, and the coverage rate of the finals is greater than or equal to the second threshold value, the speaker recognition can be directly carried out by adopting a first voiceprint recognition model trained by voiceprint information of the first registered voice signal, and voiceprint registration is completed.

Preferably, in a scenario that it is required to determine whether the first voiceprint recognition model needs to be updated to the second voiceprint recognition model through the initial coverage rate and the final coverage rate at the same time, if any coverage rate is detected to be greater than or equal to a corresponding threshold value, the other coverage rate may not be detected, and it is determined that the target condition is not satisfied, that is, the first voiceprint recognition model does not need to be updated to the second voiceprint recognition model.

For example, the initial coverage of at least one phoneme in the first audio signal may be first calculated, and the initial coverage may be obtained. If the coverage rate of the obtained vocal print is greater than or equal to the first threshold, the vocal print registration is completed, that is, the subsequent speaker recognition can be performed by using the first vocal print recognition model trained by the vocal print feature information of the first audio signal, specifically please refer to steps S201-S206 in the embodiment of fig. 5. If the coverage rate of the initial consonants is smaller than a first threshold value, the coverage rate of the vowels of at least one phoneme in the first audio signal is required to be calculated, the coverage rate of the vowels is obtained, if the coverage rate of the vowels is smaller than a second threshold value, the target condition is met, the first voiceprint recognition model is required to be updated into a second voiceprint recognition model, and if the coverage rate of the vowels is larger than or equal to the second threshold value, registration is completed. Of course, the coverage rate of the final can be calculated first, whether the coverage rate of the final is larger than or equal to the second threshold value is judged, whether the coverage rate of the initial is calculated or not is determined according to the judging result, the coverage rate of the initial and the coverage rate of the final can be obtained simultaneously, the coverage rate of the initial and the coverage rate of the final are calculated, and whether the first voiceprint recognition model needs to be updated to the second voiceprint recognition model is determined according to the calculating result.

Optionally, the first voiceprint recognition model is obtained by training voiceprint feature information of a first audio signal, and the second voiceprint recognition model is obtained by training voiceprint feature information of a second audio signal. Namely, under the condition that the first audio signal is unqualified, the second voice recognition model is required to be trained by voice print characteristic information of the second audio signal.

It can be understood that if the phoneme coverage rate of the second audio signal meets the target condition, for example, the phoneme coverage rate is smaller than the corresponding threshold value, the user may be further prompted to perform voiceprint registration again, the audio signal input again by the user is obtained, a new voiceprint recognition model is obtained through training of voiceprint feature information of the audio signal input again, and the second voiceprint recognition model is updated to the new voiceprint recognition model, which is not described in detail.

The fact that the phoneme coverage rate of the second audio signal meets the target condition means that the second audio signal cannot completely embody the pronunciation characteristics of the user, that is, the ratio between the type of the phonemes and the total number of the phonemes in at least one phoneme in the second audio signal is smaller than a corresponding threshold value, wherein the threshold value can be obtained through a large amount of data training, and can also be customized according to the needs in a voiceprint registration system.

In this embodiment, by calculating a phoneme coverage rate of at least one phoneme included in the first audio signal, if the phoneme coverage rate meets a target condition, the first voiceprint recognition model is updated to a second voiceprint recognition model. The application can ensure that the finally obtained voiceprint recognition model is obtained through audio signal training which can reflect the pronunciation characteristics of the user more completely, and improves the accuracy of the voiceprint recognition model for speaker recognition.

Referring to fig. 5, a flowchart of another audio signal processing method provided by the present application includes, but is not limited to, steps S201-S206;

s201, acquiring a first audio signal to be processed;

S202, determining at least one phoneme contained in the first audio signal;

S203, calculating the phoneme coverage rate of the at least one phoneme;

in this embodiment, the content of steps S201-S203 is specifically referred to as steps S101-S103, and will not be described herein.

S204, if the phoneme coverage rate does not meet the target condition, determining the first voiceprint recognition model as a voiceprint recognition model corresponding to a first user identifier;

Optionally, the phoneme coverage includes an initial coverage and/or a final coverage, and if the phoneme coverage does not meet the target condition, the phoneme coverage may be greater than or equal to a corresponding threshold, which is specifically referred to the description of the foregoing embodiment and will not be repeated herein.

If the phoneme coverage rate does not meet the target condition, determining the first voiceprint recognition model as a voiceprint recognition model corresponding to a first user identifier, wherein the first user identifier is used for identifying a first user corresponding to a first audio signal, and subsequently, identifying a speaker through the first voiceprint recognition model to confirm whether the speaker is the first user.

S205, acquiring a third audio signal, inputting the third audio signal into the first voiceprint recognition model for voiceprint recognition processing, and obtaining a processing result;

In one embodiment, the third audio signal may be an audio signal collected in a subsequent speaker recognition process, and the third audio signal is input into the first voiceprint recognition model to perform voiceprint recognition processing, so as to obtain a processing result, where optionally, the processing result may be a matching degree, that is, a matching degree between voiceprint feature information of the third audio signal and voiceprint feature information of the first audio signal used for training the first voiceprint recognition model.

S206, determining whether the third audio signal is associated with the first user identification according to the processing result.

And further confirming that the third audio signal is associated or not associated with the first user identifier according to the information of the processing result, wherein the association of the third audio signal with the first user identifier refers to the speaking of the third audio signal for the user identified by the first user identifier.

In this embodiment, by calculating a phoneme coverage rate of at least one phoneme included in the first audio signal, if the phoneme coverage rate does not meet the target condition, determining the first voiceprint recognition model as a voiceprint recognition model corresponding to the first user identifier, obtaining a third audio signal, inputting the third audio signal into the first voiceprint recognition model to perform voiceprint recognition processing, obtaining a processing result, and determining whether the third audio signal is associated with the first user identifier according to the processing result. The application can ensure that the finally obtained voiceprint recognition model is obtained through audio signal training which can reflect the pronunciation characteristics of the user more completely, and improves the accuracy of the voiceprint recognition model for speaker recognition.

In another embodiment, before the updating of the first voiceprint recognition model to the second voiceprint recognition model in step S104, the second voiceprint recognition model may be obtained by other methods, and optionally, referring to fig. 6, the second voiceprint recognition model is obtained by a graphical method, including but not limited to S301-S309, and this embodiment is specifically described below:

s301, acquiring a first audio signal to be processed;

S302, determining at least one phoneme contained in the first audio signal;

S303, calculating the phoneme coverage rate of the at least one phoneme;

S304, if the phoneme coverage rate meets a target condition;

S305, outputting prompt information for prompting a user to input a second audio signal;

s306, acquiring the input second audio signal;

In this embodiment, the phoneme coverage rate of the phonemes is determined, and if the phoneme coverage rate meets the target condition, it is indicated that the first audio signal cannot embody the pronunciation characteristics of the user, and the speaker cannot be identified by directly using the first voiceprint recognition model trained by the voiceprint information of the first audio signal, so that the voiceprint change is required, that is, the first voiceprint recognition model is updated to the second voiceprint recognition model. When the system judges that the first audio signal does not meet the target condition and needs to acquire the second audio signal, the system outputs prompt information for prompting the user to input the second audio signal, wherein the prompt information can be a similar voice prompt by playing a voice print registration failure, please re-enter, or a similar text prompt can be displayed on a display screen, or whether the user is successfully registered can be prompted by using color prompt information such as red, green and the like.

S307, extracting voiceprint feature information of the second audio signal;

s308, training a second voice print recognition model by using voice print characteristic information of the second audio signal,

S309, the first voiceprint recognition model is updated to the second voiceprint recognition model.

After the second audio signal is obtained, voiceprint characteristic information of the second audio signal is extracted, a second voiceprint recognition model is obtained through voiceprint information training of a second registered voice signal, and the first voiceprint recognition model is updated to the second voiceprint recognition model. Please refer to steps S103-S104, and detailed description is omitted herein.

In this embodiment, by calculating a phoneme coverage rate of at least one phoneme included in the first audio signal, if the phoneme coverage rate meets a target condition, outputting prompt information for prompting a user to input a second audio signal, obtaining the input second audio signal, training a second voice print recognition model by using voice print feature information of the second audio signal, and updating the first voice print recognition model to the second voice print recognition model. The application can ensure that the finally obtained voiceprint recognition model is obtained through audio signal training which can reflect the pronunciation characteristics of the user more completely, and improves the accuracy of the voiceprint recognition model for speaker recognition.

Referring to fig. 7, a schematic structural diagram of an audio signal processing apparatus is provided in an embodiment of the present invention. As shown in fig. 7, the audio signal processing apparatus according to an embodiment of the present invention may include:

A first acquisition 11 for acquiring a first audio signal to be processed;

a first determining unit 12 for determining at least one phoneme contained in the first audio signal;

Specifically, the first audio signal is framed to obtain N frames of audio signals, i.e., the first audio signal is divided into N small segments, one of which is called a frame. The frame length of each frame is the length of the small audio signal, which may be 25ms, for example. Wherein the framing operation may be implemented using a moving window function.

The method comprises the steps of processing according to a multidimensional vector corresponding to each frame of audio signal, determining which state the frame of audio signal corresponds to has the largest probability, namely determining the state as the state corresponding to the frame of audio signal, wherein the states corresponding to a plurality of adjacent frames of audio signals are generally the same, namely, a plurality of frames of audio signals correspond to one state, and each three states are combined into one phoneme, and generally all initials and finals form all phonemes. Optionally, a plurality of phonemes form a word, so as to obtain a text corresponding to the first registration audio signal.

A calculation unit 13 for calculating a phoneme coverage of the at least one phoneme, the phoneme coverage being indicative of a ratio between a kind of a phoneme and a total number of phonemes in the at least one phoneme;

And an updating unit 14, configured to update a first voiceprint recognition model to a second voiceprint recognition model if the phoneme coverage rate meets a target condition, where the first voiceprint recognition model is obtained by training voiceprint feature information of the first audio signal, and the second voiceprint recognition model is obtained by training voiceprint feature information of the second audio signal.

In one embodiment, as shown in fig. 8, the apparatus further comprises:

A second acquisition unit configured to acquire the second audio signal input;

In one embodiment, the phoneme coverage comprises an initial coverage and/or a final coverage, the initial coverage being used to represent a ratio between a category of initials and a total number of initials in the at least one phoneme, and the final coverage being used to represent a ratio between a category of finals and a total number of finals in the at least one phoneme.

In one embodiment, the phoneme coverage includes an initial coverage and a final coverage;

In one embodiment, as shown in fig. 8, the apparatus further comprises:

In one embodiment, the apparatus further comprises a fourth determination unit,

Referring to fig. 9, a schematic structural diagram of another audio signal processing apparatus according to an embodiment of the present invention is shown in fig. 9, and the audio signal processing apparatus 1000 may include: at least one processor 1001, such as a CPU, at least one communication interface 1003, memory 1004, at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. Communication interface 1003 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1004 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1004 may also optionally be at least one storage device located remotely from the processor 1001. As shown in fig. 9, an operating system, network communication modules, and program instructions may be included in memory 1004, which is a type of computer storage medium.

In the audio signal processing apparatus 1000 shown in fig. 9, the processor 1001 may be configured to load program instructions stored in the memory 1004 and specifically perform the following operations:

Acquiring a first audio signal to be processed;

determining at least one phoneme contained in the first audio signal;

Optionally, before updating the first voiceprint recognition model to the second voiceprint recognition model, the method further includes:

acquiring the second audio signal;

Extracting voiceprint feature information of the second audio signal;

Optionally, the phoneme coverage includes an initial coverage and a final coverage;

Optionally, the method further comprises:

Optionally, before determining the at least one phoneme contained in the first audio signal, the method further includes:

Optionally, before calculating the phoneme coverage of the at least one phoneme, the method further includes:

It should be noted that, the specific implementation process may refer to the specific description of the method embodiment shown in fig. 1 to fig. 6, and will not be described herein.

The embodiment of the present invention further provides a computer storage medium, where the computer storage medium may store a plurality of instructions, where the instructions are suitable for being loaded by a processor and executed by the processor to perform the steps of the method described in the embodiment shown in fig. 1 to fig. 6, and the specific implementation process may refer to the specific description of the embodiment shown in fig. 1 to fig. 6, which is not repeated herein.

Those skilled in the art will appreciate that implementing all or part of the above-described embodiment methods may be accomplished by way of a computer program, which may be stored on a computer readable storage medium, which when executed comprises the steps of embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random-access Memory (Random Access Memory, RAM), or the like.

Claims

1. An audio signal processing method, comprising:

Acquiring a first audio signal to be processed;

Framing the first audio signal to obtain N frames of audio signals;

Extracting acoustic features of each frame of audio signals in the N frames of audio signals to obtain multidimensional vectors which correspond to each frame of audio signals and are used for representing the acoustic features;

Determining the state corresponding to each frame of audio signal according to the multidimensional vector corresponding to each frame of audio signal, wherein the states corresponding to adjacent frames of audio signals are the same;

Determining adjacent multi-frame audio signals corresponding to the same state as a state corresponding to the multi-frame audio signals;

combining each adjacent three states into one phoneme to obtain at least one phoneme contained in the first audio signal;

2. The method of claim 1, wherein before updating the first voiceprint recognition model to the second voiceprint recognition model, further comprising:

acquiring the second audio signal;

Extracting voiceprint feature information of the second audio signal;

3. The method of claim 1, wherein the phoneme coverage includes an initial coverage and/or a final coverage, the initial coverage being used to represent a ratio between a category of an initial and a total number of initials in the at least one phoneme, and the final coverage being used to represent a ratio between a category of a final and a total number of finals in the at least one phoneme.

4. The method of claim 3, wherein the phoneme coverage comprises an initial coverage and a final coverage;

5. The method of any one of claims 1-4, wherein the method further comprises:

6. The method of claim 1, wherein prior to said determining at least one phoneme contained in the first audio signal, further comprising:

7. The method of claim 1, wherein prior to said calculating the phoneme coverage for the at least one phoneme, further comprising:

8. An audio signal processing device, characterized in that the audio signal processing device comprises means for performing the method according to any of claims 1-7, the device comprising:

9. An audio signal processing device comprising a processor, a memory and a communication interface, the processor, memory and communication interface being interconnected, wherein the communication interface is adapted to receive and transmit data, the memory is adapted to store program code, and the processor is adapted to invoke the program code to perform the method of any of claims 1-7.

10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program, which is executed by a processor to implement the method of any one of claims 1 to 7.