Disclosure of Invention
In view of the above problems, embodiments of the present invention provide a method for preprocessing a speech recognition signal, a device and apparatus for the method and apparatus for preprocessing the speech recognition signal, and a computer readable storage medium, which are used for solving the technical problem that the speech recognition signal in the prior art cannot eliminate the sounds of other people caused by the surrounding environment.
According to an aspect of an embodiment of the present invention, there is provided a voice recognition signal preprocessing method, including:
receiving a voice signal to be recognized, and extracting voiceprint characteristics of each sentence to be recognized in the voice signal to be recognized, wherein the voice signal to be recognized comprises at least one sentence to be recognized;
identifying the voiceprint characteristics of each statement to be identified according to the voiceprint model library to obtain an initial identification result; the voiceprint model library is constructed by short-time registration according to each sentence to be identified before a current sentence to be identified in the voice signal to be identified, and the identification result is the corresponding relation between the sentence to be identified and the user;
performing distortion analysis on each sentence to be identified of the voice signal to be identified to obtain a distortion degree result of each sentence to be identified;
And adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.
In an alternative way, the first and second modules,
receiving a voice signal to be recognized, and extracting voiceprint characteristics of each sentence to be recognized in the voice signal to be recognized, wherein the voice signal to be recognized comprises at least one sentence to be recognized, and further comprises:
dividing the voice signal to be recognized into a plurality of sentences;
and extracting the identity characteristics and text characteristics of each sentence to be recognized before the current sentence to be recognized in the voice signal to be recognized, and fusing to obtain voiceprint characteristics.
In an optional manner, identifying the voiceprint features of each sentence to be identified according to a voiceprint model library to obtain an initial identification result, where the voiceprint model library is obtained by performing short-time registration construction according to each sentence to be identified before a current sentence to be identified in the speech signal to be identified, and further includes:
storing the voiceprint features and the corresponding user identifications in an associated mode, and constructing a voiceprint model library;
and comparing the current sentence to be identified with the voiceprint characteristics stored in the voiceprint model library, judging the similarity, matching the current voice signal to be identified with the corresponding user identification, and storing the voice signal to be identified in the voiceprint model library in an associated manner.
In an optional manner, performing distortion analysis on each sentence to be identified of the speech signal to be identified to obtain a distortion result of each sentence to be identified, and further including:
and carrying out distortion analysis on each statement to be recognized in the voice signal to be recognized by adopting a THD total harmonic distortion analysis method to obtain a distortion result corresponding to each statement.
In an alternative manner, the THD total harmonic distortion analysis method further includes:
and adopting a THD total harmonic distortion analysis formula to analyze the distortion degree:
wherein V is THD_R Representing the ratio of the root mean square value to the total root mean square value of all the harmonic components of the designated N times, V h,rms The root mean square of volume, rms the root mean square, and h the designated order.
In an optional manner, the initial recognition result is adjusted according to the distortion degree result to obtain a target recognition result, which further includes:
determining whether the distortion degree result of each sentence to be identified is within a distortion degree threshold value interval;
and eliminating sentences to be identified and corresponding user identifications, the distortion degree results of which are not in the distortion degree threshold value interval, from the voiceprint model library to obtain a target voiceprint model library and target identification results.
According to another aspect of the embodiment of the present invention, there is also provided a voice recognition signal preprocessing apparatus, including:
the voice print extraction module is used for receiving the voice signals to be identified and extracting voice print characteristics of each sentence to be identified in the voice signals to be identified, wherein the voice signals to be identified comprise at least one sentence to be identified;
the voiceprint registration module is used for identifying voiceprint characteristics of each statement to be identified according to the voiceprint model library to obtain an initial identification result; the voiceprint model library is constructed by short-time registration according to each sentence to be identified before a current sentence to be identified in the voice signal to be identified, and the identification result is the corresponding relation between the sentence to be identified and the user;
the distortion analysis module is used for carrying out distortion analysis on each sentence to be identified of the voice signal to be identified to obtain a distortion result of each sentence to be identified;
and the adjusting module is used for adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.
In an optional manner, the voiceprint registration module identifies voiceprint features of each sentence to be identified according to a voiceprint model library, so as to obtain an initial identification result, where the voiceprint model library is obtained by performing short-time registration construction according to each sentence to be identified before a current sentence to be identified in the speech signal to be identified, and further includes:
Storing the voiceprint features and the corresponding user identifications in an associated mode, and constructing a voiceprint model library;
and comparing the current sentence to be identified with the voiceprint characteristics stored in the voiceprint model library, judging the similarity, matching the current voice signal to be identified with the corresponding user identification, and storing the voice signal to be identified in the voiceprint model library in an associated manner.
According to another aspect of an embodiment of the present invention, there is provided a voice recognition signal preprocessing apparatus including: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;
the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation of the voice recognition signal preprocessing method.
According to yet another aspect of embodiments of the present invention, there is provided a computer-readable storage medium having stored therein at least one executable instruction that, when run on a speech recognition signal preprocessing apparatus/device, causes the speech recognition signal preprocessing apparatus/device to perform the operations of the above-described speech recognition signal preprocessing method.
According to the voice recognition signal preprocessing method, signal distortion degree analysis is carried out by combining the THD total harmonic distortion analysis method, fine adjustment is carried out on the voiceprint model library according to the distortion degree analysis result, fine adjustment is carried out on different speaker results analyzed by the voiceprint model library, the fine adjustment is carried out on the voice recognition result after the fine adjustment as target voice to be recognized, and therefore accuracy of the voice recognition result is improved.
The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and may be implemented according to the content of the specification, so that the technical means of the embodiments of the present invention can be more clearly understood, and the following specific embodiments of the present invention are given for clarity and understanding.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein.
Fig. 1 is a schematic flow chart of a voice recognition signal preprocessing method according to an embodiment of the present invention, where the method is performed by a voice recognition signal preprocessing device. As shown in fig. 1, the method comprises the steps of:
s110: receiving a voice signal to be recognized, and extracting voiceprint characteristics of each statement to be recognized in the voice signal to be recognized, wherein the voice signal to be recognized comprises at least one statement to be recognized.
The voice signal to be recognized is a voice signal of the rest voice part after being processed by the microphone array. The invention aims to effectively extract the voice signals of a user using the voice recognition device, remove the voice signals of other people around the user caused by the use environment and improve the recognition accuracy of the input voice signals. Abnormal information signals such as background noise, reverberation, echo, interference, car horn and the like are processed when passing through the microphone array.
Specifically, after receiving the speech signal to be recognized, the speech signal to be recognized is divided into a plurality of sentences. Each sentence in the speech signal to be recognized is taken as a sentence. And extracting the identity characteristics and text characteristics of each sentence to be recognized before the current sentence to be recognized in the voice signal to be recognized, and fusing to obtain voiceprint characteristics. And respectively extracting the identity characteristics of the speaker and the corresponding text characteristics related to the content of the voice information, which are contained in the voice, by adopting a DNN algorithm. The identity features include features of time domain and frequency domain such as tone quality, loudness, tone and the like of sound, such as short-time energy, short-time average amplitude, short-time green rate, MFCC parameters, PLP parameters, pitch and the like. The text feature is text content related to the content of the speech signal to be recognized. And fusing the identity characteristic and the text characteristic to obtain the voiceprint characteristic.
S120: identifying the voiceprint characteristics of each statement to be identified according to the voiceprint model library to obtain an initial identification result; the voiceprint model library is constructed by short-time registration according to each sentence to be identified before the current sentence to be identified in the voice signal to be identified, and the identification result is the corresponding relation between the sentence to be identified and the user.
Specifically, in this embodiment, the steps specifically include:
and storing the voiceprint characteristics and the user identification in an associated mode, and constructing a voiceprint model library. Specifically, voiceprint features of each sentence to be recognized in the received voice signal to be recognized are respectively associated with user identifiers for storage. And (3) carrying out voiceprint self-registration by taking each sentence to be identified (each sentence) as a unit, establishing a free-speaking voiceprint model library after associating user identification, and storing the voiceprint characteristics of the user. The user identification is a plurality of preset identifications and is used for matching voiceprint features. The recognition result is that the corresponding relation between the sentence to be recognized and the user is represented by the corresponding relation between the voiceprint feature of the sentence to be recognized and the user identification.
And comparing the current voice signal to be recognized with the voiceprint characteristics of each statement to be recognized stored in the voiceprint model library, and judging the similarity, so as to match the user identifier for the current voice signal to be recognized.
For example: in a conference scenario, a speech signal is received. When a first sentence is received, the first sentence is subjected to voiceprint feature extraction and matched with a random user identifier, such as user 1, and stored in a voiceprint model library after being associated. And when the second sentence is received, extracting the voiceprint characteristics of the second sentence, and comparing the voiceprint characteristics with the voiceprint characteristics of the first sentence in the voiceprint model library, and if the similarity reaches a preset similarity threshold, matching the second sentence as the user 1, wherein the second sentence is also said by the user 1. If the similarity does not reach the similarity threshold, a user identifier, such as identifier 2, is randomly matched for the second sentence. And associating the voiceprint features of the second sentence with the corresponding user identifications and storing the voiceprint features in a voiceprint model library. When the third sentence is received, the voice print characteristics of the first sentence and the second sentence are respectively matched in the same way, so that the user identification of the third sentence is obtained, and the user identification is stored in a voice print model library in an associated mode. Wherein, the user identification can realize the association storage with the statement to be identified in the form of adding the field.
By adopting the mode, the voice print characteristics of the current sentences to be identified and the previous sentences to be identified stored in the voice print model library are compared, and the corresponding user identifications are matched, so that the user identifications corresponding to the sentences to be identified are obtained, and the voice print characteristics of the sentences to be identified and the corresponding user identifications are initial identification results. That is, through the above processing, the speaker corresponding to each sentence to be recognized can be obtained. Thus, accurate judgment of speech signals generated in the conference is realized, which sentences are spoken by one user and which sentences are spoken by another user.
S130: and carrying out distortional analysis on each statement to be identified of the voice signal to be identified to obtain a distortion degree result of each statement to be identified.
The distortion is defined as the influence of interference and noise on the sound during the conversion, amplification and transmission processes, and the output signal changes in waveform relative to the input signal. The degree to which the output signal deviates from the input signal by an amount is the degree of distortion. These disturbances and noise include the sounds of other people around the speaker. Thus, performing distortion analysis may assist in distinguishing between speech signals of a speaker and others.
In this embodiment, the distortion degree of the original speech signal is analyzed by using the THD total harmonic distortion analysis formula. And carrying out signal distortion degree analysis on each sentence unit of the sentences to be recognized to obtain the distortion degree of the input signal relative to the output signal of the voice recognition signal preprocessing equipment.
The distortion analysis is performed by adopting the following THD total harmonic distortion analysis formula:
firstly, using the value Vsamp of the waveform sample sampling point of the volume of the selected voice signal to be recognized to perform corresponding volume root mean square V rms And inserts it into a standard root mean square equation:
then, the THD parameter is calculated as the total harmonic content in the signal:
wherein V is THD_R Representing the ratio of the root mean square value to the total root mean square value of all the harmonic components of the designated N times, V h,rms The root mean square of the volume, and rms the root mean square.
And after analysis and calculation are carried out according to the distortion degree formula, obtaining a distortion degree result of the volume corresponding to each statement to be recognized in the voice signal to be recognized. In this embodiment, the distortion result may be expressed as a percentage, for example, the distortion is 5%, which indicates that the sentence to be recognized is distorted by 5% with respect to its original input.
S140: and adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.
In this embodiment, after the initial recognition result is obtained, an initial corresponding result of the voiceprint features of each sentence to be recognized and the corresponding user identifier is obtained, and a voiceprint model library in which the voiceprint features of all the sentences to be recognized are associated with the corresponding user identifiers is obtained. The initial recognition result does not take the distortion degree into consideration, so that whether each sentence to be recognized is distorted or not needs to be judged according to the calculated distortion degree result of each sentence to be recognized. Under the same tone and other voiceprint characteristics, the voiceprint characteristics of the sentences to be identified with larger signal distortion degree change are removed from the self-registration of the sentences to be identified, and only the sentences to be identified which are added in a reasonable distortion degree range are selected to be used as the voiceprint characteristics of the effective sentences to be identified from the self-registration. Therefore, in the present embodiment, a distortion threshold section is set. After the distortion degree threshold value interval is exceeded, the fact that the distortion degree of the statement to be identified is higher is indicated, so that the statement to be identified, the result of which the distortion degree is not in the distortion degree threshold value interval, and the corresponding identification result are required to be removed in the voiceprint model library, and the voiceprint model library is adjusted to obtain the target voiceprint model library. And meanwhile, obtaining a target recognition result after removing the sentences to be recognized with high distortion. In this embodiment, the distortion threshold interval is 0-5%. And if the distortion degree result of a sentence to be identified exceeds 5%, rejecting the sentence to be identified.
In this embodiment, after the voiceprint model library and the initial recognition result are adjusted according to the distortion degree result to obtain a target recognition result, the sentences to be recognized are further ordered according to the features of the volume of the speech recognition signal, the integrity of harmonics, the distortion degree and the like, and the sentences to be recognized corresponding to the user identifier are selected to output the speech recognition result, so as to obtain an output result. For example, a sentence to be recognized with large volume, complete harmonic wave and small distortion degree of a voice recognition signal and a corresponding user identification are selected and output. Judging the target recognition result through the assistance of the user behavior, if the target recognition result is output in error, prompting the user to repeat the voice recognition step so as to optimize the voiceprint model library, correcting the output user identification and recognition result, and further improving the accuracy of the voiceprint model library on the user voice recognition result.
In this embodiment, the scenario is that in the conference record, according to different voiceprint features and distortion results registered by the voice to be recognized, classification summarization is performed on the sentences to be recognized, the voices to be recognized with the same user identifier are summarized together, and the corresponding user identifier is used as a name and is output as a result.
According to the voice recognition signal preprocessing method, signal distortion degree analysis is carried out by combining the THD total harmonic distortion analysis method, fine adjustment is carried out on the voiceprint model library according to the distortion degree analysis result, fine adjustment is carried out on different speaker results analyzed by the voiceprint model library, the fine adjustment is carried out on the voice recognition result after the fine adjustment as target voice to be recognized, and therefore accuracy of the voice recognition result is improved.
Fig. 2 is a schematic diagram showing the structure of an embodiment of the speech recognition signal preprocessing apparatus of the present invention. As shown in fig. 2, the apparatus 200 includes: voiceprint extraction module 210, voiceprint registration module 220, distortion analysis module 230, and adjustment module 240.
The voiceprint extraction module 210 is configured to receive a voice signal to be identified, and extract voiceprint features of each sentence to be identified in the voice signal to be identified, where the voice signal to be identified includes at least one sentence to be identified.
The voiceprint registration module 220 is configured to identify voiceprint features of each sentence to be identified according to a voiceprint model library, so as to obtain an initial identification result; the voiceprint model library is constructed by short-time registration according to each sentence to be identified before the current sentence to be identified in the voice signal to be identified, and the identification result is the corresponding relation between the sentence to be identified and the user.
The distortion analysis module 230 is configured to perform distortion analysis on each sentence to be identified of the speech signal to obtain a distortion result of each sentence to be identified.
And the adjusting module 140 is configured to adjust the voiceprint model library and the initial recognition result according to the distortion degree result, so as to obtain a target voiceprint model library and a target recognition result.
The specific working process of each module is as follows:
the voiceprint extraction module 210 receives a voice signal to be identified, and extracts voiceprint features of each sentence to be identified in the voice signal to be identified, where the voice signal to be identified includes at least one sentence to be identified.
The voice signal to be recognized is a voice signal of the rest voice part after being processed by the microphone array. The invention aims to effectively extract the voice signals of a user using the voice recognition device, remove the voice signals of other people around the user caused by the use environment and improve the accuracy of the input voice signals. Abnormal information signals such as background noise, reverberation, echo, interference, car horn and the like are processed when passing through the microphone array.
In this embodiment, the voiceprint extraction module 210 includes a voice dividing sub-module and a voiceprint fusion sub-module.
The sentence dividing module is used for dividing the voice signal to be recognized into a plurality of sentences. Each sentence in the speech signal to be recognized is taken as a sentence.
And the voiceprint fusion submodule is used for extracting the identity characteristics and the text characteristics of each sentence to be recognized before the current sentence to be recognized in the voice signal to be recognized, and fusing the identity characteristics and the text characteristics to obtain voiceprint characteristics.
Specifically, the voiceprint fusion submodule adopts DNN algorithm to respectively extract the identity characteristics of the speaker and the corresponding text characteristics related to the content of the voice information. The identity features include features of time domain and frequency domain such as tone quality, loudness, tone and the like of sound, such as short-time energy, short-time average amplitude, short-time green rate, MFCC parameters, PLP parameters, pitch and the like. The text feature is text content related to the content of the speech signal to be recognized. And fusing the identity characteristic and the text characteristic to obtain the voiceprint characteristic.
The voiceprint registration module 220 identifies voiceprint features of each statement to be identified according to the voiceprint model library to obtain an initial identification result; the voiceprint model library is constructed by short-time registration according to each sentence to be identified before the current sentence to be identified in the voice signal to be identified, and the identification result is the corresponding relation between the sentence to be identified and the user.
Wherein, the voiceprint registration module 220 stores voiceprint features in association with user identifications, and constructs a voiceprint model library. Specifically, voiceprint features of each sentence to be recognized in the received voice signal to be recognized are respectively associated with user identifiers for storage. And (3) carrying out voiceprint self-registration by taking each sentence to be identified (each sentence) as a unit, establishing a free-speaking voiceprint model library after associating user identification, and storing the voiceprint characteristics of the user. The user identification is a plurality of preset identifications and is used for matching voiceprint features. And comparing the current voice signal to be recognized with the voiceprint characteristics of each statement to be recognized stored in the voiceprint model library, and judging the similarity, so as to match the user identifier for the current voice signal to be recognized. The recognition result is that the corresponding relation between the sentence to be recognized and the user is represented by the corresponding relation between the voiceprint feature of the sentence to be recognized and the user identification.
For example: in a conference scenario, a speech signal is received. When a first sentence is received, the first sentence is subjected to voiceprint feature extraction and matched with a random user identifier, such as user 1, and stored in a voiceprint model library after being associated. And when the second sentence is received, extracting the voiceprint characteristics of the second sentence, and comparing the voiceprint characteristics with the voiceprint characteristics of the first sentence in the voiceprint model library, and if the similarity reaches a preset similarity threshold, matching the second sentence as the user 1, wherein the second sentence is also said by the user 1. If the similarity does not reach the similarity threshold, a user identifier, such as identifier 2, is randomly matched for the second sentence. And associating the voiceprint features of the second sentence with the corresponding user identifications and storing the voiceprint features in a voiceprint model library. When the third sentence is received, the voice print characteristics of the first sentence and the second sentence are respectively matched in the same way, so that the user identification of the third sentence is obtained, and the user identification is stored in a voice print model library in an associated mode. Wherein, the user identification can realize the association storage with the statement to be identified in the form of adding the field.
The voiceprint registration module 220 compares the voiceprint characteristics of each previous sentence to be identified stored in the voiceprint model library with the voiceprint characteristics of the current sentence to be identified, and matches the corresponding user identifier, so as to obtain the user identifier corresponding to each sentence to be identified, where the voiceprint characteristics of each sentence to be identified and the corresponding user identifier are initial identification results. That is, through the above processing, the speaker corresponding to each sentence to be recognized can be obtained. Thus, accurate judgment of speech signals generated in the conference is realized, which sentences are spoken by one user and which sentences are spoken by another user.
The distortion analysis module 230 performs distortion analysis on each sentence to be identified of the speech signal to be identified, so as to obtain a distortion result of each sentence to be identified.
The distortion is defined as the influence of interference and noise on the sound during the conversion, amplification and transmission processes, and the output signal changes in waveform relative to the input signal. The degree to which the output signal deviates from the input signal by an amount is the degree of distortion. These disturbances and noise include the sounds of other people around the speaker. Thus, performing distortion analysis may assist in distinguishing between speech signals of a speaker and others.
In this embodiment, the distortion degree of the original speech signal is analyzed by using the THD total harmonic distortion analysis formula. And carrying out signal distortion degree analysis on each sentence unit of the sentences to be recognized to obtain the distortion degree of the input signal relative to the output signal of the voice recognition signal preprocessing equipment.
And adopting a THD total harmonic distortion analysis formula to analyze the distortion degree:
firstly, using the value Vsamp of the waveform sample sampling point of the volume of the selected voice signal to be recognized to perform corresponding volume root mean square V rms And inserts it into a standard root mean square equation:
then, the THD parameter is calculated as the total harmonic content in the signal:
wherein V is THD_R Representing the ratio of the root mean square value to the total root mean square value of all the harmonic components of the designated N times, V h,rms The root mean square of volume, rms the root mean square, and h the designated order.
And after analysis and calculation are carried out according to the distortion degree formula, obtaining a distortion degree result corresponding to each statement to be recognized in the voice signal to be recognized. In this embodiment, the distortion result may be expressed as a percentage, for example, the distortion is 5%, which indicates that the sentence to be recognized is distorted by 5% with respect to its original input.
The adjustment module 140 adjusts the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.
In this embodiment, after the initial recognition result is obtained, an initial corresponding result of the voiceprint features of each sentence to be recognized and the corresponding user identifier is obtained, and a voiceprint model library in which the voiceprint features of all the sentences to be recognized are associated with the corresponding user identifiers is obtained. The initial recognition result does not take the distortion degree into consideration, so that whether each sentence to be recognized is distorted or not needs to be judged according to the calculated distortion degree result of each sentence to be recognized. Under the same tone and other voiceprint characteristics, the voiceprint characteristics of the sentences to be identified with larger signal distortion degree change are removed from the self-registration of the sentences to be identified, and only the sentences to be identified which are added in a reasonable distortion degree range are selected to be used as the voiceprint characteristics of the effective sentences to be identified from the self-registration. Therefore, in the present embodiment, a distortion threshold section is set. After the distortion degree threshold value interval is exceeded, the fact that the distortion degree of the statement to be identified is higher is indicated, so that the statement to be identified, the result of which the distortion degree is not in the distortion degree threshold value interval, and the corresponding identification result are required to be removed in the voiceprint model library, and the voiceprint model library is adjusted to obtain the target voiceprint model library. And meanwhile, obtaining a target recognition result after removing the sentences to be recognized with high distortion. In this embodiment, the distortion threshold interval is 0-5%. And if the distortion degree result of a sentence to be identified exceeds 5%, rejecting the sentence to be identified.
In this embodiment, after the voiceprint model library and the initial recognition result are adjusted according to the distortion degree result to obtain a target recognition result, the sentences to be recognized are further ordered according to the features of the volume of the speech recognition signal, the integrity of harmonics, the distortion degree and the like, and the sentences to be recognized corresponding to the user identifier are selected to output the speech recognition result, so as to obtain an output result. For example, a sentence to be recognized with large volume, complete harmonic wave and small distortion degree of a voice recognition signal and a corresponding user identification are selected and output. Judging the target recognition result through the assistance of the user behavior, if the target recognition result is output in error, prompting the user to repeat the voice recognition step so as to optimize the voiceprint model library, correcting the output user identification and recognition result, and further improving the accuracy of the voiceprint model library on the user voice recognition result.
In this embodiment, the scenario is that in the conference record, according to different voiceprint features and distortion results registered by the voice to be recognized, classification summarization is performed on the sentences to be recognized, the voices to be recognized with the same user identifier are summarized together, and the corresponding user identifier is used as a name and is output as a result.
According to the voice recognition signal preprocessing device, signal distortion degree analysis is performed by combining the THD total harmonic distortion analysis method, fine adjustment is performed on the voiceprint model library according to the distortion degree analysis result, fine adjustment is performed on different speaker results analyzed by the voiceprint model library, and the fine-adjusted voice recognition result is used as target voice to be recognized, so that the accuracy of the voice recognition result is improved.
Fig. 3 is a schematic structural diagram of an embodiment of a speech recognition signal preprocessing device according to the present invention, and the embodiment of the present invention is not limited to the specific implementation of the speech recognition signal preprocessing device.
As shown in fig. 3, the voice recognition signal preprocessing apparatus may include: a processor (processor) 302, a communication interface (Communications Interface) 304, a memory (memory) 506, and a communication bus 308.
Wherein: processor 302, communication interface 304, and memory 306 communicate with each other via communication bus 508. A communication interface 304 for communicating with network elements of other devices, such as clients or other application servers, etc. The processor 302 is configured to execute the program 310, and may specifically perform the relevant steps in the foregoing embodiments of the method for preprocessing a speech recognition signal.
In particular, program 310 may include program code comprising computer-executable instructions.
The processor 302 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors comprised by the authentication device may be of the same type of processor, such as one or more CPUs; but may also be different types of processors such as one or more CPUs and one or more ASICs.
Memory 306 for storing programs 310. Memory 306 may comprise high-speed RAM memory or may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
Program 310 may be specifically invoked by processor 302 to cause the electronic device to:
receiving a voice signal to be recognized, and extracting voiceprint characteristics of each sentence to be recognized in the voice signal to be recognized, wherein the voice signal to be recognized comprises at least one sentence to be recognized;
identifying the voiceprint characteristics of each statement to be identified according to the voiceprint model library to obtain an initial identification result; the voiceprint model library is constructed by short-time registration according to each sentence to be identified before a current sentence to be identified in the voice signal to be identified, and the identification result is the corresponding relation between the sentence to be identified and the user;
Performing distortion analysis on each sentence to be identified of the voice signal to be identified to obtain a distortion degree result of each sentence to be identified;
and adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.
In an optional manner, a voice signal to be recognized is received, and voiceprint characteristics of each sentence to be recognized in the voice signal to be recognized are extracted, wherein the voice signal to be recognized comprises at least one sentence to be recognized, and further comprises:
dividing the voice signal to be recognized into a plurality of sentences;
and extracting the identity characteristics and text characteristics of each sentence to be recognized before the current sentence to be recognized in the voice signal to be recognized, and fusing to obtain voiceprint characteristics.
In an optional manner, identifying the voiceprint features of each sentence to be identified according to a voiceprint model library to obtain an initial identification result, where the voiceprint model library is obtained by performing short-time registration construction according to each sentence to be identified before a current sentence to be identified in the speech signal to be identified, and further includes:
storing the voiceprint features and the corresponding user identifications in an associated mode, and constructing a voiceprint model library;
And comparing the current sentence to be identified with the voiceprint characteristics stored in the voiceprint model library, judging the similarity, matching the current voice signal to be identified with the corresponding user identification, and storing the voice signal to be identified in the voiceprint model library in an associated manner.
In an optional manner, performing distortion analysis on each sentence to be identified of the speech signal to be identified to obtain a distortion result of each sentence to be identified, and further including:
and carrying out distortion analysis on each statement to be recognized in the voice signal to be recognized by adopting a THD total harmonic distortion analysis method to obtain a distortion result corresponding to each statement.
In an alternative manner, the THD total harmonic distortion analysis method further includes:
and adopting a THD total harmonic distortion analysis formula to analyze the distortion degree:
wherein V is THD_R Representing the ratio of the root mean square value to the total root mean square value of all the harmonic components of the designated N times, V h,rms The root mean square of volume, rms the root mean square, and h the designated order.
In an optional manner, the initial recognition result is adjusted according to the distortion degree result to obtain a target recognition result, which further includes:
determining whether the distortion degree result of each sentence to be identified is within a distortion degree threshold value interval;
And eliminating sentences to be identified and corresponding user identifications, the distortion degree results of which are not in the distortion degree threshold value interval, from the voiceprint model library to obtain a target voiceprint model library and target identification results.
According to the voice recognition signal preprocessing equipment, signal distortion degree analysis is carried out by combining the THD total harmonic distortion analysis method, fine adjustment is carried out on the voiceprint model library according to the distortion degree analysis result, fine adjustment is carried out on different speaker results analyzed by the voiceprint model library, the fine adjustment is carried out on the voice recognition result after the fine adjustment as target voice to be recognized, and therefore accuracy of the voice recognition result is improved.
An embodiment of the present invention provides a computer readable storage medium storing at least one executable instruction that, when executed on a speech recognition signal preprocessing apparatus/device, causes the speech recognition signal preprocessing apparatus/device to execute the speech recognition signal preprocessing method in any of the above method embodiments.
The executable instructions may be specifically for causing a speech recognition signal preprocessing device/arrangement to:
receiving a voice signal to be recognized, and extracting voiceprint characteristics of each sentence to be recognized in the voice signal to be recognized, wherein the voice signal to be recognized comprises at least one sentence to be recognized;
Identifying the voiceprint characteristics of each statement to be identified according to the voiceprint model library to obtain an initial identification result; the voiceprint model library is constructed by short-time registration according to each sentence to be identified before a current sentence to be identified in the voice signal to be identified, and the identification result is the corresponding relation between the sentence to be identified and the user;
performing distortion analysis on each sentence to be identified of the voice signal to be identified to obtain a distortion degree result of each sentence to be identified;
and adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.
In an optional manner, a voice signal to be recognized is received, and voiceprint characteristics of each sentence to be recognized in the voice signal to be recognized are extracted, wherein the voice signal to be recognized comprises at least one sentence to be recognized, and further comprises:
dividing the voice signal to be recognized into a plurality of sentences;
and extracting the identity characteristics and text characteristics of each sentence to be recognized before the current sentence to be recognized in the voice signal to be recognized, and fusing to obtain voiceprint characteristics.
In an optional manner, identifying the voiceprint features of each sentence to be identified according to a voiceprint model library to obtain an initial identification result, where the voiceprint model library is obtained by performing short-time registration construction according to each sentence to be identified before a current sentence to be identified in the speech signal to be identified, and further includes:
Storing the voiceprint features and the corresponding user identifications in an associated mode, and constructing a voiceprint model library;
and comparing the current sentence to be identified with the voiceprint characteristics stored in the voiceprint model library, judging the similarity, matching the current voice signal to be identified with the corresponding user identification, and storing the voice signal to be identified in the voiceprint model library in an associated manner.
In an optional manner, performing distortion analysis on each sentence to be identified of the speech signal to be identified to obtain a distortion result of each sentence to be identified, and further including:
and carrying out distortion analysis on each statement to be recognized in the voice signal to be recognized by adopting a THD total harmonic distortion analysis method to obtain a distortion result corresponding to each statement.
In an alternative manner, the THD total harmonic distortion analysis method further includes:
and adopting a THD total harmonic distortion analysis formula to analyze the distortion degree:
wherein V is THD_R Representing the ratio of the root mean square value to the total root mean square value of all the harmonic components of the designated N times, V h,rms The root mean square of volume, rms the root mean square, and h the designated order.
In an optional manner, the initial recognition result is adjusted according to the distortion degree result to obtain a target recognition result, which further includes:
Determining whether the distortion degree result of each sentence to be identified is within a distortion degree threshold value interval;
and eliminating sentences to be identified and corresponding user identifications, the distortion degree results of which are not in the distortion degree threshold value interval, from the voiceprint model library to obtain a target voiceprint model library and target identification results.
In the embodiment, signal distortion degree analysis is performed by combining the THD total harmonic distortion analysis method, the voiceprint model library is finely tuned according to the distortion degree analysis result, different speaker results analyzed by the voiceprint model library are finely tuned, and the finely tuned voice recognition result is used as target voice to be recognized, so that the accuracy of the voice recognition result is improved.
The embodiment of the invention provides a voice recognition signal preprocessing device for executing the voice recognition signal preprocessing method.
An embodiment of the present invention provides a computer program, where the computer program may be invoked by a processor to cause the electronic device to execute the method for preprocessing a speech recognition signal in any of the above method embodiments.
An embodiment of the present invention provides a computer program product comprising a computer program stored on a computer readable storage medium, the computer program comprising program instructions which, when run on a computer, cause the computer to perform the method for preprocessing a speech recognition signal in any of the method embodiments described above.
The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for a construction of such a system is apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It will be appreciated that the teachings of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present invention.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the above description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed invention requires more features than are expressly recited in each claim. Inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments can be used in any combination.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specifically stated.