CN113571054A

CN113571054A - Speech recognition signal preprocessing method, device, equipment and computer storage medium

Info

Publication number: CN113571054A
Application number: CN202010349173.XA
Authority: CN
Inventors: 陈润泽; 陈航; 任永华; 胡瑛; 王振志
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Zhejiang Co Ltd
Current assignee: China Mobile Zhejiang Innovation Research Institute Co ltd; China Mobile Communications Group Co Ltd; China Mobile Group Zhejiang Co Ltd
Priority date: 2020-04-28
Filing date: 2020-04-28
Publication date: 2021-10-29
Anticipated expiration: 2040-04-28
Also published as: CN113571054B

Abstract

The embodiment of the invention relates to the technical field of voice signal processing, and discloses a voice recognition signal preprocessing method, which comprises the following steps: receiving a voice signal to be recognized, and extracting the voiceprint characteristics of each sentence to be recognized in the voice signal to be recognized, wherein the voice signal to be recognized comprises at least one sentence to be recognized; recognizing the voiceprint characteristics of each sentence to be recognized according to the voiceprint model library to obtain an initial recognition result; the voice print model base is obtained by performing short-time registration construction on each sentence to be recognized before the current sentence to be recognized in the voice signal to be recognized; performing distortion analysis on each sentence to be recognized of the voice signal to be recognized to obtain a distortion degree result of each sentence to be recognized; and adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result. Through the mode, the embodiment of the invention has the beneficial effect of realizing the accuracy of voice recognition.

Description

Speech recognition signal preprocessing method, device, equipment and computer storage medium

Technical Field

The embodiment of the invention relates to the technical field of artificial intelligence, in particular to a method, a device and equipment for preprocessing a voice recognition signal and a computer readable storage medium.

Background

At present, in order to improve the accuracy of speech recognition, screening and filtering of signal input are generally implemented by a microphone array, and the main purpose of the method is to remove interference sources other than effective sound sources, which mainly includes the following parts:

1. sound source localization: sound sources are localized by angle and distance measurements.

2. Echo suppression and elimination: and abnormal signals such as background noise, interference, reverberation, echo and the like are suppressed.

3. Signal separation and extraction: according to the rule, signal separation and extraction are carried out

However, the main technical goal of the existing microphone array technology is to have a significant effect on removing other interference sources than human voice, such as abnormal signals of background noise, reverberation, echo, interference, car horn sound, and the like, with respect to the sound signal input by the microphone. But cannot process the sound signals of other people around the user, which are introduced due to the use environment.

Therefore, there is a need for a speech signal preprocessing method that can eliminate the use of sound signals of other people around the user, which is introduced due to the use environment, to improve the accuracy of speech recognition.

Disclosure of Invention

In view of the foregoing problems, embodiments of the present invention provide a speech recognition signal preprocessing method, a speech recognition method apparatus, a device and a computer-readable storage medium, which are used to solve the technical problem existing in the prior art that a speech recognition signal cannot eliminate sound of other people caused by a surrounding environment.

According to an aspect of an embodiment of the present invention, there is provided a speech recognition signal preprocessing method, including:

receiving a voice signal to be recognized, and extracting voiceprint characteristics of each sentence to be recognized in the voice signal to be recognized, wherein the voice signal to be recognized comprises at least one sentence to be recognized;

recognizing the voiceprint characteristics of the statements to be recognized according to the voiceprint model library to obtain an initial recognition result; the voiceprint model library is obtained by performing short-time registration construction according to each sentence to be recognized before the current sentence to be recognized in the speech signal to be recognized, and the recognition result is the corresponding relation between the sentence to be recognized and the user;

performing distortion analysis on each sentence to be recognized of the voice signal to be recognized to obtain a distortion degree result of each sentence to be recognized;

and adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.

In an alternative form of the method according to the invention,

receiving a voice signal to be recognized, and extracting voiceprint features of sentences to be recognized in the voice signal to be recognized, wherein the voice signal to be recognized comprises at least one sentence to be recognized, and the method further comprises the following steps:

dividing the voice signal to be recognized into a plurality of sentences;

and extracting the identity characteristic and the text characteristic of each sentence to be recognized before the current sentence to be recognized in the voice signal to be recognized, and fusing to obtain the voiceprint characteristic.

In an optional manner, the method includes identifying a voiceprint feature of each to-be-identified sentence according to a voiceprint model library to obtain an initial identification result, where the voiceprint model library is obtained by performing short-time registration construction according to each to-be-identified sentence before a current to-be-identified sentence in the to-be-identified speech signal, and further includes:

storing the voiceprint characteristics and the corresponding user identification in an associated manner to construct a voiceprint model library;

and comparing the current sentence to be recognized with the voiceprint characteristics stored in the voiceprint model library, judging the similarity, matching the corresponding user identification for the current voice signal to be recognized, and storing the user identification in the voiceprint model library in a correlation manner.

In an optional manner, performing distortion analysis on each to-be-recognized sentence of the to-be-recognized speech signal to obtain a distortion result of each to-be-recognized sentence, further includes:

and carrying out distortion degree analysis on each sentence to be recognized in the voice signal to be recognized by adopting a THD total harmonic distortion analysis method to obtain a distortion degree result corresponding to each sentence.

In an optional manner, the THD total harmonic distortion analysis method further includes:

and (3) carrying out distortion degree analysis by adopting a THD total harmonic distortion analysis formula:

wherein, V_{THD_R}Representing the ratio of the root mean square value of all harmonic components of a given Nth order to the total root mean square value, V_h,rmsRepresents the volume root mean square, rms represents the root mean square, and h represents the specified order.

In an optional manner, adjusting the initial recognition result according to the distortion result to obtain a target recognition result, further includes:

determining whether the distortion degree result of each sentence to be identified is within a distortion degree threshold interval;

and eliminating the sentences to be recognized and the corresponding user identifications of which the distortion degree results are not in the distortion degree threshold interval in the voiceprint model library to obtain a target voiceprint model library and a target recognition result.

According to another aspect of the embodiments of the present invention, there is also provided a speech recognition signal preprocessing apparatus including:

the voice recognition system comprises a voiceprint extraction module, a voice recognition module and a voice recognition module, wherein the voiceprint extraction module is used for receiving a voice signal to be recognized and extracting voiceprint characteristics of sentences to be recognized in the voice signal to be recognized, and the voice signal to be recognized comprises at least one sentence to be recognized;

the voiceprint registration module is used for identifying the voiceprint characteristics of the statements to be identified according to the voiceprint model library to obtain an initial identification result; the voiceprint model library is obtained by performing short-time registration construction according to each sentence to be recognized before the current sentence to be recognized in the speech signal to be recognized, and the recognition result is the corresponding relation between the sentence to be recognized and the user;

the distortion degree analysis module is used for carrying out distortion analysis on each sentence to be recognized of the voice signal to be recognized to obtain a distortion degree result of each sentence to be recognized;

and the adjusting module is used for adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.

In an optional manner, the voiceprint registration module identifies the voiceprint features of the statements to be recognized according to a voiceprint model library to obtain an initial recognition result, where the voiceprint model library is obtained by performing short-time registration construction according to the statements to be recognized before the current statement to be recognized in the speech signal to be recognized, and further includes:

According to another aspect of embodiments of the present invention, there is provided a speech recognition signal preprocessing apparatus including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation of the voice recognition signal preprocessing method.

According to a further aspect of the embodiments of the present invention, there is provided a computer-readable storage medium having at least one executable instruction stored therein, which when run on a speech recognition signal preprocessing apparatus/device, causes the speech recognition signal preprocessing apparatus/device to perform the operations of the above-mentioned speech recognition signal preprocessing method.

The voice recognition signal preprocessing method of the embodiment analyzes the signal distortion degree by combining the THD total harmonic distortion analysis method, finely adjusts the voiceprint model library according to the distortion degree analysis result, finely adjusts different speaker results analyzed by the voiceprint model library, and takes the finely adjusted voice recognition result as the target voice to be recognized, thereby improving the accuracy of the voice recognition result.

The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and the embodiments of the present invention can be implemented according to the content of the description in order to make the technical means of the embodiments of the present invention more clearly understood, and the detailed description of the present invention is provided below in order to make the foregoing and other objects, features, and advantages of the embodiments of the present invention more clearly understandable.

Drawings

The drawings are only for purposes of illustrating embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flow chart of a method for preprocessing a speech recognition signal according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a speech recognition signal preprocessing apparatus according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram illustrating a speech recognition signal preprocessing apparatus according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein.

Fig. 1 is a flow chart illustrating a method for preprocessing a speech recognition signal according to an embodiment of the present invention, where the method is performed by a device for preprocessing a speech recognition signal. As shown in fig. 1, the method comprises the steps of:

s110: receiving a voice signal to be recognized, and extracting the voiceprint characteristics of each sentence to be recognized in the voice signal to be recognized, wherein the voice signal to be recognized comprises at least one sentence to be recognized.

The voice signal to be recognized is the voice signal of the residual human voice part after being processed by the microphone array. The invention aims to effectively extract the voice signal of a user using the voice recognition device, remove the voice signal of other people around the user due to the use environment and improve the accuracy of input voice signal recognition. Abnormal information signals such as abnormal signals of background noise, reverberation, echo, interference, car horn sound, etc. have been processed while passing through the microphone array.

Specifically, after receiving the voice signal to be recognized, the voice signal to be recognized is divided into a plurality of sentences. Each sentence in the speech signal to be recognized is taken as a sentence. And extracting the identity characteristic and the text characteristic of each sentence to be recognized before the current sentence to be recognized in the voice signal to be recognized, and fusing to obtain the voiceprint characteristic. And respectively extracting the identity characteristics of the speaker and the corresponding text characteristics related to the content of the voice information, wherein the identity characteristics of the speaker and the corresponding text characteristics are contained in the voice by adopting a DNN algorithm. The identity features include features of sound timbre, loudness, pitch, etc. and frequency domain, such as short-time energy, short-time average amplitude, short-time green rate, MFCC parameters, PLP parameters, pitch, etc. The text feature is a text content related to the content of the speech signal to be recognized. And fusing the identity characteristic and the text characteristic to obtain the voiceprint characteristic.

S120: recognizing the voiceprint characteristics of the statements to be recognized according to the voiceprint model library to obtain an initial recognition result; the voiceprint model library is obtained by performing short-time registration construction according to each sentence to be recognized before the current sentence to be recognized in the speech signal to be recognized, and the recognition result is the corresponding relation between the sentence to be recognized and the user.

Specifically, in this embodiment, the steps specifically include:

and (4) storing the voiceprint characteristics and the user identification in an associated manner to construct a voiceprint model library. Specifically, the voiceprint characteristics of each statement to be recognized in the received speech signal to be recognized are respectively associated and stored with the user identifier. And (4) carrying out voiceprint self-registration by taking each sentence (every sentence) to be recognized as a unit, establishing a free-speaking voiceprint model library after associating the user identification, and storing the voiceprint characteristics of the user. The user identification is a plurality of preset identifications used for matching the voiceprint characteristics. The recognition result is that the corresponding relation between the sentence to be recognized and the user is expressed by the corresponding relation between the voiceprint feature of the sentence to be recognized and the user identification.

And comparing the current voice signal to be recognized with the voice print characteristics of the previously stored sentences to be recognized in the voice print model library, and judging the similarity so as to match the user identification for the current voice signal to be recognized.

For example: in a conference scenario, a voice signal is received. When a first sentence is received, the first sentence is subjected to voiceprint feature extraction, matched with a random user identifier, such as user 1, and stored in a voiceprint model library after being associated and stored. And when the second sentence is received, extracting the voiceprint characteristics of the second sentence, comparing the voiceprint characteristics with the voiceprint characteristics of the first sentence in the voiceprint model library, and matching the second sentence as the user 1 if the similarity reaches a preset similarity threshold, which indicates that the second sentence is also spoken by the user 1. And if the similarity does not reach the similarity threshold, randomly matching a user identifier, such as identifier 2, for the second sentence. And associating the voiceprint characteristics of the second sentence with the corresponding user identification, and storing the voiceprint characteristics in a voiceprint model library. And when the third sentence is received, matching the third sentence with the voiceprint characteristics of the first sentence and the second sentence respectively in the same way to obtain the user identification of the third sentence, and storing the user identification in the voiceprint model library in a correlated manner. The user identification can realize the associated storage with the sentence to be recognized in the form of adding fields.

By adopting the above manner, the current sentence to be recognized is compared with the voiceprint features of each previous sentence to be recognized stored in the voiceprint model library, and the corresponding user identifier is matched, so that the user identifier corresponding to each sentence to be recognized is obtained, and the voiceprint features of each sentence to be recognized and the corresponding user identifier are the initial recognition result. That is, through the above processing, the speaker corresponding to each sentence to be recognized can be obtained. Thus, it is possible to accurately judge which sentences are spoken by one user and which sentences are spoken by another user in the speech signal generated in the conference.

S130: and performing distortion analysis on each sentence to be recognized of the voice signal to be recognized to obtain a distortion degree result of each sentence to be recognized.

The distortion is defined as the influence of interference and noise on the sound during conversion, amplification and transmission, and the output signal will have a waveform change relative to the input signal. The degree of deviation of the output signal from the input signal is the degree of distortion. These disturbances and noise include the voice of others around the speaker. Thus, performing a distortion analysis may assist in distinguishing between the speech signals of the speaker and others.

In this embodiment, a THD total harmonic distortion analysis formula is used to analyze the distortion degree of the original speech signal. And analyzing the signal distortion degree by taking each sentence of the sentence to be recognized as a unit to obtain the distortion degree of the input signal of the voice recognition signal preprocessing equipment relative to the output signal.

And (3) carrying out distortion degree analysis by adopting the following THD total harmonic distortion analysis formula:

firstly, using the value Vsamp of the waveform sample sampling point of the volume of the selected voice signal to be recognized to carry out the corresponding volume root mean square V_rmsAnd inserting it into a standard root mean square equation:

then, the THD parameters are calculated according to the total harmonic content in the signal:

wherein, V_{THD_R}Representing the ratio of the root mean square value of all harmonic components of a given Nth order to the total root mean square value, V_h,rmsRepresents the volume root mean square, rms represents the root mean square.

And after analysis and calculation are carried out according to the distortion degree formula, obtaining the distortion degree result of the volume corresponding to each sentence to be recognized in the voice signal to be recognized. In this embodiment, the distortion result may be represented by a percentage, for example, the distortion is 5%, which represents that the sentence to be recognized is distorted by 5% relative to its original input.

S140: and adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.

In this embodiment, after the initial recognition result is obtained, a voiceprint feature of each to-be-recognized sentence and an initial corresponding result of the corresponding user identifier are obtained, and a voiceprint model library in which the voiceprint features of all to-be-recognized sentences are stored in association with the corresponding user identifiers is obtained. The initial recognition result does not take the distortion degree into consideration, so that whether each sentence to be recognized is distorted or not needs to be judged according to the calculated distortion degree result of each sentence to be recognized. Under the voiceprint characteristics of the same tone, tone and the like, the voiceprint characteristics of the sentence to be recognized with large signal distortion degree change need to be removed in the self-registration of the sentence to be recognized, and only the sentences to be recognized added in the reasonable distortion degree range are selected as the voiceprint characteristics of the self-registration effective sentences to be recognized. Therefore, in this embodiment, a distortion threshold interval is set. After the distortion threshold interval is exceeded, it indicates that the distortion of the to-be-recognized sentence is high, and therefore, the to-be-recognized sentence with the distortion result not within the distortion threshold interval and the corresponding recognition result need to be removed from the voiceprint model library to adjust the voiceprint model library to obtain the target voiceprint model library. And simultaneously obtaining a target recognition result after the sentence to be recognized with high distortion degree is removed. In this embodiment, the threshold interval of distortion is 0-5%. And if the distortion degree result of a sentence to be recognized exceeds 5%, rejecting the sentence to be recognized.

In this embodiment, after the voiceprint model library and the initial recognition result are adjusted according to the distortion result to obtain a target recognition result, the statements to be recognized are also sorted according to the characteristics of the voice recognition signal, such as volume, harmonic integrity, distortion degree, and the like, and the statements to be recognized corresponding to the user identifier are selected to output the voice recognition result, so as to obtain an output result. For example, the sentence to be recognized and the corresponding user identification of the voice recognition signal are selected to be output, wherein the sentence to be recognized is large in volume, complete in harmonic wave and small in distortion degree. And judging the target recognition result by the aid of user behavior, and if the target recognition result is output wrongly, prompting the user to repeat the voice recognition step so as to optimize the voiceprint model library, correct the output user identification and recognition result and further improve the accuracy of the voiceprint model library on the user voice recognition result.

In this embodiment, the scenario is that, in the conference record, statements to be recognized are classified and collected according to different voiceprint features and distortion degree results registered in voices to be recognized, voices to be recognized of the same user identifier are collected together, and the corresponding user identifier is used as a name and output as a result.

Fig. 2 is a schematic structural diagram of an embodiment of the speech recognition signal preprocessing apparatus according to the present invention. As shown in fig. 2, the apparatus 200 includes: a voiceprint extraction module 210, a voiceprint registration module 220, a distortion analysis module 230, and an adjustment module 240.

The voiceprint extraction module 210 is configured to receive a speech signal to be recognized, and extract a voiceprint feature of each sentence to be recognized in the speech signal to be recognized, where the speech signal to be recognized includes at least one sentence to be recognized.

The voiceprint registration module 220 is configured to identify the voiceprint features of the statements to be identified according to the voiceprint model library to obtain an initial identification result; the voiceprint model library is obtained by performing short-time registration construction according to each sentence to be recognized before the current sentence to be recognized in the speech signal to be recognized, and the recognition result is the corresponding relation between the sentence to be recognized and the user.

The distortion analyzing module 230 is configured to perform distortion analysis on each to-be-recognized sentence of the to-be-recognized speech signal to obtain a distortion result of each to-be-recognized sentence.

And the adjusting module 140 is configured to adjust the voiceprint model library and the initial recognition result according to the distortion result to obtain a target voiceprint model library and a target recognition result.

The specific working process of each module is as follows:

the voiceprint extraction module 210 receives a speech signal to be recognized, and extracts voiceprint features of each sentence to be recognized in the speech signal to be recognized, where the speech signal to be recognized includes at least one sentence to be recognized.

The voice signal to be recognized is the voice signal of the residual human voice part after being processed by the microphone array. The invention aims to effectively extract the voice signal of a user using the voice recognition device, remove the voice signal of other people around the user due to the use environment and improve the accuracy of the input voice signal. Abnormal information signals such as abnormal signals of background noise, reverberation, echo, interference, car horn sound, etc. have been processed while passing through the microphone array.

In this embodiment, the voiceprint extraction module 210 includes a voice division sub-module and a voiceprint fusion sub-module.

And the sentence dividing module is used for dividing the voice signal to be recognized into a plurality of sentences. Each sentence in the speech signal to be recognized is taken as a sentence.

And the voiceprint fusion submodule is used for extracting the identity characteristic and the text characteristic of each sentence to be recognized before the current sentence to be recognized in the voice signal to be recognized and fusing to obtain the voiceprint characteristic.

Specifically, the voiceprint fusion submodule respectively extracts the identity characteristic of the speaker contained in the voice and the corresponding text characteristic related to the content of the voice information by adopting a DNN algorithm. The identity features include features of sound timbre, loudness, pitch, etc. and frequency domain, such as short-time energy, short-time average amplitude, short-time green rate, MFCC parameters, PLP parameters, pitch, etc. The text feature is a text content related to the content of the speech signal to be recognized. And fusing the identity characteristic and the text characteristic to obtain the voiceprint characteristic.

The voiceprint registration module 220 identifies the voiceprint characteristics of the statements to be identified according to the voiceprint model library to obtain an initial identification result; the voiceprint model library is obtained by performing short-time registration construction according to each sentence to be recognized before the current sentence to be recognized in the speech signal to be recognized, and the recognition result is the corresponding relation between the sentence to be recognized and the user.

The voiceprint registration module 220 stores the voiceprint features and the user identifier in an associated manner, and constructs a voiceprint model library. Specifically, the voiceprint characteristics of each statement to be recognized in the received speech signal to be recognized are respectively associated and stored with the user identifier. And (4) carrying out voiceprint self-registration by taking each sentence (every sentence) to be recognized as a unit, establishing a free-speaking voiceprint model library after associating the user identification, and storing the voiceprint characteristics of the user. The user identification is a plurality of preset identifications used for matching the voiceprint characteristics. And comparing the current voice signal to be recognized with the voice print characteristics of the previously stored sentences to be recognized in the voice print model library, and judging the similarity so as to match the user identification for the current voice signal to be recognized. The recognition result is that the corresponding relation between the sentence to be recognized and the user is expressed by the corresponding relation between the voiceprint feature of the sentence to be recognized and the user identification.

The voiceprint registration module 220 compares the current to-be-recognized statement with the voiceprint features of each previous to-be-recognized statement stored in the voiceprint model library through the above operations, and matches the corresponding user identifier, so as to obtain the user identifier corresponding to each to-be-recognized statement, where the voiceprint features of each to-be-recognized statement and the corresponding user identifiers are the initial recognition results. That is, through the above processing, the speaker corresponding to each sentence to be recognized can be obtained. Thus, it is possible to accurately judge which sentences are spoken by one user and which sentences are spoken by another user in the speech signal generated in the conference.

The distortion analysis module 230 performs distortion analysis on each to-be-recognized sentence of the to-be-recognized speech signal to obtain a distortion result of each to-be-recognized sentence.

And after analysis and calculation are carried out according to the distortion degree formula, obtaining a distortion degree result corresponding to each sentence to be recognized in the voice signal to be recognized. In this embodiment, the distortion result may be represented by a percentage, for example, the distortion is 5%, which represents that the sentence to be recognized is distorted by 5% relative to its original input.

The adjusting module 140 adjusts the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.

The voice recognition signal preprocessing device of the embodiment analyzes the signal distortion degree by combining the THD total harmonic distortion analysis method, finely adjusts the voiceprint model library according to the distortion degree analysis result, finely adjusts different speaker results analyzed by the voiceprint model library, and takes the finely adjusted voice recognition result as the target voice to be recognized, thereby improving the accuracy of the voice recognition result.

Fig. 3 is a schematic structural diagram illustrating an embodiment of a speech recognition signal preprocessing device according to the present invention, and the embodiment of the present invention does not limit the specific implementation of the speech recognition signal preprocessing device.

As shown in fig. 3, the voice recognition signal preprocessing apparatus may include: a processor (processor)302, a Communications Interface 304, a memory 506, and a communication bus 308.

Wherein: the processor 302, communication interface 304, and memory 306 communicate with each other via a communication bus 508. A communication interface 304 for communicating with network elements of other devices, such as clients or other application servers. The processor 302 is configured to execute the program 310, and may specifically execute the relevant steps in the embodiment of the method for preprocessing the speech recognition signal.

In particular, program 310 may include program code comprising computer-executable instructions.

The processor 302 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement an embodiment of the present invention. The identity authentication device comprises one or more processors, which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

And a memory 306 for storing a program 310. Memory 306 may comprise high-speed RAM memory and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

Specifically, the program 310 may be invoked by the processor 302 to cause the electronic device to perform the following operations:

In an optional manner, receiving a speech signal to be recognized, and extracting a voiceprint feature of each sentence to be recognized in the speech signal to be recognized, where the speech signal to be recognized includes at least one sentence to be recognized, further including:

dividing the voice signal to be recognized into a plurality of sentences;

An embodiment of the present invention provides a computer-readable storage medium, where the storage medium stores at least one executable instruction, and when the executable instruction is executed on a speech recognition signal preprocessing device/apparatus, the speech recognition signal preprocessing device/apparatus executes a speech recognition signal preprocessing method in any method embodiment described above.

The executable instructions may be specifically configured to cause the speech recognition signal pre-processing device/arrangement to perform the following operations:

dividing the voice signal to be recognized into a plurality of sentences;

In the embodiment, the signal distortion degree is analyzed by combining a THD total harmonic distortion analysis method, the voiceprint model library is finely adjusted according to the distortion degree analysis result, different speaker results analyzed by the voiceprint model library are finely adjusted, and the finely adjusted voice recognition result is used as the target voice to be recognized, so that the accuracy of the voice recognition result is improved.

The embodiment of the invention provides a preprocessing device based on a voice recognition signal, which is used for executing the voice recognition signal preprocessing method.

Embodiments of the present invention provide a computer program, where the computer program can be called by a processor to enable the electronic device to execute the speech recognition signal preprocessing method in any of the above method embodiments.

An embodiment of the present invention provides a computer program product comprising a computer program stored on a computer-readable storage medium, the computer program comprising program instructions that, when run on a computer, cause the computer to perform the method for pre-processing a speech recognition signal in any of the above-mentioned method embodiments.

The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specified otherwise.

Claims

1. A method of pre-processing a speech recognition signal, the method comprising:

2. The method according to claim 1, wherein a speech signal to be recognized is received, and a voiceprint feature of each sentence to be recognized in the speech signal to be recognized is extracted, wherein the speech signal to be recognized comprises at least one sentence to be recognized, and further comprising:

dividing the voice signal to be recognized into a plurality of sentences;

3. The method according to claim 1, wherein the voiceprint features of the statements to be recognized are recognized according to a voiceprint model library to obtain an initial recognition result, wherein the voiceprint model library is obtained by performing short-time registration construction according to the statements to be recognized before the current statement to be recognized in the speech signal to be recognized, and further comprising:

4. The method of claim 1, wherein performing distortion analysis on each to-be-recognized sentence of the to-be-recognized speech signal to obtain a distortion result of each to-be-recognized sentence, further comprising:

5. The method of claim 4, wherein the THD total harmonic distortion analysis method further comprises:

6. The method of claim 1, wherein the initial recognition result is adjusted according to the distortion result to obtain a target recognition result, further comprising:

7. A speech recognition signal preprocessing apparatus, characterized in that the apparatus comprises:

8. The apparatus according to claim 7, wherein the voiceprint registration module identifies the voiceprint features of the statements to be identified according to a voiceprint model library to obtain an initial identification result, wherein the voiceprint model library is obtained by performing short-time registration construction according to the statements to be identified before the current statement to be identified in the speech signal to be identified, and further comprises:

9. A speech recognition signal preprocessing apparatus characterized by comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is configured to store at least one executable instruction that causes the processor to perform the operations of the method of any of claims 1-6.

10. A computer-readable storage medium having stored therein at least one executable instruction which, when run on a speech recognition signal pre-processing device, causes the speech recognition signal pre-processing device to perform the operations of the speech recognition signal pre-processing method according to any one of claims 1-6.