CN113571054A - Speech recognition signal preprocessing method, device, equipment and computer storage medium - Google Patents

Speech recognition signal preprocessing method, device, equipment and computer storage medium Download PDF

Info

Publication number
CN113571054A
CN113571054A CN202010349173.XA CN202010349173A CN113571054A CN 113571054 A CN113571054 A CN 113571054A CN 202010349173 A CN202010349173 A CN 202010349173A CN 113571054 A CN113571054 A CN 113571054A
Authority
CN
China
Prior art keywords
recognized
sentence
voiceprint
model library
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010349173.XA
Other languages
Chinese (zh)
Other versions
CN113571054B (en
Inventor
陈润泽
陈航
任永华
胡瑛
王振志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Zhejiang Innovation Research Institute Co ltd
China Mobile Communications Group Co Ltd
China Mobile Group Zhejiang Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Zhejiang Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Zhejiang Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202010349173.XA priority Critical patent/CN113571054B/en
Publication of CN113571054A publication Critical patent/CN113571054A/en
Application granted granted Critical
Publication of CN113571054B publication Critical patent/CN113571054B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention relates to the technical field of voice signal processing, and discloses a voice recognition signal preprocessing method, which comprises the following steps: receiving a voice signal to be recognized, and extracting the voiceprint characteristics of each sentence to be recognized in the voice signal to be recognized, wherein the voice signal to be recognized comprises at least one sentence to be recognized; recognizing the voiceprint characteristics of each sentence to be recognized according to the voiceprint model library to obtain an initial recognition result; the voice print model base is obtained by performing short-time registration construction on each sentence to be recognized before the current sentence to be recognized in the voice signal to be recognized; performing distortion analysis on each sentence to be recognized of the voice signal to be recognized to obtain a distortion degree result of each sentence to be recognized; and adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result. Through the mode, the embodiment of the invention has the beneficial effect of realizing the accuracy of voice recognition.

Description

Speech recognition signal preprocessing method, device, equipment and computer storage medium
Technical Field
The embodiment of the invention relates to the technical field of artificial intelligence, in particular to a method, a device and equipment for preprocessing a voice recognition signal and a computer readable storage medium.
Background
At present, in order to improve the accuracy of speech recognition, screening and filtering of signal input are generally implemented by a microphone array, and the main purpose of the method is to remove interference sources other than effective sound sources, which mainly includes the following parts:
1. sound source localization: sound sources are localized by angle and distance measurements.
2. Echo suppression and elimination: and abnormal signals such as background noise, interference, reverberation, echo and the like are suppressed.
3. Signal separation and extraction: according to the rule, signal separation and extraction are carried out
However, the main technical goal of the existing microphone array technology is to have a significant effect on removing other interference sources than human voice, such as abnormal signals of background noise, reverberation, echo, interference, car horn sound, and the like, with respect to the sound signal input by the microphone. But cannot process the sound signals of other people around the user, which are introduced due to the use environment.
Therefore, there is a need for a speech signal preprocessing method that can eliminate the use of sound signals of other people around the user, which is introduced due to the use environment, to improve the accuracy of speech recognition.
Disclosure of Invention
In view of the foregoing problems, embodiments of the present invention provide a speech recognition signal preprocessing method, a speech recognition method apparatus, a device and a computer-readable storage medium, which are used to solve the technical problem existing in the prior art that a speech recognition signal cannot eliminate sound of other people caused by a surrounding environment.
According to an aspect of an embodiment of the present invention, there is provided a speech recognition signal preprocessing method, including:
receiving a voice signal to be recognized, and extracting voiceprint characteristics of each sentence to be recognized in the voice signal to be recognized, wherein the voice signal to be recognized comprises at least one sentence to be recognized;
recognizing the voiceprint characteristics of the statements to be recognized according to the voiceprint model library to obtain an initial recognition result; the voiceprint model library is obtained by performing short-time registration construction according to each sentence to be recognized before the current sentence to be recognized in the speech signal to be recognized, and the recognition result is the corresponding relation between the sentence to be recognized and the user;
performing distortion analysis on each sentence to be recognized of the voice signal to be recognized to obtain a distortion degree result of each sentence to be recognized;
and adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.
In an alternative form of the method according to the invention,
receiving a voice signal to be recognized, and extracting voiceprint features of sentences to be recognized in the voice signal to be recognized, wherein the voice signal to be recognized comprises at least one sentence to be recognized, and the method further comprises the following steps:
dividing the voice signal to be recognized into a plurality of sentences;
and extracting the identity characteristic and the text characteristic of each sentence to be recognized before the current sentence to be recognized in the voice signal to be recognized, and fusing to obtain the voiceprint characteristic.
In an optional manner, the method includes identifying a voiceprint feature of each to-be-identified sentence according to a voiceprint model library to obtain an initial identification result, where the voiceprint model library is obtained by performing short-time registration construction according to each to-be-identified sentence before a current to-be-identified sentence in the to-be-identified speech signal, and further includes:
storing the voiceprint characteristics and the corresponding user identification in an associated manner to construct a voiceprint model library;
and comparing the current sentence to be recognized with the voiceprint characteristics stored in the voiceprint model library, judging the similarity, matching the corresponding user identification for the current voice signal to be recognized, and storing the user identification in the voiceprint model library in a correlation manner.
In an optional manner, performing distortion analysis on each to-be-recognized sentence of the to-be-recognized speech signal to obtain a distortion result of each to-be-recognized sentence, further includes:
and carrying out distortion degree analysis on each sentence to be recognized in the voice signal to be recognized by adopting a THD total harmonic distortion analysis method to obtain a distortion degree result corresponding to each sentence.
In an optional manner, the THD total harmonic distortion analysis method further includes:
and (3) carrying out distortion degree analysis by adopting a THD total harmonic distortion analysis formula:
Figure BDA0002471321870000031
wherein, VTHD_RRepresenting the ratio of the root mean square value of all harmonic components of a given Nth order to the total root mean square value, Vh,rmsRepresents the volume root mean square, rms represents the root mean square, and h represents the specified order.
In an optional manner, adjusting the initial recognition result according to the distortion result to obtain a target recognition result, further includes:
determining whether the distortion degree result of each sentence to be identified is within a distortion degree threshold interval;
and eliminating the sentences to be recognized and the corresponding user identifications of which the distortion degree results are not in the distortion degree threshold interval in the voiceprint model library to obtain a target voiceprint model library and a target recognition result.
According to another aspect of the embodiments of the present invention, there is also provided a speech recognition signal preprocessing apparatus including:
the voice recognition system comprises a voiceprint extraction module, a voice recognition module and a voice recognition module, wherein the voiceprint extraction module is used for receiving a voice signal to be recognized and extracting voiceprint characteristics of sentences to be recognized in the voice signal to be recognized, and the voice signal to be recognized comprises at least one sentence to be recognized;
the voiceprint registration module is used for identifying the voiceprint characteristics of the statements to be identified according to the voiceprint model library to obtain an initial identification result; the voiceprint model library is obtained by performing short-time registration construction according to each sentence to be recognized before the current sentence to be recognized in the speech signal to be recognized, and the recognition result is the corresponding relation between the sentence to be recognized and the user;
the distortion degree analysis module is used for carrying out distortion analysis on each sentence to be recognized of the voice signal to be recognized to obtain a distortion degree result of each sentence to be recognized;
and the adjusting module is used for adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.
In an optional manner, the voiceprint registration module identifies the voiceprint features of the statements to be recognized according to a voiceprint model library to obtain an initial recognition result, where the voiceprint model library is obtained by performing short-time registration construction according to the statements to be recognized before the current statement to be recognized in the speech signal to be recognized, and further includes:
storing the voiceprint characteristics and the corresponding user identification in an associated manner to construct a voiceprint model library;
and comparing the current sentence to be recognized with the voiceprint characteristics stored in the voiceprint model library, judging the similarity, matching the corresponding user identification for the current voice signal to be recognized, and storing the user identification in the voiceprint model library in a correlation manner.
According to another aspect of embodiments of the present invention, there is provided a speech recognition signal preprocessing apparatus including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation of the voice recognition signal preprocessing method.
According to a further aspect of the embodiments of the present invention, there is provided a computer-readable storage medium having at least one executable instruction stored therein, which when run on a speech recognition signal preprocessing apparatus/device, causes the speech recognition signal preprocessing apparatus/device to perform the operations of the above-mentioned speech recognition signal preprocessing method.
The voice recognition signal preprocessing method of the embodiment analyzes the signal distortion degree by combining the THD total harmonic distortion analysis method, finely adjusts the voiceprint model library according to the distortion degree analysis result, finely adjusts different speaker results analyzed by the voiceprint model library, and takes the finely adjusted voice recognition result as the target voice to be recognized, thereby improving the accuracy of the voice recognition result.
The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and the embodiments of the present invention can be implemented according to the content of the description in order to make the technical means of the embodiments of the present invention more clearly understood, and the detailed description of the present invention is provided below in order to make the foregoing and other objects, features, and advantages of the embodiments of the present invention more clearly understandable.
Drawings
The drawings are only for purposes of illustrating embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a flow chart of a method for preprocessing a speech recognition signal according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a speech recognition signal preprocessing apparatus according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram illustrating a speech recognition signal preprocessing apparatus according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein.
Fig. 1 is a flow chart illustrating a method for preprocessing a speech recognition signal according to an embodiment of the present invention, where the method is performed by a device for preprocessing a speech recognition signal. As shown in fig. 1, the method comprises the steps of:
s110: receiving a voice signal to be recognized, and extracting the voiceprint characteristics of each sentence to be recognized in the voice signal to be recognized, wherein the voice signal to be recognized comprises at least one sentence to be recognized.
The voice signal to be recognized is the voice signal of the residual human voice part after being processed by the microphone array. The invention aims to effectively extract the voice signal of a user using the voice recognition device, remove the voice signal of other people around the user due to the use environment and improve the accuracy of input voice signal recognition. Abnormal information signals such as abnormal signals of background noise, reverberation, echo, interference, car horn sound, etc. have been processed while passing through the microphone array.
Specifically, after receiving the voice signal to be recognized, the voice signal to be recognized is divided into a plurality of sentences. Each sentence in the speech signal to be recognized is taken as a sentence. And extracting the identity characteristic and the text characteristic of each sentence to be recognized before the current sentence to be recognized in the voice signal to be recognized, and fusing to obtain the voiceprint characteristic. And respectively extracting the identity characteristics of the speaker and the corresponding text characteristics related to the content of the voice information, wherein the identity characteristics of the speaker and the corresponding text characteristics are contained in the voice by adopting a DNN algorithm. The identity features include features of sound timbre, loudness, pitch, etc. and frequency domain, such as short-time energy, short-time average amplitude, short-time green rate, MFCC parameters, PLP parameters, pitch, etc. The text feature is a text content related to the content of the speech signal to be recognized. And fusing the identity characteristic and the text characteristic to obtain the voiceprint characteristic.
S120: recognizing the voiceprint characteristics of the statements to be recognized according to the voiceprint model library to obtain an initial recognition result; the voiceprint model library is obtained by performing short-time registration construction according to each sentence to be recognized before the current sentence to be recognized in the speech signal to be recognized, and the recognition result is the corresponding relation between the sentence to be recognized and the user.
Specifically, in this embodiment, the steps specifically include:
and (4) storing the voiceprint characteristics and the user identification in an associated manner to construct a voiceprint model library. Specifically, the voiceprint characteristics of each statement to be recognized in the received speech signal to be recognized are respectively associated and stored with the user identifier. And (4) carrying out voiceprint self-registration by taking each sentence (every sentence) to be recognized as a unit, establishing a free-speaking voiceprint model library after associating the user identification, and storing the voiceprint characteristics of the user. The user identification is a plurality of preset identifications used for matching the voiceprint characteristics. The recognition result is that the corresponding relation between the sentence to be recognized and the user is expressed by the corresponding relation between the voiceprint feature of the sentence to be recognized and the user identification.
And comparing the current voice signal to be recognized with the voice print characteristics of the previously stored sentences to be recognized in the voice print model library, and judging the similarity so as to match the user identification for the current voice signal to be recognized.
For example: in a conference scenario, a voice signal is received. When a first sentence is received, the first sentence is subjected to voiceprint feature extraction, matched with a random user identifier, such as user 1, and stored in a voiceprint model library after being associated and stored. And when the second sentence is received, extracting the voiceprint characteristics of the second sentence, comparing the voiceprint characteristics with the voiceprint characteristics of the first sentence in the voiceprint model library, and matching the second sentence as the user 1 if the similarity reaches a preset similarity threshold, which indicates that the second sentence is also spoken by the user 1. And if the similarity does not reach the similarity threshold, randomly matching a user identifier, such as identifier 2, for the second sentence. And associating the voiceprint characteristics of the second sentence with the corresponding user identification, and storing the voiceprint characteristics in a voiceprint model library. And when the third sentence is received, matching the third sentence with the voiceprint characteristics of the first sentence and the second sentence respectively in the same way to obtain the user identification of the third sentence, and storing the user identification in the voiceprint model library in a correlated manner. The user identification can realize the associated storage with the sentence to be recognized in the form of adding fields.
By adopting the above manner, the current sentence to be recognized is compared with the voiceprint features of each previous sentence to be recognized stored in the voiceprint model library, and the corresponding user identifier is matched, so that the user identifier corresponding to each sentence to be recognized is obtained, and the voiceprint features of each sentence to be recognized and the corresponding user identifier are the initial recognition result. That is, through the above processing, the speaker corresponding to each sentence to be recognized can be obtained. Thus, it is possible to accurately judge which sentences are spoken by one user and which sentences are spoken by another user in the speech signal generated in the conference.
S130: and performing distortion analysis on each sentence to be recognized of the voice signal to be recognized to obtain a distortion degree result of each sentence to be recognized.
The distortion is defined as the influence of interference and noise on the sound during conversion, amplification and transmission, and the output signal will have a waveform change relative to the input signal. The degree of deviation of the output signal from the input signal is the degree of distortion. These disturbances and noise include the voice of others around the speaker. Thus, performing a distortion analysis may assist in distinguishing between the speech signals of the speaker and others.
In this embodiment, a THD total harmonic distortion analysis formula is used to analyze the distortion degree of the original speech signal. And analyzing the signal distortion degree by taking each sentence of the sentence to be recognized as a unit to obtain the distortion degree of the input signal of the voice recognition signal preprocessing equipment relative to the output signal.
And (3) carrying out distortion degree analysis by adopting the following THD total harmonic distortion analysis formula:
firstly, using the value Vsamp of the waveform sample sampling point of the volume of the selected voice signal to be recognized to carry out the corresponding volume root mean square VrmsAnd inserting it into a standard root mean square equation:
Figure BDA0002471321870000071
then, the THD parameters are calculated according to the total harmonic content in the signal:
Figure BDA0002471321870000072
wherein, VTHD_RRepresenting the ratio of the root mean square value of all harmonic components of a given Nth order to the total root mean square value, Vh,rmsRepresents the volume root mean square, rms represents the root mean square.
And after analysis and calculation are carried out according to the distortion degree formula, obtaining the distortion degree result of the volume corresponding to each sentence to be recognized in the voice signal to be recognized. In this embodiment, the distortion result may be represented by a percentage, for example, the distortion is 5%, which represents that the sentence to be recognized is distorted by 5% relative to its original input.
S140: and adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.
In this embodiment, after the initial recognition result is obtained, a voiceprint feature of each to-be-recognized sentence and an initial corresponding result of the corresponding user identifier are obtained, and a voiceprint model library in which the voiceprint features of all to-be-recognized sentences are stored in association with the corresponding user identifiers is obtained. The initial recognition result does not take the distortion degree into consideration, so that whether each sentence to be recognized is distorted or not needs to be judged according to the calculated distortion degree result of each sentence to be recognized. Under the voiceprint characteristics of the same tone, tone and the like, the voiceprint characteristics of the sentence to be recognized with large signal distortion degree change need to be removed in the self-registration of the sentence to be recognized, and only the sentences to be recognized added in the reasonable distortion degree range are selected as the voiceprint characteristics of the self-registration effective sentences to be recognized. Therefore, in this embodiment, a distortion threshold interval is set. After the distortion threshold interval is exceeded, it indicates that the distortion of the to-be-recognized sentence is high, and therefore, the to-be-recognized sentence with the distortion result not within the distortion threshold interval and the corresponding recognition result need to be removed from the voiceprint model library to adjust the voiceprint model library to obtain the target voiceprint model library. And simultaneously obtaining a target recognition result after the sentence to be recognized with high distortion degree is removed. In this embodiment, the threshold interval of distortion is 0-5%. And if the distortion degree result of a sentence to be recognized exceeds 5%, rejecting the sentence to be recognized.
In this embodiment, after the voiceprint model library and the initial recognition result are adjusted according to the distortion result to obtain a target recognition result, the statements to be recognized are also sorted according to the characteristics of the voice recognition signal, such as volume, harmonic integrity, distortion degree, and the like, and the statements to be recognized corresponding to the user identifier are selected to output the voice recognition result, so as to obtain an output result. For example, the sentence to be recognized and the corresponding user identification of the voice recognition signal are selected to be output, wherein the sentence to be recognized is large in volume, complete in harmonic wave and small in distortion degree. And judging the target recognition result by the aid of user behavior, and if the target recognition result is output wrongly, prompting the user to repeat the voice recognition step so as to optimize the voiceprint model library, correct the output user identification and recognition result and further improve the accuracy of the voiceprint model library on the user voice recognition result.
In this embodiment, the scenario is that, in the conference record, statements to be recognized are classified and collected according to different voiceprint features and distortion degree results registered in voices to be recognized, voices to be recognized of the same user identifier are collected together, and the corresponding user identifier is used as a name and output as a result.
The voice recognition signal preprocessing method of the embodiment analyzes the signal distortion degree by combining the THD total harmonic distortion analysis method, finely adjusts the voiceprint model library according to the distortion degree analysis result, finely adjusts different speaker results analyzed by the voiceprint model library, and takes the finely adjusted voice recognition result as the target voice to be recognized, thereby improving the accuracy of the voice recognition result.
Fig. 2 is a schematic structural diagram of an embodiment of the speech recognition signal preprocessing apparatus according to the present invention. As shown in fig. 2, the apparatus 200 includes: a voiceprint extraction module 210, a voiceprint registration module 220, a distortion analysis module 230, and an adjustment module 240.
The voiceprint extraction module 210 is configured to receive a speech signal to be recognized, and extract a voiceprint feature of each sentence to be recognized in the speech signal to be recognized, where the speech signal to be recognized includes at least one sentence to be recognized.
The voiceprint registration module 220 is configured to identify the voiceprint features of the statements to be identified according to the voiceprint model library to obtain an initial identification result; the voiceprint model library is obtained by performing short-time registration construction according to each sentence to be recognized before the current sentence to be recognized in the speech signal to be recognized, and the recognition result is the corresponding relation between the sentence to be recognized and the user.
The distortion analyzing module 230 is configured to perform distortion analysis on each to-be-recognized sentence of the to-be-recognized speech signal to obtain a distortion result of each to-be-recognized sentence.
And the adjusting module 140 is configured to adjust the voiceprint model library and the initial recognition result according to the distortion result to obtain a target voiceprint model library and a target recognition result.
The specific working process of each module is as follows:
the voiceprint extraction module 210 receives a speech signal to be recognized, and extracts voiceprint features of each sentence to be recognized in the speech signal to be recognized, where the speech signal to be recognized includes at least one sentence to be recognized.
The voice signal to be recognized is the voice signal of the residual human voice part after being processed by the microphone array. The invention aims to effectively extract the voice signal of a user using the voice recognition device, remove the voice signal of other people around the user due to the use environment and improve the accuracy of the input voice signal. Abnormal information signals such as abnormal signals of background noise, reverberation, echo, interference, car horn sound, etc. have been processed while passing through the microphone array.
In this embodiment, the voiceprint extraction module 210 includes a voice division sub-module and a voiceprint fusion sub-module.
And the sentence dividing module is used for dividing the voice signal to be recognized into a plurality of sentences. Each sentence in the speech signal to be recognized is taken as a sentence.
And the voiceprint fusion submodule is used for extracting the identity characteristic and the text characteristic of each sentence to be recognized before the current sentence to be recognized in the voice signal to be recognized and fusing to obtain the voiceprint characteristic.
Specifically, the voiceprint fusion submodule respectively extracts the identity characteristic of the speaker contained in the voice and the corresponding text characteristic related to the content of the voice information by adopting a DNN algorithm. The identity features include features of sound timbre, loudness, pitch, etc. and frequency domain, such as short-time energy, short-time average amplitude, short-time green rate, MFCC parameters, PLP parameters, pitch, etc. The text feature is a text content related to the content of the speech signal to be recognized. And fusing the identity characteristic and the text characteristic to obtain the voiceprint characteristic.
The voiceprint registration module 220 identifies the voiceprint characteristics of the statements to be identified according to the voiceprint model library to obtain an initial identification result; the voiceprint model library is obtained by performing short-time registration construction according to each sentence to be recognized before the current sentence to be recognized in the speech signal to be recognized, and the recognition result is the corresponding relation between the sentence to be recognized and the user.
The voiceprint registration module 220 stores the voiceprint features and the user identifier in an associated manner, and constructs a voiceprint model library. Specifically, the voiceprint characteristics of each statement to be recognized in the received speech signal to be recognized are respectively associated and stored with the user identifier. And (4) carrying out voiceprint self-registration by taking each sentence (every sentence) to be recognized as a unit, establishing a free-speaking voiceprint model library after associating the user identification, and storing the voiceprint characteristics of the user. The user identification is a plurality of preset identifications used for matching the voiceprint characteristics. And comparing the current voice signal to be recognized with the voice print characteristics of the previously stored sentences to be recognized in the voice print model library, and judging the similarity so as to match the user identification for the current voice signal to be recognized. The recognition result is that the corresponding relation between the sentence to be recognized and the user is expressed by the corresponding relation between the voiceprint feature of the sentence to be recognized and the user identification.
For example: in a conference scenario, a voice signal is received. When a first sentence is received, the first sentence is subjected to voiceprint feature extraction, matched with a random user identifier, such as user 1, and stored in a voiceprint model library after being associated and stored. And when the second sentence is received, extracting the voiceprint characteristics of the second sentence, comparing the voiceprint characteristics with the voiceprint characteristics of the first sentence in the voiceprint model library, and matching the second sentence as the user 1 if the similarity reaches a preset similarity threshold, which indicates that the second sentence is also spoken by the user 1. And if the similarity does not reach the similarity threshold, randomly matching a user identifier, such as identifier 2, for the second sentence. And associating the voiceprint characteristics of the second sentence with the corresponding user identification, and storing the voiceprint characteristics in a voiceprint model library. And when the third sentence is received, matching the third sentence with the voiceprint characteristics of the first sentence and the second sentence respectively in the same way to obtain the user identification of the third sentence, and storing the user identification in the voiceprint model library in a correlated manner. The user identification can realize the associated storage with the sentence to be recognized in the form of adding fields.
The voiceprint registration module 220 compares the current to-be-recognized statement with the voiceprint features of each previous to-be-recognized statement stored in the voiceprint model library through the above operations, and matches the corresponding user identifier, so as to obtain the user identifier corresponding to each to-be-recognized statement, where the voiceprint features of each to-be-recognized statement and the corresponding user identifiers are the initial recognition results. That is, through the above processing, the speaker corresponding to each sentence to be recognized can be obtained. Thus, it is possible to accurately judge which sentences are spoken by one user and which sentences are spoken by another user in the speech signal generated in the conference.
The distortion analysis module 230 performs distortion analysis on each to-be-recognized sentence of the to-be-recognized speech signal to obtain a distortion result of each to-be-recognized sentence.
The distortion is defined as the influence of interference and noise on the sound during conversion, amplification and transmission, and the output signal will have a waveform change relative to the input signal. The degree of deviation of the output signal from the input signal is the degree of distortion. These disturbances and noise include the voice of others around the speaker. Thus, performing a distortion analysis may assist in distinguishing between the speech signals of the speaker and others.
In this embodiment, a THD total harmonic distortion analysis formula is used to analyze the distortion degree of the original speech signal. And analyzing the signal distortion degree by taking each sentence of the sentence to be recognized as a unit to obtain the distortion degree of the input signal of the voice recognition signal preprocessing equipment relative to the output signal.
And (3) carrying out distortion degree analysis by adopting a THD total harmonic distortion analysis formula:
firstly, using the value Vsamp of the waveform sample sampling point of the volume of the selected voice signal to be recognized to carry out the corresponding volume root mean square VrmsAnd inserting it into a standard root mean square equation:
Figure BDA0002471321870000111
then, the THD parameters are calculated according to the total harmonic content in the signal:
Figure BDA0002471321870000112
wherein, VTHD_RRepresenting the ratio of the root mean square value of all harmonic components of a given Nth order to the total root mean square value, Vh,rmsRepresents the volume root mean square, rms represents the root mean square, and h represents the specified order.
And after analysis and calculation are carried out according to the distortion degree formula, obtaining a distortion degree result corresponding to each sentence to be recognized in the voice signal to be recognized. In this embodiment, the distortion result may be represented by a percentage, for example, the distortion is 5%, which represents that the sentence to be recognized is distorted by 5% relative to its original input.
The adjusting module 140 adjusts the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.
In this embodiment, after the initial recognition result is obtained, a voiceprint feature of each to-be-recognized sentence and an initial corresponding result of the corresponding user identifier are obtained, and a voiceprint model library in which the voiceprint features of all to-be-recognized sentences are stored in association with the corresponding user identifiers is obtained. The initial recognition result does not take the distortion degree into consideration, so that whether each sentence to be recognized is distorted or not needs to be judged according to the calculated distortion degree result of each sentence to be recognized. Under the voiceprint characteristics of the same tone, tone and the like, the voiceprint characteristics of the sentence to be recognized with large signal distortion degree change need to be removed in the self-registration of the sentence to be recognized, and only the sentences to be recognized added in the reasonable distortion degree range are selected as the voiceprint characteristics of the self-registration effective sentences to be recognized. Therefore, in this embodiment, a distortion threshold interval is set. After the distortion threshold interval is exceeded, it indicates that the distortion of the to-be-recognized sentence is high, and therefore, the to-be-recognized sentence with the distortion result not within the distortion threshold interval and the corresponding recognition result need to be removed from the voiceprint model library to adjust the voiceprint model library to obtain the target voiceprint model library. And simultaneously obtaining a target recognition result after the sentence to be recognized with high distortion degree is removed. In this embodiment, the threshold interval of distortion is 0-5%. And if the distortion degree result of a sentence to be recognized exceeds 5%, rejecting the sentence to be recognized.
In this embodiment, after the voiceprint model library and the initial recognition result are adjusted according to the distortion result to obtain a target recognition result, the statements to be recognized are also sorted according to the characteristics of the voice recognition signal, such as volume, harmonic integrity, distortion degree, and the like, and the statements to be recognized corresponding to the user identifier are selected to output the voice recognition result, so as to obtain an output result. For example, the sentence to be recognized and the corresponding user identification of the voice recognition signal are selected to be output, wherein the sentence to be recognized is large in volume, complete in harmonic wave and small in distortion degree. And judging the target recognition result by the aid of user behavior, and if the target recognition result is output wrongly, prompting the user to repeat the voice recognition step so as to optimize the voiceprint model library, correct the output user identification and recognition result and further improve the accuracy of the voiceprint model library on the user voice recognition result.
In this embodiment, the scenario is that, in the conference record, statements to be recognized are classified and collected according to different voiceprint features and distortion degree results registered in voices to be recognized, voices to be recognized of the same user identifier are collected together, and the corresponding user identifier is used as a name and output as a result.
The voice recognition signal preprocessing device of the embodiment analyzes the signal distortion degree by combining the THD total harmonic distortion analysis method, finely adjusts the voiceprint model library according to the distortion degree analysis result, finely adjusts different speaker results analyzed by the voiceprint model library, and takes the finely adjusted voice recognition result as the target voice to be recognized, thereby improving the accuracy of the voice recognition result.
Fig. 3 is a schematic structural diagram illustrating an embodiment of a speech recognition signal preprocessing device according to the present invention, and the embodiment of the present invention does not limit the specific implementation of the speech recognition signal preprocessing device.
As shown in fig. 3, the voice recognition signal preprocessing apparatus may include: a processor (processor)302, a Communications Interface 304, a memory 506, and a communication bus 308.
Wherein: the processor 302, communication interface 304, and memory 306 communicate with each other via a communication bus 508. A communication interface 304 for communicating with network elements of other devices, such as clients or other application servers. The processor 302 is configured to execute the program 310, and may specifically execute the relevant steps in the embodiment of the method for preprocessing the speech recognition signal.
In particular, program 310 may include program code comprising computer-executable instructions.
The processor 302 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement an embodiment of the present invention. The identity authentication device comprises one or more processors, which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.
And a memory 306 for storing a program 310. Memory 306 may comprise high-speed RAM memory and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
Specifically, the program 310 may be invoked by the processor 302 to cause the electronic device to perform the following operations:
receiving a voice signal to be recognized, and extracting voiceprint characteristics of each sentence to be recognized in the voice signal to be recognized, wherein the voice signal to be recognized comprises at least one sentence to be recognized;
recognizing the voiceprint characteristics of the statements to be recognized according to the voiceprint model library to obtain an initial recognition result; the voiceprint model library is obtained by performing short-time registration construction according to each sentence to be recognized before the current sentence to be recognized in the speech signal to be recognized, and the recognition result is the corresponding relation between the sentence to be recognized and the user;
performing distortion analysis on each sentence to be recognized of the voice signal to be recognized to obtain a distortion degree result of each sentence to be recognized;
and adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.
In an optional manner, receiving a speech signal to be recognized, and extracting a voiceprint feature of each sentence to be recognized in the speech signal to be recognized, where the speech signal to be recognized includes at least one sentence to be recognized, further including:
dividing the voice signal to be recognized into a plurality of sentences;
and extracting the identity characteristic and the text characteristic of each sentence to be recognized before the current sentence to be recognized in the voice signal to be recognized, and fusing to obtain the voiceprint characteristic.
In an optional manner, the method includes identifying a voiceprint feature of each to-be-identified sentence according to a voiceprint model library to obtain an initial identification result, where the voiceprint model library is obtained by performing short-time registration construction according to each to-be-identified sentence before a current to-be-identified sentence in the to-be-identified speech signal, and further includes:
storing the voiceprint characteristics and the corresponding user identification in an associated manner to construct a voiceprint model library;
and comparing the current sentence to be recognized with the voiceprint characteristics stored in the voiceprint model library, judging the similarity, matching the corresponding user identification for the current voice signal to be recognized, and storing the user identification in the voiceprint model library in a correlation manner.
In an optional manner, performing distortion analysis on each to-be-recognized sentence of the to-be-recognized speech signal to obtain a distortion result of each to-be-recognized sentence, further includes:
and carrying out distortion degree analysis on each sentence to be recognized in the voice signal to be recognized by adopting a THD total harmonic distortion analysis method to obtain a distortion degree result corresponding to each sentence.
In an optional manner, the THD total harmonic distortion analysis method further includes:
and (3) carrying out distortion degree analysis by adopting a THD total harmonic distortion analysis formula:
Figure BDA0002471321870000151
wherein, VTHD_RRepresenting the ratio of the root mean square value of all harmonic components of a given Nth order to the total root mean square value, Vh,rmsRepresents the volume root mean square, rms represents the root mean square, and h represents the specified order.
In an optional manner, adjusting the initial recognition result according to the distortion result to obtain a target recognition result, further includes:
determining whether the distortion degree result of each sentence to be identified is within a distortion degree threshold interval;
and eliminating the sentences to be recognized and the corresponding user identifications of which the distortion degree results are not in the distortion degree threshold interval in the voiceprint model library to obtain a target voiceprint model library and a target recognition result.
The voice recognition signal preprocessing device of the embodiment analyzes the signal distortion degree by combining the THD total harmonic distortion analysis method, finely adjusts the voiceprint model library according to the distortion degree analysis result, finely adjusts different speaker results analyzed by the voiceprint model library, and takes the finely adjusted voice recognition result as the target voice to be recognized, thereby improving the accuracy of the voice recognition result.
An embodiment of the present invention provides a computer-readable storage medium, where the storage medium stores at least one executable instruction, and when the executable instruction is executed on a speech recognition signal preprocessing device/apparatus, the speech recognition signal preprocessing device/apparatus executes a speech recognition signal preprocessing method in any method embodiment described above.
The executable instructions may be specifically configured to cause the speech recognition signal pre-processing device/arrangement to perform the following operations:
receiving a voice signal to be recognized, and extracting voiceprint characteristics of each sentence to be recognized in the voice signal to be recognized, wherein the voice signal to be recognized comprises at least one sentence to be recognized;
recognizing the voiceprint characteristics of the statements to be recognized according to the voiceprint model library to obtain an initial recognition result; the voiceprint model library is obtained by performing short-time registration construction according to each sentence to be recognized before the current sentence to be recognized in the speech signal to be recognized, and the recognition result is the corresponding relation between the sentence to be recognized and the user;
performing distortion analysis on each sentence to be recognized of the voice signal to be recognized to obtain a distortion degree result of each sentence to be recognized;
and adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.
In an optional manner, receiving a speech signal to be recognized, and extracting a voiceprint feature of each sentence to be recognized in the speech signal to be recognized, where the speech signal to be recognized includes at least one sentence to be recognized, further including:
dividing the voice signal to be recognized into a plurality of sentences;
and extracting the identity characteristic and the text characteristic of each sentence to be recognized before the current sentence to be recognized in the voice signal to be recognized, and fusing to obtain the voiceprint characteristic.
In an optional manner, the method includes identifying a voiceprint feature of each to-be-identified sentence according to a voiceprint model library to obtain an initial identification result, where the voiceprint model library is obtained by performing short-time registration construction according to each to-be-identified sentence before a current to-be-identified sentence in the to-be-identified speech signal, and further includes:
storing the voiceprint characteristics and the corresponding user identification in an associated manner to construct a voiceprint model library;
and comparing the current sentence to be recognized with the voiceprint characteristics stored in the voiceprint model library, judging the similarity, matching the corresponding user identification for the current voice signal to be recognized, and storing the user identification in the voiceprint model library in a correlation manner.
In an optional manner, performing distortion analysis on each to-be-recognized sentence of the to-be-recognized speech signal to obtain a distortion result of each to-be-recognized sentence, further includes:
and carrying out distortion degree analysis on each sentence to be recognized in the voice signal to be recognized by adopting a THD total harmonic distortion analysis method to obtain a distortion degree result corresponding to each sentence.
In an optional manner, the THD total harmonic distortion analysis method further includes:
and (3) carrying out distortion degree analysis by adopting a THD total harmonic distortion analysis formula:
Figure BDA0002471321870000161
wherein, VTHD_RRepresenting the ratio of the root mean square value of all harmonic components of a given Nth order to the total root mean square value, Vh,rmsRepresents the volume root mean square, rms represents the root mean square, and h represents the specified order.
In an optional manner, adjusting the initial recognition result according to the distortion result to obtain a target recognition result, further includes:
determining whether the distortion degree result of each sentence to be identified is within a distortion degree threshold interval;
and eliminating the sentences to be recognized and the corresponding user identifications of which the distortion degree results are not in the distortion degree threshold interval in the voiceprint model library to obtain a target voiceprint model library and a target recognition result.
In the embodiment, the signal distortion degree is analyzed by combining a THD total harmonic distortion analysis method, the voiceprint model library is finely adjusted according to the distortion degree analysis result, different speaker results analyzed by the voiceprint model library are finely adjusted, and the finely adjusted voice recognition result is used as the target voice to be recognized, so that the accuracy of the voice recognition result is improved.
The embodiment of the invention provides a preprocessing device based on a voice recognition signal, which is used for executing the voice recognition signal preprocessing method.
Embodiments of the present invention provide a computer program, where the computer program can be called by a processor to enable the electronic device to execute the speech recognition signal preprocessing method in any of the above method embodiments.
An embodiment of the present invention provides a computer program product comprising a computer program stored on a computer-readable storage medium, the computer program comprising program instructions that, when run on a computer, cause the computer to perform the method for pre-processing a speech recognition signal in any of the above-mentioned method embodiments.
The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specified otherwise.

Claims (10)

1. A method of pre-processing a speech recognition signal, the method comprising:
receiving a voice signal to be recognized, and extracting voiceprint characteristics of each sentence to be recognized in the voice signal to be recognized, wherein the voice signal to be recognized comprises at least one sentence to be recognized;
recognizing the voiceprint characteristics of the statements to be recognized according to the voiceprint model library to obtain an initial recognition result; the voiceprint model library is obtained by performing short-time registration construction according to each sentence to be recognized before the current sentence to be recognized in the speech signal to be recognized, and the recognition result is the corresponding relation between the sentence to be recognized and the user;
performing distortion analysis on each sentence to be recognized of the voice signal to be recognized to obtain a distortion degree result of each sentence to be recognized;
and adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.
2. The method according to claim 1, wherein a speech signal to be recognized is received, and a voiceprint feature of each sentence to be recognized in the speech signal to be recognized is extracted, wherein the speech signal to be recognized comprises at least one sentence to be recognized, and further comprising:
dividing the voice signal to be recognized into a plurality of sentences;
and extracting the identity characteristic and the text characteristic of each sentence to be recognized before the current sentence to be recognized in the voice signal to be recognized, and fusing to obtain the voiceprint characteristic.
3. The method according to claim 1, wherein the voiceprint features of the statements to be recognized are recognized according to a voiceprint model library to obtain an initial recognition result, wherein the voiceprint model library is obtained by performing short-time registration construction according to the statements to be recognized before the current statement to be recognized in the speech signal to be recognized, and further comprising:
storing the voiceprint characteristics and the corresponding user identification in an associated manner to construct a voiceprint model library;
and comparing the current sentence to be recognized with the voiceprint characteristics stored in the voiceprint model library, judging the similarity, matching the corresponding user identification for the current voice signal to be recognized, and storing the user identification in the voiceprint model library in a correlation manner.
4. The method of claim 1, wherein performing distortion analysis on each to-be-recognized sentence of the to-be-recognized speech signal to obtain a distortion result of each to-be-recognized sentence, further comprising:
and carrying out distortion degree analysis on each sentence to be recognized in the voice signal to be recognized by adopting a THD total harmonic distortion analysis method to obtain a distortion degree result corresponding to each sentence.
5. The method of claim 4, wherein the THD total harmonic distortion analysis method further comprises:
and (3) carrying out distortion degree analysis by adopting a THD total harmonic distortion analysis formula:
Figure FDA0002471321860000021
wherein, VTHD_RRepresenting the ratio of the root mean square value of all harmonic components of a given Nth order to the total root mean square value, Vh,rmsRepresents the volume root mean square, rms represents the root mean square, and h represents the specified order.
6. The method of claim 1, wherein the initial recognition result is adjusted according to the distortion result to obtain a target recognition result, further comprising:
determining whether the distortion degree result of each sentence to be identified is within a distortion degree threshold interval;
and eliminating the sentences to be recognized and the corresponding user identifications of which the distortion degree results are not in the distortion degree threshold interval in the voiceprint model library to obtain a target voiceprint model library and a target recognition result.
7. A speech recognition signal preprocessing apparatus, characterized in that the apparatus comprises:
the voice recognition system comprises a voiceprint extraction module, a voice recognition module and a voice recognition module, wherein the voiceprint extraction module is used for receiving a voice signal to be recognized and extracting voiceprint characteristics of sentences to be recognized in the voice signal to be recognized, and the voice signal to be recognized comprises at least one sentence to be recognized;
the voiceprint registration module is used for identifying the voiceprint characteristics of the statements to be identified according to the voiceprint model library to obtain an initial identification result; the voiceprint model library is obtained by performing short-time registration construction according to each sentence to be recognized before the current sentence to be recognized in the speech signal to be recognized, and the recognition result is the corresponding relation between the sentence to be recognized and the user;
the distortion degree analysis module is used for carrying out distortion analysis on each sentence to be recognized of the voice signal to be recognized to obtain a distortion degree result of each sentence to be recognized;
and the adjusting module is used for adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.
8. The apparatus according to claim 7, wherein the voiceprint registration module identifies the voiceprint features of the statements to be identified according to a voiceprint model library to obtain an initial identification result, wherein the voiceprint model library is obtained by performing short-time registration construction according to the statements to be identified before the current statement to be identified in the speech signal to be identified, and further comprises:
storing the voiceprint characteristics and the corresponding user identification in an associated manner to construct a voiceprint model library;
and comparing the current sentence to be recognized with the voiceprint characteristics stored in the voiceprint model library, judging the similarity, matching the corresponding user identification for the current voice signal to be recognized, and storing the user identification in the voiceprint model library in a correlation manner.
9. A speech recognition signal preprocessing apparatus characterized by comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is configured to store at least one executable instruction that causes the processor to perform the operations of the method of any of claims 1-6.
10. A computer-readable storage medium having stored therein at least one executable instruction which, when run on a speech recognition signal pre-processing device, causes the speech recognition signal pre-processing device to perform the operations of the speech recognition signal pre-processing method according to any one of claims 1-6.
CN202010349173.XA 2020-04-28 2020-04-28 Speech recognition signal preprocessing method, device, equipment and computer storage medium Active CN113571054B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010349173.XA CN113571054B (en) 2020-04-28 2020-04-28 Speech recognition signal preprocessing method, device, equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010349173.XA CN113571054B (en) 2020-04-28 2020-04-28 Speech recognition signal preprocessing method, device, equipment and computer storage medium

Publications (2)

Publication Number Publication Date
CN113571054A true CN113571054A (en) 2021-10-29
CN113571054B CN113571054B (en) 2023-08-15

Family

ID=78157992

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010349173.XA Active CN113571054B (en) 2020-04-28 2020-04-28 Speech recognition signal preprocessing method, device, equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN113571054B (en)

Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751921A (en) * 2009-12-16 2010-06-23 南京邮电大学 Real-time voice conversion method under conditions of minimal amount of training data
CN102044247A (en) * 2009-10-10 2011-05-04 北京理工大学 Objective evaluation method for VoIP speech
CN102044248A (en) * 2009-10-10 2011-05-04 北京理工大学 Objective evaluating method for audio quality of streaming media
CN102270451A (en) * 2011-08-18 2011-12-07 安徽科大讯飞信息科技股份有限公司 Method and system for identifying speaker
CN102723081A (en) * 2012-05-30 2012-10-10 林其灿 Voice signal processing method, voice and voiceprint recognition method and device
CN103984315A (en) * 2014-05-15 2014-08-13 成都百威讯科技有限责任公司 Domestic multifunctional intelligent robot
CN104143326A (en) * 2013-12-03 2014-11-12 腾讯科技(深圳)有限公司 Voice command recognition method and device
CN104269177A (en) * 2014-09-22 2015-01-07 联想(北京)有限公司 Voice processing method and electronic device
CN105139858A (en) * 2015-07-27 2015-12-09 联想(北京)有限公司 Information processing method and electronic equipment
CN105405439A (en) * 2015-11-04 2016-03-16 科大讯飞股份有限公司 Voice playing method and device
CN105632515A (en) * 2014-10-31 2016-06-01 科大讯飞股份有限公司 Pronunciation error detection method and device
CN105632489A (en) * 2016-01-20 2016-06-01 曾戟 Voice playing method and voice playing device
CN105679324A (en) * 2015-12-29 2016-06-15 福建星网视易信息系统有限公司 Voiceprint identification similarity scoring method and apparatus
CN106057206A (en) * 2016-06-01 2016-10-26 腾讯科技(深圳)有限公司 Voiceprint model training method, voiceprint recognition method and device
CN106297772A (en) * 2016-08-24 2017-01-04 武汉大学 Detection method is attacked in the playback of voice signal distorted characteristic based on speaker introducing
CN106887229A (en) * 2015-12-16 2017-06-23 芋头科技(杭州)有限公司 A kind of method and system for lifting the Application on Voiceprint Recognition degree of accuracy
CN108320732A (en) * 2017-01-13 2018-07-24 阿里巴巴集团控股有限公司 The method and apparatus for generating target speaker's speech recognition computation model
CN108564940A (en) * 2018-03-20 2018-09-21 平安科技(深圳)有限公司 Audio recognition method, server and computer readable storage medium
CN108847244A (en) * 2018-08-22 2018-11-20 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Voiceprint recognition method and system based on MFCC and improved BP neural network
CN110047474A (en) * 2019-05-06 2019-07-23 齐鲁工业大学 A kind of English phonetic pronunciation intelligent training system and training method

Patent Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102044247A (en) * 2009-10-10 2011-05-04 北京理工大学 Objective evaluation method for VoIP speech
CN102044248A (en) * 2009-10-10 2011-05-04 北京理工大学 Objective evaluating method for audio quality of streaming media
CN101751921A (en) * 2009-12-16 2010-06-23 南京邮电大学 Real-time voice conversion method under conditions of minimal amount of training data
CN102270451A (en) * 2011-08-18 2011-12-07 安徽科大讯飞信息科技股份有限公司 Method and system for identifying speaker
CN102723081A (en) * 2012-05-30 2012-10-10 林其灿 Voice signal processing method, voice and voiceprint recognition method and device
CN104143326A (en) * 2013-12-03 2014-11-12 腾讯科技(深圳)有限公司 Voice command recognition method and device
CN103984315A (en) * 2014-05-15 2014-08-13 成都百威讯科技有限责任公司 Domestic multifunctional intelligent robot
CN104269177A (en) * 2014-09-22 2015-01-07 联想(北京)有限公司 Voice processing method and electronic device
CN105632515A (en) * 2014-10-31 2016-06-01 科大讯飞股份有限公司 Pronunciation error detection method and device
CN105139858A (en) * 2015-07-27 2015-12-09 联想(北京)有限公司 Information processing method and electronic equipment
CN105405439A (en) * 2015-11-04 2016-03-16 科大讯飞股份有限公司 Voice playing method and device
CN106887229A (en) * 2015-12-16 2017-06-23 芋头科技(杭州)有限公司 A kind of method and system for lifting the Application on Voiceprint Recognition degree of accuracy
CN105679324A (en) * 2015-12-29 2016-06-15 福建星网视易信息系统有限公司 Voiceprint identification similarity scoring method and apparatus
CN105632489A (en) * 2016-01-20 2016-06-01 曾戟 Voice playing method and voice playing device
CN106057206A (en) * 2016-06-01 2016-10-26 腾讯科技(深圳)有限公司 Voiceprint model training method, voiceprint recognition method and device
CN106297772A (en) * 2016-08-24 2017-01-04 武汉大学 Detection method is attacked in the playback of voice signal distorted characteristic based on speaker introducing
CN108320732A (en) * 2017-01-13 2018-07-24 阿里巴巴集团控股有限公司 The method and apparatus for generating target speaker's speech recognition computation model
CN108564940A (en) * 2018-03-20 2018-09-21 平安科技(深圳)有限公司 Audio recognition method, server and computer readable storage medium
CN108847244A (en) * 2018-08-22 2018-11-20 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Voiceprint recognition method and system based on MFCC and improved BP neural network
CN110047474A (en) * 2019-05-06 2019-07-23 齐鲁工业大学 A kind of English phonetic pronunciation intelligent training system and training method

Also Published As

Publication number Publication date
CN113571054B (en) 2023-08-15

Similar Documents

Publication Publication Date Title
CN108305615B (en) Object identification method and device, storage medium and terminal thereof
CN105023573B (en) It is detected using speech syllable/vowel/phone boundary of auditory attention clue
US11017781B2 (en) Reverberation compensation for far-field speaker recognition
US20170154640A1 (en) Method and electronic device for voice recognition based on dynamic voice model selection
CN112053695A (en) Voiceprint recognition method and device, electronic equipment and storage medium
AU2013223662B2 (en) Modified mel filter bank structure using spectral characteristics for sound analysis
CN111429935B (en) Voice caller separation method and device
Zhang et al. X-tasnet: Robust and accurate time-domain speaker extraction network
CN111312259B (en) Voiceprint recognition method, system, mobile terminal and storage medium
CN111081223B (en) Voice recognition method, device, equipment and storage medium
Ting Yuan et al. Frog sound identification system for frog species recognition
WO2022134798A1 (en) Segmentation method, apparatus and device based on natural language, and storage medium
US11611581B2 (en) Methods and devices for detecting a spoofing attack
CN110648669B (en) Multi-frequency shunt voiceprint recognition method, device and system and computer readable storage medium
Singh et al. Novel feature extraction algorithm using DWT and temporal statistical techniques for word dependent speaker’s recognition
CN113571054B (en) Speech recognition signal preprocessing method, device, equipment and computer storage medium
CN113012684B (en) Synthesized voice detection method based on voice segmentation
CN110931020B (en) Voice detection method and device
Runqiang et al. CASA based speech separation for robust speech recognition
Tahliramani et al. Performance analysis of speaker identification system with and without spoofing attack of voice conversion
CN111681671A (en) Abnormal sound identification method and device and computer storage medium
Neelima et al. Spoofing det ection and count ermeasure is aut omat ic speaker verificat ion syst em using dynamic feat ures
CN110875044A (en) Speaker identification method based on word correlation score calculation
CN111816218B (en) Voice endpoint detection method, device, equipment and storage medium
CN113409763B (en) Voice correction method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20231219

Address after: No.19, Jiefang East Road, Hangzhou, Zhejiang Province, 310000

Patentee after: CHINA MOBILE GROUP ZHEJIANG Co.,Ltd.

Patentee after: China Mobile (Zhejiang) Innovation Research Institute Co.,Ltd.

Patentee after: CHINA MOBILE COMMUNICATIONS GROUP Co.,Ltd.

Address before: No. 19, Jiefang East Road, Hangzhou, Zhejiang Province, 310016

Patentee before: CHINA MOBILE GROUP ZHEJIANG Co.,Ltd.

Patentee before: CHINA MOBILE COMMUNICATIONS GROUP Co.,Ltd.