CN113571054B - Speech recognition signal preprocessing method, device, equipment and computer storage medium - Google Patents

Speech recognition signal preprocessing method, device, equipment and computer storage medium Download PDF

Info

Publication number
CN113571054B
CN113571054B CN202010349173.XA CN202010349173A CN113571054B CN 113571054 B CN113571054 B CN 113571054B CN 202010349173 A CN202010349173 A CN 202010349173A CN 113571054 B CN113571054 B CN 113571054B
Authority
CN
China
Prior art keywords
identified
sentence
voiceprint
recognized
model library
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010349173.XA
Other languages
Chinese (zh)
Other versions
CN113571054A (en
Inventor
陈润泽
陈航
任永华
胡瑛
王振志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Zhejiang Innovation Research Institute Co ltd
China Mobile Communications Group Co Ltd
China Mobile Group Zhejiang Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Zhejiang Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Zhejiang Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202010349173.XA priority Critical patent/CN113571054B/en
Publication of CN113571054A publication Critical patent/CN113571054A/en
Application granted granted Critical
Publication of CN113571054B publication Critical patent/CN113571054B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention relates to the technical field of voice signal processing, and discloses a voice recognition signal preprocessing method, which comprises the following steps: receiving a voice signal to be recognized, and extracting voiceprint characteristics of each sentence to be recognized in the voice signal to be recognized, wherein the voice signal to be recognized comprises at least one sentence to be recognized; identifying voiceprint characteristics of each statement to be identified according to the voiceprint model library to obtain an initial identification result; the voiceprint model library is obtained by short-time registration construction according to each sentence to be identified before the current sentence to be identified in the voice signal to be identified; performing distortional analysis on each sentence to be identified of the voice signal to be identified to obtain a distortion degree result of each sentence to be identified; and adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result. By the mode, the voice recognition method and the voice recognition device have the beneficial effect of achieving the accuracy of voice recognition.

Description

Speech recognition signal preprocessing method, device, equipment and computer storage medium
Technical Field
The embodiment of the invention relates to the technical field of artificial intelligence, in particular to a voice recognition signal preprocessing method, a device, equipment and a computer readable storage medium.
Background
At present, in order to improve the accuracy of speech recognition, screening and filtering of signal input are generally implemented through a microphone array, and the main purpose of the method is to remove interference sources other than an effective sound source, and the method mainly comprises the following parts:
1. sound source localization: the sound source is located by angle and distance measurements.
2. Echo suppression and cancellation: abnormal signals such as background noise, interference, reverberation, and echo are suppressed.
3. Signal separation and extraction: signal separation and extraction are carried out according to rules
However, the main technical objective of the existing microphone array technology is to aim at the sound signals input by the microphones, and have obvious effects on removing other interference sources besides human voice, such as abnormal signals of background noise, reverberation, echo, interference, automobile horn sound and the like. But cannot process sound signals of other people around the user introduced by the use environment.
Therefore, there is a need for a voice signal preprocessing method that can eliminate the use of voice signals of other people around the user introduced by the use environment so as to improve the accuracy of voice recognition.
Disclosure of Invention
In view of the above problems, embodiments of the present invention provide a method for preprocessing a speech recognition signal, a device and apparatus for the method and apparatus for preprocessing the speech recognition signal, and a computer readable storage medium, which are used for solving the technical problem that the speech recognition signal in the prior art cannot eliminate the sounds of other people caused by the surrounding environment.
According to an aspect of an embodiment of the present invention, there is provided a voice recognition signal preprocessing method, including:
receiving a voice signal to be recognized, and extracting voiceprint characteristics of each sentence to be recognized in the voice signal to be recognized, wherein the voice signal to be recognized comprises at least one sentence to be recognized;
identifying the voiceprint characteristics of each statement to be identified according to the voiceprint model library to obtain an initial identification result; the voiceprint model library is constructed by short-time registration according to each sentence to be identified before a current sentence to be identified in the voice signal to be identified, and the identification result is the corresponding relation between the sentence to be identified and the user;
performing distortion analysis on each sentence to be identified of the voice signal to be identified to obtain a distortion degree result of each sentence to be identified;
And adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.
In an alternative way, the first and second modules,
receiving a voice signal to be recognized, and extracting voiceprint characteristics of each sentence to be recognized in the voice signal to be recognized, wherein the voice signal to be recognized comprises at least one sentence to be recognized, and further comprises:
dividing the voice signal to be recognized into a plurality of sentences;
and extracting the identity characteristics and text characteristics of each sentence to be recognized before the current sentence to be recognized in the voice signal to be recognized, and fusing to obtain voiceprint characteristics.
In an optional manner, identifying the voiceprint features of each sentence to be identified according to a voiceprint model library to obtain an initial identification result, where the voiceprint model library is obtained by performing short-time registration construction according to each sentence to be identified before a current sentence to be identified in the speech signal to be identified, and further includes:
storing the voiceprint features and the corresponding user identifications in an associated mode, and constructing a voiceprint model library;
and comparing the current sentence to be identified with the voiceprint characteristics stored in the voiceprint model library, judging the similarity, matching the current voice signal to be identified with the corresponding user identification, and storing the voice signal to be identified in the voiceprint model library in an associated manner.
In an optional manner, performing distortion analysis on each sentence to be identified of the speech signal to be identified to obtain a distortion result of each sentence to be identified, and further including:
and carrying out distortion analysis on each statement to be recognized in the voice signal to be recognized by adopting a THD total harmonic distortion analysis method to obtain a distortion result corresponding to each statement.
In an alternative manner, the THD total harmonic distortion analysis method further includes:
and adopting a THD total harmonic distortion analysis formula to analyze the distortion degree:
wherein V is THD_R Representing the ratio of the root mean square value to the total root mean square value of all the harmonic components of the designated N times, V h,rms The root mean square of volume, rms the root mean square, and h the designated order.
In an optional manner, the initial recognition result is adjusted according to the distortion degree result to obtain a target recognition result, which further includes:
determining whether the distortion degree result of each sentence to be identified is within a distortion degree threshold value interval;
and eliminating sentences to be identified and corresponding user identifications, the distortion degree results of which are not in the distortion degree threshold value interval, from the voiceprint model library to obtain a target voiceprint model library and target identification results.
According to another aspect of the embodiment of the present invention, there is also provided a voice recognition signal preprocessing apparatus, including:
the voice print extraction module is used for receiving the voice signals to be identified and extracting voice print characteristics of each sentence to be identified in the voice signals to be identified, wherein the voice signals to be identified comprise at least one sentence to be identified;
the voiceprint registration module is used for identifying voiceprint characteristics of each statement to be identified according to the voiceprint model library to obtain an initial identification result; the voiceprint model library is constructed by short-time registration according to each sentence to be identified before a current sentence to be identified in the voice signal to be identified, and the identification result is the corresponding relation between the sentence to be identified and the user;
the distortion analysis module is used for carrying out distortion analysis on each sentence to be identified of the voice signal to be identified to obtain a distortion result of each sentence to be identified;
and the adjusting module is used for adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.
In an optional manner, the voiceprint registration module identifies voiceprint features of each sentence to be identified according to a voiceprint model library, so as to obtain an initial identification result, where the voiceprint model library is obtained by performing short-time registration construction according to each sentence to be identified before a current sentence to be identified in the speech signal to be identified, and further includes:
Storing the voiceprint features and the corresponding user identifications in an associated mode, and constructing a voiceprint model library;
and comparing the current sentence to be identified with the voiceprint characteristics stored in the voiceprint model library, judging the similarity, matching the current voice signal to be identified with the corresponding user identification, and storing the voice signal to be identified in the voiceprint model library in an associated manner.
According to another aspect of an embodiment of the present invention, there is provided a voice recognition signal preprocessing apparatus including: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;
the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation of the voice recognition signal preprocessing method.
According to yet another aspect of embodiments of the present invention, there is provided a computer-readable storage medium having stored therein at least one executable instruction that, when run on a speech recognition signal preprocessing apparatus/device, causes the speech recognition signal preprocessing apparatus/device to perform the operations of the above-described speech recognition signal preprocessing method.
According to the voice recognition signal preprocessing method, signal distortion degree analysis is carried out by combining the THD total harmonic distortion analysis method, fine adjustment is carried out on the voiceprint model library according to the distortion degree analysis result, fine adjustment is carried out on different speaker results analyzed by the voiceprint model library, the fine adjustment is carried out on the voice recognition result after the fine adjustment as target voice to be recognized, and therefore accuracy of the voice recognition result is improved.
The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and may be implemented according to the content of the specification, so that the technical means of the embodiments of the present invention can be more clearly understood, and the following specific embodiments of the present invention are given for clarity and understanding.
Drawings
The drawings are only for purposes of illustrating embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
fig. 1 is a schematic flow chart of a voice recognition signal preprocessing method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a voice recognition signal preprocessing device according to an embodiment of the present invention;
Fig. 3 is a schematic structural diagram of a voice recognition signal preprocessing device according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein.
Fig. 1 is a schematic flow chart of a voice recognition signal preprocessing method according to an embodiment of the present invention, where the method is performed by a voice recognition signal preprocessing device. As shown in fig. 1, the method comprises the steps of:
s110: receiving a voice signal to be recognized, and extracting voiceprint characteristics of each statement to be recognized in the voice signal to be recognized, wherein the voice signal to be recognized comprises at least one statement to be recognized.
The voice signal to be recognized is a voice signal of the rest voice part after being processed by the microphone array. The invention aims to effectively extract the voice signals of a user using the voice recognition device, remove the voice signals of other people around the user caused by the use environment and improve the recognition accuracy of the input voice signals. Abnormal information signals such as background noise, reverberation, echo, interference, car horn and the like are processed when passing through the microphone array.
Specifically, after receiving the speech signal to be recognized, the speech signal to be recognized is divided into a plurality of sentences. Each sentence in the speech signal to be recognized is taken as a sentence. And extracting the identity characteristics and text characteristics of each sentence to be recognized before the current sentence to be recognized in the voice signal to be recognized, and fusing to obtain voiceprint characteristics. And respectively extracting the identity characteristics of the speaker and the corresponding text characteristics related to the content of the voice information, which are contained in the voice, by adopting a DNN algorithm. The identity features include features of time domain and frequency domain such as tone quality, loudness, tone and the like of sound, such as short-time energy, short-time average amplitude, short-time green rate, MFCC parameters, PLP parameters, pitch and the like. The text feature is text content related to the content of the speech signal to be recognized. And fusing the identity characteristic and the text characteristic to obtain the voiceprint characteristic.
S120: identifying the voiceprint characteristics of each statement to be identified according to the voiceprint model library to obtain an initial identification result; the voiceprint model library is constructed by short-time registration according to each sentence to be identified before the current sentence to be identified in the voice signal to be identified, and the identification result is the corresponding relation between the sentence to be identified and the user.
Specifically, in this embodiment, the steps specifically include:
and storing the voiceprint characteristics and the user identification in an associated mode, and constructing a voiceprint model library. Specifically, voiceprint features of each sentence to be recognized in the received voice signal to be recognized are respectively associated with user identifiers for storage. And (3) carrying out voiceprint self-registration by taking each sentence to be identified (each sentence) as a unit, establishing a free-speaking voiceprint model library after associating user identification, and storing the voiceprint characteristics of the user. The user identification is a plurality of preset identifications and is used for matching voiceprint features. The recognition result is that the corresponding relation between the sentence to be recognized and the user is represented by the corresponding relation between the voiceprint feature of the sentence to be recognized and the user identification.
And comparing the current voice signal to be recognized with the voiceprint characteristics of each statement to be recognized stored in the voiceprint model library, and judging the similarity, so as to match the user identifier for the current voice signal to be recognized.
For example: in a conference scenario, a speech signal is received. When a first sentence is received, the first sentence is subjected to voiceprint feature extraction and matched with a random user identifier, such as user 1, and stored in a voiceprint model library after being associated. And when the second sentence is received, extracting the voiceprint characteristics of the second sentence, and comparing the voiceprint characteristics with the voiceprint characteristics of the first sentence in the voiceprint model library, and if the similarity reaches a preset similarity threshold, matching the second sentence as the user 1, wherein the second sentence is also said by the user 1. If the similarity does not reach the similarity threshold, a user identifier, such as identifier 2, is randomly matched for the second sentence. And associating the voiceprint features of the second sentence with the corresponding user identifications and storing the voiceprint features in a voiceprint model library. When the third sentence is received, the voice print characteristics of the first sentence and the second sentence are respectively matched in the same way, so that the user identification of the third sentence is obtained, and the user identification is stored in a voice print model library in an associated mode. Wherein, the user identification can realize the association storage with the statement to be identified in the form of adding the field.
By adopting the mode, the voice print characteristics of the current sentences to be identified and the previous sentences to be identified stored in the voice print model library are compared, and the corresponding user identifications are matched, so that the user identifications corresponding to the sentences to be identified are obtained, and the voice print characteristics of the sentences to be identified and the corresponding user identifications are initial identification results. That is, through the above processing, the speaker corresponding to each sentence to be recognized can be obtained. Thus, accurate judgment of speech signals generated in the conference is realized, which sentences are spoken by one user and which sentences are spoken by another user.
S130: and carrying out distortional analysis on each statement to be identified of the voice signal to be identified to obtain a distortion degree result of each statement to be identified.
The distortion is defined as the influence of interference and noise on the sound during the conversion, amplification and transmission processes, and the output signal changes in waveform relative to the input signal. The degree to which the output signal deviates from the input signal by an amount is the degree of distortion. These disturbances and noise include the sounds of other people around the speaker. Thus, performing distortion analysis may assist in distinguishing between speech signals of a speaker and others.
In this embodiment, the distortion degree of the original speech signal is analyzed by using the THD total harmonic distortion analysis formula. And carrying out signal distortion degree analysis on each sentence unit of the sentences to be recognized to obtain the distortion degree of the input signal relative to the output signal of the voice recognition signal preprocessing equipment.
The distortion analysis is performed by adopting the following THD total harmonic distortion analysis formula:
firstly, using the value Vsamp of the waveform sample sampling point of the volume of the selected voice signal to be recognized to perform corresponding volume root mean square V rms And inserts it into a standard root mean square equation:
then, the THD parameter is calculated as the total harmonic content in the signal:
wherein V is THD_R Representing the ratio of the root mean square value to the total root mean square value of all the harmonic components of the designated N times, V h,rms The root mean square of the volume, and rms the root mean square.
And after analysis and calculation are carried out according to the distortion degree formula, obtaining a distortion degree result of the volume corresponding to each statement to be recognized in the voice signal to be recognized. In this embodiment, the distortion result may be expressed as a percentage, for example, the distortion is 5%, which indicates that the sentence to be recognized is distorted by 5% with respect to its original input.
S140: and adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.
In this embodiment, after the initial recognition result is obtained, an initial corresponding result of the voiceprint features of each sentence to be recognized and the corresponding user identifier is obtained, and a voiceprint model library in which the voiceprint features of all the sentences to be recognized are associated with the corresponding user identifiers is obtained. The initial recognition result does not take the distortion degree into consideration, so that whether each sentence to be recognized is distorted or not needs to be judged according to the calculated distortion degree result of each sentence to be recognized. Under the same tone and other voiceprint characteristics, the voiceprint characteristics of the sentences to be identified with larger signal distortion degree change are removed from the self-registration of the sentences to be identified, and only the sentences to be identified which are added in a reasonable distortion degree range are selected to be used as the voiceprint characteristics of the effective sentences to be identified from the self-registration. Therefore, in the present embodiment, a distortion threshold section is set. After the distortion degree threshold value interval is exceeded, the fact that the distortion degree of the statement to be identified is higher is indicated, so that the statement to be identified, the result of which the distortion degree is not in the distortion degree threshold value interval, and the corresponding identification result are required to be removed in the voiceprint model library, and the voiceprint model library is adjusted to obtain the target voiceprint model library. And meanwhile, obtaining a target recognition result after removing the sentences to be recognized with high distortion. In this embodiment, the distortion threshold interval is 0-5%. And if the distortion degree result of a sentence to be identified exceeds 5%, rejecting the sentence to be identified.
In this embodiment, after the voiceprint model library and the initial recognition result are adjusted according to the distortion degree result to obtain a target recognition result, the sentences to be recognized are further ordered according to the features of the volume of the speech recognition signal, the integrity of harmonics, the distortion degree and the like, and the sentences to be recognized corresponding to the user identifier are selected to output the speech recognition result, so as to obtain an output result. For example, a sentence to be recognized with large volume, complete harmonic wave and small distortion degree of a voice recognition signal and a corresponding user identification are selected and output. Judging the target recognition result through the assistance of the user behavior, if the target recognition result is output in error, prompting the user to repeat the voice recognition step so as to optimize the voiceprint model library, correcting the output user identification and recognition result, and further improving the accuracy of the voiceprint model library on the user voice recognition result.
In this embodiment, the scenario is that in the conference record, according to different voiceprint features and distortion results registered by the voice to be recognized, classification summarization is performed on the sentences to be recognized, the voices to be recognized with the same user identifier are summarized together, and the corresponding user identifier is used as a name and is output as a result.
According to the voice recognition signal preprocessing method, signal distortion degree analysis is carried out by combining the THD total harmonic distortion analysis method, fine adjustment is carried out on the voiceprint model library according to the distortion degree analysis result, fine adjustment is carried out on different speaker results analyzed by the voiceprint model library, the fine adjustment is carried out on the voice recognition result after the fine adjustment as target voice to be recognized, and therefore accuracy of the voice recognition result is improved.
Fig. 2 is a schematic diagram showing the structure of an embodiment of the speech recognition signal preprocessing apparatus of the present invention. As shown in fig. 2, the apparatus 200 includes: voiceprint extraction module 210, voiceprint registration module 220, distortion analysis module 230, and adjustment module 240.
The voiceprint extraction module 210 is configured to receive a voice signal to be identified, and extract voiceprint features of each sentence to be identified in the voice signal to be identified, where the voice signal to be identified includes at least one sentence to be identified.
The voiceprint registration module 220 is configured to identify voiceprint features of each sentence to be identified according to a voiceprint model library, so as to obtain an initial identification result; the voiceprint model library is constructed by short-time registration according to each sentence to be identified before the current sentence to be identified in the voice signal to be identified, and the identification result is the corresponding relation between the sentence to be identified and the user.
The distortion analysis module 230 is configured to perform distortion analysis on each sentence to be identified of the speech signal to obtain a distortion result of each sentence to be identified.
And the adjusting module 140 is configured to adjust the voiceprint model library and the initial recognition result according to the distortion degree result, so as to obtain a target voiceprint model library and a target recognition result.
The specific working process of each module is as follows:
the voiceprint extraction module 210 receives a voice signal to be identified, and extracts voiceprint features of each sentence to be identified in the voice signal to be identified, where the voice signal to be identified includes at least one sentence to be identified.
The voice signal to be recognized is a voice signal of the rest voice part after being processed by the microphone array. The invention aims to effectively extract the voice signals of a user using the voice recognition device, remove the voice signals of other people around the user caused by the use environment and improve the accuracy of the input voice signals. Abnormal information signals such as background noise, reverberation, echo, interference, car horn and the like are processed when passing through the microphone array.
In this embodiment, the voiceprint extraction module 210 includes a voice dividing sub-module and a voiceprint fusion sub-module.
The sentence dividing module is used for dividing the voice signal to be recognized into a plurality of sentences. Each sentence in the speech signal to be recognized is taken as a sentence.
And the voiceprint fusion submodule is used for extracting the identity characteristics and the text characteristics of each sentence to be recognized before the current sentence to be recognized in the voice signal to be recognized, and fusing the identity characteristics and the text characteristics to obtain voiceprint characteristics.
Specifically, the voiceprint fusion submodule adopts DNN algorithm to respectively extract the identity characteristics of the speaker and the corresponding text characteristics related to the content of the voice information. The identity features include features of time domain and frequency domain such as tone quality, loudness, tone and the like of sound, such as short-time energy, short-time average amplitude, short-time green rate, MFCC parameters, PLP parameters, pitch and the like. The text feature is text content related to the content of the speech signal to be recognized. And fusing the identity characteristic and the text characteristic to obtain the voiceprint characteristic.
The voiceprint registration module 220 identifies voiceprint features of each statement to be identified according to the voiceprint model library to obtain an initial identification result; the voiceprint model library is constructed by short-time registration according to each sentence to be identified before the current sentence to be identified in the voice signal to be identified, and the identification result is the corresponding relation between the sentence to be identified and the user.
Wherein, the voiceprint registration module 220 stores voiceprint features in association with user identifications, and constructs a voiceprint model library. Specifically, voiceprint features of each sentence to be recognized in the received voice signal to be recognized are respectively associated with user identifiers for storage. And (3) carrying out voiceprint self-registration by taking each sentence to be identified (each sentence) as a unit, establishing a free-speaking voiceprint model library after associating user identification, and storing the voiceprint characteristics of the user. The user identification is a plurality of preset identifications and is used for matching voiceprint features. And comparing the current voice signal to be recognized with the voiceprint characteristics of each statement to be recognized stored in the voiceprint model library, and judging the similarity, so as to match the user identifier for the current voice signal to be recognized. The recognition result is that the corresponding relation between the sentence to be recognized and the user is represented by the corresponding relation between the voiceprint feature of the sentence to be recognized and the user identification.
For example: in a conference scenario, a speech signal is received. When a first sentence is received, the first sentence is subjected to voiceprint feature extraction and matched with a random user identifier, such as user 1, and stored in a voiceprint model library after being associated. And when the second sentence is received, extracting the voiceprint characteristics of the second sentence, and comparing the voiceprint characteristics with the voiceprint characteristics of the first sentence in the voiceprint model library, and if the similarity reaches a preset similarity threshold, matching the second sentence as the user 1, wherein the second sentence is also said by the user 1. If the similarity does not reach the similarity threshold, a user identifier, such as identifier 2, is randomly matched for the second sentence. And associating the voiceprint features of the second sentence with the corresponding user identifications and storing the voiceprint features in a voiceprint model library. When the third sentence is received, the voice print characteristics of the first sentence and the second sentence are respectively matched in the same way, so that the user identification of the third sentence is obtained, and the user identification is stored in a voice print model library in an associated mode. Wherein, the user identification can realize the association storage with the statement to be identified in the form of adding the field.
The voiceprint registration module 220 compares the voiceprint characteristics of each previous sentence to be identified stored in the voiceprint model library with the voiceprint characteristics of the current sentence to be identified, and matches the corresponding user identifier, so as to obtain the user identifier corresponding to each sentence to be identified, where the voiceprint characteristics of each sentence to be identified and the corresponding user identifier are initial identification results. That is, through the above processing, the speaker corresponding to each sentence to be recognized can be obtained. Thus, accurate judgment of speech signals generated in the conference is realized, which sentences are spoken by one user and which sentences are spoken by another user.
The distortion analysis module 230 performs distortion analysis on each sentence to be identified of the speech signal to be identified, so as to obtain a distortion result of each sentence to be identified.
The distortion is defined as the influence of interference and noise on the sound during the conversion, amplification and transmission processes, and the output signal changes in waveform relative to the input signal. The degree to which the output signal deviates from the input signal by an amount is the degree of distortion. These disturbances and noise include the sounds of other people around the speaker. Thus, performing distortion analysis may assist in distinguishing between speech signals of a speaker and others.
In this embodiment, the distortion degree of the original speech signal is analyzed by using the THD total harmonic distortion analysis formula. And carrying out signal distortion degree analysis on each sentence unit of the sentences to be recognized to obtain the distortion degree of the input signal relative to the output signal of the voice recognition signal preprocessing equipment.
And adopting a THD total harmonic distortion analysis formula to analyze the distortion degree:
firstly, using the value Vsamp of the waveform sample sampling point of the volume of the selected voice signal to be recognized to perform corresponding volume root mean square V rms And inserts it into a standard root mean square equation:
then, the THD parameter is calculated as the total harmonic content in the signal:
wherein V is THD_R Representing the ratio of the root mean square value to the total root mean square value of all the harmonic components of the designated N times, V h,rms The root mean square of volume, rms the root mean square, and h the designated order.
And after analysis and calculation are carried out according to the distortion degree formula, obtaining a distortion degree result corresponding to each statement to be recognized in the voice signal to be recognized. In this embodiment, the distortion result may be expressed as a percentage, for example, the distortion is 5%, which indicates that the sentence to be recognized is distorted by 5% with respect to its original input.
The adjustment module 140 adjusts the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.
In this embodiment, after the initial recognition result is obtained, an initial corresponding result of the voiceprint features of each sentence to be recognized and the corresponding user identifier is obtained, and a voiceprint model library in which the voiceprint features of all the sentences to be recognized are associated with the corresponding user identifiers is obtained. The initial recognition result does not take the distortion degree into consideration, so that whether each sentence to be recognized is distorted or not needs to be judged according to the calculated distortion degree result of each sentence to be recognized. Under the same tone and other voiceprint characteristics, the voiceprint characteristics of the sentences to be identified with larger signal distortion degree change are removed from the self-registration of the sentences to be identified, and only the sentences to be identified which are added in a reasonable distortion degree range are selected to be used as the voiceprint characteristics of the effective sentences to be identified from the self-registration. Therefore, in the present embodiment, a distortion threshold section is set. After the distortion degree threshold value interval is exceeded, the fact that the distortion degree of the statement to be identified is higher is indicated, so that the statement to be identified, the result of which the distortion degree is not in the distortion degree threshold value interval, and the corresponding identification result are required to be removed in the voiceprint model library, and the voiceprint model library is adjusted to obtain the target voiceprint model library. And meanwhile, obtaining a target recognition result after removing the sentences to be recognized with high distortion. In this embodiment, the distortion threshold interval is 0-5%. And if the distortion degree result of a sentence to be identified exceeds 5%, rejecting the sentence to be identified.
In this embodiment, after the voiceprint model library and the initial recognition result are adjusted according to the distortion degree result to obtain a target recognition result, the sentences to be recognized are further ordered according to the features of the volume of the speech recognition signal, the integrity of harmonics, the distortion degree and the like, and the sentences to be recognized corresponding to the user identifier are selected to output the speech recognition result, so as to obtain an output result. For example, a sentence to be recognized with large volume, complete harmonic wave and small distortion degree of a voice recognition signal and a corresponding user identification are selected and output. Judging the target recognition result through the assistance of the user behavior, if the target recognition result is output in error, prompting the user to repeat the voice recognition step so as to optimize the voiceprint model library, correcting the output user identification and recognition result, and further improving the accuracy of the voiceprint model library on the user voice recognition result.
In this embodiment, the scenario is that in the conference record, according to different voiceprint features and distortion results registered by the voice to be recognized, classification summarization is performed on the sentences to be recognized, the voices to be recognized with the same user identifier are summarized together, and the corresponding user identifier is used as a name and is output as a result.
According to the voice recognition signal preprocessing device, signal distortion degree analysis is performed by combining the THD total harmonic distortion analysis method, fine adjustment is performed on the voiceprint model library according to the distortion degree analysis result, fine adjustment is performed on different speaker results analyzed by the voiceprint model library, and the fine-adjusted voice recognition result is used as target voice to be recognized, so that the accuracy of the voice recognition result is improved.
Fig. 3 is a schematic structural diagram of an embodiment of a speech recognition signal preprocessing device according to the present invention, and the embodiment of the present invention is not limited to the specific implementation of the speech recognition signal preprocessing device.
As shown in fig. 3, the voice recognition signal preprocessing apparatus may include: a processor (processor) 302, a communication interface (Communications Interface) 304, a memory (memory) 506, and a communication bus 308.
Wherein: processor 302, communication interface 304, and memory 306 communicate with each other via communication bus 508. A communication interface 304 for communicating with network elements of other devices, such as clients or other application servers, etc. The processor 302 is configured to execute the program 310, and may specifically perform the relevant steps in the foregoing embodiments of the method for preprocessing a speech recognition signal.
In particular, program 310 may include program code comprising computer-executable instructions.
The processor 302 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors comprised by the authentication device may be of the same type of processor, such as one or more CPUs; but may also be different types of processors such as one or more CPUs and one or more ASICs.
Memory 306 for storing programs 310. Memory 306 may comprise high-speed RAM memory or may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
Program 310 may be specifically invoked by processor 302 to cause the electronic device to:
receiving a voice signal to be recognized, and extracting voiceprint characteristics of each sentence to be recognized in the voice signal to be recognized, wherein the voice signal to be recognized comprises at least one sentence to be recognized;
identifying the voiceprint characteristics of each statement to be identified according to the voiceprint model library to obtain an initial identification result; the voiceprint model library is constructed by short-time registration according to each sentence to be identified before a current sentence to be identified in the voice signal to be identified, and the identification result is the corresponding relation between the sentence to be identified and the user;
Performing distortion analysis on each sentence to be identified of the voice signal to be identified to obtain a distortion degree result of each sentence to be identified;
and adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.
In an optional manner, a voice signal to be recognized is received, and voiceprint characteristics of each sentence to be recognized in the voice signal to be recognized are extracted, wherein the voice signal to be recognized comprises at least one sentence to be recognized, and further comprises:
dividing the voice signal to be recognized into a plurality of sentences;
and extracting the identity characteristics and text characteristics of each sentence to be recognized before the current sentence to be recognized in the voice signal to be recognized, and fusing to obtain voiceprint characteristics.
In an optional manner, identifying the voiceprint features of each sentence to be identified according to a voiceprint model library to obtain an initial identification result, where the voiceprint model library is obtained by performing short-time registration construction according to each sentence to be identified before a current sentence to be identified in the speech signal to be identified, and further includes:
storing the voiceprint features and the corresponding user identifications in an associated mode, and constructing a voiceprint model library;
And comparing the current sentence to be identified with the voiceprint characteristics stored in the voiceprint model library, judging the similarity, matching the current voice signal to be identified with the corresponding user identification, and storing the voice signal to be identified in the voiceprint model library in an associated manner.
In an optional manner, performing distortion analysis on each sentence to be identified of the speech signal to be identified to obtain a distortion result of each sentence to be identified, and further including:
and carrying out distortion analysis on each statement to be recognized in the voice signal to be recognized by adopting a THD total harmonic distortion analysis method to obtain a distortion result corresponding to each statement.
In an alternative manner, the THD total harmonic distortion analysis method further includes:
and adopting a THD total harmonic distortion analysis formula to analyze the distortion degree:
wherein V is THD_R Representing the ratio of the root mean square value to the total root mean square value of all the harmonic components of the designated N times, V h,rms The root mean square of volume, rms the root mean square, and h the designated order.
In an optional manner, the initial recognition result is adjusted according to the distortion degree result to obtain a target recognition result, which further includes:
determining whether the distortion degree result of each sentence to be identified is within a distortion degree threshold value interval;
And eliminating sentences to be identified and corresponding user identifications, the distortion degree results of which are not in the distortion degree threshold value interval, from the voiceprint model library to obtain a target voiceprint model library and target identification results.
According to the voice recognition signal preprocessing equipment, signal distortion degree analysis is carried out by combining the THD total harmonic distortion analysis method, fine adjustment is carried out on the voiceprint model library according to the distortion degree analysis result, fine adjustment is carried out on different speaker results analyzed by the voiceprint model library, the fine adjustment is carried out on the voice recognition result after the fine adjustment as target voice to be recognized, and therefore accuracy of the voice recognition result is improved.
An embodiment of the present invention provides a computer readable storage medium storing at least one executable instruction that, when executed on a speech recognition signal preprocessing apparatus/device, causes the speech recognition signal preprocessing apparatus/device to execute the speech recognition signal preprocessing method in any of the above method embodiments.
The executable instructions may be specifically for causing a speech recognition signal preprocessing device/arrangement to:
receiving a voice signal to be recognized, and extracting voiceprint characteristics of each sentence to be recognized in the voice signal to be recognized, wherein the voice signal to be recognized comprises at least one sentence to be recognized;
Identifying the voiceprint characteristics of each statement to be identified according to the voiceprint model library to obtain an initial identification result; the voiceprint model library is constructed by short-time registration according to each sentence to be identified before a current sentence to be identified in the voice signal to be identified, and the identification result is the corresponding relation between the sentence to be identified and the user;
performing distortion analysis on each sentence to be identified of the voice signal to be identified to obtain a distortion degree result of each sentence to be identified;
and adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.
In an optional manner, a voice signal to be recognized is received, and voiceprint characteristics of each sentence to be recognized in the voice signal to be recognized are extracted, wherein the voice signal to be recognized comprises at least one sentence to be recognized, and further comprises:
dividing the voice signal to be recognized into a plurality of sentences;
and extracting the identity characteristics and text characteristics of each sentence to be recognized before the current sentence to be recognized in the voice signal to be recognized, and fusing to obtain voiceprint characteristics.
In an optional manner, identifying the voiceprint features of each sentence to be identified according to a voiceprint model library to obtain an initial identification result, where the voiceprint model library is obtained by performing short-time registration construction according to each sentence to be identified before a current sentence to be identified in the speech signal to be identified, and further includes:
Storing the voiceprint features and the corresponding user identifications in an associated mode, and constructing a voiceprint model library;
and comparing the current sentence to be identified with the voiceprint characteristics stored in the voiceprint model library, judging the similarity, matching the current voice signal to be identified with the corresponding user identification, and storing the voice signal to be identified in the voiceprint model library in an associated manner.
In an optional manner, performing distortion analysis on each sentence to be identified of the speech signal to be identified to obtain a distortion result of each sentence to be identified, and further including:
and carrying out distortion analysis on each statement to be recognized in the voice signal to be recognized by adopting a THD total harmonic distortion analysis method to obtain a distortion result corresponding to each statement.
In an alternative manner, the THD total harmonic distortion analysis method further includes:
and adopting a THD total harmonic distortion analysis formula to analyze the distortion degree:
wherein V is THD_R Representing the ratio of the root mean square value to the total root mean square value of all the harmonic components of the designated N times, V h,rms The root mean square of volume, rms the root mean square, and h the designated order.
In an optional manner, the initial recognition result is adjusted according to the distortion degree result to obtain a target recognition result, which further includes:
Determining whether the distortion degree result of each sentence to be identified is within a distortion degree threshold value interval;
and eliminating sentences to be identified and corresponding user identifications, the distortion degree results of which are not in the distortion degree threshold value interval, from the voiceprint model library to obtain a target voiceprint model library and target identification results.
In the embodiment, signal distortion degree analysis is performed by combining the THD total harmonic distortion analysis method, the voiceprint model library is finely tuned according to the distortion degree analysis result, different speaker results analyzed by the voiceprint model library are finely tuned, and the finely tuned voice recognition result is used as target voice to be recognized, so that the accuracy of the voice recognition result is improved.
The embodiment of the invention provides a voice recognition signal preprocessing device for executing the voice recognition signal preprocessing method.
An embodiment of the present invention provides a computer program, where the computer program may be invoked by a processor to cause the electronic device to execute the method for preprocessing a speech recognition signal in any of the above method embodiments.
An embodiment of the present invention provides a computer program product comprising a computer program stored on a computer readable storage medium, the computer program comprising program instructions which, when run on a computer, cause the computer to perform the method for preprocessing a speech recognition signal in any of the method embodiments described above.
The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for a construction of such a system is apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It will be appreciated that the teachings of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present invention.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the above description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed invention requires more features than are expressly recited in each claim. Inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments can be used in any combination.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specifically stated.

Claims (10)

1. A method of preprocessing a speech recognition signal, the method comprising:
receiving a voice signal to be recognized, and extracting voiceprint characteristics of each sentence to be recognized in the voice signal to be recognized, wherein the voice signal to be recognized comprises at least one sentence to be recognized;
Identifying the voiceprint characteristics of each statement to be identified according to the voiceprint model library to obtain an initial identification result; the voiceprint model library is constructed by short-time registration according to each sentence to be identified before a current sentence to be identified in the voice signal to be identified, and the identification result is the corresponding relation between the sentence to be identified and the user;
performing distortion analysis on each sentence to be identified of the voice signal to be identified to obtain a distortion degree result of each sentence to be identified;
and adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.
2. The method of claim 1, wherein receiving a speech signal to be recognized and extracting voiceprint features of each sentence to be recognized in the speech signal to be recognized, wherein the speech signal to be recognized includes at least one sentence to be recognized, further comprises:
dividing the voice signal to be recognized into a plurality of sentences;
and extracting the identity characteristics and text characteristics of each sentence to be recognized before the current sentence to be recognized in the voice signal to be recognized, and fusing to obtain voiceprint characteristics.
3. The method of claim 1, wherein the identifying the voiceprint feature of each sentence to be identified according to a voiceprint model library to obtain an initial identifying result, wherein the voiceprint model library is obtained by performing short-time registration construction according to each sentence to be identified before a current sentence to be identified in the speech signal to be identified, and further comprises:
storing the voiceprint features and the corresponding user identifications in an associated mode, and constructing a voiceprint model library;
and comparing the current sentence to be identified with the voiceprint characteristics stored in the voiceprint model library, judging the similarity, matching the current voice signal to be identified with the corresponding user identification, and storing the voice signal to be identified in the voiceprint model library in an associated manner.
4. The method of claim 1, wherein performing a distorting analysis on each sentence to be recognized of the speech signal to be recognized, to obtain a distortion result of each sentence to be recognized, further comprises:
and carrying out distortion analysis on each statement to be recognized in the voice signal to be recognized by adopting a THD total harmonic distortion analysis method to obtain a distortion result corresponding to each statement.
5. The method of claim 4, wherein the THD total harmonic distortion analysis method further comprises:
And adopting a THD total harmonic distortion analysis formula to analyze the distortion degree:
wherein V is THD_R Representing the ratio of the root mean square value to the total root mean square value of all the harmonic components of the designated N times, V h,rms The root mean square of volume, rms the root mean square, and h the designated order.
6. The method of claim 1, wherein adjusting the initial recognition result based on the distortion factor result to obtain a target recognition result, further comprises:
determining whether the distortion degree result of each sentence to be identified is within a distortion degree threshold value interval;
and eliminating sentences to be identified and corresponding user identifications, the distortion degree results of which are not in the distortion degree threshold value interval, from the voiceprint model library to obtain a target voiceprint model library and target identification results.
7. A speech recognition signal preprocessing apparatus, the apparatus comprising:
the voice print extraction module is used for receiving the voice signals to be identified and extracting voice print characteristics of each sentence to be identified in the voice signals to be identified, wherein the voice signals to be identified comprise at least one sentence to be identified;
the voiceprint registration module is used for identifying voiceprint characteristics of each statement to be identified according to the voiceprint model library to obtain an initial identification result; the voiceprint model library is constructed by short-time registration according to each sentence to be identified before a current sentence to be identified in the voice signal to be identified, and the identification result is the corresponding relation between the sentence to be identified and the user;
The distortion analysis module is used for carrying out distortion analysis on each sentence to be identified of the voice signal to be identified to obtain a distortion result of each sentence to be identified;
and the adjusting module is used for adjusting the voiceprint model library and the initial recognition result according to the distortion degree result to obtain a target voiceprint model library and a target recognition result.
8. The apparatus of claim 7, wherein the voiceprint registration module identifies voiceprint features of each sentence to be identified according to a voiceprint model library to obtain an initial identification result, wherein the voiceprint model library is constructed by performing short-time registration according to each sentence to be identified before a current sentence to be identified in the speech signal to be identified, and further comprises:
storing the voiceprint features and the corresponding user identifications in an associated mode, and constructing a voiceprint model library;
and comparing the current sentence to be identified with the voiceprint characteristics stored in the voiceprint model library, judging the similarity, matching the current voice signal to be identified with the corresponding user identification, and storing the voice signal to be identified in the voiceprint model library in an associated manner.
9. A speech recognition signal preprocessing apparatus, characterized by comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;
The memory is configured to store at least one executable instruction that causes the processor to perform the operations of the method of any one of claims 1-6.
10. A computer readable storage medium, characterized in that at least one executable instruction is stored in the storage medium, which executable instruction, when run on a speech recognition signal preprocessing device, causes the speech recognition signal preprocessing device to perform the operations of the speech recognition signal preprocessing method according to any one of claims 1-6.
CN202010349173.XA 2020-04-28 2020-04-28 Speech recognition signal preprocessing method, device, equipment and computer storage medium Active CN113571054B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010349173.XA CN113571054B (en) 2020-04-28 2020-04-28 Speech recognition signal preprocessing method, device, equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010349173.XA CN113571054B (en) 2020-04-28 2020-04-28 Speech recognition signal preprocessing method, device, equipment and computer storage medium

Publications (2)

Publication Number Publication Date
CN113571054A CN113571054A (en) 2021-10-29
CN113571054B true CN113571054B (en) 2023-08-15

Family

ID=78157992

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010349173.XA Active CN113571054B (en) 2020-04-28 2020-04-28 Speech recognition signal preprocessing method, device, equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN113571054B (en)

Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751921A (en) * 2009-12-16 2010-06-23 南京邮电大学 Real-time voice conversion method under conditions of minimal amount of training data
CN102044248A (en) * 2009-10-10 2011-05-04 北京理工大学 Objective evaluating method for audio quality of streaming media
CN102044247A (en) * 2009-10-10 2011-05-04 北京理工大学 Objective evaluation method for VoIP speech
CN102270451A (en) * 2011-08-18 2011-12-07 安徽科大讯飞信息科技股份有限公司 Method and system for identifying speaker
CN102723081A (en) * 2012-05-30 2012-10-10 林其灿 Voice signal processing method, voice and voiceprint recognition method and device
CN103984315A (en) * 2014-05-15 2014-08-13 成都百威讯科技有限责任公司 Domestic multifunctional intelligent robot
CN104143326A (en) * 2013-12-03 2014-11-12 腾讯科技(深圳)有限公司 Voice command recognition method and device
CN104269177A (en) * 2014-09-22 2015-01-07 联想(北京)有限公司 Voice processing method and electronic device
CN105139858A (en) * 2015-07-27 2015-12-09 联想(北京)有限公司 Information processing method and electronic equipment
CN105405439A (en) * 2015-11-04 2016-03-16 科大讯飞股份有限公司 Voice playing method and device
CN105632515A (en) * 2014-10-31 2016-06-01 科大讯飞股份有限公司 Pronunciation error detection method and device
CN105632489A (en) * 2016-01-20 2016-06-01 曾戟 Voice playing method and voice playing device
CN105679324A (en) * 2015-12-29 2016-06-15 福建星网视易信息系统有限公司 Voiceprint identification similarity scoring method and apparatus
CN106057206A (en) * 2016-06-01 2016-10-26 腾讯科技(深圳)有限公司 Voiceprint model training method, voiceprint recognition method and device
CN106297772A (en) * 2016-08-24 2017-01-04 武汉大学 Detection method is attacked in the playback of voice signal distorted characteristic based on speaker introducing
CN106887229A (en) * 2015-12-16 2017-06-23 芋头科技(杭州)有限公司 A kind of method and system for lifting the Application on Voiceprint Recognition degree of accuracy
CN108320732A (en) * 2017-01-13 2018-07-24 阿里巴巴集团控股有限公司 The method and apparatus for generating target speaker's speech recognition computation model
CN108564940A (en) * 2018-03-20 2018-09-21 平安科技(深圳)有限公司 Audio recognition method, server and computer readable storage medium
CN108847244A (en) * 2018-08-22 2018-11-20 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Voiceprint recognition method and system based on MFCC and improved BP neural network
CN110047474A (en) * 2019-05-06 2019-07-23 齐鲁工业大学 A kind of English phonetic pronunciation intelligent training system and training method

Patent Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102044248A (en) * 2009-10-10 2011-05-04 北京理工大学 Objective evaluating method for audio quality of streaming media
CN102044247A (en) * 2009-10-10 2011-05-04 北京理工大学 Objective evaluation method for VoIP speech
CN101751921A (en) * 2009-12-16 2010-06-23 南京邮电大学 Real-time voice conversion method under conditions of minimal amount of training data
CN102270451A (en) * 2011-08-18 2011-12-07 安徽科大讯飞信息科技股份有限公司 Method and system for identifying speaker
CN102723081A (en) * 2012-05-30 2012-10-10 林其灿 Voice signal processing method, voice and voiceprint recognition method and device
CN104143326A (en) * 2013-12-03 2014-11-12 腾讯科技(深圳)有限公司 Voice command recognition method and device
CN103984315A (en) * 2014-05-15 2014-08-13 成都百威讯科技有限责任公司 Domestic multifunctional intelligent robot
CN104269177A (en) * 2014-09-22 2015-01-07 联想(北京)有限公司 Voice processing method and electronic device
CN105632515A (en) * 2014-10-31 2016-06-01 科大讯飞股份有限公司 Pronunciation error detection method and device
CN105139858A (en) * 2015-07-27 2015-12-09 联想(北京)有限公司 Information processing method and electronic equipment
CN105405439A (en) * 2015-11-04 2016-03-16 科大讯飞股份有限公司 Voice playing method and device
CN106887229A (en) * 2015-12-16 2017-06-23 芋头科技(杭州)有限公司 A kind of method and system for lifting the Application on Voiceprint Recognition degree of accuracy
CN105679324A (en) * 2015-12-29 2016-06-15 福建星网视易信息系统有限公司 Voiceprint identification similarity scoring method and apparatus
CN105632489A (en) * 2016-01-20 2016-06-01 曾戟 Voice playing method and voice playing device
CN106057206A (en) * 2016-06-01 2016-10-26 腾讯科技(深圳)有限公司 Voiceprint model training method, voiceprint recognition method and device
CN106297772A (en) * 2016-08-24 2017-01-04 武汉大学 Detection method is attacked in the playback of voice signal distorted characteristic based on speaker introducing
CN108320732A (en) * 2017-01-13 2018-07-24 阿里巴巴集团控股有限公司 The method and apparatus for generating target speaker's speech recognition computation model
CN108564940A (en) * 2018-03-20 2018-09-21 平安科技(深圳)有限公司 Audio recognition method, server and computer readable storage medium
CN108847244A (en) * 2018-08-22 2018-11-20 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Voiceprint recognition method and system based on MFCC and improved BP neural network
CN110047474A (en) * 2019-05-06 2019-07-23 齐鲁工业大学 A kind of English phonetic pronunciation intelligent training system and training method

Also Published As

Publication number Publication date
CN113571054A (en) 2021-10-29

Similar Documents

Publication Publication Date Title
CN108305615B (en) Object identification method and device, storage medium and terminal thereof
Cummins et al. An image-based deep spectrum feature representation for the recognition of emotional speech
CN111429935B (en) Voice caller separation method and device
Zhang et al. X-tasnet: Robust and accurate time-domain speaker extraction network
CN111312259B (en) Voiceprint recognition method, system, mobile terminal and storage medium
CN111081223B (en) Voice recognition method, device, equipment and storage medium
Liu et al. Replay attack detection using magnitude and phase information with attention-based adaptive filters
CN112382300A (en) Voiceprint identification method, model training method, device, equipment and storage medium
US11611581B2 (en) Methods and devices for detecting a spoofing attack
Hou et al. Learning disentangled feature representations for speech enhancement via adversarial training
CN112466276A (en) Speech synthesis system training method and device and readable storage medium
CN116052689A (en) Voiceprint recognition method
CN112002307B (en) Voice recognition method and device
Chakroun et al. Efficient text-independent speaker recognition with short utterances in both clean and uncontrolled environments
CN113571054B (en) Speech recognition signal preprocessing method, device, equipment and computer storage medium
CN116229987B (en) Campus voice recognition method, device and storage medium
Singh et al. Novel feature extraction algorithm using DWT and temporal statistical techniques for word dependent speaker’s recognition
WO2021051533A1 (en) Address information-based blacklist identification method, apparatus, device, and storage medium
CN116469396A (en) Cross-domain voice fake identifying method and system based on time-frequency domain masking effect
Mon et al. Spoof Detection using Voice Contribution on LFCC features and ResNet-34
CN113012684B (en) Synthesized voice detection method based on voice segmentation
CN110931020B (en) Voice detection method and device
Runqiang et al. CASA based speech separation for robust speech recognition
Tahliramani et al. Performance Analysis of Speaker Identification System With and Without Spoofing Attack of Voice Conversion
Neelima et al. Spoofing det ection and count ermeasure is aut omat ic speaker verificat ion syst em using dynamic feat ures

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20231219

Address after: No.19, Jiefang East Road, Hangzhou, Zhejiang Province, 310000

Patentee after: CHINA MOBILE GROUP ZHEJIANG Co.,Ltd.

Patentee after: China Mobile (Zhejiang) Innovation Research Institute Co.,Ltd.

Patentee after: CHINA MOBILE COMMUNICATIONS GROUP Co.,Ltd.

Address before: No. 19, Jiefang East Road, Hangzhou, Zhejiang Province, 310016

Patentee before: CHINA MOBILE GROUP ZHEJIANG Co.,Ltd.

Patentee before: CHINA MOBILE COMMUNICATIONS GROUP Co.,Ltd.

TR01 Transfer of patent right