CN114724544B - Voice chip, voice recognition method, device and equipment and intelligent automobile - Google Patents

Voice chip, voice recognition method, device and equipment and intelligent automobile Download PDF

Info

Publication number
CN114724544B
CN114724544B CN202210389461.7A CN202210389461A CN114724544B CN 114724544 B CN114724544 B CN 114724544B CN 202210389461 A CN202210389461 A CN 202210389461A CN 114724544 B CN114724544 B CN 114724544B
Authority
CN
China
Prior art keywords
voice
recognition result
speech
input
chip
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210389461.7A
Other languages
Chinese (zh)
Other versions
CN114724544A (en
Inventor
严小平
田超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202210389461.7A priority Critical patent/CN114724544B/en
Publication of CN114724544A publication Critical patent/CN114724544A/en
Application granted granted Critical
Publication of CN114724544B publication Critical patent/CN114724544B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The utility model provides a pronunciation chip, speech recognition method, device, equipment and intelligent automobile, relates to speech processing technology field, especially relates to on-vehicle speech processing field, includes: an audio feature extractor configured to extract a voice audio feature in an input voice; a phoneme feature extractor; configured to extract a phonetic phoneme feature in an input speech; and a speech recognizer configured to: acquiring a first recognition result matched with the voice audio features and a second recognition result matched with the voice phoneme features; and performing auxiliary judgment on the second recognition result by using the first recognition result to obtain a voice recognition result of the input voice. This is disclosed can improve the off-line of pronunciation chip awakens up and the recognition rate of discernment word to promote the speech recognition rate of accuracy, reduce the misidentification rate, can also be under the condition of equal recognition rate, rate of accuracy and misidentification rate, reduce chip consumption and cost.

Description

Voice chip, voice recognition method, device and equipment and intelligent automobile
Technical Field
The disclosure relates to the technical field of voice processing, in particular to the field of vehicle-mounted voice processing, and specifically relates to a voice chip, a voice recognition method, a voice recognition device, voice recognition equipment and an intelligent automobile.
Background
At present, traditional automobiles are gradually replaced by new energy automobiles, vehicle-mounted intelligent technologies, unmanned technologies and the like are being developed vigorously, and intelligent cabins and the like cannot leave requirements for intelligent voice, offline voice interaction and control technologies and the like.
The existing off-line voice interaction scheme is mainly realized on the basis of a voice chip of a neural network model algorithm (such as a deep neural network, a cyclic neural network, a long-short term memory artificial neural network and the like).
However, based on the limitations of chip resources, power consumption and computational power, there are many limitations when the neural network model speech algorithm is used on a speech chip, and in an actual application scenario, the difficulty, accuracy and computation amount of the neural network model speech algorithm have to be reduced, so that the advantages of the neural network model speech algorithm cannot be well exerted.
Disclosure of Invention
The disclosure provides a voice chip, a voice recognition method, a voice recognition device, voice recognition equipment and an intelligent automobile.
According to an aspect of the present disclosure, there is provided a voice chip including:
an audio feature extractor configured to extract a voice audio feature in an input voice;
a phoneme feature extractor; configured to extract phonetic phoneme features in the input speech; and;
a speech recognizer configured to: acquiring a first recognition result matched with the voice audio features and a second recognition result matched with the voice phoneme features; and performing auxiliary judgment on the second recognition result by using the first recognition result to obtain a voice recognition result of the input voice.
According to another aspect of the present disclosure, there is provided a speech recognition method, performed by a speech recognizer in a speech chip, including:
acquiring voice audio features extracted from input voice by an audio feature extractor in a voice chip, and acquiring voice phoneme features extracted from the input voice by a phoneme feature extractor in the voice chip;
acquiring a first recognition result matched with the voice audio features and a second recognition result matched with the voice phoneme features;
and performing auxiliary judgment on the second recognition result by using the first recognition result to obtain a voice recognition result of the input voice.
According to another aspect of the present disclosure, there is provided a voice recognition apparatus, executed by a voice recognizer in a voice chip, including:
the voice feature acquisition module is configured to acquire voice audio features extracted from the input voice by the audio feature extractor in the voice chip and acquire voice phoneme features extracted from the input voice by the phoneme feature extractor in the voice chip;
a dual recognition result acquisition module configured to acquire a first recognition result matching the speech audio feature and a second recognition result matching the speech phoneme feature;
and the auxiliary judgment module is configured to perform auxiliary judgment on the second recognition result by using the first recognition result to obtain a voice recognition result of the input voice.
According to another aspect of the present disclosure, there is provided an electronic device including: an audio pickup configured to acquire an input voice; and
the voice chip according to any embodiment of the present disclosure is configured to perform offline voice recognition on an input voice.
According to another aspect of the present disclosure, there is provided a smart car including:
an audio pickup configured to acquire an input voice;
the voice chip according to any embodiment of the present disclosure is configured to perform offline voice recognition on an input voice; and
and the vehicle-mounted machine system is configured to perform matched vehicle-mounted machine control according to the voice recognition result of the voice chip.
The technical scheme of this disclosed embodiment can improve the off-line of pronunciation chip and awaken up and the recognition rate of discernment word to promote the speech recognition rate of accuracy, reduce the mistake recognition rate, can also reduce pronunciation chip consumption and cost under the condition of equal recognition rate, rate of accuracy and mistake recognition rate.
It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
fig. 1 is a schematic diagram of a speech chip provided in an embodiment of the present disclosure;
fig. 2a is a flowchart of a speech recognition method provided by an embodiment of the present disclosure;
FIG. 2b is a flow chart of another speech recognition method provided by the embodiments of the present disclosure;
FIG. 3 is a schematic diagram of a processing flow of an offline speech chip according to an embodiment of the present disclosure;
FIG. 4 is a flow chart of signal processing for a dedicated speech accelerator provided by an embodiment of the present disclosure;
FIG. 5 is an architecture diagram of an offline voice chip provided by an embodiment of the present disclosure;
fig. 6 is a schematic diagram of a speech recognition apparatus according to an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In an example, fig. 1 is a schematic diagram of a voice chip provided by an embodiment of the present disclosure, and as shown in fig. 1, the voice chip may include: an audio feature extractor 110, which may be configured to extract a speech audio feature in an input speech; a phoneme feature extractor 120, which may be configured to extract a speech phoneme feature in the input speech; and a speech recognizer 130, which may be configured to: acquiring a first recognition result matched with the voice audio features and a second recognition result matched with the voice phoneme features; and performing auxiliary judgment on the second recognition result by using the first recognition result to obtain a voice recognition result of the input voice.
The input voice can be the voice which is acquired by the voice chip and needs to be subjected to content recognition.
The audio feature extractor 110 may be a device or hardware unit capable of extracting audio features in speech. The speech audio features may be audio features extracted from the input speech by the audio feature extractor 110. Alternatively, the speech audio features may include, but are not limited to, frequency, loudness, and the like, and it is understood that the speech audio features are features in the speech dimension.
The phoneme feature extractor 120 may be a device or hardware unit capable of extracting phoneme features in speech. The phoneme features may be minimum speech units obtained by dividing natural attributes of the speech, such as vowels and consonants. The phonetic phoneme features may be phoneme features extracted from the input speech by the phoneme feature extractor 120, and may be understood as features in a text dimension.
The speech recognizer 130 may be a device or a hardware unit capable of recognizing input speech based on audio features extracted from the input speech and phoneme features. The first recognition result may be a recognition result of the input speech based on the speech audio feature. The second recognition result may be a recognition result of the input speech based on the phonetic phoneme characteristics. The voice recognition result may be a recognition result of the input voice based on the first recognition result and the second recognition result.
In the embodiment of the present disclosure, the audio feature extractor 110, the phoneme feature extractor 120, and the speech recognizer 130 constitute a speech chip. The audio feature extractor 110 and the phoneme feature extractor 120 are respectively communicatively coupled to the speech recognizer 130.
Optionally, after the voice chip acquires a voice signal (for example, a voice signal picked up by a connected microphone), the acquired voice may be input to the audio feature extractor 110 and the phoneme feature extractor 120 as an input voice, and then the audio feature extractor 110 performs audio feature extraction on the input voice to obtain a voice audio feature, and the phoneme feature extractor 120 performs phoneme feature extraction on the input voice to obtain a voice phoneme feature.
Further, the audio feature extractor 110 and the phoneme feature extractor 120 respectively send the speech phoneme features and the speech phoneme features to the speech recognizer 130. The speech recognizer 130 may determine a first recognition result matched with the speech audio features according to the speech audio features and the preset standard speech or text, determine a second recognition result matched with the speech phoneme features according to the speech phoneme features and the preset standard speech or text, and perform auxiliary judgment on the second recognition result through the first recognition result to obtain a speech recognition result of the input speech, so as to trigger corresponding equipment to act according to the speech recognition result.
When the voice recognition result can be accurately judged only according to the second recognition result, the voice recognition result of the input voice can be obtained only according to the second recognition result.
Optionally, the speech recognizer 130 may select a small amount of speech or text with a higher similarity to the speech audio feature of the input speech (for example, speech or text with a similarity to the speech audio feature of the input speech being greater than a certain threshold, or speech or text with a highest similarity to the speech audio feature of the input speech) from preset standard speech or text, and then compare the speech audio feature with the selected speech or text to determine whether the speech audio feature can be matched with the standard speech or text, and determine the first recognition result according to a matching condition. The speech recognizer 130 may further compare the speech audio characteristics of the input speech with preset standard speech or text one by one, so as to determine a first recognition result according to a matching condition of the speech audio characteristics and each standard speech or text.
Compared with the method that the voice audio features of the input voice are compared with the preset standard voice or text one by one, the voice audio features of the input voice are compared with a small amount of screened voice or text, the data comparison times can be reduced, and the data processing efficiency is improved.
The preset standard voice or text can be all voice samples which can be recognized by the voice chip and the text matched with the voice samples. The standard voice can be a voice sample stored in the voice chip in advance, and the text corresponding to the standard voice can be used for triggering a device connected with the voice chip in communication to perform actions.
Similarly, the speech recognizer 130 may also screen out a small amount of speech or text with a higher similarity to the speech-phoneme feature of the input speech (for example, speech or text with a similarity to the speech-phoneme feature of the input speech greater than a certain threshold, or speech or text with a highest similarity to the speech-phoneme feature of the input speech), compare the speech-phoneme feature with the screened speech or text to determine a matching degree of the speech-phoneme feature with the speech or text, and determine the second recognition result according to the matching degree. The speech recognizer 130 may further compare the speech phoneme characteristics of the input speech with all the preset standard speech or text one by one to determine a second recognition result according to the matching degree of the speech phoneme characteristics and each speech or text.
Compared with the method that the voice phoneme characteristics of the input voice are compared with all preset standard voices or texts one by one, the voice phoneme characteristics of the input voice are compared with a small amount of screened voices or texts, the data comparison times can be reduced, and the data processing efficiency is improved.
It should be noted that, the first recognition result is used to perform the auxiliary judgment on the second recognition result, so that the problems of low accuracy and high misjudgment rate of the voice recognition result caused by determining the voice recognition result of the input voice only according to the first recognition result or determining the voice recognition result of the input voice only according to the second recognition result can be avoided, and the offline awakening and recognition rate of the recognized word is improved, and the accuracy and the misrecognition rate are reduced.
In this embodiment, considering that in the prior art, when only the neural network model algorithm is used for speech recognition, when the neural network model algorithm performs speech recognition according to input speech, the recognition result is generally a matching score between the input speech and a recognized text, if the neural network model algorithm is implemented with high precision or complexity, the matching score may be a score result tending to two polarizations (approaching 0 or approaching 100), so that the speech chip may directly recognize the input speech as the accurately matched recognized text or directly determine a mismatch. However, if the limitations of the memory, power consumption, computational power, and the like of the voice chip are taken into consideration, the neural network model voice algorithm must reduce a certain recognition accuracy, and at this time, a large part of non-bipolarization score results appear in the finally obtained matching degree score, which finally results in that the voice chip obtains a fuzzy matching voice recognition result, and further, the user may be required to re-input a clearer input voice to enhance the judgment, or the input voice may be misjudged with a high probability.
Based on the above, it is creatively proposed that while obtaining a recognition result in the form of score value through the voice phoneme characteristics, a two-polarization recognition result (exact match or mismatch) is obtained through the voice audio characteristics, and further, when the voice chip can definitely determine that the recognition result is exact match or mismatch according to the score value, the recognition structure is directly used as a final voice recognition result; and when the recognition result determined according to the score value is a fuzzy matching result, the fuzzy matching result can be classified as matching or mismatching by combining the recognition result of two polarizations obtained through the voice audio characteristics.
According to the technical scheme of the embodiment of the disclosure, the voice chip is formed by the audio feature extractor, the phoneme feature extractor and the voice recognizer, the voice audio features are extracted from the input voice based on the audio feature extractor, the voice phoneme features are extracted from the input voice based on the phoneme feature extractor, so that a first recognition result matched with the voice audio features and a second recognition result matched with the voice phoneme features are obtained through the voice recognizer, the second recognition result is subjected to auxiliary discrimination by using the first recognition result, the voice recognition result of the input voice is obtained, the voice misrecognition rate of the voice chip can be reduced, the voice recognition accuracy is improved, the problems that a neural network model voice algorithm in the prior art is large in use limitation on the chip and has to reduce algorithm difficulty, accuracy and operation amount are solved, and the power consumption and cost of the chip can be reduced under the conditions of the same recognition rate, accuracy and misrecognition rate.
In an optional embodiment of the present disclosure, the audio feature extractor 110 may include: a signal preprocessing unit configured to perform signal preprocessing on an input voice; the signal transformation processing unit is configured to perform time-frequency domain transformation processing on the preprocessed input signal to obtain a time-frequency domain transformation processing signal; a parameter extraction unit configured to perform parameter extraction on the time-frequency domain transform processing signal to obtain at least one speech parameter; and the characteristic quantization unit is configured to perform characteristic quantization on each voice parameter to obtain voice audio characteristics.
The signal preprocessing unit may be a device, module or apparatus capable of preprocessing a voice signal in the audio feature extractor 110. Optionally, signal pre-processing may include, but is not limited to, noise reduction, mixing, and clutter filtering. The signal conversion processing unit may be an apparatus, a module, a device, or the like that performs signal processing based on fourier transform. The time-frequency domain transform processed signal may be an input signal in the frequency domain. The parameter extraction unit may be a device, module or means or the like that extracts a desired parameter in the signal. The speech parameter may be a parameter related to the input speech in the frequency domain, such as MFCC (Mel-scale frequency Cepstral Coefficients), etc. The feature quantization unit may be a device, a module, or an apparatus that performs a quantization process on the speech parameter.
In the embodiment of the present disclosure, the audio feature extractor 110 may perform signal preprocessing such as noise reduction, frequency mixing, and clutter filtering on the input voice through the signal preprocessing unit to obtain a preprocessed input signal, and then input the preprocessed input signal to the signal conversion processing unit. The signal transformation processing unit may perform time-frequency domain transformation processing on the preprocessed input signal by using a fourier transform function, that is, perform time-domain to frequency-domain transformation on the preprocessed input signal to obtain a time-frequency domain transformation processing signal, and further send the time-frequency domain transformation processing signal to the parameter extraction unit. The parameter extraction unit can extract parameters of the time-frequency domain conversion processing signals according to the parameter extraction requirements to obtain at least one voice parameter, and sends the voice parameter to the feature quantization unit, and the feature quantization unit can quantize the features of the voice parameters after receiving the voice parameters to obtain voice audio features matched with the input voice.
By preprocessing the signal, performing time-frequency domain transformation processing, extracting parameters and quantizing the characteristics, the voice audio characteristics of the input voice can be quantized, and the voice audio characteristics can be further compared and processed conveniently.
In an optional embodiment of the present disclosure, the phoneme feature extractor 120 is embedded with a pre-trained neural network model; the neural network model may be configured to extract phonetic phoneme features in the input speech.
Wherein the neural network model may be a neural network capable of extracting phonetic phoneme features in the speech.
In the embodiment of the present disclosure, the voice sample may be input to the neural network model, the neural network model may be trained through the voice sample, so that the neural network model can recognize the voice phoneme of the voice sample, and the pre-trained neural network model may be further configured in the phoneme feature extractor 120, so as to extract the voice phoneme feature in the input voice through the neural network model of the phoneme feature extractor 120. The speech phoneme characteristics of the input speech are extracted based on the neural network model, and the processing of the off-line speech can be realized.
In an optional embodiment of the present disclosure, the voice chip may further include a voice input module; a speech input module may be configured to perform signal preprocessing on the input speech and send the preprocessed input speech to the phoneme feature extractor 120.
The voice input module may be a module in a voice chip, which preprocesses the input voice and inputs the preprocessing result to the phoneme feature extractor 120.
In the embodiment of the present disclosure, the voice chip may send the acquired input voice to the voice input module, and further perform signal preprocessing on the input voice through the voice input module, and further send the preprocessed input voice to the phoneme feature extractor 120, that is, the voice input module can share the data processing pressure of the phoneme feature extractor 120, reduce the hardware complexity of the phoneme feature extractor 120, and reduce the resource consumption of signal preprocessing.
In an optional embodiment of the present disclosure, the voice chip may further include at least one of: an external data storage that may be configured to store operating programs and data in the speech recognizer 130; an internal data store that may be configured to store a start-up program in the speech recognizer 130; the peripheral module comprises at least one form of communication interface and can be configured to establish communication connection between the voice chip and other equipment; and a voice output module configured to perform voice output on the response voice generated by the voice recognizer 130 according to the voice recognition result.
Wherein both the external data storage and the internal data storage may be components with storage capabilities. The external data storage stores data having different contents than the internal data storage. The voice output module may be a module having a voice output function. The peripheral module may be an interface having a communication function. Optionally, the Peripheral module may include SPI (Serial Peripheral Interface), UART (Universal Asynchronous Receiver/Transmitter), SDIO (Secure Digital Input and Output Interface), and the like. The response voice may be a result of feedback of the input voice by the voice recognizer 130 according to the voice recognition result. Optionally, the response speech may be a repeat of the input speech, or an answer speech of text content corresponding to the input speech. For example, assuming that the input voice is "turn on the air conditioner", the response voice may be "turn on the air conditioner", or "is it necessary to turn on the air conditioner? ".
In the embodiment of the present disclosure, the speech recognizer 130 may wake up each component of the speech chip according to a start program stored in the internal data storage, and may further perform corresponding function operations on each component of the speech chip according to an operating program and data stored in the external data storage. The voice chip can be connected with other devices through the peripheral module in a communication mode. After the speech recognizer 130 generates the response speech according to the speech recognition result, the speech recognizer 130 may perform speech output on the response speech generated by the speech recognition result through the speech output module. The interaction between the voice chip and a user inputting voice can be realized through the external data memory, the internal data memory, the peripheral module and the voice output module, and the user experience is improved.
Alternatively, the speech output module and the speech input module may be integrated in the same audio processing hardware.
In an optional embodiment of the present disclosure, the audio feature extractor 110 may be a dedicated speech accelerator; the phoneme feature extractor 120 may be an embedded neural network processor; the speech recognizer 130 may be at least one central processor or at least one data signal processor.
The special voice accelerator can be any existing audio feature extractor.
In the disclosed embodiment, any known dedicated speech accelerator may be used as the audio feature extractor 110. An embedded neural network processor of a neural network model capable of phoneme feature recognition is used as the phoneme feature extractor 120, and at least one central processor or at least one data signal processor may also be used as the speech recognizer 130. Compared with the case that at least one central processing unit is used as the voice recognizer 130 and a data signal processor is used as the voice recognizer 130, the data can be processed more quickly, and the voice recognition result can be obtained efficiently.
In an example, fig. 2a is a flowchart of a speech recognition method provided in an embodiment of the present disclosure, where the embodiment is applicable to a case of performing offline speech recognition on an input speech, the method may be performed by a speech recognizer in a speech chip, and the apparatus may be implemented by at least one of software and hardware. Accordingly, as shown in fig. 2a, the method comprises the following operations:
step 210, obtaining a voice audio feature extracted from the input voice by the audio feature extractor in the voice chip, and obtaining a voice phoneme feature extracted from the input voice by the phoneme feature extractor in the voice chip.
In the embodiment of the present disclosure, the input speech may be acquired through a speech chip, and then the speech audio features may be extracted from the input speech through an audio feature extractor in the speech chip, and the speech phoneme features may be extracted through a phoneme feature extractor in the speech chip.
Step 220, a first recognition result matching the voice audio features and a second recognition result matching the voice phoneme features are obtained.
In the embodiment of the present disclosure, the speech recognizer may determine a first recognition result matching the speech audio feature according to the speech audio feature and a preset standard speech or text, and determine a second recognition result matching the speech phoneme feature according to the speech phoneme feature and the preset standard speech or text.
And step 230, performing auxiliary judgment on the second recognition result by using the first recognition result to obtain a voice recognition result of the input voice.
In the embodiment of the disclosure, the speech recognizer performs auxiliary judgment on the second recognition result by using the first recognition result to obtain a speech recognition result of the input speech, so as to trigger a corresponding device to act according to the speech recognition result. When the speech recognition result can be accurately discriminated only from the second recognition result, the speech recognizer can obtain the speech recognition result of the input speech only from the second recognition result.
According to the technical scheme of the embodiment, the voice audio features extracted from the input voice by the audio feature extractor in the voice chip are obtained, the voice phoneme features extracted from the input voice by the phoneme feature extractor in the voice chip are obtained, the first recognition result matched with the voice audio features and the second recognition result matched with the voice phoneme features are obtained by the voice recognizer, the first recognition result is used for carrying out auxiliary judgment on the second recognition result, the voice recognition result of the input voice is obtained, the voice recognition rate of the voice can be reduced, the voice recognition accuracy is improved, the problems that the neural network model voice algorithm in the prior art is large in use limitation on the chip, the algorithm difficulty, the accuracy and the operand are reduced are solved, and the power consumption and the cost of the chip can be reduced under the condition that the recognition rate, the accuracy and the misidentification rate have to be equal.
In an optional embodiment of the present disclosure, obtaining the first recognition result matching the speech audio feature may include: inputting the voice audio features into a pre-trained classifier, and acquiring at least one target voice feature label matched with the voice audio features; acquiring a standard voice feature label matched with a target text in a voice feature quantization library; matching each target voice characteristic label with each standard voice characteristic label to obtain a first recognition result between the input voice and the target text; wherein the first recognition result may comprise an exact match or a mismatch.
The target voice feature tag may be a tag matched with a voice audio feature, and the voice audio feature and the target voice feature tag have a one-to-one correspondence relationship. The target text may be at least one text stored in the speech recognizer that has a high similarity to the speech audio features. The speech feature quantization library may be a database storing speech audio features corresponding to standard speech. The standard speech feature tags may be tags of speech audio features stored in a speech feature quantization library.
In the embodiment of the present disclosure, the voice recognizer may input the voice audio features extracted from the input voice to a pre-trained classifier, and further analyze the voice audio features through the classifier, determine at least one target voice feature tag matched with the voice audio features, so as to determine a target text corresponding to the input voice, thereby obtaining a standard voice feature tag matched with the target text in a voice feature quantization library, and further matching the target voice feature tag with each standard voice feature tag. And if the data similarity of the standard voice feature tag matched with the target text is greater than or equal to a preset percentage, determining that the input voice is accurately matched with the target text, wherein the first recognition result between the input voice and the target text is accurate matching. If the target speech feature tag does not match the standard speech feature tag that matches the target text, the first recognition result between the input speech and the target text may be determined to be a mismatch. Namely, the deterministic matching result can be obtained through the first recognition result, and then under the condition that the voice recognition result can not be determined only according to the second recognition result, the second recognition result is subjected to auxiliary judgment through the first recognition result, so that the condition that the voice recognition result can not be determined only according to the second recognition result is avoided.
In an optional embodiment of the present disclosure, obtaining the second recognition result matching the phonetic phoneme feature may include: inputting the voice phoneme characteristics into a pre-trained decoder, and acquiring input text matching result characteristics matched with the voice phoneme characteristics; acquiring standard text characteristics matched with a target text from a multi-word model library; calculating a similarity score between the input text matching result characteristic and the standard text characteristic; determining a second recognition result between the input voice and the target text according to the similarity score; wherein the second recognition result may include an exact match, a fuzzy match, and a mismatch.
Wherein the input text matching result feature may be a vocabulary text matching with a speech phoneme feature of the input speech. The multi-word model library may be a database that stores lexical text corresponding to standard speech. The standard text feature may be lexical text corresponding to the target text. The similarity score may be data characterizing how similar the input text match result features are to the standard text features.
In the embodiment of the present disclosure, the speech recognizer may input the speech phoneme features of the input speech into a pre-trained decoder, and the decoder may perform decoding processing on the speech phoneme features of the input speech to obtain input text matching result features matched with the speech phoneme features, further obtain standard text features matched with the target text in the multi-word model library, and further calculate a similarity score between the input text matching result features and the standard text features through an existing similarity algorithm, so as to determine whether the input speech and the target text are accurately matched, fuzzy matched, or unmatched according to the similarity score. Through the input text matching result characteristics matched with the speech phoneme characteristics and the similarity score between the standard text characteristics, the quantitative value of the similarity between the input text matching result characteristics and the standard text characteristics can be determined, and the subsequent data processing is facilitated.
In an optional embodiment of the present disclosure, determining the second recognition result between the input speech and the target text according to the similarity score may include: if the similarity score is not smaller than the first score threshold value, determining that the second recognition result is an accurate match; determining that the second recognition result is not matched if the similarity score is not greater than the second score threshold; and if the similarity score is smaller than the first score threshold and larger than the second score threshold, determining that the second recognition result is fuzzy matching.
Wherein the first score threshold and the second score threshold may be two different values. The second score threshold is less than the first score threshold. The first score threshold may be used to determine that the second recognition result is an exact match and the second score threshold may be used to determine that the second recognition result is a no match. The fuzzy matching can be a second recognition result between the input voice and the target text, and cannot be directly used for judging the voice recognition result.
In an embodiment of the present disclosure, the speech recognizer may compare the similarity score with a first score threshold, and determine that the second recognition result is an exact match if the similarity score is greater than or equal to, i.e., not less than, the first score threshold. If the similarity score is less than or equal to, i.e., not greater than, the second score threshold, the second recognition result is determined to be a mismatch. If the similarity score is between the first score threshold and the second score threshold, that is, the similarity score is smaller than the first score threshold and larger than the second score threshold, it is determined that the second recognition result is fuzzy matching, and at this time, the first recognition result is required to perform auxiliary judgment to determine the voice recognition result. By dividing the different similarity scores, the second recognition result can be determined so that the second recognition result and the first recognition result belong to a unified dimension.
In an optional embodiment of the present disclosure, performing auxiliary discrimination on the second recognition result using the first recognition result to obtain a speech recognition result of the input speech, includes: if the second recognition result is fuzzy matching, acquiring the first recognition result for auxiliary judgment; if the first recognition result is an accurate match, determining that the voice recognition result between the input voice and the target text is a match; if the first recognition result is a mismatch, it is determined that the speech recognition result between the input speech and the target text is a mismatch.
In the embodiment of the present disclosure, if the speech recognizer determines that the second recognition result is a fuzzy match, the first recognition result is further obtained, and the first recognition result is used as the speech recognition result, so as to perform the auxiliary judgment through the first recognition result. And if the speech recognizer determines that the second recognition result is fuzzy matching and the first recognition result is precise matching, determining that the speech recognition result between the input speech and the target text is matching. And if the speech recognizer determines that the second recognition result is fuzzy matching and the first recognition result is not matching, determining that the speech recognition result between the input speech and the target text is not matching. That is, when the second recognition result is an exact match or a mismatch, the speech recognition result can be determined to be an exact match or a mismatch only according to the second recognition result, and only when the second recognition result is a fuzzy match, the auxiliary determination needs to be performed by combining the first recognition result, so that the speech recognition result can still be determined by the first recognition result when the speech recognition result cannot be determined only according to the second recognition result.
In an example, fig. 2b is a flowchart of another speech recognition method provided by the embodiment of the present disclosure, and as shown in fig. 2b, the method includes:
step 2100, acquiring the voice audio features extracted from the input voice by the audio feature extractor in the voice chip, and acquiring the voice phoneme features extracted from the input voice by the phoneme feature extractor in the voice chip.
Step 2110, inputting the voice audio features into a pre-trained classifier, and acquiring at least one target voice feature label matched with the voice audio features.
And step 2120, acquiring a standard voice feature label matched with the target text from the voice feature quantization library.
And 2130, matching each target voice feature tag with each standard voice feature tag to obtain a first recognition result between the input voice and the target text.
Step 2140, inputting the phonetic phoneme characteristics into a pre-trained decoder, and obtaining input text matching result characteristics matched with the phonetic phoneme characteristics.
And step 2150, acquiring standard text features matched with the target text from the multi-word model library.
Step 2160, calculating a similarity score between the input text matching result feature and the standard text feature.
Step 2170, according to the similarity score, determining a second recognition result between the input speech and the target text.
Step 2180, if the second identification result is fuzzy matching, acquiring a first identification result for auxiliary judgment; if the first recognition result is an accurate match, determining that the voice recognition result between the input voice and the target text is a match; and if the first recognition result is not matched, determining that the voice recognition result between the input voice and the target text is not matched.
According to the technical scheme, the voice audio features extracted from input voice by an audio feature extractor in a voice chip are obtained, the voice phoneme features extracted from the input voice by the phoneme feature extractor in the voice chip are obtained, the voice audio features are further input to a pre-trained classifier, at least one target voice feature tag matched with the voice audio features is obtained, so that a standard voice feature tag matched with a target text is obtained in a voice feature quantization library, each target voice feature tag is matched with each standard voice feature tag, a first recognition result between the input voice and the target text is obtained, the voice phoneme features are further input to a pre-trained decoder, input text matching result features matched with the voice phoneme features are obtained, further, standard text features matched with the target text are obtained in a multi-word model library, and therefore the similarity between the input text matching result features and the standard text features is calculated, and a second recognition result between the input voice and the target text is determined according to the similarity score. If the second recognition result is fuzzy matching, acquiring the first recognition result for auxiliary judgment; if the first recognition result is an accurate match, determining that the voice recognition result between the input voice and the target text is a match; if the first recognition result is mismatching, the voice recognition result between the input voice and the target text is determined to be mismatching, the misrecognition rate of the voice can be reduced, the voice recognition accuracy rate is improved, the problems that in the prior art, a neural network model voice algorithm is large in limitation in use on a chip, the algorithm difficulty, the accuracy and the operation amount have to be reduced are solved, and the power consumption and the cost of the chip can be reduced under the conditions of the same recognition rate, accuracy rate and misrecognition rate.
In an example, fig. 3 is a schematic diagram of a processing flow of an offline voice chip provided in an embodiment of the present disclosure, and as shown in fig. 3, a conventional voice processing flow (referred to as flow 1 for short) is implemented by a dedicated voice accelerator: speech audio features are extracted from the input speech. Further, the voice audio features are input into a pre-trained classifier through a voice recognizer, at least one target voice feature tag matched with the voice audio features is obtained, then in a feature quantization library (a voice feature quantization library), a standard voice feature tag matched with a target text is obtained, each target voice feature tag is compared with each standard voice feature tag, and the comparison result is usually two-staged, such as accurate matching or unmatching (exemplarily, a target voice feature tag can be represented by 1 to be accurately matched with a standard voice feature tag, and a target voice feature tag can be represented by 0 to be unmatched with a standard voice feature tag). The signal processing flow of the dedicated speech accelerator can be seen in fig. 4.
The neural network processing flow (called flow 2 for short) is realized through an embedded neural network processor: speech phoneme features are extracted from the input speech. Further, the speech phoneme characteristics are input into a pre-trained decoder through a speech recognizer, the input text matching result characteristics matched with the speech phoneme characteristics are obtained, further, the standard text characteristics matched with the target text are obtained in a multi-word model library, so that the similarity score between the input text matching result characteristics and the standard text characteristics is calculated, namely, a similarity score with a relevant word in the multi-word model library is obtained and converted into a probability score through score discrimination, the two-stage result is different from the process 1, and the obtained score discrimination result of the input speech and the target text can be precise matching, fuzzy matching or mismatching (exemplarily, a can be used for representing that the input speech and the target text are precise matching, b is used for representing that the input speech and the target text are fuzzy matching, and c is used for representing that the input speech and the target text are mismatching). The data in the characteristic quantification library and the data in the multi-word model library can be obtained from the standard word library. Wherein, the standard word bank can be a database for storing standard voice related data. The process 2 is mainly implemented based on an SoC (System on Chip) architecture, and the operation is completed by an NPU (neural-network processing units) neural network processor.
TABLE 1 match identification tables for Process 1 and Process 2
Figure BDA0003595033160000161
A, b, and c in table 1, which match with flow 2, respectively indicate that the input speech is an exact match, a fuzzy match, and a mismatch with the target text, and 1 and 2, which match with flow 1, respectively indicate that the target speech feature tag is an exact match and a mismatch with the standard speech feature tag. The off-line voice is based on two flow identification matching tables, and through the combined judgment, the comparison result (first identification result) of the flow 1 can assist in judging the comparison result (first identification result) of the flow 2, so that the off-line awakening and the recognition rate of recognized words are improved, the accuracy is improved, and the false recognition rate is reduced. As shown in Table 1, 6 matching and mismatching cases are generated, and each case is a mixed judgment completion, which is more accurate. The algorithm of the flow 1 is simplified, hardware chips are easy to realize, overall power consumption and chip cost are reduced, and the algorithm of the flow 2 is assisted by the algorithm of the flow 1, so that the difficulty, resource occupation, overall calculation capacity and the like of the algorithm of the flow 2 are reduced.
Fig. 5 is an architecture diagram of an offline voice chip according to an embodiment of the present disclosure, as shown in fig. 5, a CPU (Central Processing Unit)/DSP module (voice recognizer) is a system control center of the chip, and if a DSP is selected, the offline voice chip further has a Processing function of a voice signal algorithm. The CPU/DSP module may each support a software operating system such as FreeRTOS/Linux, etc. The NPU module (phoneme feature extractor) is used for algorithmic implementation of neural networks. DDR (Double Data Rate)/PSRAM (Pseudo static random access memory) is used as an external Data storage, and may be used for voice input, processing Data storage, and the like, and may also be used for program storage and Data storage of CPU/DSP. When the peripheral module is a high-speed UART (universal asynchronous receiver/transmitter), the peripheral module can be used for connecting a Bluetooth module, and when the peripheral module is an SDIO (serial digital input/output) module, the peripheral module can be used for connecting a WIFI (Wireless Fidelity, wireless local area network) module, so that the Wireless communication expansion function is realized. ROM (Read-Only Memory)/Flash (multimedia software platform) is equivalent to an internal data Memory, and can be used for storing a start program of a CPU/DSP module. The voice input/output module supports audio formats such as I2S/PDM (Product Data Management)/TDM (Time Division Multiplexing). A dedicated speech accelerator is used to extract the speech phoneme features. The input voice is input to the voice input and output module and the special voice accelerator, and the response voice is output through the voice input and output module. A bus/NOC (network-on-Chip) unit may be used to implement data interaction of the above modules.
The embodiment of the disclosure also provides a voice recognition device, which is executed by the voice recognizer in the voice chip.
Fig. 6 is a schematic diagram of a speech recognition apparatus provided in an embodiment of the present disclosure, as shown in fig. 6, the apparatus includes: a voice feature obtaining module 310, a dual recognition result obtaining module 320, and an auxiliary judging module 330, wherein:
a voice feature obtaining module 310, configured to obtain a voice audio feature extracted from the input voice by an audio feature extractor in the voice chip, and obtain a voice phoneme feature extracted from the input voice by a phoneme feature extractor in the voice chip;
a dual recognition result obtaining module 320 configured to obtain a first recognition result matching the speech audio feature and a second recognition result matching the speech phoneme feature;
and the auxiliary judgment module 330 is configured to perform auxiliary judgment on the second recognition result by using the first recognition result to obtain a voice recognition result of the input voice.
According to the technical scheme of the embodiment of the disclosure, the voice audio features extracted by the audio feature extractor in the voice chip in the input voice are obtained, the voice phoneme features extracted by the phoneme feature extractor in the voice chip in the input voice are obtained, the voice recognizer further obtains the first recognition result matched with the voice audio features and the second recognition result matched with the voice phoneme features, and the first recognition result is used for carrying out auxiliary judgment on the second recognition result to obtain the voice recognition result of the input voice, so that the false recognition rate of the voice can be reduced, the voice recognition accuracy is improved, the problems that the neural network model voice algorithm in the prior art is large in use limitation on the chip and has to reduce the difficulty, precision and operand of the algorithm are solved, and the power consumption and cost of the chip can be reduced under the conditions of the same recognition rate, accuracy and false recognition rate.
Optionally, the dual recognition result obtaining module 320 is specifically configured to input the voice audio features into a pre-trained classifier, and obtain at least one target voice feature tag matched with the voice audio features; acquiring a standard voice feature label matched with a target text in a voice feature quantization library; matching each target voice feature tag with each standard voice feature tag to obtain a first recognition result between the input voice and a target text; wherein the first recognition result comprises an exact match or no match.
Optionally, the dual recognition result obtaining module 320 is specifically configured to input the speech phoneme features into a pre-trained decoder, and obtain input text matching result features matching the speech phoneme features; acquiring standard text characteristics matched with a target text from a multi-word model library; calculating a similarity score between the input text matching result features and the standard text features; determining a second recognition result between the input voice and a target text according to the similarity score; wherein the second recognition result comprises an exact match, a fuzzy match and a mismatch.
Optionally, the dual recognition result obtaining module 320 is specifically configured to determine that the second recognition result is an exact match if the similarity score is not less than the first score threshold; determining that the second identification result is a mismatch if the similarity score is not greater than a second score threshold; the second score threshold is less than the first score threshold; and if the similarity score is smaller than a first score threshold value and larger than a second score threshold value, determining that the second recognition result is fuzzy matching.
Optionally, the auxiliary determining module 330 is specifically configured to, if the second recognition result is a fuzzy match, obtain the first recognition result for auxiliary determination; if the first recognition result is an accurate match, determining that the voice recognition result between the input voice and the target text is a match; and if the first recognition result is not matched, determining that the voice recognition result between the input voice and the target text is not matched.
According to the embodiment of the disclosure, the disclosure further provides the electronic equipment and the intelligent automobile.
The electronic device may comprise an audio pick-up and a speech chip as described in any of the above embodiments. The audio pickup device may be configured to acquire an input voice, the voice chip may be configured to perform offline voice recognition on the input voice, and the voice recognition method may be the method according to any embodiment of the present disclosure. Alternatively, the audio pickup may be a microphone or other device capable of collecting audio signals.
The smart car may include an audio pickup, a voice chip as described in any of the embodiments above, and a locomotive system. Wherein the audio pickup may be configured to acquire an input voice. The voice chip can be configured to perform offline voice recognition on the input voice. And the vehicle-mounted machine system can be configured to perform matched vehicle-mounted machine control according to the voice recognition result of the voice chip. The locomotive system may include an on-board system communicatively coupled to the voice chip and operable based on the voice recognition result.
In an optional embodiment of the present disclosure, the smart car may further include an audio player, and the audio player may be configured to perform voice playing by the voice chip for the response voice output by the voice recognition result. The audio player may be a speaker or other sound producing device.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (14)

1. A speech chip comprising:
an audio feature extractor configured to extract a voice audio feature in an input voice;
a phoneme feature extractor configured to extract a speech phoneme feature in an input speech; and
a speech recognizer configured to: acquiring a first recognition result matched with the voice audio features and a second recognition result matched with the voice phoneme features; using the first recognition result to perform auxiliary judgment on the second recognition result to obtain a voice recognition result of the input voice;
the function of the speech recognizer for performing the auxiliary discrimination on the second recognition result by using the first recognition result to obtain the speech recognition result of the input speech is specifically configured to: if the second recognition result is fuzzy matching, acquiring a first recognition result for auxiliary judgment; if the first recognition result is an accurate match, determining that the voice recognition result between the input voice and the target text is a match; and if the first recognition result is not matched, determining that the voice recognition result between the input voice and the target text is not matched.
2. The speech chip of claim 1, wherein the audio feature extractor comprises:
a signal preprocessing unit configured to perform signal preprocessing on an input voice;
the signal transformation processing unit is configured to perform time-frequency domain transformation processing on the preprocessed input signal to obtain a time-frequency domain transformation processing signal;
a parameter extraction unit configured to perform parameter extraction on the time-frequency domain transform processing signal to obtain at least one speech parameter;
and the characteristic quantization unit is configured to perform characteristic quantization on each voice parameter to obtain voice audio characteristics.
3. The speech chip according to claim 1, wherein the phoneme feature extractor is embedded with a pre-trained neural network model;
the neural network model is configured to extract phonetic phoneme features in the input speech.
4. The voice chip according to claim 1, wherein the voice chip further comprises a voice input module;
the voice input module is configured to perform signal preprocessing on the input voice and send the preprocessed input voice to the phoneme feature extractor.
5. The voice chip according to claim 1, wherein the voice chip further comprises at least one of:
an external data storage configured to store an operating program and data in the speech recognizer;
an internal data store configured to store a start-up program in the speech recognizer;
the peripheral module comprises at least one form of communication interface and is configured to establish communication connection between the voice chip and other equipment;
a voice output module configured to perform voice output on the response voice generated by the voice recognizer according to the voice recognition result.
6. The speech chip according to any one of claims 1-5, wherein the audio feature extractor is a dedicated speech accelerator;
the phoneme feature extractor is an embedded neural network processor;
the speech recognizer is at least one central processor or at least one data signal processor.
7. A speech recognition method, performed by a speech recognizer in a speech chip, comprising:
acquiring voice audio features extracted from input voice by an audio feature extractor in a voice chip, and acquiring voice phoneme features extracted from the input voice by a phoneme feature extractor in the voice chip;
acquiring a first recognition result matched with the voice audio features and a second recognition result matched with the voice phoneme features;
using the first recognition result to perform auxiliary judgment on the second recognition result to obtain a voice recognition result of the input voice;
the method for performing auxiliary judgment on the second recognition result by using the first recognition result to obtain the voice recognition result of the input voice includes:
if the second recognition result is fuzzy matching, acquiring a first recognition result for auxiliary judgment;
if the first recognition result is an accurate match, determining that the voice recognition result between the input voice and the target text is a match;
and if the first recognition result is not matched, determining that the voice recognition result between the input voice and the target text is not matched.
8. The method of claim 7, wherein obtaining a first recognition result matching the speech audio feature comprises:
inputting the voice audio features into a pre-trained classifier, and acquiring at least one target voice feature label matched with the voice audio features;
acquiring a standard voice feature label matched with a target text in a voice feature quantization library;
matching each target voice feature tag with each standard voice feature tag to obtain a first recognition result between the input voice and a target text;
wherein the first recognition result comprises an exact match or no match.
9. The method of claim 8, wherein obtaining the second recognition result matching the phonetic phoneme characteristics comprises:
inputting the voice phoneme features into a pre-trained decoder, and acquiring input text matching result features matched with the voice phoneme features;
acquiring standard text characteristics matched with a target text from a multi-word model library;
calculating a similarity score between the input text matching result features and the standard text features;
determining a second recognition result between the input voice and a target text according to the similarity score;
wherein the second recognition result comprises an exact match, a fuzzy match and a mismatch.
10. The method of claim 9, wherein determining a second recognition result between the input speech and a target text according to the similarity score comprises:
if the similarity score is not smaller than a first score threshold value, determining that a second recognition result is an accurate match;
determining that the second identification result is a mismatch if the similarity score is not greater than a second score threshold; the second score threshold is less than the first score threshold;
and if the similarity score is smaller than a first score threshold value and larger than a second score threshold value, determining that the second recognition result is fuzzy matching.
11. A speech recognition apparatus, executed by a speech recognizer in a speech chip, comprising:
the voice feature acquisition module is configured to acquire voice audio features extracted from the input voice by the audio feature extractor in the voice chip and acquire voice phoneme features extracted from the input voice by the phoneme feature extractor in the voice chip;
a dual recognition result acquisition module configured to acquire a first recognition result matching the speech audio feature and a second recognition result matching the speech phoneme feature;
the auxiliary judgment module is configured to perform auxiliary judgment on a second recognition result by using a first recognition result to obtain a voice recognition result of the input voice;
the auxiliary judgment module is specifically configured to obtain a first identification result for auxiliary judgment if the second identification result is fuzzy matching; if the first recognition result is an accurate match, determining that the voice recognition result between the input voice and the target text is a match; and if the first recognition result is not matched, determining that the voice recognition result between the input voice and the target text is not matched.
12. An electronic device, comprising:
an audio pickup configured to acquire an input voice; and
the speech chip of any of claims 1-6, configured to perform offline speech recognition on the input speech.
13. An intelligent vehicle comprising:
an audio pickup configured to acquire an input voice;
the voice chip of any of claims 1-6, configured to perform offline voice recognition on the input voice; and
and the vehicle-mounted machine system is configured to perform matched vehicle-mounted machine control according to the voice recognition result of the voice chip.
14. The smart car of claim 13, further comprising:
and the audio player is configured to play the response voice output by the voice chip aiming at the voice recognition result in a voice mode.
CN202210389461.7A 2022-04-13 2022-04-13 Voice chip, voice recognition method, device and equipment and intelligent automobile Active CN114724544B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210389461.7A CN114724544B (en) 2022-04-13 2022-04-13 Voice chip, voice recognition method, device and equipment and intelligent automobile

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210389461.7A CN114724544B (en) 2022-04-13 2022-04-13 Voice chip, voice recognition method, device and equipment and intelligent automobile

Publications (2)

Publication Number Publication Date
CN114724544A CN114724544A (en) 2022-07-08
CN114724544B true CN114724544B (en) 2022-12-06

Family

ID=82244562

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210389461.7A Active CN114724544B (en) 2022-04-13 2022-04-13 Voice chip, voice recognition method, device and equipment and intelligent automobile

Country Status (1)

Country Link
CN (1) CN114724544B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9818416B1 (en) * 2011-04-19 2017-11-14 Deka Products Limited Partnership System and method for identifying and processing audio signals
JP2018072697A (en) * 2016-11-02 2018-05-10 日本電信電話株式会社 Phoneme collapse detection model learning apparatus, phoneme collapse section detection apparatus, phoneme collapse detection model learning method, phoneme collapse section detection method, program
CN110364142A (en) * 2019-06-28 2019-10-22 腾讯科技(深圳)有限公司 Phoneme of speech sound recognition methods and device, storage medium and electronic device
CN112669818A (en) * 2020-12-08 2021-04-16 北京地平线机器人技术研发有限公司 Voice wake-up method and device, readable storage medium and electronic equipment
CN112908301A (en) * 2021-01-27 2021-06-04 科大讯飞(上海)科技有限公司 Voice recognition method, device, storage medium and equipment
CN113160820A (en) * 2021-04-28 2021-07-23 百度在线网络技术(北京)有限公司 Speech recognition method, and training method, device and equipment of speech recognition model
CN113838456A (en) * 2021-09-28 2021-12-24 科大讯飞股份有限公司 Phoneme extraction method, voice recognition method, device, equipment and storage medium
CN113889086A (en) * 2021-09-15 2022-01-04 青岛信芯微电子科技股份有限公司 Training method of voice recognition model, voice recognition method and related device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5633042B2 (en) * 2010-01-28 2014-12-03 本田技研工業株式会社 Speech recognition apparatus, speech recognition method, and speech recognition robot
CN110491382B (en) * 2019-03-11 2020-12-04 腾讯科技(深圳)有限公司 Speech recognition method and device based on artificial intelligence and speech interaction equipment
US11410642B2 (en) * 2019-08-16 2022-08-09 Soundhound, Inc. Method and system using phoneme embedding
CN111696557A (en) * 2020-06-23 2020-09-22 深圳壹账通智能科技有限公司 Method, device and equipment for calibrating voice recognition result and storage medium
CN113539242A (en) * 2020-12-23 2021-10-22 腾讯科技(深圳)有限公司 Speech recognition method, speech recognition device, computer equipment and storage medium
CN113823265A (en) * 2021-07-19 2021-12-21 腾讯科技(深圳)有限公司 Voice recognition method and device and computer equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9818416B1 (en) * 2011-04-19 2017-11-14 Deka Products Limited Partnership System and method for identifying and processing audio signals
JP2018072697A (en) * 2016-11-02 2018-05-10 日本電信電話株式会社 Phoneme collapse detection model learning apparatus, phoneme collapse section detection apparatus, phoneme collapse detection model learning method, phoneme collapse section detection method, program
CN110364142A (en) * 2019-06-28 2019-10-22 腾讯科技(深圳)有限公司 Phoneme of speech sound recognition methods and device, storage medium and electronic device
CN112669818A (en) * 2020-12-08 2021-04-16 北京地平线机器人技术研发有限公司 Voice wake-up method and device, readable storage medium and electronic equipment
CN112908301A (en) * 2021-01-27 2021-06-04 科大讯飞(上海)科技有限公司 Voice recognition method, device, storage medium and equipment
CN113160820A (en) * 2021-04-28 2021-07-23 百度在线网络技术(北京)有限公司 Speech recognition method, and training method, device and equipment of speech recognition model
CN113889086A (en) * 2021-09-15 2022-01-04 青岛信芯微电子科技股份有限公司 Training method of voice recognition model, voice recognition method and related device
CN113838456A (en) * 2021-09-28 2021-12-24 科大讯飞股份有限公司 Phoneme extraction method, voice recognition method, device, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Phoneme Aware Speech Recognition through Evolutionary Optimisation;Jordan J. Bird et al;《GECCO’19》;20190713;全文 *
深度学习语音识别系统中的若干建模问题研究;唐健;《中国优秀博硕士学位论文全文数据库(博士)信息科技辑》;20210115;全文 *

Also Published As

Publication number Publication date
CN114724544A (en) 2022-07-08

Similar Documents

Publication Publication Date Title
CN107680582B (en) Acoustic model training method, voice recognition method, device, equipment and medium
US11580960B2 (en) Generating input alternatives
CN110364143B (en) Voice awakening method and device and intelligent electronic equipment
KR100655491B1 (en) Two stage utterance verification method and device of speech recognition system
CN108447471B (en) Speech recognition method and speech recognition device
US11348601B1 (en) Natural language understanding using voice characteristics
US11030999B1 (en) Word embeddings for natural language processing
CN111145763A (en) GRU-based voice recognition method and system in audio
US20240013784A1 (en) Speaker recognition adaptation
CN112233651A (en) Dialect type determining method, dialect type determining device, dialect type determining equipment and storage medium
CN109215634A (en) A kind of method and its system of more word voice control on-off systems
US11157696B1 (en) Language agnostic phonetic entity resolution
CN109065026B (en) Recording control method and device
CN114724544B (en) Voice chip, voice recognition method, device and equipment and intelligent automobile
CN111048068B (en) Voice wake-up method, device and system and electronic equipment
CN111477226A (en) Control method, intelligent device and storage medium
CN116361316A (en) Semantic engine adaptation method, device, equipment and storage medium
CN113593565B (en) Intelligent home device management and control method and system
CN110808050A (en) Voice recognition method and intelligent equipment
US11551666B1 (en) Natural language processing
US11817090B1 (en) Entity resolution using acoustic data
CN110265003B (en) Method for recognizing voice keywords in broadcast signal
CN114171009A (en) Voice recognition method, device, equipment and storage medium for target equipment
JPH09134193A (en) Speech recognition device
US7231352B2 (en) Method for computer-supported speech recognition, speech recognition system and control device for controlling a technical system and telecommunications device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant