CN117524198B - Voice recognition method and device and vehicle - Google Patents

Voice recognition method and device and vehicle Download PDF

Info

Publication number
CN117524198B
CN117524198B CN202311844966.9A CN202311844966A CN117524198B CN 117524198 B CN117524198 B CN 117524198B CN 202311844966 A CN202311844966 A CN 202311844966A CN 117524198 B CN117524198 B CN 117524198B
Authority
CN
China
Prior art keywords
word
ipa
voice recognition
syllables
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311844966.9A
Other languages
Chinese (zh)
Other versions
CN117524198A (en
Inventor
张辽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Xiaopeng Motors Technology Co Ltd
Original Assignee
Guangzhou Xiaopeng Motors Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Xiaopeng Motors Technology Co Ltd filed Critical Guangzhou Xiaopeng Motors Technology Co Ltd
Priority to CN202311844966.9A priority Critical patent/CN117524198B/en
Publication of CN117524198A publication Critical patent/CN117524198A/en
Application granted granted Critical
Publication of CN117524198B publication Critical patent/CN117524198B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W40/00Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models
    • B60W40/08Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models related to drivers or passengers
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • B60W50/08Interaction between the driver and the control system
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W40/00Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models
    • B60W40/08Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models related to drivers or passengers
    • B60W2040/089Driver voice
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2540/00Input parameters relating to occupants
    • B60W2540/21Voice

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Automation & Control Theory (AREA)
  • Transportation (AREA)
  • Mechanical Engineering (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to a voice recognition method, a voice recognition device and a vehicle. The method comprises the following steps: receiving a voice request sent by a user in a vehicle seat cabin; extracting features of the voice request to be identified to generate feature vectors; outputting corresponding voice recognition texts through an end-to-end preset voice recognition model according to the input feature vectors, and displaying the corresponding voice recognition texts one by one in a sub-word mode on a graphical user interface of the vehicle-mounted system; the modeling unit of the voice recognition model comprises a subword unit, wherein the number of the subwords split by single words in the voice recognition text is the same as the number of the corresponding IPA syllables and is forcedly aligned, so that the corresponding subwords are output one by one according to the IPA syllables. According to the scheme provided by the application, the voice recognition text output from end to end can be strongly associated with pronunciation, the recognition efficiency is high, and the data consumption is small.

Description

Voice recognition method and device and vehicle
Technical Field
The present disclosure relates to the field of speech recognition technologies, and in particular, to a speech recognition method, device, and vehicle.
Background
In the related art, the end-to-end speech recognition model can directly output text according to input audio data, has higher speech recognition efficiency and is widely applied. For models for speech recognition in Latin letters such as English, french, german, etc., the modeling unit used in modeling is typically wordpiece. Where wordbiece may be regarded as a fragmentation of a word, i.e. a word may be split into at least one subword, each subword consisting of 1 letter or n letters of consecutive spelling. However, the split sub-words are not directly related to the pronunciation syllables of the words, but only related to spelling, and the syllable-based modeling unit cannot directly obtain the end-to-end voice recognition effect.
At present, in order to enable the voice recognition of Latin language to obtain a recognition result associated with pronunciation, a large amount of audio data of each language is required to train a voice recognition model so as to acquire a mapping relation between pronunciation and wordbiece, and therefore high training cost and training time are consumed.
Disclosure of Invention
In order to solve or partially solve the problems in the related art, the application provides a voice recognition method, a voice recognition device and a vehicle, which can realize strong association between voice recognition text output from end to end and pronunciation, and have high recognition efficiency and small data consumption.
A first aspect of the present application provides a speech recognition method, including:
receiving a voice request sent by a user in a vehicle seat cabin; extracting features of the voice request to be identified to generate feature vectors; outputting corresponding voice recognition texts through an end-to-end preset voice recognition model according to the input feature vectors, and displaying the corresponding voice recognition texts one by one in a sub-word form on a graphical user interface of a vehicle-mounted system; the modeling unit of the voice recognition model comprises a subword unit, wherein the number of the subwords split by the single word in the voice recognition text is the same as the number of the corresponding IPA syllables and is forcedly aligned, so that the corresponding subwords are output one by one according to the IPA syllables. According to the voice recognition method, the number of the sub words after the constraint word is divided is kept consistent with the number of the corresponding IPA syllables, the sub word units and the IPA syllables are forcedly aligned through the preset alignment algorithm to form a one-to-one mapping relation, so that voice recognition texts which can be output by a voice recognition model are sub words with strong relevance to pronunciation, priori knowledge of good language pronunciation rules can be utilized, data volume requirements are reduced, and the advantage of wordpiece can be utilized, so that the end-to-end input voice recognition effect is realized.
In the voice recognition method of the present application, the number of sub-words split by a single word in the voice recognition text is the same as the number of corresponding IPA syllables and is forcedly aligned, so that the corresponding sub-words are output one by one for display according to the IPA syllables, including:
if the number of letters of a word is greater than or equal to the corresponding number of IPA syllables, obtaining a mapping relation between each letter of the word and one of the IPA syllables through a preset alignment algorithm;
forming a sub word by letters which have the same mapping relation and are continuously spelled, and respectively obtaining the alignment result of each sub word and a single IPA syllable; wherein the number of subwords is the same as the number of IPA syllables;
and outputting the subwords one by one according to the alignment result and the time sequence.
In the voice recognition method of the present application, the method further includes:
and if the number of letters of the word is smaller than the corresponding number of IPA syllables, repeating the corresponding letters for a plurality of times to be consistent with the number of IPA syllables, and setting a preset merger behind each letter so as to merge the repeated letters into one letter according to the preset merger and output the letter.
In the voice recognition method of the present application, the obtaining, by a preset alignment algorithm, a mapping relationship between each letter of the word and one of the IPA syllables, and forming a subword from consecutively spelled letters having the same mapping relationship, to obtain an alignment result between each subword and a single IPA syllable, includes:
obtaining a first mapping relation between each letter in the word and one IPA syllable through a first preset alignment algorithm; forming a sub word by using letters with the same mapping relation and continuous spelling, and obtaining a first alignment result of each sub word and a single IPA syllable;
obtaining a second mapping relation between each letter in the word and one IPA syllable through a second preset alignment algorithm; forming a sub word by using letters with the same mapping relation and continuous spelling, and obtaining a second alignment result of each sub word and a single IPA syllable; wherein the second preset alignment algorithm is different from the first preset alignment algorithm.
In the voice recognition method of the present application, the outputting each of the subwords according to the alignment result one by one in time sequence includes:
and outputting the subword of the first alignment result or the second alignment result according to time sequence when the first alignment result is the same as the second alignment result.
In the voice recognition method of the present application, the outputting each of the subwords according to the alignment result one by one in time sequence includes:
when the first alignment result is different from the second alignment result, respectively obtaining preset scores corresponding to different subwords in the first alignment result and the second alignment result; the preset score is the occurrence frequency of the sub-words in the corpus;
and outputting the first alignment result or the sub-word corresponding to the second alignment result with higher total score according to the time sequence one by one.
In the voice recognition method of the present application, the method further includes:
and setting a separator in advance before the sub word unit positioned at the word head, and generating and displaying a space character according to the separator when the sub word of the voice recognition text is positioned at the word head.
A second aspect of the present application provides a speech recognition apparatus, comprising:
the voice receiving module is used for receiving a voice request sent by a user in the vehicle cabin;
the feature extraction module is used for extracting features of the voice request to be identified and generating feature vectors;
the voice recognition module is used for outputting corresponding voice recognition texts through an end-to-end preset voice recognition model according to the input feature vectors and displaying the corresponding voice recognition texts one by one in a sub-word mode on a graphical user interface of the vehicle-mounted system;
the modeling unit of the voice recognition model comprises a subword unit, wherein the number of the subwords split by the single word in the voice recognition text is the same as the number of the corresponding IPA syllables and is forcedly aligned, so that the corresponding subwords are output one by one according to the IPA syllables.
A third aspect of the present application provides a vehicle comprising:
a processor; and
a memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method as described above.
A fourth aspect of the present application provides a computer readable storage medium having stored thereon executable code which, when executed by a processor of a vehicle, causes the processor to perform a method as described above.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The foregoing and other objects, features and advantages of the application will be apparent from the following more particular descriptions of exemplary embodiments of the application as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the application.
FIG. 1 is a flow chart of a speech recognition method shown in the present application;
FIG. 2 is another flow chart of a speech recognition method shown in the present application;
FIG. 3 is a flow chart of aligning subwords and IPA syllables according to a preset algorithm;
FIG. 4 is a schematic diagram of letter processing when the number of letters of a word in the present application is less than the corresponding number of IPA syllables;
FIG. 5 is a schematic diagram of the structure of the speech recognition device shown in the present application;
FIG. 6 is another schematic structural view of the speech recognition device shown in the present application;
fig. 7 is a schematic structural view of the vehicle shown in the present application.
Detailed Description
Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms "first," "second," "third," etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
In the related art, the end-to-end speech recognition model, for the modeling unit wordbiece used in latin, is related only to the spelled letters of the word and not to the pronunciation syllables. When the end-to-end voice recognition is carried out on the mixed languages, such as Chinese and English mixing, chinese and French mixing, english and French mixing and the like, each pronunciation represents a word based on the pronunciation characteristics of the Chinese, and a text recognition result of one sound and one word can be obtained through the voice recognition; for Latin language system speech recognition, the output text recognition result cannot be output in association with the pronunciation of the corresponding language.
Aiming at the problems, the application provides a voice recognition method which can realize strong correlation between the voice recognition text output from end to end and pronunciation, has high recognition efficiency and small data consumption.
The technical scheme of the present application is described in detail below with reference to the accompanying drawings.
Fig. 1 is a flow chart of a speech recognition method shown in the present application.
Referring to fig. 1, a speech recognition method shown in the present application includes:
s110, receiving a voice request sent by a user in a vehicle seat cabin.
The main body of the speech recognition process may be a speech recognition system mounted on a server, a vehicle, or other intelligent device. Taking a vehicle as an example, when a user performs man-machine interaction in the vehicle through voice, a voice request sent by the user in the cabin can be collected in real time through a microphone in the vehicle. The language included in the voice request of the present application may be a single language or a mixed language.
S120, extracting features of the voice request to be identified, and generating feature vectors.
In this step, the speech signal of the speech request to be recognized may be framed according to the related technology, and acoustic feature extraction may be performed on each frame of speech signal, so as to obtain a feature vector corresponding to each frame of speech signal. For example, in the related art, the speech signal is converted into a feature vector, and features such as mel-frequency cepstral coefficient (MFCC) or cepstral coefficient (cepstral coefficients) are generally used, which are merely illustrative and not limiting.
S130, outputting corresponding voice recognition texts through an end-to-end preset voice recognition model according to the input feature vectors, and displaying the corresponding voice recognition texts one by one in a sub-word mode on a graphical user interface of the vehicle-mounted system; the modeling unit of the voice recognition model comprises a subword unit, wherein the number of the subwords split by single words in the voice recognition text is the same as the number of the corresponding IPA syllables and is forcedly aligned, so that the corresponding subwords are output one by one according to the IPA syllables.
In this step, the feature vector is input into a pre-trained end-to-end speech recognition code model to directly output the corresponding speech recognition text. When the voice request contains one or more of Chinese and Latin series languages, the end-to-end voice recognition technology in the related technology can enable the voice recognition text corresponding to the Chinese to be output word by word and displayed on a screen, and each word corresponds to pronunciation in the voice request one by one; the Latin languages of the application are decomposed into at least one group of sub-words in a form of word splitting, namely word splitting, and are sequentially output and displayed one by one, namely one word can be split into at least one sub-word, the number of the split sub-words is completely consistent with the number of IPA syllables corresponding to the word, the split sub-words are corresponding one by one and are forcedly aligned, the effect that each IPA syllable correspondingly outputs and displays one sub-word is achieved, and the pronunciation and the displayed text are enabled to achieve strong relevance.
It should be noted that, when the speech recognition model of the present application is further used for recognizing languages other than latin, such as chinese, the modeling unit further includes a single word unit in the text, i.e. each chinese character serves as a corresponding modeling unit.
Specifically, taking english and french as examples, although different languages have phonemes corresponding to each other, there are phonemes that are partially identical between the different languages. Based on this, by using IPA (International Phonetic Alphabet, ten thousand phonetic symbols, also called international phonetic symbols) syllables as training labels for the model, syllables of different languages are combined, thereby greatly reducing the data volume and consequently reducing the system resource consumption. In the training stage of the end-to-end voice recognition model, the number of the subwords after the word is divided is kept consistent with the corresponding number of IPA syllables, and the subword units and the IPA syllables are forcedly aligned through a preset alignment algorithm to form a one-to-one mapping relationship, so that voice recognition texts which can be output by the voice recognition model are subwords with strong relevance to pronunciation, the priori knowledge of the pronunciation rules of the languages can be utilized, the data volume requirement is reduced, the advantage of wordpiece can be utilized, and the end-to-end voice recognition effect can be realized.
The speech recognition method of the present application will be further described below. Referring to fig. 2 and 3, a method for generating a speech decoding graph shown in the present application includes:
s210, receiving a voice request sent by a user in a vehicle seat cabin.
The description of this step is the same as that of S110, and will not be repeated here.
S220, extracting features of the voice request to be identified, and generating feature vectors.
The description of this step is the same as that of S110, and will not be repeated here.
S230, outputting corresponding voice recognition texts through an end-to-end preset voice recognition model according to the input feature vectors, and displaying the corresponding voice recognition texts one by one in a sub-word mode on a graphical user interface of the vehicle-mounted system; the modeling unit of the voice recognition model comprises a subword unit, and a mapping relation between each letter of a word and one IPA syllable is obtained through a preset alignment algorithm; forming a sub word by letters which have the same mapping relation and are continuously spelled, and respectively obtaining the alignment result of each sub word and a single IPA syllable; wherein the number of subwords is the same as the number of IPA syllables; and outputting the sub words one by one according to the alignment result and the time sequence.
For the voice recognition of Latin language system, in order to obtain the forced alignment relation between each sub word of the word and the IPA syllable, in this step, forced alignment can be performed according to a preset alignment algorithm. The preset alignment algorithm may be one alignment algorithm with a forced alignment function or two different alignment algorithms in the related art; when two alignment algorithms exist, each algorithm has different advantages and disadvantages, and better alignment effect can be obtained through mutual complementation of different algorithms. For example, the preset alignment algorithm may be an HMM-GMM forced alignment algorithm in the related art and/or a monotonic-type (a neural network) -based end-to-end forced alignment algorithm. For example, the HMM-GMM algorithm uses the EM algorithm in forced alignment, has strict mathematical derivation, can ensure that the segmented subword wordpiece is the logically optimal segmentation result, and does not strictly fit training data. The end-to-end alignment algorithm based on monotonic attribute has no data reasoning guarantee, but can be used for fitting training data to the greatest extent, and has strong training data correlation.
Specifically, after receiving the feature vector corresponding to each frame of the speech signal in step S220, in this step, the preset alignment algorithm in the speech recognition model may force each letter recognized by each frame of the feature vector of the non-silence portion to be aligned with an IPA syllable first, and form the sub-word corresponding to the IPA syllable with letters that are continuously spelled and correspond to the same IPA syllable, so as to obtain the alignment result of each IPA syllable and the sub-word.
In a specific embodiment, a first mapping relation between each letter in a word and one of IPA syllables is obtained through a first preset alignment algorithm; forming a sub word by using letters with the same mapping relation and continuous spelling, and obtaining a first alignment result of each sub word and a single IPA syllable; obtaining a second mapping relation between each letter in the word and one IPA syllable through a second preset alignment algorithm; forming a sub word by using letters with the same mapping relation and continuous spelling, and obtaining a second alignment result of each sub word and a single IPA syllable; wherein the second preset alignment algorithm is different from the first preset alignment algorithm. The first preset alignment algorithm is an HMM-GMM forced alignment algorithm, and the second preset alignment algorithm is an end-to-end forced alignment algorithm based on a monotonic registration neural network. It will be appreciated that the first and second preset alignment algorithms are different, but the first and second alignment results obtained may be identical, partially identical or all different.
For ease of understanding, the word intelligence is used as an example to illustrate the process of obtaining alignment results during the training of the speech recognition model. inteligence has a sequence of 12 letters spelled consecutively and corresponding sequences of 5 IPA syllables i_n, t_ , i_i, d _i_n and s. Alignment results of two different alignment algorithms as shown in fig. 3.
As shown in fig. 3, in the alignment process of the HMM-GMM forced alignment algorithm, the first letter i and the letter n are mapped with the same IPA syllable i_n, and the letters i and n with the same mapping relation and continuous spelling form a subword in; similarly, letters t, e and l are mapped with IPA syllable t_ , and letters t, e and l with continuous spelling are obtained to form a subword tel; the letters l and i are mapped with IPA syllables l_i to obtain continuous spelled letters l and i which form a sub word li, and the like, wherein each letter and the IPA syllables form a first mapping relation, and a first alignment result of the sub word in the word and all the IPA syllables can be found according to the first mapping relation; specifically, the first alignment result includes 5 subwords obtained according to the first mapping relationship, namely in, tel, li, gen and ce, and the 5 subwords and the 5 IPA syllables form the first alignment result of forced alignment.
As shown in fig. 3, in the same way, in the end-to-end forced alignment algorithm based on monotone intent, a second mapping relationship is generated, which is partially different from the foregoing alignment algorithm, that is, letters t and e are mapped with IPA syllables t_ , so as to obtain letters te with continuous spelling, and letters l, l and i are mapped with IPA syllables l_i, so as to form a subword lli. According to the second mapping relation, the words can be split to form 5 sub-words, namely in, te, lli, gen and ce, and the 5 sub-words and 5 IPA syllables form a second alignment result of forced alignment.
It can be understood that whether only one alignment algorithm is adopted or two different alignment algorithms are adopted at the same time, the word is split into sub-words with the number completely consistent with that of IPA syllables, so that IPA syllables and sub-words are formed to correspond one by one, and the voice recognition result can output a sub-word sequence associated with pronunciation.
Further, in order to output the subword more conforming to the pronunciation rules, the above two different alignment algorithms may be adopted to generate the corresponding first alignment result and second alignment result. In some embodiments, when the first alignment result is the same as the second alignment result, the subwords of the first alignment result or the second alignment result are output and displayed in time sequence. Obviously, if the first alignment result and the second alignment result are identical, that is, the sub-words split in the word are identical and the mapping relation with each IPA syllable is identical, the sub-words of one group of alignment results are arbitrarily output as the final output text.
In some embodiments, when the first alignment result is different from the second alignment result, respectively obtaining preset scores corresponding to different subwords in the first alignment result and the second alignment result; the score is preset as the occurrence frequency of the sub-words in the corpus; and outputting and displaying the first alignment result or the sub-word corresponding to the second alignment result with higher preset score one by one according to the time sequence. As shown in fig. 3, if there are partial different sub-words in the two alignment results, filtering may be performed according to a preset rule, so as to obtain sub-words in the alignment result that more accords with the pronunciation rule from the sub-words as a final output text.
Specifically, the present application uses the occurrence frequency (i.e., the occurrence frequency) of the subwords in the corpus as a criterion for screening the subwords. Each language has a public corpus, the corpus comprises a large number of words of the language, and each word can be split into at least one sub-word. According to the related art, the occurrence frequency of each sub-word in the whole corpus can be recognized and counted in advance, for example, the occurrence frequency of each sub-word te, tel, li, lli in the corpus can be counted. According to the numerical value of the occurrence frequency, the word habit in the language can be reflected more objectively. When the first alignment result and the second alignment result are different, the occurrence frequencies corresponding to different sub-words in the first alignment result and the second alignment result can be obtained respectively, then the occurrence frequencies corresponding to different sub-words in the first alignment result are added and summed, and the occurrence frequencies corresponding to different sub-words in the second alignment result are added and summed, so that it can be determined which of the first alignment result and the second alignment result has a higher total score, and the other of the first alignment result and the second alignment result has a higher total score can be used as a final output text. For example, in this embodiment, the sum of the occurrence frequencies of te and lli in the second alignment result is greater than the sum of the occurrence frequencies of tel and li in the first alignment result, and then the second alignment result is determined as the final output text for output display.
From the above, the present application divides the speech recognition text of the same word into the subwords with the number completely consistent with the IPA syllables by adopting two alignment algorithms with different advantages, forms the corresponding possible identical or different subword sequences, and then adopts the selector based on the occurrence frequency to screen out the better alignment results from the different alignment results for output and display, thereby achieving the recognition effect conforming to the optimal pronunciation rules.
In some embodiments, when the number of letters of a word and the number of IPA syllables contained therein are greater than or equal to the corresponding number of IPA syllables, a final alignment result is obtained according to the above embodiments and displayed. For example, the word inteligence has 12 letters spelled consecutively, but only 5 IPA syllables, the number of letters is significantly greater than the number of IPA syllables, which is also an objective phenomenon where most words exist.
As shown in fig. 4, in some embodiments, if the number of letters of the word is smaller than the corresponding number of IPA syllables, the corresponding letters are repeated multiple times to be consistent with the number of IPA syllables, and a preset merger is set at the rear of each letter, so that the repeated letters are merged into one letter according to the preset merger and displayed. For ease of understanding, taking the example that a word includes only 1 subword w, w includes 3 corresponding syllables, namely d_ , b_ , and lju, it is apparent that the number of letters is less than the number of syllables. Based on this, the letters are repeated to the same number as syllables, i.e., 1 w is repeated to 3 w, and then a preset merger is set at the rear of each letter, so that 3 letters repeatedly displayed are merged into the original 1 letter for output when displayed on the screen.
Wherein the preset merger includes 3 different mergers for indicating that the positions of the preceding letters are the first letter, the middle letter and the last letter. For example, the preset merger includes "_s" "_m" "_e" to correspond to being placed after the first letter, after the middle letter, and after the last letter of the repeated letters, respectively. When the first letter and the last letter are recognized, the repeated middle letters are omitted and combined to display only the first letter. For example, when w is repeated to 3, when searching the mapping relation with the IPA syllables, the mapping relation is corresponding to "w_ s w _ m w _e" one by one, specifically, w_s is mapped with d_ , w_m is mapped with b_ , and w_e is mapped with lju, so as to ensure that the number of subwords is consistent and aligned with the number of IPA syllables. Meanwhile, when the preset merging symbol is identified, merging the identification text displayed on the upper screen into w for output. As another example, consider that a word includes only 1 subword f, which has only 2 syllables and f. Similarly, "f_ s f _e" is aligned one-to-one with " f", and the output text is only "f".
For another example, taking the common word "KFC" as an example, it includes 3 subwords formed by letters, respectively, the corresponding IPA syllables are 4, respectively kei, , f, si, and the number of subwords is smaller than the number of syllables. Accordingly, "K" is aligned with the "kei" map, "f_s" is aligned with the "" map, "f_e" is aligned with "F" and "C" is aligned with the "si" map, the sequence of subwords is "K F _ s F _ e C", and the output text is "KFC".
It can be understood that when the number of letters is smaller than the number of IPA syllables, after repeating the letters according to the above manner and arranging corresponding preset mergers according to the actual number after repeating, performing subsequent steps according to the alignment algorithm and the screening manner as described above, so as to satisfy the absolute consistency of the number of subwords and IPA syllables, and ensure that the text displayed by output is correct text.
Further, in order to meet the writing rule that a space exists between texts of two adjacent words in a sentence, in some embodiments, a separator is set in advance before a sub-word unit located at the head of a word, so that when a sub-word of a speech recognition text is located at the head of a word, a space character is generated and displayed according to the separator. That is, according to the sub-words located at the initial in the corpus, the corresponding modeling units set separators before the initial. The separator may be any custom symbol, such as "_", for example only and not limitation. Taking the subword "in" as an example, the same subword may have two different modeling units, e.g., when located at the beginning of a word, the corresponding modeling unit is "_in"; when located at a non-headings, the corresponding modeling element is "in". It will be appreciated that by training, the separator will automatically omit and form spaces when displayed on the screen so that the output text complies with the writing specification when displayed on the screen.
In summary, the voice recognition method of the application obtains alignment results of the subwords and the IPA syllables, which are consistent in number and correspond to each other one by one, through different alignment algorithms, and when the two alignment results are different, the more objective corpus occurrence frequency is adopted to screen out the subwords which are well known to split and output one by one, so that the relevance between the output text of the end-to-end voice recognition model and word pronunciation is improved; in addition, when the number of the corresponding letters is smaller than the number of IPA syllables, the number is increased in a repeated letter mode, the number of the letters is restored by setting the preset mergers, the design is exquisite, the operation resources of the system are not affected, and the voice recognition efficiency is ensured.
Corresponding to the embodiment of the application function implementation method, the application also provides a voice recognition device, a vehicle and corresponding embodiments.
Fig. 5 is a schematic structural diagram of the speech recognition apparatus shown in the present application.
Referring to fig. 5, a speech recognition apparatus shown in the present application includes a speech receiving module 510, a feature extracting module 520, and a speech recognition module 530, wherein:
the voice receiving module 510 is configured to receive a voice request sent by a user in a cabin of a vehicle.
The feature extraction module 520 is configured to perform feature extraction on the voice request to be identified, and generate a feature vector.
The voice recognition module 530 is configured to output corresponding voice recognition text through an end-to-end preset voice recognition model according to the input feature vector, and display the corresponding voice recognition text one by one in a sub-word form on a graphical user interface of the vehicle-mounted system. The modeling unit of the voice recognition model comprises a subword unit, wherein the number of the subwords split by single words in the voice recognition text is the same as the number of the corresponding IPA syllables and is forcedly aligned, so that the corresponding subwords are output one by one according to the IPA syllables.
In some embodiments, the voice recognition module 530 includes an alignment module 531, and if the number of letters of the word is greater than or equal to the corresponding number of IPA syllables, the mapping relationship between each letter of the word and one of the IPA syllables is obtained by a preset alignment algorithm; forming a sub word by letters which have the same mapping relation and are continuously spelled, and respectively obtaining the alignment result of each sub word and a single IPA syllable; wherein the number of subwords is the same as the number of IPA syllables; and outputting the sub words one by one according to the alignment result and the time sequence.
Referring to fig. 6, in a specific embodiment, the voice recognition module 530 includes a first alignment module 531A and a second alignment module 531B. The first alignment module is used for obtaining a first mapping relation between each letter in the word and one IPA syllable through a first preset alignment algorithm; and forming a sub word by using letters with the same mapping relation and spelling continuously, and obtaining a first alignment result of each sub word and a single IPA syllable. The second alignment module is used for obtaining a second mapping relation between each letter in the word and one IPA syllable through a second preset alignment algorithm; forming a sub word by using letters with the same mapping relation and continuous spelling, and obtaining a second alignment result of each sub word and a single IPA syllable; wherein the second preset alignment algorithm is different from the first preset alignment algorithm.
In a specific embodiment, the speech recognition module further includes a discrimination module 532, where the discrimination module is configured to output the first alignment result or the subword of the second alignment result in time sequence when the first alignment result is the same as the second alignment result. The judging module is further used for respectively acquiring preset scores corresponding to different subwords in the first alignment result and the second alignment result when the first alignment result and the second alignment result are different; the score is preset as the occurrence frequency of the sub-words in the corpus; and outputting the first alignment result or the sub-word corresponding to the second alignment result with higher total score according to the time sequence one by one.
In a specific embodiment, the voice recognition module further includes a spelling processing module 533 for repeating the corresponding letters multiple times to be consistent with the number of IPA syllables if the number of letters of the word is smaller than the number of corresponding IPA syllables, and setting a preset merger behind each letter to merge the repeated letters into one letter according to the preset merger and output.
In summary, the speech recognition device of the application can make the end-to-end speech recognition model output the sub-word which is forcedly aligned with the pronunciation, thereby realizing the utilization of priori knowledge of the pronunciation rules of languages, meeting the understanding habit of users on the display text on the screen, and the speech recognition system has small data volume required in the training process and saves system resources.
The specific manner in which the respective modules perform the operations in the apparatus of the above embodiments has been described in detail in the embodiments related to the method, and will not be described in detail herein.
Fig. 7 is a schematic structural view of the vehicle shown in the present application.
Referring to fig. 7, a vehicle 1000 includes a memory 1010 and a processor 1020.
The processor 1020 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Memory 1010 may include various types of storage units, such as system memory, read Only Memory (ROM), and persistent storage. Where the ROM may store static data or instructions that are required by the processor 1020 or other modules of the computer. The persistent storage may be a readable and writable storage. The persistent storage may be a non-volatile memory device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the persistent storage may be a removable storage device (e.g., diskette, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as dynamic random access memory. The system memory may store instructions and data that are required by some or all of the processors at runtime. Furthermore, memory 1010 may comprise any combination of computer-readable storage media including various types of semiconductor memory chips (e.g., DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic disks, and/or optical disks may also be employed. In some implementations, memory 1010 may include readable and/or writable removable storage devices such as Compact Discs (CDs), digital versatile discs (e.g., DVD-ROMs, dual-layer DVD-ROMs), blu-ray discs read only, super-density discs, flash memory cards (e.g., SD cards, min SD cards, micro-SD cards, etc.), magnetic floppy disks, and the like. The computer readable storage medium does not contain a carrier wave or an instantaneous electronic signal transmitted by wireless or wired transmission.
The memory 1010 has stored thereon executable code that, when processed by the processor 1020, can cause the processor 1020 to perform some or all of the methods described above.
Furthermore, the method according to the present application may also be implemented as a computer program or computer program product comprising computer program code instructions for performing part or all of the steps of the above-described method of the present application.
Alternatively, the present application may also be embodied as a computer-readable storage medium (or non-transitory machine-readable storage medium or machine-readable storage medium) having stored thereon executable code (or a computer program or computer instruction code) which, when executed by a processor of a server (or server, etc.), causes the processor to perform part or all of the steps of the above-described methods according to the present application.
The embodiments of the present application have been described above, the foregoing description is exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (9)

1. A method of speech recognition, comprising:
receiving a voice request sent by a user in a vehicle seat cabin;
extracting features of the voice request to be identified to generate feature vectors;
outputting corresponding voice recognition texts through an end-to-end preset voice recognition model according to the input feature vectors, and displaying the corresponding voice recognition texts one by one in a sub-word form on a graphical user interface of a vehicle-mounted system; the modeling unit of the voice recognition model comprises a subword unit, wherein the number of the subwords split by single words in the voice recognition text is the same as the number of the corresponding IPA syllables and is forcedly aligned, so that the corresponding subwords are output one by one for display according to the IPA syllables;
if the number of letters of the word is greater than or equal to the corresponding number of IPA syllables, obtaining a mapping relation between each letter of the word and one of the IPA syllables through a preset alignment algorithm; forming a sub word by letters which have the same mapping relation and are continuously spelled, and respectively obtaining the alignment result of each sub word and a single IPA syllable; wherein the number of subwords is the same as the number of IPA syllables; and outputting the subwords one by one according to the alignment result and the time sequence.
2. The method according to claim 1, wherein the method further comprises:
and if the number of letters of the word is smaller than the corresponding number of IPA syllables, repeating the corresponding letters for a plurality of times to be consistent with the number of IPA syllables, and setting a preset merger behind each letter so as to merge the repeated letters into one letter according to the preset merger and output the letter.
3. The method according to claim 1 or 2, wherein the obtaining, by a preset alignment algorithm, a mapping relationship between each letter of the word and one of the IPA syllables, and forming a sub-word from consecutively spelled letters having the same mapping relationship, and obtaining an alignment result between each sub-word and a single IPA syllable, includes:
obtaining a first mapping relation between each letter in the word and one IPA syllable through a first preset alignment algorithm; forming a sub word by using letters with the same mapping relation and continuous spelling, and obtaining a first alignment result of each sub word and a single IPA syllable;
obtaining a second mapping relation between each letter in the word and one IPA syllable through a second preset alignment algorithm; forming a sub word by using letters with the same mapping relation and continuous spelling, and obtaining a second alignment result of each sub word and a single IPA syllable; wherein the second preset alignment algorithm is different from the first preset alignment algorithm.
4. A method according to claim 3, wherein outputting each of the subwords sequentially one by one according to the alignment result includes:
and outputting the subword of the first alignment result or the second alignment result according to time sequence when the first alignment result is the same as the second alignment result.
5. A method according to claim 3, wherein outputting each of the subwords sequentially one by one according to the alignment result includes:
when the first alignment result is different from the second alignment result, respectively obtaining preset scores corresponding to different subwords in the first alignment result and the second alignment result; the preset score is the occurrence frequency of the sub-words in the corpus;
and outputting the first alignment result or the sub-word corresponding to the second alignment result with higher total score according to the time sequence one by one.
6. The method according to claim 2, wherein the method further comprises:
and setting a separator in advance before the sub word unit positioned at the word head, and generating and displaying a space character according to the separator when the sub word of the voice recognition text is positioned at the word head.
7. A speech recognition apparatus, comprising:
the voice receiving module is used for receiving a voice request sent by a user in the vehicle cabin;
the feature extraction module is used for extracting features of the voice request to be identified and generating feature vectors;
the voice recognition module is used for outputting corresponding voice recognition texts through an end-to-end preset voice recognition model according to the input feature vectors and displaying the corresponding voice recognition texts one by one in a sub-word mode on a graphical user interface of the vehicle-mounted system;
the modeling unit of the voice recognition model comprises a subword unit, wherein the number of the subwords split by single words in the voice recognition text is the same as the number of the corresponding IPA syllables and is forcedly aligned, so that the corresponding subwords are output one by one according to the IPA syllables; if the number of letters of the word is greater than or equal to the corresponding number of IPA syllables, obtaining a mapping relation between each letter of the word and one of the IPA syllables through a preset alignment algorithm; forming a sub word by letters which have the same mapping relation and are continuously spelled, and respectively obtaining the alignment result of each sub word and a single IPA syllable; wherein the number of subwords is the same as the number of IPA syllables; and outputting the subwords one by one according to the alignment result and the time sequence.
8. A vehicle, characterized by comprising:
a processor; and
a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of claims 1-6.
9. A computer readable storage medium having executable code stored thereon, which when executed by a processor of a vehicle causes the processor to perform the method of any of claims 1-6.
CN202311844966.9A 2023-12-29 2023-12-29 Voice recognition method and device and vehicle Active CN117524198B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311844966.9A CN117524198B (en) 2023-12-29 2023-12-29 Voice recognition method and device and vehicle

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311844966.9A CN117524198B (en) 2023-12-29 2023-12-29 Voice recognition method and device and vehicle

Publications (2)

Publication Number Publication Date
CN117524198A CN117524198A (en) 2024-02-06
CN117524198B true CN117524198B (en) 2024-04-16

Family

ID=89751554

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311844966.9A Active CN117524198B (en) 2023-12-29 2023-12-29 Voice recognition method and device and vehicle

Country Status (1)

Country Link
CN (1) CN117524198B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113299282A (en) * 2021-07-23 2021-08-24 北京世纪好未来教育科技有限公司 Voice recognition method, device, equipment and storage medium
CN115862600A (en) * 2023-01-10 2023-03-28 广州小鹏汽车科技有限公司 Voice recognition method and device and vehicle
CN115910043A (en) * 2023-01-10 2023-04-04 广州小鹏汽车科技有限公司 Voice recognition method and device and vehicle

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3652732B1 (en) * 2017-07-10 2023-08-16 SCTI Holdings, Inc. Syllable based automatic speech recognition
US11410642B2 (en) * 2019-08-16 2022-08-09 Soundhound, Inc. Method and system using phoneme embedding
CN113539242A (en) * 2020-12-23 2021-10-22 腾讯科技(深圳)有限公司 Speech recognition method, speech recognition device, computer equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113299282A (en) * 2021-07-23 2021-08-24 北京世纪好未来教育科技有限公司 Voice recognition method, device, equipment and storage medium
CN115862600A (en) * 2023-01-10 2023-03-28 广州小鹏汽车科技有限公司 Voice recognition method and device and vehicle
CN115910043A (en) * 2023-01-10 2023-04-04 广州小鹏汽车科技有限公司 Voice recognition method and device and vehicle

Also Published As

Publication number Publication date
CN117524198A (en) 2024-02-06

Similar Documents

Publication Publication Date Title
US11423238B2 (en) Sentence embedding method and apparatus based on subword embedding and skip-thoughts
CN107301860B (en) Voice recognition method and device based on Chinese-English mixed dictionary
CN107195295B (en) Voice recognition method and device based on Chinese-English mixed dictionary
CN113692616B (en) Phoneme-based contextualization for cross-language speech recognition in an end-to-end model
US7805312B2 (en) Conversation control apparatus
WO2018149209A1 (en) Voice recognition method, electronic device, and computer storage medium
CN109858038B (en) Text punctuation determination method and device
US9589563B2 (en) Speech recognition of partial proper names by natural language processing
WO2021179701A1 (en) Multilingual speech recognition method and apparatus, and electronic device
CN115862600B (en) Voice recognition method and device and vehicle
EP4060548A1 (en) Method and device for presenting prompt information and storage medium
JP7544989B2 (en) Lookup Table Recurrent Language Models
CN117859173A (en) Speech recognition with speech synthesis based model adaptation
US20240153484A1 (en) Massive multilingual speech-text joint semi-supervised learning for text-to-speech
KR20180025559A (en) Apparatus and Method for Learning Pronunciation Dictionary
CN116457871A (en) Improving cross-language speech synthesis using speech recognition
CN112639796B (en) Multi-character text input system with audio feedback and word completion
CN117524198B (en) Voice recognition method and device and vehicle
CN109002454B (en) Method and electronic equipment for determining spelling partition of target word
CN116312485B (en) Voice recognition method and device and vehicle
US20240185841A1 (en) Parameter-efficient model reprogramming for cross-lingual speech recognition
US20240013777A1 (en) Unsupervised Data Selection via Discrete Speech Representation for Automatic Speech Recognition
US12008986B1 (en) Universal semi-word model for vocabulary contraction in automatic speech recognition
US20240185844A1 (en) Context-aware end-to-end asr fusion of context, acoustic and text presentations
EP4216209A1 (en) Speech recognition method and apparatus, terminal, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant