CN115132187A - Hot word enhanced speech recognition method and device, storage medium and electronic device - Google Patents

Hot word enhanced speech recognition method and device, storage medium and electronic device Download PDF

Info

Publication number
CN115132187A
CN115132187A CN202210658379.XA CN202210658379A CN115132187A CN 115132187 A CN115132187 A CN 115132187A CN 202210658379 A CN202210658379 A CN 202210658379A CN 115132187 A CN115132187 A CN 115132187A
Authority
CN
China
Prior art keywords
decoding
hot word
score
hot
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210658379.XA
Other languages
Chinese (zh)
Inventor
肖艳红
赵茂祥
李全忠
何国涛
蒲瑶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Puqiang Times Zhuhai Hengqin Information Technology Co ltd
Original Assignee
Puqiang Times Zhuhai Hengqin Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Puqiang Times Zhuhai Hengqin Information Technology Co ltd filed Critical Puqiang Times Zhuhai Hengqin Information Technology Co ltd
Priority to CN202210658379.XA priority Critical patent/CN115132187A/en
Publication of CN115132187A publication Critical patent/CN115132187A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application discloses a method and a device for identifying a hot word enhanced voice, a storage medium and an electronic device. The method comprises the steps of decoding the acoustic characteristics of each frame of voice signal according to a preset decoding network, wherein the decoding process comprises a plurality of decoding paths; judging whether the decoding path contains a hot word or not, wherein the hot word is stored in a hot word set according to a preset rule, and the hot word set is obtained according to user definition; if the hot words are contained in the decoding path, calculating the score of the current hot words according to a preset attenuation model, wherein the preset attenuation model is used for representing the weight of the hot words attenuated according to time; and updating the corresponding accumulated score of the decoding path according to the current hotword score, and outputting a decoded voice recognition result. The method and the device solve the technical problem that the hot word enhancement method is poor in effect.

Description

Hot word enhanced speech recognition method and device, storage medium and electronic device
Technical Field
The present application relates to the field of speech recognition, and in particular, to a method and an apparatus for identifying a hot word enhanced speech, a storage medium, and an electronic apparatus.
Background
As the development of deep learning technology and chip performance, recognition accuracy and computational power have progressed in a cross-over manner, and related speech recognition applications such as online speech input and search are receiving more and more attention. In practical application, different specialized words 'domain hot words' exist in different domains, different users have different personalized requirements, new hot topics and hot words continuously emerge due to the development of the internet and the mobile internet, and due to timeliness and particularity, the hot words are often not covered in an original language model, so that the recognition performance is poor.
In a hot word enhancement method in the related art, hot words are usually added into a corpus, a language model is retrained, and decoding resources are constructed, which is time-consuming and time-ineffective; or a hot word network is constructed and the newly constructed network is combined with the original decoding network, or a hot word unit added by a user is stimulated in the decoding process, so that the decoding resources are not required to be reconstructed, and the hot word information can be dynamically updated. However, the size and the distribution mode of the hot word weight not only affect the occurrence frequency of the hot words, but also affect the accuracy of the whole recognition process, the hot word weight is too low to play a role in enhancing the hot words, and the hot word weight is too high, so that a correct path can be cut off in advance, and the recognition rate is reduced.
Aiming at the problem of poor effect of the hot word enhancement method in the related technology, no effective solution is provided at present.
Disclosure of Invention
The present application mainly aims to provide a method and an apparatus for speech recognition with hot word enhancement, a storage medium, and an electronic apparatus, so as to solve the problem of poor effect of the hot word enhancement method.
To achieve the above object, according to one aspect of the present application, there is provided a hotword enhanced speech recognition method.
The hot word enhanced speech recognition method according to the application comprises the following steps: decoding the acoustic characteristics of each frame of voice signal according to a preset decoding network, wherein the decoding process comprises a plurality of decoding paths; judging whether the decoding path contains a hot word or not, wherein the hot word is stored in a hot word set according to a preset rule, and the hot word set is obtained according to user definition; if the hot words are contained in the decoding path, calculating the score of the current hot words according to a preset attenuation model, wherein the preset attenuation model is used for representing the weight of the hot words attenuated according to time; and updating the accumulated score of the corresponding decoding path according to the current hotword score, and outputting a decoded voice recognition result.
Further, still include: and performing hot word enhancement processing in the preset decoding network according to the mode that the hot word weight is attenuated along with time.
Further, the time-attenuating includes: linear decay over time, exponential decay over time.
Further, the hotword enhancement process includes: setting an initial weight according to a hotword list which can be uploaded and/or updated by a user; and segmenting the hot words in the hot word list, converting the hot word weight, and then converting the initial weight into a score which is in the same order of magnitude as the linguistic score in the preset decoding network, so that the weight can be attenuated along with time in the decoding process.
Further, the preset attenuation model is used for characterizing the weight of the hotword attenuated according to time, and comprises the following steps:
the hotword weight w t Decay with time as follows:
w t =-λt+w 0 (1)
where λ is the attenuation coefficient, w 0 Is the initial weight of the current hot word, t is the relative time of the hot word in the decoding path, t is 0 is the first time of the hot word, and the weight in the attenuation process satisfies the following relation, wherein W is all the hot wordsOriginal weight of word:
W=w 0 +w 1 +...+w T (2)
further, the decoding the acoustic features of each frame of the speech signal according to a preset decoding network, wherein the decoding process includes a plurality of decoding paths, including: in the decoding process of the target audio, calculating the acoustic score and the linguistic score of a current frame decoding path, and updating the accumulated score of the decoding path; if the hot word is contained in the decoding path, calculating the score of the current hot word according to a preset attenuation model, wherein the preset attenuation model is used for representing the weight of the hot word attenuated according to time, and the method comprises the following steps: calculating the score of the current hot word according to a preset attenuation model and setting a hot word matching mark and the relative time of hot word matching for the path; and adding the hot word score to the path to be accumulated in the score until the target audio decoding is finished, and then backtracking the decoding path to obtain a voice recognition result.
Further, still include: the voice recognition process is based on the decoding network and is carried out by combining a Viterbi algorithm and a pruning strategy.
To achieve the above object, according to another aspect of the present application, there is provided a hotword enhanced speech recognition apparatus.
The hot word enhanced speech recognition device according to the present application includes: the feature decoding module is used for decoding the acoustic features of each frame of voice signal according to a preset decoding network, wherein the decoding process comprises a plurality of decoding paths; the judging module is used for judging whether the decoding path contains a hot word or not, wherein the hot word is stored in a hot word set according to a preset rule, and the hot word set is obtained according to user definition; the hot word score calculating module is used for calculating the current hot word score according to a preset attenuation model if the decoding path contains the hot words, wherein the preset attenuation model is used for representing the weight of the hot words attenuated according to time; and the output module is used for updating the accumulated score of the corresponding decoding path according to the current hotword score and outputting a decoded voice recognition result.
In order to achieve the above object, according to another aspect of the present application, there is also provided a storage medium having a computer program stored therein, wherein the computer program is configured to perform the steps in any of the above method embodiments when executed.
In order to achieve the above object, according to yet another aspect of the present application, there is also provided an electronic device comprising a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.
In the embodiment of the application, a method and a device for identifying a speech with enhanced hot words, a storage medium and an electronic device for identifying a speech with enhanced hot words are adopted, wherein the method for identifying the speech with enhanced hot words comprises a plurality of decoding paths, and whether the decoding paths contain the hot words is judged, the hot words are stored in a hot word set according to a preset rule, wherein the hot word set is obtained by user-defined according to a user, and if the decoding paths contain the hot words, a current hot word score is calculated according to a preset attenuation model, wherein the preset attenuation model is used for representing hot word weights attenuated according to time, so that the purpose of updating the accumulated score of the corresponding decoding paths according to the current hot word score and outputting a decoded speech identification result is achieved, and the purpose of adding the hot word score to enhance the speech and effectively avoiding the phenomenon that the score of the hot words is suddenly introduced to bring about the hot word score is achieved The technical effect of the error in cutting is achieved, and the technical problem that the effect of the hot word enhancing method is poor is solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, serve to provide a further understanding of the application and to enable other features, objects, and advantages of the application to be more apparent. The drawings and their description illustrate the embodiments of the invention and do not limit it. In the drawings:
FIG. 1 is a schematic diagram of a hardware structure of a hot word enhanced speech recognition method according to an embodiment of the present application;
FIG. 2 is a flow chart illustrating a method of hotword enhanced speech recognition according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a hotword enhanced speech recognition device according to an embodiment of the present application;
FIG. 4 is a process trend diagram for hotword enhanced decoding according to an embodiment of the present application;
fig. 5 is a flow chart illustrating a method for hot word enhanced speech recognition according to a preferred embodiment of the present application.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances in order to facilitate the description of the embodiments of the application herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In this application, the terms "upper", "lower", "left", "right", "front", "rear", "top", "bottom", "inner", "outer", "middle", "vertical", "horizontal", "lateral", "longitudinal", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings. These terms are used primarily to better describe the present application and its embodiments, and are not used to limit the indicated devices, elements or components to a particular orientation or to be constructed and operated in a particular orientation.
Moreover, some of the above terms may be used to indicate other meanings besides the orientation or positional relationship, for example, the term "on" may also be used to indicate some kind of attachment or connection relationship in some cases. The specific meaning of these terms in this application will be understood by those of ordinary skill in the art as appropriate.
Furthermore, the terms "mounted," "disposed," "provided," "connected," and "coupled" are to be construed broadly. For example, it may be a fixed connection, a removable connection, or a unitary construction; can be a mechanical connection, or an electrical connection; may be directly connected, or indirectly connected through intervening media, or may be in internal communication between two devices, elements or components. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as appropriate.
It should be noted that, in the present application, the embodiments and features of the embodiments may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
As shown in fig. 1, decoding the acoustic features of each frame of speech signal according to a preset decoding network; judging whether the decoding path contains hot words or not; if the hot words are contained in the decoding path, calculating the score of the current hot words according to a preset attenuation model; and updating the accumulated score of the corresponding decoding path according to the current hotword score, and outputting a decoded voice recognition result.
As shown in fig. 2, the method includes steps S201 to S203 as follows:
step S201, decoding the acoustic characteristics of each frame of voice signal according to a preset decoding network, wherein the decoding process comprises a plurality of decoding paths;
step S202, judging whether the decoding path contains a hot word or not, wherein the hot word is stored in a hot word set according to a preset rule, and the hot word set is obtained according to user definition;
step S203, if the hot words are contained in the decoding path, calculating the score of the current hot words according to a preset attenuation model, wherein the preset attenuation model is used for representing the weight of the hot words attenuated according to time;
and step S204, updating the accumulated score of the corresponding decoding path according to the current hotword score, and outputting a decoded voice recognition result.
From the above description, it can be seen that the present application achieves the following technical effects:
decoding the acoustic characteristics of each frame of voice signals by adopting a mode of a plurality of decoding paths according to a preset decoding network, wherein the decoding process comprises judging whether the decoding paths contain hot words or not, the hot words are stored in a hot word set according to a preset rule, the hot word set is obtained according to user definition, if the decoding paths contain the hot words, the current hot word score is calculated according to a preset attenuation model, the preset attenuation model is used for representing the hot word weight attenuated according to time, the aim of updating the accumulated score of the corresponding decoding paths according to the current hot word score and outputting the decoded voice recognition result is achieved, and therefore the technical effect that the addition of the hot word score can be enhanced and the error caused by the sudden introduction of the hot word score is effectively avoided at the same time, and further solves the technical problem that the hot word enhancement method has poor effect.
In the step S201, the acoustic features of each frame of speech signal are decoded according to a preset decoding network, and the decoding process of speech recognition is developed by combining the viterbi algorithm and the pruning strategy on the basis of the decoding network. After each frame of voice signal is extracted with acoustic features, the voice signal is input into a preset decoding network.
As a preference in the present embodiment, the present invention further includes: the voice recognition process is based on the decoding network and is carried out by combining a Viterbi algorithm and a pruning strategy.
As an optional implementation, the decoding process includes a plurality of decoding paths. A score is applied for each decoding path.
In a preferred embodiment, in the decoding process, the acoustic score and the linguistic score of the decoding unit of the current frame active path are calculated, and the accumulated score of the path is updated.
In step S202, it is determined whether the decoding path includes a hotword. And judging whether the output of the decoding unit of each path at each moment contains the hot words in the hot word set, and if so, calculating the score of the hot words according to a preset hot word weight attenuation model.
As an optional implementation mode, the hot word set is obtained according to user customization.
As a preferred embodiment, according to a user-defined manner, the user may upload and update the hotword list, and set an initial weight for the uploaded hotword list.
In step S203, if it is determined that the hot word is included in the decoding path, calculating a current hot word score according to a preset attenuation model. And if the decoding path does not contain the hotword, directly updating the scores in each path.
As a preference in this embodiment, the method further includes: and performing hot word enhancement processing in the preset decoding network according to the mode that the hot word weight is attenuated along with time.
As an alternative embodiment, the preset decay model is used to characterize the weight of the hotword decaying in time, and may be linear or exponential.
In a preferred embodiment, during the decoding process, the weight of each hotword unit based on the preset attenuation model is attenuated continuously with time, that is, the effect of hotword enhancement affects the decoding process according to a certain rule with the increase of time.
In the above step S204, the cumulative score of the corresponding decoding path is updated according to the current hotword score, and a decoded speech recognition result is output. The decoding path includes a plurality of scores such as an acoustic score, a language score, and the like.
As a preference in the present embodiment, the time attenuation includes: linear decay over time, exponential decay over time.
As a preference in the present embodiment, the hotword enhancement processing includes: setting an initial weight according to a hotword list which can be uploaded and/or updated by a user; and segmenting the hot words in the hot word list, converting the initial weight into a score which is in the same order of magnitude as the linguistic score in the preset decoding network after converting the hot word weight, so that the weight can be attenuated along with time in the decoding process.
As a preference in this embodiment, the preset attenuation model is used for characterizing hotword weights attenuated according to time, and includes: hotword weight w set by user t Decay with time as follows:
w t =-λt+w 0 (1)
where λ is the attenuation coefficient, w 0 The initial weight of the current hot word, t is the relative time when the hot word appears in the decoding path, t is 0, the time when the hot word appears for the first time, and the weight in the attenuation process satisfies the following relation, wherein W is the original weight of the hot word:
W=w 0 +w 1 +...+w T (2),
the user can upload and update the hotword list, an initial weight is set for the uploaded hotword list, then the hotword module carries out word segmentation on the hotword uploaded by the user, hot word weight conversion is carried out, a weight W (such as 3, the range is [ -5,5]) set by the user is converted into a score which is equal to the language score in the decoding network, and the weight is the original weight W of the hotword unit.
Further, in the decoding process, the weight of each hot word unit may be attenuated continuously with time, that is, the effect of hot word enhancement may affect the decoding process according to a certain rule with the increase of time. Such asHot word unit "Beijing", its hot word weight w t Will decay over time in the manner described above in (1).
As a preference in this embodiment, the decoding is performed on the acoustic feature of each frame of speech signal according to a preset decoding network, where the decoding process includes multiple decoding paths, including: in the decoding process of the target audio, calculating the acoustic score and the linguistic score of a current frame decoding path, and updating the accumulated score of the decoding path; if the hot word is contained in the decoding path, calculating the score of the current hot word according to a preset attenuation model, wherein the preset attenuation model is used for representing the weight of the hot word attenuated according to time, and the method comprises the following steps: calculating the score of the current hot word according to a preset attenuation model and setting a hot word matching mark and the relative time of hot word matching for the path; and adding the hot word score to the path to be accumulated in the score until the target audio decoding is finished, and then backtracking the decoding path to obtain a voice recognition result.
In specific implementation, in the decoding process of the target audio, the acoustic score and the linguistic score of the decoding path of the current frame are calculated, and the accumulated score of the decoding path is updated. Calculating the score of the current hot word according to a preset attenuation model and setting a hot word matching mark and the relative time of hot word matching for the path; and adding the hot word score to the path and accumulating the hot word score to the score until the target audio decoding is finished, and then backtracking the decoding path to obtain a voice recognition result.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than here.
According to an embodiment of the present application, there is also provided a hotword enhanced speech recognition apparatus for implementing the above method, as shown in fig. 3, the apparatus including:
a feature decoding module 301, configured to decode the acoustic features of each frame of speech signal according to a preset decoding network, where the decoding process includes multiple decoding paths;
a judging module 302, configured to judge whether the decoding path includes a hot word, where the hot word is stored in a hot word set according to a preset rule, where the hot word set is obtained according to user definition;
a hotword score calculating module 303, configured to calculate a current hotword score according to a preset attenuation model if it is determined that the decoding path includes the hotword, where the preset attenuation model is used to represent a weight of the hotword attenuated according to time;
and the output module 304 is configured to update the cumulative score of the corresponding decoding path according to the current hotword score, and output a decoded voice recognition result.
The feature decoding module 301 decodes the acoustic features of each frame of speech signal according to a preset decoding network, and the decoding process of speech recognition is developed by combining a viterbi algorithm and a pruning strategy on the basis of the decoding network. After each frame of voice signal is extracted with acoustic features, the voice signal is input into a preset decoding network.
As a preference in the present embodiment, the present invention further includes: the voice recognition process is based on the decoding network and is carried out by combining a Viterbi algorithm and a pruning strategy.
As an optional implementation, the decoding process includes a plurality of decoding paths. A score is applied for each decoding path.
In a preferred embodiment, in the decoding process, the acoustic score and the linguistic score of the decoding unit of the current frame of active paths are calculated, and the accumulated score of the paths is updated.
The determining module 302 determines whether the decoding path includes a hotword. And for the decoding unit of each path at each moment, judging whether the output of the decoding unit contains the hot words in the hot word set, and if so, calculating the score of the hot words according to a preset hot word weight attenuation model.
As an optional implementation mode, the hot word set is obtained according to user customization.
As a preferred embodiment, according to a user-defined manner, the user may upload and update the hotword list, and set an initial weight for the uploaded hotword list.
If the hot word score calculation module 303 determines that the decoding path includes the hot word, the current hot word score is calculated according to a preset attenuation model. And if the decoding path does not contain the hotword, directly updating the scores in each path.
As a preference in this embodiment, the method further includes: and performing hot word enhancement processing in the preset decoding network according to the mode that the hot word weight is attenuated along with time.
As an alternative embodiment, the preset decay model is used to characterize the weight of the hotword decaying in time, and may be linear or exponential.
As a preferred embodiment, during the decoding process, the weight of each hotword unit based on the preset attenuation model will continuously attenuate with time, that is, the effect of hotword enhancement will affect the decoding process according to a certain rule with the increase of time.
The output module 304 updates the accumulated score of the corresponding decoding path according to the current hotword score, and outputs a decoded voice recognition result. The decoding path includes a plurality of scores such as an acoustic score, a language score, and the like.
It will be apparent to those skilled in the art that the modules or steps of the present application described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and they may alternatively be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, or fabricated separately as individual integrated circuit modules, or fabricated as a single integrated circuit module from multiple modules or steps. Thus, the present application is not limited to any specific combination of hardware and software.
In order to better understand the flow of the above-mentioned hot word enhanced speech recognition method, the following explains the above technical solutions with reference to the preferred embodiments, but the technical solutions of the embodiments of the present invention are not limited.
According to the speech recognition method for hot word enhancement in the embodiment of the application, a simple mathematical model which gradually attenuates along with time is established for the hot word weight, and the significance is that the decoding process can be gradually influenced along with time by adding the hot words, on one hand, the enhancement effect can be achieved by adding the scores of the hot words, and on the other hand, errors in cutting caused by sudden introduction of the scores of the hot words are effectively avoided.
As shown in fig. 4, in the decoding process, the weight of each hotword unit may be attenuated continuously over time, that is, the effect of hotword enhancement may affect the decoding process according to a certain rule as time goes on. FIG. 4 is an exemplary graph of the decay of the hotword weight according to a linear law, such as the hotword unit "Beijing" with the hotword weight w t Will decay over time as follows:
w t =-λt+w 0 (1)
where λ is the attenuation coefficient, w 0 The initial weight is t, the relative time of the hot word unit appearing in the decoding path is t, 0 is the time of first appearing, and the weight in the attenuation process satisfies the following relation, wherein W is the original weight of the hot word:
W=w 0 +w 1 +...+w T (2)
the linear decay is only a simple example of the decay mode of the hotword weight, and in practical application, the hotword decay may adopt other decay modes such as an exponential and the like.
As shown in fig. 5, a schematic flow chart of the method for identifying a hot word enhanced speech includes:
step S501 starts.
Step S502, whether the audio is finished.
Step S503, voice acoustic characteristics.
Step S504, decode the network.
In step S505, whether the current decoding path output contains a hotword is determined.
Step S506, updating the cumulative score of the decoding path: history score + acoustic score + linguistic score.
Step S507, updating the relative time of hotword matching on the path.
Step S508, calculating the hotword score, and updating the decoding path score, i.e. adding the accumulated path score to the hotword score.
Step S509, backtracking the decoding path to obtain the best recognition result.
And step S510, ending.
It is noted that the predetermined decoding network is a recognition resource constructed from an acoustic model, a language model, and a pronunciation dictionary, such as a Weighted Finite State Transducer (WFST).
Specifically, as shown in fig. 5, the decoding process of speech recognition is developed by combining the viterbi algorithm and the pruning strategy based on the decoding network. After the acoustic features of each frame of voice signal are extracted, the voice signals are input into a decoding network, in the decoding process, the acoustic scores and the linguistic scores of decoding units of all active paths of a current frame are calculated, and the accumulated scores of the paths are updated. And for the decoding unit of each path at each moment, judging whether the output of the decoding unit contains the hotword in the hotword set or not, if so, calculating the hotword score according to the hotword weight attenuation model, if so, calculating the hotword score according to the formula (1), setting the relative time of the hotword matching mark and the hotword matching for the path, then adding the hotword score into the path accumulation score, repeating the process until the audio decoding is finished, and then tracing back the decoding path to obtain the final recognition result.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A method for hotword enhanced speech recognition, comprising:
decoding the acoustic characteristics of each frame of voice signal according to a preset decoding network, wherein the decoding process comprises a plurality of decoding paths;
judging whether the decoding path contains a hot word or not, wherein the hot word is stored in a hot word set according to a preset rule, and the hot word set is obtained according to user definition;
if the hot words are contained in the decoding path, calculating the score of the current hot words according to a preset attenuation model, wherein the preset attenuation model is used for representing the weight of the hot words attenuated according to time;
and updating the corresponding accumulated score of the decoding path according to the current hotword score, and outputting a decoded voice recognition result.
2. The method of claim 1, further comprising:
and performing hot word enhancement processing in the preset decoding network according to the mode that the hot word weight is attenuated along with time.
3. The method of claim 2, wherein the time-attenuating comprises: linear decay over time, exponential decay over time.
4. The method of claim 2, wherein the hotword enhancement process comprises:
setting an initial weight according to a hotword list which can be uploaded and/or updated by a user;
and segmenting the hot words in the hot word list, converting the initial weight into a score which is in the same order of magnitude as the linguistic score in the preset decoding network after converting the hot word weight, so that the weight can be attenuated along with time in the decoding process.
5. The method of claim 1, wherein the preset decay model is used to characterize hotword weights decaying in time, comprising:
the hotword weight w t Decay with time as follows:
w t =-λt+w 0 (1)
where λ is the attenuation coefficient, w 0 The initial weight of the current hot word, t is the relative time of the hot word in the decoding path, t is 0, the time of the first occurrence, and the weight in the attenuation process satisfies the following relation, wherein W is the original weight of all the hot words:
W=w 0 +w 1 +...+w T (2)。
6. the method of claim 1,
the decoding is performed on the acoustic characteristics of each frame of voice signal according to a preset decoding network, wherein the decoding process comprises a plurality of decoding paths, and comprises the following steps:
in the decoding process of the target audio, calculating the acoustic score and the linguistic score of a current frame decoding path, and updating the accumulated score of the decoding path;
if the hot word is contained in the decoding path, calculating the score of the current hot word according to a preset attenuation model, wherein the preset attenuation model is used for representing the weight of the hot word attenuated according to time, and the method comprises the following steps:
calculating the score of the current hot word according to a preset attenuation model and setting a hot word matching mark and the relative time of hot word matching for the path;
and adding the hot word score to the path to be accumulated in the score until the target audio decoding is finished, and then backtracking the decoding path to obtain a voice recognition result.
7. The method of claim 6, further comprising:
the voice recognition process is based on the decoding network and is carried out by combining a Viterbi algorithm and a pruning strategy.
8. A hotword-enhanced speech recognition device, comprising:
the characteristic decoding module is used for decoding the acoustic characteristics of each frame of voice signal according to a preset decoding network, wherein the decoding process comprises a plurality of decoding paths;
the judging module is used for judging whether the decoding path contains a hot word or not, wherein the hot word is stored in a hot word set according to a preset rule, and the hot word set is obtained according to user definition;
the hot word score calculating module is used for calculating the current hot word score according to a preset attenuation model if the decoding path contains the hot words, wherein the preset attenuation model is used for representing the weight of the hot words attenuated according to time;
and the output module is used for updating the accumulated score of the corresponding decoding path according to the current hotword score and outputting a decoded voice recognition result.
9. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 7 when executed.
10. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 7.
CN202210658379.XA 2022-06-10 2022-06-10 Hot word enhanced speech recognition method and device, storage medium and electronic device Pending CN115132187A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210658379.XA CN115132187A (en) 2022-06-10 2022-06-10 Hot word enhanced speech recognition method and device, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210658379.XA CN115132187A (en) 2022-06-10 2022-06-10 Hot word enhanced speech recognition method and device, storage medium and electronic device

Publications (1)

Publication Number Publication Date
CN115132187A true CN115132187A (en) 2022-09-30

Family

ID=83377435

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210658379.XA Pending CN115132187A (en) 2022-06-10 2022-06-10 Hot word enhanced speech recognition method and device, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN115132187A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117437909A (en) * 2023-12-20 2024-01-23 慧言科技(天津)有限公司 Speech recognition model construction method based on hotword feature vector self-attention mechanism

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117437909A (en) * 2023-12-20 2024-01-23 慧言科技(天津)有限公司 Speech recognition model construction method based on hotword feature vector self-attention mechanism
CN117437909B (en) * 2023-12-20 2024-03-05 慧言科技(天津)有限公司 Speech recognition model construction method based on hotword feature vector self-attention mechanism

Similar Documents

Publication Publication Date Title
US10332507B2 (en) Method and device for waking up via speech based on artificial intelligence
CN109523991B (en) Voice recognition method, device and equipment
JP7278477B2 (en) Decryption network construction method, speech recognition method, device, equipment and storage medium
CN108899013B (en) Voice search method and device and voice recognition system
US20210193121A1 (en) Speech recognition method, apparatus, and device, and storage medium
CN109215630B (en) Real-time voice recognition method, device, equipment and storage medium
JP6306528B2 (en) Acoustic model learning support device and acoustic model learning support method
CN105654955B (en) Audio recognition method and device
CN112307188B (en) Dialog generation method, system, electronic device and readable storage medium
CN112259089A (en) Voice recognition method and device
CN111767394A (en) Abstract extraction method and device based on artificial intelligence expert system
CN111968678B (en) Audio data processing method, device, equipment and readable storage medium
CN115798518B (en) Model training method, device, equipment and medium
CN113641807A (en) Training method, device, equipment and storage medium of dialogue recommendation model
CN115132187A (en) Hot word enhanced speech recognition method and device, storage medium and electronic device
CN112579760A (en) Man-machine conversation method and device, computer equipment and readable storage medium
CN111402864A (en) Voice processing method and electronic equipment
CN112151021A (en) Language model training method, speech recognition device and electronic equipment
CN111508481A (en) Training method and device of voice awakening model, electronic equipment and storage medium
CN111640423A (en) Word boundary estimation method and device and electronic equipment
CN113689866B (en) Training method and device of voice conversion model, electronic equipment and medium
CN113724698B (en) Training method, device, equipment and storage medium of voice recognition model
CN113807106B (en) Translation model training method and device, electronic equipment and storage medium
CN110428814B (en) Voice recognition method and device
CN114783409A (en) Training method of speech synthesis model, speech synthesis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination