CN113113024A - Voice recognition method and device, electronic equipment and storage medium - Google Patents
Voice recognition method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN113113024A CN113113024A CN202110474762.5A CN202110474762A CN113113024A CN 113113024 A CN113113024 A CN 113113024A CN 202110474762 A CN202110474762 A CN 202110474762A CN 113113024 A CN113113024 A CN 113113024A
- Authority
- CN
- China
- Prior art keywords
- user
- decoding
- voice
- preset state
- recognized
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 69
- 230000007704 transition Effects 0.000 claims abstract description 65
- 238000012546 transfer Methods 0.000 claims abstract description 33
- 238000004590 computer program Methods 0.000 claims description 8
- 239000011159 matrix material Substances 0.000 claims description 7
- 238000012549 training Methods 0.000 claims description 4
- 230000003190 augmentative effect Effects 0.000 claims 2
- 238000010586 diagram Methods 0.000 description 12
- 230000000694 effects Effects 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 230000005284 excitation Effects 0.000 description 5
- 230000003993 interaction Effects 0.000 description 5
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000001965 increasing effect Effects 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000003014 reinforcing effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining the voice to be recognized of a user; based on a preset state transfer path, performing voice recognition decoding on the voice to be recognized to obtain a voice recognition result; the preset state transition path is obtained by expanding based on the region information and/or the historical input information of the user. The voice recognition method, the voice recognition device, the electronic equipment and the storage medium provided by the invention dynamically expand the preset state transfer path in the decoding network based on the regional information and/or the historical input information of the current user, so that the voice recognition decoding is carried out on the voice to be recognized of the user based on the expanded preset state transfer path, the accuracy of personalized voice recognition is improved by utilizing the personalized information of the user, and the practicability is enhanced by dynamically expanding the mode of the preset state transfer path.
Description
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a speech recognition method and apparatus, an electronic device, and a storage medium.
Background
With the rapid development of artificial intelligence technology, speech recognition technology is widely applied in the interactive fields of smart homes, intelligent robots and the like. Because the number of users using voice recognition is continuously increased, the difference of pronunciation habits among the users is obvious, and the universal voice recognition method cannot achieve a good recognition effect on all the users.
In order to realize personalized speech recognition for each user and improve speech recognition accuracy, an existing speech recognition method usually constructs a personalized speech recognition system for a certain user based on a large amount of historical speech data of the user. However, the optimization effect of the method is limited, the deployment and maintenance difficulty is large, and the practicability is poor.
Disclosure of Invention
The invention provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, which are used for solving the defects of poor voice recognition optimization effect and poor practicability in the prior art.
The invention provides a voice recognition method, which comprises the following steps:
determining the voice to be recognized of a user;
based on a preset state transfer path, performing voice recognition decoding on the voice to be recognized to obtain a voice recognition result; the preset state transition path is obtained by expanding based on the region information and/or the historical input information of the user.
The invention provides a voice recognition method, which is used for carrying out voice recognition decoding on the voice to be recognized based on a preset state transition path and comprises the following steps:
determining a phoneme sequence corresponding to the speech to be recognized;
and decoding the phoneme sequence at the current decoding position based on the preset state transition path corresponding to the previous decoding position to obtain a decoding result at the current decoding position.
According to the voice recognition method provided by the invention, the preset state transition path is expanded based on the following steps:
determining a region noun associated with the region information of the user;
and expanding preset state transfer paths corresponding to the place names in the decoding network based on the domain name words of all places.
According to the voice recognition method provided by the invention, the preset state transition path is expanded based on the following steps:
determining similar hotwords corresponding to the phoneme sequence at the current decoding position based on the historical input information of the user;
and expanding a preset state transition path corresponding to the previous decoding position based on the similar hot words.
According to a speech recognition method provided by the present invention, the determining a similar hotword corresponding to a phoneme sequence at a current decoding position based on the historical input information of the user includes:
determining a similar phoneme sequence corresponding to the phoneme sequence at the current decoding position based on the phoneme sequence at the current decoding position and a pre-constructed pronunciation similarity matrix;
determining similar hotwords corresponding to the phoneme sequence and/or the similar phoneme sequence at the current decoding position based on the hot words of the user; the hotword is determined based on the historical input information.
According to a voice recognition method provided by the present invention, the performing voice recognition decoding on the voice to be recognized based on a preset state transition path includes:
based on a language model, combining the preset state transition path to perform voice recognition decoding on the voice to be recognized;
wherein the language model corresponds to a device type currently used by the user; the language model corresponding to any equipment type is obtained by training based on the application scene text of any equipment type.
According to a voice recognition method provided by the present invention, the performing voice recognition decoding on the voice to be recognized based on a preset state transition path includes:
determining voiceprint characteristics of the user;
and based on the preset state transition path, combining the audio features of the voice to be recognized and the voiceprint features of the user to perform voice recognition decoding on the voice to be recognized.
The present invention also provides a speech recognition apparatus comprising:
the voice data determining unit is used for determining the voice to be recognized of the user;
the voice recognition decoding unit is used for performing voice recognition decoding on the voice to be recognized based on a preset state transfer path to obtain a voice recognition result; the preset state transition path is obtained by expanding based on the region information and/or the historical input information of the user.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and operable on the processor, wherein the processor implements the steps of any of the speech recognition methods described above when executing the program.
The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the speech recognition method as described in any of the above.
The voice recognition method, the voice recognition device, the electronic equipment and the storage medium provided by the invention dynamically expand the preset state transfer path in the decoding network based on the regional information and/or the historical input information of the current user, so that the voice recognition decoding is carried out on the voice to be recognized of the user based on the expanded preset state transfer path, the accuracy of personalized voice recognition is improved by utilizing the personalized information of the user, and the practicability is enhanced by dynamically expanding the mode of the preset state transfer path.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flow chart of a speech recognition method according to the present invention;
FIG. 2 is a flowchart illustrating a path expansion method according to the present invention;
FIG. 3 is a schematic diagram of a geographical information extended path according to the present invention;
FIG. 4 is a second flowchart of the path expansion method according to the present invention;
FIG. 5 is a schematic diagram of a similar hotword expansion path provided by the present invention;
FIG. 6 is a flowchart illustrating a similar hotword determining method according to the present invention;
FIG. 7 is a schematic diagram of language model selection provided by the present invention;
FIG. 8 is a schematic diagram of a speech recognition system according to the present invention;
FIG. 9 is a schematic structural diagram of a speech recognition apparatus according to the present invention;
fig. 10 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
With the rapid development of the artificial intelligence industry, the voice recognition technology is widely applied to the interactive fields of smart homes, intelligent robots and the like. In recent years, many development technologies related to speech recognition are continuously innovated, and speech is one of the most convenient and fast interaction modes, and the recognition thereof is just an important link of human-computer interaction. With the increasing use of voice users, the difference of pronunciation habits among users becomes more and more obvious, and in this case, in the traditional method for performing voice recognition by using a unified universal voice recognition system, as the universal voice recognition system needs to cover more users and more scenes, the method cannot achieve good recognition accuracy for all users.
Therefore, how to utilize the personalized information of each user and enhance the pertinence of the voice recognition system so as to improve the voice recognition accuracy of each user becomes an important research direction in the field of the current voice recognition. The existing personalized speech recognition method generally constructs a personalized speech recognition system for each user based on a large amount of historical speech data of the user. However, this method is difficult for a new user to construct a reliable speech recognition system due to lack of historical data of the user, resulting in limited personalized enhancement effect of the method; for old users, the difference of the number of the historical speeches of each user is large, each user needs to individually customize and store a set of recognition models (such as acoustic models in the traditional hidden markov model-based recognition system or Encode-Decode models), and the deployment and maintenance difficulty is large, so the practicability is poor.
Therefore, the embodiment of the invention provides a voice recognition method, which can effectively perform personalized enhancement of voice recognition and improve the accuracy of voice recognition. Fig. 1 is a schematic flow chart of a speech recognition method according to an embodiment of the present invention, as shown in fig. 1, the method includes:
Specifically, a speech to be recognized of a user is acquired. The voice to be recognized may be voice data recorded by the user in real time through the electronic device, or may also be stored or received voice data, which is not specifically limited in this embodiment of the present invention.
And then, carrying out voice recognition decoding on the voice to be recognized by utilizing a preset state transfer path to obtain a voice recognition result. The preset state transition path may be a path between any two adjacent nodes in the decoding network. Here, the decoding network can be used as a search space to find an optimal path from the initial node to the termination node, thereby realizing the decoding of the speech to be recognized. Specifically, after each speech frame in the speech to be recognized is converted into a state sequence or a phoneme sequence by using an acoustic model, the state sequence or the phoneme sequence is mapped to a word sequence based on a decoding network; or combining an end-to-end language recognition model, such as an Encode-Decode model, converting the speech to be recognized into a word sequence, and mapping the word sequence to a word sequence based on a decoding network. In addition, the decoding network may be constructed based on knowledge sources such as an acoustic model, a pronunciation dictionary, a language model, and the like, and may be established based on a weighted finite-state transducers (WFST), for example, which is not specifically limited in this embodiment of the present invention.
When the speech to be recognized is subjected to speech recognition decoding, a preset state transition path in a decoding network is gradually searched from an initial node and a score is calculated according to a state sequence, a phoneme sequence or a word sequence of the speech to be recognized, so that an optimal path is found. Therefore, the construction of the preset state transition path is an important part in the personalized speech recognition process, and the closer the preset state transition path is attached to the current user, the higher the accuracy of the speech recognition result obtained by decoding is.
In contrast, when the decoding network is constructed, in the embodiment of the present invention, in addition to using knowledge sources such as an acoustic model, a pronunciation dictionary, and a language model, a preset state transition path in the decoding network is expanded according to regional information and/or historical input information of a user. The path expansion aiming at the current user can be carried out on the basis of the existing decoding network, so that the method provided by the embodiment of the invention can realize the personalized voice recognition only by storing less personalized information aiming at the current user and slightly changing the existing decoding network, and the practicability of the voice recognition method is enhanced. For example, a plurality of new paths may be added between corresponding nodes in the decoding network according to the regional information and/or the historical input information of the user, and the score of the added paths may be calculated based on the language model.
The region information of the user can provide the current position information of the user, and the destination which the user may go to next step can be presumed according to the region information of the user and expanded into the decoding network. When the user navigates by using voice, the expanded preset state transition path can be searched when the voice is recognized and decoded, and even if the pronunciation of the user is not standard or the name of the navigation destination is difficult to recognize, the voice content input by the user can be accurately recognized. Besides, the historical input information of the user can also additionally provide the expression habit of the user, so as to help deduce the words which the user may express at present. Therefore, the historical input information of the user can be further refined and expanded into a decoding network, so that the accuracy of voice recognition is improved.
The method provided by the embodiment of the invention dynamically expands the preset state transfer path in the decoding network based on the regional information and/or the historical input information of the current user, so that the voice to be recognized of the user is recognized and decoded based on the expanded preset state transfer path, the accuracy of personalized voice recognition is improved by utilizing the personalized information of the user, and the practicability of the method is enhanced by dynamically expanding the preset state transfer path.
Based on the above embodiment, step 120 includes:
determining a phoneme sequence corresponding to the speech to be recognized;
and decoding the phoneme sequence at the current decoding position based on the preset state transition path corresponding to the previous decoding position to obtain a decoding result at the current decoding position.
Specifically, each speech frame of the speech to be recognized is recognized, and a phoneme sequence corresponding to the speech to be recognized is obtained. Here, the acoustic features of each speech frame of the speech to be recognized may be extracted to obtain the acoustic features of each speech frame, and then the acoustic features of each speech frame are recognized based on the acoustic model to determine the state to which each speech frame belongs, and further combine the states into phonemes to obtain a phoneme sequence of the speech to be recognized.
Subsequently, the phoneme sequence of the speech to be recognized is decoded. In the decoding process, a proper preset state transition path is searched from an initial node of a decoding network according to the phoneme sequence of the speech to be recognized so as to reach a next node, and the steps are repeated until a termination node is reached. Assuming that the current decoding position of the speech to be recognized is currently decoded, the phoneme sequence of the current decoding position needs to be decoded. Correspondingly, in the decoding network, when the current search reaches the node t, a proper path needs to be found to travel to the next node. At this time, the phoneme sequence at the current decoding position may be decoded based on the preset state transition path corresponding to the previous decoding position, so as to obtain a decoding result at the current decoding position. The preset state transition path corresponding to the previous decoding position is a preset state transition path between the node where the current decoding network is located and the next node, that is, a preset state transition path from the node t. And decoding the phoneme sequence at the current decoding position based on the score of the preset state transition path corresponding to the last decoding position, and selecting one preset state transition path to determine the decoding result at the current decoding position.
Based on any of the above embodiments, fig. 2 is a schematic flow chart of a path expansion method provided by an embodiment of the present invention, as shown in fig. 2, the method includes:
and step 220, expanding a preset state transfer path corresponding to the place name in the decoding network based on each place domain name word.
Specifically, according to the region information of the user, a region noun associated with the region information is determined. The region information can be used as a center to obtain the region terms of other places in the peripheral activity range. For example, the name of a hot spot in the surrounding activity range may be obtained, or the name of a spot that the user has gone to in the surrounding activity range may be obtained as an associated area noun according to the historical positioning information and/or the historical navigation data of the user.
Based on each region noun, the preset state transition path corresponding to the place name in the decoding network can be expanded. Fig. 3 is a schematic diagram of a region information extended path according to an embodiment of the present invention, and as shown in an upper part of fig. 3, a basic decoding network having a universal location name ($ location) path may be pre-constructed. Besides the universal place name path, a plurality of popular place name paths can be expanded. After obtaining the region term (e.g., meya photoelectric, first research institute) associated with the region information of the user, as shown in the middle part of fig. 3, the preset state transition path corresponding to each region term may be first constructed, where the nodes connected to the preset state transition path corresponding to each region term are the same. Subsequently, as shown in the lower part of fig. 3, the preset state transition path corresponding to each regional noun is extended to the preset state transition path corresponding to the generic name in the basic decoding network.
The method provided by the embodiment of the invention expands the preset state transfer path corresponding to the place name in the decoding network through the place name associated with the user's place information, and performs personalized expansion on the preset state transfer path, thereby being beneficial to improving the accuracy of voice recognition.
Based on any of the above embodiments, fig. 4 is a second schematic flow chart of the path expansion method provided by the embodiment of the present invention, as shown in fig. 4, the method includes:
Specifically, the historical input information of the user may provide the language expression habit of the user, such as words or phrases frequently spoken by the user, and when the phoneme sequence at the current decoding position is the same as or similar to the pronunciation of a part of the words or phrases in the historical input information, it indicates that the user may express the words or phrases, and thus the words or phrases may be expanded into the decoding network. Therefore, in the decoding process, the similar hotword corresponding to the phoneme sequence at the current decoding position can be determined based on the historical input information of the user. Wherein, the similar hotword is a hotword which is in the historical input information of the user and has the same pronunciation or similar pronunciation with the phoneme sequence at the current decoding position.
Based on the obtained similar hotwords, a new preset state transition path can be generated and expanded to the corresponding position of the preset state transition path corresponding to the previous decoding position. Fig. 5 is a schematic diagram of a similar hotword expansion path according to an embodiment of the present invention, and as shown in fig. 5, a similar hotword corresponding to a phoneme sequence at a current decoding position is "cross", a new path corresponding to the similar hotword is generated, and the new path is inserted into a preset state transition path corresponding to a previous decoding position, so as to implement dynamic expansion of the path.
In the method provided by the embodiment of the invention, the preset state transition path corresponding to the previous decoding position is expanded by using the similar hotword corresponding to the phoneme sequence at the current decoding position in the decoding process, so that the preset state transition path is individually and dynamically expanded, and the accuracy of voice recognition is improved.
Based on any of the above embodiments, fig. 6 is a schematic flowchart of a similar hotword determining method provided by the embodiment of the present invention, as shown in fig. 6, step 410 includes:
Specifically, the pronunciation similarity matrix may be constructed in advance from a pronunciation dictionary. As shown in fig. 5, the pronunciation similarity matrix may store phoneme sequences with similar pronunciations. And searching in the pronunciation similarity matrix according to the phoneme sequence at the current decoding position to find out a similar phoneme sequence with similar pronunciation to the phoneme sequence at the current decoding position.
Based on the respective hotwords of the user, hotwords whose pronunciations correspond to the phoneme sequence and/or the similar phoneme sequence at the current decoding position are determined therefrom as similar hotwords. Wherein each hotword is determined according to the historical input information of the user. For example, the text information manually input by the user history may be acquired, and according to the frequency input by the user, a word with a higher frequency is screened out as a hotword, and a hotword list of the user is constructed.
Based on any of the above embodiments, step 120 includes:
based on the language model, combining with a preset state transfer path, performing voice recognition decoding on the voice to be recognized;
the language model corresponds to the type of equipment currently used by a user; the language model corresponding to any equipment type is obtained by training based on the application scene text of the equipment type.
In particular, with the popularization of various electronic devices, users often use different electronic devices in different application scenarios. For example, for large-screen electronic devices such as televisions, users usually use voice interaction to perform television station control, network video on demand, and the like; for the intelligent sound box equipment, a user can use voice interaction to perform weather inquiry, song on demand and the like; for the vehicle-mounted computer equipment, more users can use voice interaction to perform address navigation and the like.
Therefore, the application scenes of different device types can be predetermined, the application scene texts under each application scene are collected, and the language model corresponding to the device type is trained based on the application scene texts of each device type for the different types of devices to use in voice recognition. When the speech recognition system used by the electronic device is a Hidden Markov Model (HMM) based recognition system, the above-mentioned language Model refers to a traditional language Model, such as an n-gram based language Model, which can directly replace the language Model in the original speech recognition system; when the speech recognition system used by the electronic device is an Encode-Decode-based recognition system, the language model may be a neural network language model, and the recognition result may be fused with the recognition result of the original speech recognition system by various fusion methods.
And determining the information of the equipment generating the speech to be recognized, such as a mobile phone, a vehicle-mounted computer, a television or an intelligent sound box, and the like, so as to determine the language model corresponding to the equipment. Based on the language model and the preset state transition path, the speech recognition decoding can be carried out on the speech to be recognized. Fig. 7 is a schematic diagram of language model selection according to an embodiment of the present invention, and as shown in fig. 7, a language model corresponding to a current device is dynamically selected from language models corresponding to multiple device types according to a device ID of an input voice. The obtained language model can be combined with an acoustic model and also can be combined with an Encode-Decode model to realize voice recognition and obtain a recognition result of input voice.
According to the method provided by the embodiment of the invention, the speech recognition decoding is carried out on the speech to be recognized by dynamically selecting the language model corresponding to the equipment type currently used by the user and combining the preset state transfer path, so that the accuracy of the speech recognition is further improved.
Based on any of the above embodiments, step 120 includes:
determining a voiceprint characteristic of a user;
and based on the preset state transition path, carrying out voice recognition decoding on the voice to be recognized by combining the audio features of the voice to be recognized and the voiceprint features of the user.
Specifically, because different users have different accents and speaking styles, when performing speech recognition, adaptive speech recognition can be performed according to the pronunciation characteristics of the current user to adapt to the speech data of different users, thereby improving the accuracy of speech recognition. Therefore, the voiceprint feature of the current user can be acquired. Wherein the voiceprint feature can express the pronunciation characteristics and pronunciation habits of the user. Here, the identity authentication vector of the current user may be extracted as its voiceprint feature using an existing i-vector extraction model, such as a universal background model UBM. The voiceprint features extracted by the method contain speaker information, channel information and the like, and have high stability. In addition, the voiceprint feature of the user may also be extracted by using an x-vector extraction model under a deep learning framework, which is not specifically limited in the embodiment of the present invention. And then, based on the preset state transition path, combining the audio features of the voice data to be recognized and the voiceprint features of the user, and performing voice recognition decoding on the voice to be recognized. The audio features include semantic information of the voice data, and pronunciation characteristics included in the voiceprint features of the user are combined, so that accuracy of voice recognition for the user can be improved.
According to the method provided by the embodiment of the invention, the voice recognition decoding is carried out on the voice to be recognized by determining the voiceprint characteristics of the user and combining the audio characteristics of the voice to be recognized and the voiceprint characteristics of the user, so that the accuracy of the voice recognition is further improved.
Based on any of the above embodiments, fig. 8 is a schematic structural diagram of a speech recognition system provided in an embodiment of the present invention, and as shown in fig. 8, the system may be established based on an existing speech recognition model, and perform speech recognition enhancement by using a multidimensional personalized recognition enhancement module. Wherein, individualized discernment reinforcing module includes four: the system comprises a dynamic path expansion module, a dynamic hot word excitation module, a dynamic voice model selection module and a dynamic voiceprint enhancement module.
The dynamic path extension module is configured to extend a preset state transition path corresponding to a place name in a decoding network based on the user region information, where a specific extension manner is the same as that in the foregoing embodiment, and is not described herein again.
The dynamic hot word excitation module is used for constructing a hot word library of the user based on historical input information of the user and carrying out hot word excitation based on the hot word library. If the system is established by an HMM-based recognition model, the dynamic hotword excitation module may be configured to select a similar hotword corresponding to the phoneme sequence at the current decoding position from the hotword library in an actual decoding process, and expand a preset state transition path corresponding to a previous decoding position in the decoding network by using the similar hotword. If the system is established by an Encode-Decode-based recognition model, the dynamic hotword excitation module can represent each hotword as a fixed-dimension hotword code based on a hotword Encoder (Bias Encoder), then selects the hotword code matched with the input voice as output by using the state information output by a Decoder (Decode) at the previous decoding moment through an attention mechanism, and sends the hotword code matched with the input voice and the audio characteristics of the input voice into the Decoder for decoding to obtain a recognition result.
The dynamic voice model selection module is used for dynamically selecting a language model corresponding to the equipment type currently used by the user based on the equipment information of the user so as to perform voice recognition decoding on the voice to be recognized.
The dynamic voiceprint enhancement module is used for determining the voiceprint characteristics of the user and then carrying out voice recognition decoding on the voice to be recognized by combining the audio characteristics of the voice to be recognized and the voiceprint characteristics of the user.
It should be noted that each of the personalized recognition enhancing modules in the speech recognition system may be used alone, or a plurality of modules may be used in combination, so as to improve the accuracy of speech recognition.
Based on any of the above embodiments, fig. 9 is a schematic structural diagram of a speech recognition apparatus provided in an embodiment of the present invention, as shown in fig. 9, the apparatus includes: a voice data determination unit 910 and a voice recognition decoding unit 920.
The voice data determining unit 910 is configured to determine a to-be-recognized voice of a user;
the speech recognition decoding unit 920 is configured to perform speech recognition decoding on the speech to be recognized based on the preset state transition path to obtain a speech recognition result; the preset state transition path is obtained by expansion based on the regional information and/or the historical input information of the user.
The device provided by the embodiment of the invention dynamically expands the preset state transfer path in the decoding network based on the regional information and/or the historical input information of the current user, so that the voice to be recognized of the user is recognized and decoded based on the expanded preset state transfer path, the accuracy of personalized voice recognition is improved by utilizing the personalized information of the user, and the practicability of the device is enhanced by dynamically expanding the preset state transfer path.
Based on any of the above embodiments, the speech recognition decoding unit 920 is configured to:
determining a phoneme sequence corresponding to the speech to be recognized;
and decoding the phoneme sequence at the current decoding position based on the preset state transition path corresponding to the previous decoding position to obtain a decoding result at the current decoding position.
Based on any of the above embodiments, the apparatus further includes a first path expansion unit, configured to:
determining a region noun associated with the region information of the user;
and expanding preset state transfer paths corresponding to the place names in the decoding network based on the domain name words of all places.
The device provided by the embodiment of the invention expands the preset state transfer path corresponding to the place name in the decoding network through the place name associated with the user's place information, and performs personalized expansion on the preset state transfer path, thereby being beneficial to improving the accuracy of voice recognition.
Based on any of the above embodiments, the apparatus further includes a second path expansion unit, configured to:
determining similar hotwords corresponding to the phoneme sequence at the current decoding position based on historical input information of the user;
and expanding the preset state transition path corresponding to the last decoding position based on the similar hot words.
The device provided by the embodiment of the invention expands the preset state transition path corresponding to the previous decoding position by using the similar hotword corresponding to the phoneme sequence at the current decoding position in the decoding process, and performs personalized dynamic expansion on the preset state transition path, thereby being beneficial to improving the accuracy of voice recognition.
Based on any embodiment, determining the similar hotword corresponding to the phoneme sequence at the current decoding position based on the historical input information of the user includes:
determining a similar phoneme sequence corresponding to the phoneme sequence at the current decoding position based on the phoneme sequence at the current decoding position and a pre-constructed pronunciation similarity matrix;
determining similar hotwords corresponding to the phoneme sequence and/or the similar phoneme sequence at the current decoding position based on the hot words of the user; the hotword is determined based on historical input information.
Based on any of the above embodiments, the speech recognition decoding unit 920 is configured to:
based on the language model, combining with a preset state transfer path, performing voice recognition decoding on the voice to be recognized;
the language model corresponds to the type of equipment currently used by a user; the language model corresponding to any equipment type is obtained by training based on the application scene text of the equipment type.
The device provided by the embodiment of the invention performs voice recognition decoding on the voice to be recognized by dynamically selecting the language model corresponding to the type of the equipment currently used by the user and combining the preset state transfer path, thereby further improving the accuracy of voice recognition.
Based on any of the above embodiments, the speech recognition decoding unit 920 is configured to:
determining a voiceprint characteristic of a user;
and based on the preset state transition path, carrying out voice recognition decoding on the voice to be recognized by combining the audio features of the voice to be recognized and the voiceprint features of the user.
The device provided by the embodiment of the invention performs voice recognition decoding on the voice to be recognized by determining the voiceprint characteristics of the user and combining the audio characteristics of the voice to be recognized and the voiceprint characteristics of the user, thereby further improving the accuracy of voice recognition.
Fig. 10 illustrates a physical structure diagram of an electronic device, and as shown in fig. 10, the electronic device may include: a processor (processor)1010, a communication Interface (Communications Interface)1020, a memory (memory)1030, and a communication bus 1040, wherein the processor 1010, the communication Interface 1020, and the memory 1030 communicate with each other via the communication bus 1040. Processor 1010 may invoke logic instructions in memory 1030 to perform a speech recognition method comprising: determining the voice to be recognized of a user; based on a preset state transfer path, performing voice recognition decoding on the voice to be recognized to obtain a voice recognition result; the preset state transition path is obtained by expanding based on the region information and/or the historical input information of the user.
Furthermore, the logic instructions in the memory 1030 can be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform a speech recognition method provided by the above methods, the method comprising: determining the voice to be recognized of a user; based on a preset state transfer path, performing voice recognition decoding on the voice to be recognized to obtain a voice recognition result; the preset state transition path is obtained by expanding based on the region information and/or the historical input information of the user.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the speech recognition methods provided above, the method comprising: determining the voice to be recognized of a user; based on a preset state transfer path, performing voice recognition decoding on the voice to be recognized to obtain a voice recognition result; the preset state transition path is obtained by expanding based on the region information and/or the historical input information of the user.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A speech recognition method, comprising:
determining the voice to be recognized of a user;
based on a preset state transfer path, performing voice recognition decoding on the voice to be recognized to obtain a voice recognition result; the preset state transition path is obtained by expanding based on the region information and/or the historical input information of the user.
2. The speech recognition method according to claim 1, wherein the performing speech recognition decoding on the speech to be recognized based on a preset state transition path comprises:
determining a phoneme sequence corresponding to the speech to be recognized;
and decoding the phoneme sequence at the current decoding position based on the preset state transition path corresponding to the previous decoding position to obtain a decoding result at the current decoding position.
3. The speech recognition method of claim 1, wherein the predetermined state transition path is augmented based on the steps of:
determining a region noun associated with the region information of the user;
and expanding preset state transfer paths corresponding to the place names in the decoding network based on the domain name words of all places.
4. The speech recognition method of claim 2, wherein the predetermined state transition path is augmented based on the steps of:
determining similar hotwords corresponding to the phoneme sequence at the current decoding position based on the historical input information of the user;
and expanding a preset state transition path corresponding to the previous decoding position based on the similar hot words.
5. The speech recognition method of claim 4, wherein determining similar hotwords corresponding to the phoneme sequence at the current decoding position based on the historical input information of the user comprises:
determining a similar phoneme sequence corresponding to the phoneme sequence at the current decoding position based on the phoneme sequence at the current decoding position and a pre-constructed pronunciation similarity matrix;
determining similar hotwords corresponding to the phoneme sequence and/or the similar phoneme sequence at the current decoding position based on the hot words of the user; the hotword is determined based on the historical input information.
6. The speech recognition method according to any one of claims 1 to 5, wherein the performing speech recognition decoding on the speech to be recognized based on the preset state transition path comprises:
based on a language model, combining the preset state transition path to perform voice recognition decoding on the voice to be recognized;
wherein the language model corresponds to a device type currently used by the user; the language model corresponding to any equipment type is obtained by training based on the application scene text of any equipment type.
7. The speech recognition method according to any one of claims 1 to 5, wherein the performing speech recognition decoding on the speech to be recognized based on the preset state transition path comprises:
determining voiceprint characteristics of the user;
and based on the preset state transition path, combining the audio features of the voice to be recognized and the voiceprint features of the user to perform voice recognition decoding on the voice to be recognized.
8. A speech recognition apparatus, comprising:
the voice data determining unit is used for determining the voice to be recognized of the user;
the voice recognition decoding unit is used for performing voice recognition decoding on the voice to be recognized based on a preset state transfer path to obtain a voice recognition result; the preset state transition path is obtained by expanding based on the region information and/or the historical input information of the user.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the speech recognition method according to any of claims 1 to 7 are implemented when the processor executes the program.
10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speech recognition method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110474762.5A CN113113024B (en) | 2021-04-29 | 2021-04-29 | Speech recognition method, device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110474762.5A CN113113024B (en) | 2021-04-29 | 2021-04-29 | Speech recognition method, device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113113024A true CN113113024A (en) | 2021-07-13 |
CN113113024B CN113113024B (en) | 2024-08-23 |
Family
ID=76720452
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110474762.5A Active CN113113024B (en) | 2021-04-29 | 2021-04-29 | Speech recognition method, device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113113024B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113838456A (en) * | 2021-09-28 | 2021-12-24 | 科大讯飞股份有限公司 | Phoneme extraction method, voice recognition method, device, equipment and storage medium |
CN114220444A (en) * | 2021-10-27 | 2022-03-22 | 安徽讯飞寰语科技有限公司 | Voice decoding method, device, electronic equipment and storage medium |
CN114242046A (en) * | 2021-12-01 | 2022-03-25 | 广州小鹏汽车科技有限公司 | Voice interaction method and device, server and storage medium |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007140048A (en) * | 2005-11-17 | 2007-06-07 | Oki Electric Ind Co Ltd | Voice recognition system |
CN102016502A (en) * | 2008-03-07 | 2011-04-13 | 谷歌公司 | Voice recognition grammar selection based on context |
CN103065630A (en) * | 2012-12-28 | 2013-04-24 | 安徽科大讯飞信息科技股份有限公司 | User personalized information voice recognition method and user personalized information voice recognition system |
CN103903619A (en) * | 2012-12-28 | 2014-07-02 | 安徽科大讯飞信息科技股份有限公司 | Method and system for improving accuracy of speech recognition |
US20150058018A1 (en) * | 2013-08-23 | 2015-02-26 | Nuance Communications, Inc. | Multiple pass automatic speech recognition methods and apparatus |
CN106469554A (en) * | 2015-08-21 | 2017-03-01 | 科大讯飞股份有限公司 | A kind of adaptive recognition methodss and system |
KR20170134115A (en) * | 2016-05-27 | 2017-12-06 | 주식회사 케이티 | Voice recognition apparatus using WFST optimization and method thereof |
US20190152065A1 (en) * | 2017-11-22 | 2019-05-23 | Shenzhen Xiluo Robot Co., Ltd. | Intelligent device system and intelligent device control method |
US10388272B1 (en) * | 2018-12-04 | 2019-08-20 | Sorenson Ip Holdings, Llc | Training speech recognition systems using word sequences |
CN110610700A (en) * | 2019-10-16 | 2019-12-24 | 科大讯飞股份有限公司 | Decoding network construction method, voice recognition method, device, equipment and storage medium |
CN110634472A (en) * | 2018-06-21 | 2019-12-31 | 中兴通讯股份有限公司 | Voice recognition method, server and computer readable storage medium |
CN111354347A (en) * | 2018-12-21 | 2020-06-30 | 中国科学院声学研究所 | Voice recognition method and system based on self-adaptive hot word weight |
CN111508497A (en) * | 2019-01-30 | 2020-08-07 | 北京猎户星空科技有限公司 | Voice recognition method and device, electronic equipment and storage medium |
KR20200117826A (en) * | 2019-04-05 | 2020-10-14 | 삼성전자주식회사 | Method and apparatus for speech recognition |
CN112071310A (en) * | 2019-06-11 | 2020-12-11 | 北京地平线机器人技术研发有限公司 | Speech recognition method and apparatus, electronic device, and storage medium |
CN112102815A (en) * | 2020-11-13 | 2020-12-18 | 深圳追一科技有限公司 | Speech recognition method, speech recognition device, computer equipment and storage medium |
-
2021
- 2021-04-29 CN CN202110474762.5A patent/CN113113024B/en active Active
Patent Citations (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007140048A (en) * | 2005-11-17 | 2007-06-07 | Oki Electric Ind Co Ltd | Voice recognition system |
CN102016502A (en) * | 2008-03-07 | 2011-04-13 | 谷歌公司 | Voice recognition grammar selection based on context |
CN107331389A (en) * | 2008-03-07 | 2017-11-07 | 谷歌公司 | Speech recognition grammar system of selection and system based on context |
CN103065630A (en) * | 2012-12-28 | 2013-04-24 | 安徽科大讯飞信息科技股份有限公司 | User personalized information voice recognition method and user personalized information voice recognition system |
CN103903619A (en) * | 2012-12-28 | 2014-07-02 | 安徽科大讯飞信息科技股份有限公司 | Method and system for improving accuracy of speech recognition |
US20150058018A1 (en) * | 2013-08-23 | 2015-02-26 | Nuance Communications, Inc. | Multiple pass automatic speech recognition methods and apparatus |
CN106469554A (en) * | 2015-08-21 | 2017-03-01 | 科大讯飞股份有限公司 | A kind of adaptive recognition methodss and system |
KR20170134115A (en) * | 2016-05-27 | 2017-12-06 | 주식회사 케이티 | Voice recognition apparatus using WFST optimization and method thereof |
US20190152065A1 (en) * | 2017-11-22 | 2019-05-23 | Shenzhen Xiluo Robot Co., Ltd. | Intelligent device system and intelligent device control method |
CN110634472A (en) * | 2018-06-21 | 2019-12-31 | 中兴通讯股份有限公司 | Voice recognition method, server and computer readable storage medium |
US10388272B1 (en) * | 2018-12-04 | 2019-08-20 | Sorenson Ip Holdings, Llc | Training speech recognition systems using word sequences |
CN111354347A (en) * | 2018-12-21 | 2020-06-30 | 中国科学院声学研究所 | Voice recognition method and system based on self-adaptive hot word weight |
CN111508497A (en) * | 2019-01-30 | 2020-08-07 | 北京猎户星空科技有限公司 | Voice recognition method and device, electronic equipment and storage medium |
KR20200117826A (en) * | 2019-04-05 | 2020-10-14 | 삼성전자주식회사 | Method and apparatus for speech recognition |
CN112071310A (en) * | 2019-06-11 | 2020-12-11 | 北京地平线机器人技术研发有限公司 | Speech recognition method and apparatus, electronic device, and storage medium |
CN110610700A (en) * | 2019-10-16 | 2019-12-24 | 科大讯飞股份有限公司 | Decoding network construction method, voice recognition method, device, equipment and storage medium |
CN112102815A (en) * | 2020-11-13 | 2020-12-18 | 深圳追一科技有限公司 | Speech recognition method, speech recognition device, computer equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
晁浩;: "融合音素串编辑距离的随机段模型解码算法", 计算机工程与应用, no. 06, 15 March 2015 (2015-03-15) * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113838456A (en) * | 2021-09-28 | 2021-12-24 | 科大讯飞股份有限公司 | Phoneme extraction method, voice recognition method, device, equipment and storage medium |
WO2023050541A1 (en) * | 2021-09-28 | 2023-04-06 | 科大讯飞股份有限公司 | Phoneme extraction method, speech recognition method and apparatus, device and storage medium |
CN113838456B (en) * | 2021-09-28 | 2024-05-31 | 中国科学技术大学 | Phoneme extraction method, voice recognition method, device, equipment and storage medium |
CN114220444A (en) * | 2021-10-27 | 2022-03-22 | 安徽讯飞寰语科技有限公司 | Voice decoding method, device, electronic equipment and storage medium |
CN114220444B (en) * | 2021-10-27 | 2022-09-06 | 安徽讯飞寰语科技有限公司 | Voice decoding method, device, electronic equipment and storage medium |
CN114242046A (en) * | 2021-12-01 | 2022-03-25 | 广州小鹏汽车科技有限公司 | Voice interaction method and device, server and storage medium |
CN114242046B (en) * | 2021-12-01 | 2022-08-16 | 广州小鹏汽车科技有限公司 | Voice interaction method and device, server and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN113113024B (en) | 2024-08-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11664020B2 (en) | Speech recognition method and apparatus | |
CN110111775B (en) | Streaming voice recognition method, device, equipment and storage medium | |
CN110473531B (en) | Voice recognition method, device, electronic equipment, system and storage medium | |
JP6802005B2 (en) | Speech recognition device, speech recognition method and speech recognition system | |
JP6550068B2 (en) | Pronunciation prediction in speech recognition | |
CN113113024B (en) | Speech recognition method, device, electronic equipment and storage medium | |
CN108899013B (en) | Voice search method and device and voice recognition system | |
CN103065630B (en) | User personalized information voice recognition method and user personalized information voice recognition system | |
US11093110B1 (en) | Messaging feedback mechanism | |
JP7051919B2 (en) | Speech recognition and decoding methods based on streaming attention models, devices, equipment and computer readable storage media | |
CN110634469B (en) | Speech signal processing method and device based on artificial intelligence and storage medium | |
US10872601B1 (en) | Natural language processing | |
CN106710585B (en) | Polyphone broadcasting method and system during interactive voice | |
JP2023545988A (en) | Transformer transducer: One model that combines streaming and non-streaming speech recognition | |
CN105190614A (en) | Search results using intonation nuances | |
CN112397053B (en) | Voice recognition method and device, electronic equipment and readable storage medium | |
CN114141179A (en) | Park guide and scenic spot introduction system based on intelligent voice interaction | |
US20230260502A1 (en) | Voice adaptation using synthetic speech processing | |
CN113314096A (en) | Speech synthesis method, apparatus, device and storage medium | |
CN114283786A (en) | Speech recognition method, device and computer readable storage medium | |
CN112151020A (en) | Voice recognition method and device, electronic equipment and storage medium | |
CN114360514A (en) | Speech recognition method, apparatus, device, medium, and product | |
TWI731921B (en) | Speech recognition method and device | |
US11328713B1 (en) | On-device contextual understanding | |
KR102300303B1 (en) | Voice recognition considering utterance variation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20230508 Address after: 230026 No. 96, Jinzhai Road, Hefei, Anhui Applicant after: University of Science and Technology of China Applicant after: IFLYTEK Co.,Ltd. Address before: 230088 666 Wangjiang West Road, Hefei hi tech Development Zone, Anhui Applicant before: IFLYTEK Co.,Ltd. |
|
TA01 | Transfer of patent application right | ||
GR01 | Patent grant | ||
GR01 | Patent grant |