CN113643706A

CN113643706A - Voice recognition method and device, electronic equipment and storage medium

Info

Publication number: CN113643706A
Application number: CN202110796768.4A
Authority: CN
Inventors: 李亚桐; 张伟彬; 陈东鹏
Original assignee: Voiceai Technologies Co ltd
Current assignee: Voiceai Technologies Co ltd
Priority date: 2021-07-14
Filing date: 2021-07-14
Publication date: 2021-11-12
Anticipated expiration: 2041-07-14
Also published as: CN113643706B

Abstract

The embodiment of the application discloses a voice recognition method, a voice recognition device, electronic equipment and a storage medium. The method comprises the following steps: acquiring voice data to be recognized; recognizing the voice data to be recognized, and acquiring a first voice recognition result corresponding to the voice data to be recognized and a loss corresponding to the first voice recognition result; acquiring a keyword from the first voice recognition result; based on the keyword, adjusting the loss corresponding to the first voice recognition result to obtain a first voice recognition result after the loss is adjusted; and acquiring a second voice recognition result corresponding to the voice data to be recognized from the first voice recognition result after loss adjustment. According to the method, the loss of the first voice recognition result is adjusted according to the keyword, and the second voice recognition result corresponding to the voice data to be recognized is obtained from the lost adjusted first voice recognition result, so that the accuracy of the voice data to be recognized is improved.

Description

Voice recognition method and device, electronic equipment and storage medium

Technical Field

The present application belongs to the field of speech recognition, and in particular, relates to a speech recognition method, apparatus, electronic device, and storage medium.

Background

Speech recognition is a technique that uses a machine to simulate the recognition and understanding of a human being, converting the human speech signal into corresponding text or commands. The fundamental purpose of speech recognition is to develop a machine with an auditory function that can directly receive human speech and understand human intention. With the development of artificial intelligence technology, the speech recognition technology has made great progress and started to enter various fields such as household appliances, communication, automobiles, medical treatment and the like, but the accuracy of speech recognition by related speech recognition methods still needs to be improved.

Disclosure of Invention

In view of the above problems, the present application provides a speech recognition method, apparatus, electronic device and storage medium to improve the above problems.

In a first aspect, an embodiment of the present application provides a speech recognition method, where the method includes: acquiring voice data to be recognized; recognizing the voice data to be recognized, and acquiring a first voice recognition result corresponding to the voice data to be recognized and a loss corresponding to the first voice recognition result; acquiring a keyword from the first voice recognition result; based on the keyword, adjusting the loss corresponding to the first voice recognition result to obtain a first voice recognition result after the loss is adjusted; and acquiring a second voice recognition result corresponding to the voice data to be recognized from the first voice recognition result after loss adjustment.

In a second aspect, an embodiment of the present application provides a speech recognition apparatus, including: the data acquisition unit is used for acquiring voice data to be recognized; a first result obtaining unit, configured to identify the voice data to be identified, and obtain a first voice identification result corresponding to the voice data to be identified and a loss corresponding to the first voice identification result; a keyword acquisition unit configured to acquire a keyword from the first speech recognition result; a loss adjusting unit, configured to adjust a loss corresponding to the first speech recognition result based on the keyword, so as to obtain a first speech recognition result after the loss is adjusted; and the second result acquisition unit is used for acquiring a second voice recognition result corresponding to the voice data to be recognized from the first voice recognition result after the loss is adjusted.

In a third aspect, an embodiment of the present application provides an electronic device, including one or more processors and a memory; one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the methods described above.

In a fourth aspect, the present application provides a computer-readable storage medium, in which a program code is stored, wherein the program code performs the above-mentioned method when running.

The embodiment of the application provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium. The method comprises the steps of firstly obtaining voice data to be recognized, recognizing the voice data to be recognized, obtaining a first voice recognition result corresponding to the voice data to be recognized and loss corresponding to the first voice recognition result, then obtaining a keyword from the first voice recognition result, adjusting the loss of the first voice recognition result based on the keyword to obtain the first voice recognition result after loss adjustment, and finally obtaining a second voice recognition result corresponding to the voice data to be recognized from the first voice recognition result after loss adjustment. By the method, the keywords are automatically obtained from the first voice recognition result, the loss of the first voice recognition result is adjusted according to the keywords, the second voice recognition result corresponding to the voice data to be recognized is obtained from the lost adjusted first voice recognition result, and the accuracy of the voice data to be recognized is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flow chart illustrating a speech recognition method according to an embodiment of the present application;

FIG. 2 is a flow chart illustrating a speech recognition method according to another embodiment of the present application;

FIG. 3 is a schematic diagram of a word graph according to another embodiment of the present application;

FIG. 4 is a flow chart illustrating a speech recognition method according to yet another embodiment of the present application;

FIG. 5 is a flow chart illustrating a speech recognition method according to yet another embodiment of the present application;

fig. 6 is a block diagram illustrating a structure of a speech recognition apparatus according to an embodiment of the present application;

FIG. 7 is a block diagram of an electronic device for performing a speech recognition method according to an embodiment of the present application in real time;

fig. 8 illustrates a storage unit for storing or carrying program code implementing a speech recognition method according to an embodiment of the present application in real time.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Speech recognition is a key technology for human-computer interaction by recognizing a user's voice command with a machine, which can significantly improve the way human-computer interaction so that the user can complete more tasks while speaking the command. Speech recognition is achieved by a speech recognition engine (or system) trained online or offline. The speech recognition process can be generally divided into a training phase and a recognition phase. In the training phase, an Acoustic Model (AM) and a vocabulary (lexicon) are statistically derived from the training data based on the mathematical Model on which the speech recognition engine (or system) is based. In the recognition stage, the speech recognition engine (or system) processes the input speech using the acoustic model and the vocabulary to obtain a speech recognition result. For example, feature extraction is performed from a sound wave diagram of an input sound to obtain a feature vector, then a phoneme (such as [ i ], [ o ], and the like) sequence is obtained according to an acoustic model, and finally a word or even a sentence with a high matching degree with the phoneme sequence is located from a vocabulary.

In the research on the related speech recognition method, the inventor finds that the readability of the overall speech recognition result is greatly influenced by the recognition accuracy of the keywords which often appear in the input speech in the speech recognition result output by the speech recognition engine (or system).

Therefore, the inventor proposes a speech recognition method in the present application, in which speech data to be recognized is first obtained, speech data to be recognized is recognized, a first speech recognition result corresponding to the speech data to be recognized and a loss corresponding to the first speech recognition result are obtained, then a keyword is obtained from the first speech recognition result, the loss of the first speech recognition result is adjusted based on the keyword to obtain a first speech recognition result after the loss is adjusted, finally a second speech recognition result corresponding to the speech data to be recognized is obtained from the first speech recognition result after the loss is adjusted, the keyword is automatically obtained from the first speech recognition result, further the loss of the first speech recognition result is adjusted according to the keyword, then a second speech recognition result corresponding to the speech data to be recognized is obtained from the first speech recognition result after the loss is adjusted, and the accuracy of speech data to be recognized is improved, An apparatus, an electronic device, and a storage medium.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Referring to fig. 1, a speech recognition method provided in an embodiment of the present application is applied to an electronic device, and the method includes:

step S110: and acquiring voice data to be recognized.

In the embodiment of the application, the voice data to be recognized may be voice data of different users in different application scenarios. For example, in the interview process, the voice data to be recognized may be the voice data of the interviewer or the voice data of the interview object; for another example, in a scene of watching a video, the voice data to be recognized may be voice data of a user watching the video, or may also be voice data of a user playing the video, or may also be voice data of a person in the video.

As one mode, the voice data to be recognized may be voice data sent by a voice collecting device, where the voice collecting device is an intelligent device that establishes a wireless communication connection with an electronic device. Specifically, after the voice data is collected by the voice collecting device, the collected voice data can be sent to the electronic device, so that the electronic device can perform voice recognition on the voice data.

Specifically, when voice data is collected through the voice collection device, the voice data of different users in different scenes can be collected through the voice collection device. Optionally, when the voice data of different users in different scenes are collected by the voice collecting device, if the voice data of multiple users are needed, in order to distinguish the voice data of different users, the voice data of different users can be collected by different voice collecting devices respectively. For example, in order to enable the voice data of the interviewer and the interview object as the voice to be recognized to be subsequently subjected to voice recognition respectively in the interview process, so as to avoid that the accuracy of the voice recognition and the speed of the voice recognition of the interviewer and the interview object are reduced in a complex environment, the voice data of the interviewer and the interview object can be respectively collected through a first microphone and a second microphone arranged on the voice collecting device. Specifically, in the interview process, the interviewer and the interview object are not the same person usually, and in order to enable the first microphone arranged on the voice acquisition device to acquire the voice data of the interviewer and the second microphone arranged on the voice acquisition device to acquire the voice data of the interview object, the first microphone and the second microphone arranged on the voice acquisition device can point to different directions respectively. Further, a first microphone on the voice capturing device may be set to point at the interviewer and a second microphone on the voice capturing device may be set to point at the interview subject. Furthermore, the voice data of different users can be collected through the first microphone and the second microphone of the voice collection device, and then the voice collection device can send the collected voice data of the interviewer and the collected voice data of the interview object to the electronic device to serve as the voice data to be recognized.

When the voice acquisition equipment sends the acquired voice data of the interviewer and the voice data of the interview object to the electronic equipment, the electronic equipment can receive the voice data of the interviewer acquired by a first microphone of the voice acquisition equipment through a first channel of a multi-channel signal receiver, and receive the voice data of the interview object acquired by a second microphone of the voice acquisition equipment through a second channel of the multi-channel signal receiver, so that the electronic equipment can respectively identify the voice data of the interviewer and the voice data of the interview object.

As another mode, the voice data to be recognized may also be voice data acquired by the electronic device from a cloud server after receiving the voice recognition instruction, which is not limited specifically herein.

It should be noted that, the electronic device is a speech recognition terminal, and in this embodiment of the application, the speech recognition terminal may be: the terminal such as a mobile phone, a personal computer, a tablet computer, etc. may also be a server, specifically, and the application does not limit what kind of device the speech recognition terminal is, as long as the speech recognition terminal can perform speech recognition on the speech to be recognized received by each channel in the multi-channel signal receiver.

Step S120: and recognizing the voice data to be recognized, and acquiring a first voice recognition result corresponding to the voice data to be recognized and a loss corresponding to the first voice recognition result.

In the embodiment of the present application, the loss corresponding to the first speech recognition result characterizes a difference between the first speech recognition result and a preset speech recognition result, and the loss may include an acoustic loss and a language loss. The preset voice recognition result may be text content corresponding to actually input voice data. In the embodiment of the application, the acoustic loss is the loss corresponding to the voice data to be recognized output by the acoustic model trained in advance, the language loss is the loss corresponding to the voice data to be recognized output by the language model trained in advance, the possibility of connection among the words is represented, and the smaller the language loss is, the greater the possibility of connection among the words is represented; the greater the loss of language, the less likely the connection between the characterizing words will be. For example, after the voice data to be recognized is recognized, the output first voice recognition result includes "hello, tomorrow" and "hello, computer" and other voice recognition results, wherein the language loss corresponding to "hello, tomorrow" is 8.5; the language loss corresponding to "hello, computer" is 19.2. According to the corresponding language loss, the language loss corresponding to the language loss of the 'hello and tomorrow' is smaller than the language loss corresponding to the language loss of the 'hello and tomorrow', so that the possibility of connection between the 'hello' and the 'tomorrow' is greater than the possibility of connection between the 'hello' and the 'computer'.

After the voice data to be recognized is obtained, voice recognition is carried out on the voice data to be recognized, and then a voice recognition result corresponding to the voice data to be recognized and loss corresponding to the voice recognition result can be obtained.

The first speech recognition result may include n optimal speech recognition results and a word graph. The word graph is a weighted finite state machine, and the word graph refers to a graph which can be formed by all words in a sentence. There is a path E (a, B) between a and B if the next word of a word a may be B. There may be multiple successors for a word, and multiple predecessors for a word, and the graph formed by them is called a word graph.

In the embodiment of the application, the word graph is a graph formed by all output words and the sequence of the output words in the voice data to be recognized.

Step S130: and acquiring a keyword from the first voice recognition result.

In the embodiment of the present application, the keyword may be obtained from the first speech recognition result according to a preset rule or the specified vocabulary may be obtained from the first speech recognition result as the keyword according to a vocabulary specified by a user in advance. The preset rules are preset rules which can determine keywords, such as a relative word frequency method, an absolute word frequency method and the like; the keyword is a certain output word that frequently appears in the first speech recognition result.

In one embodiment, the number of keywords obtained from the first speech recognition result may be multiple, and when the number of keywords obtained is multiple, a keyword list may be established to store the keywords obtained from the first speech recognition result. Optionally, when the keywords are stored, the importance corresponding to each keyword may also be stored, and further, when the loss corresponding to the first speech recognition result is adjusted, the loss may be adjusted according to the importance of the keyword. The importance of the keywords can be obtained through a speech recognition model, and can also be obtained through calculation according to a specified calculation rule. Specifically, the importance of the keyword may be determined according to the frequency of the keyword appearing in the first speech recognition result, or the importance corresponding to the keyword may be directly output by a speech recognition model, or may be preset by a user.

Step S140: and adjusting the loss corresponding to the first voice recognition result based on the keyword to obtain the first voice recognition result after the loss is adjusted.

When the keywords are determined by the method and stored in the keyword list, the keyword list can be searched, n optimal voice recognition results included in the first voice recognition result and whether the word graph includes the keywords stored in the related keyword list or not are determined, and if the n optimal voice recognition results included in the first voice recognition result or the word graph includes the keywords stored in the related keyword list is determined, the loss of the first voice recognition result is correspondingly adjusted.

Furthermore, when the keywords are stored in the keyword list, the importance corresponding to each keyword is also stored, and further, when the loss of the first voice recognition result is adjusted, the loss corresponding to the corresponding first voice recognition result can be adjusted according to the importance of the searched keywords. Specifically, if the importance of the searched keyword is higher, the loss corresponding to the first speech recognition result is adjusted to be smaller, that is, if the importance of the keyword is higher, the adjustment range of the loss corresponding to the first speech recognition result is larger.

Step S150: and acquiring a second voice recognition result corresponding to the voice data to be recognized from the first voice recognition result after loss adjustment.

By the above method, the loss corresponding to the first speech recognition result is adjusted to obtain the first speech recognition result after the loss is adjusted, and further, the second speech recognition result with the loss smaller than the loss corresponding to the first speech recognition result can be obtained from the first speech recognition result after the loss is adjusted. Specifically, for the plurality of speech recognition results included in the first speech recognition result after the loss is adjusted, the speech recognition result with the smallest loss in the plurality of speech recognition results may be used as the second speech recognition result, or the speech recognition results with losses arranged at the preset position may be used as the second speech recognition result after the losses corresponding to the plurality of speech recognition results are arranged from small to large. The smaller the loss corresponding to the speech recognition result in the first speech recognition result, the greater the possibility of using it as the second speech recognition result.

The voice recognition method comprises the steps of firstly obtaining voice data to be recognized, recognizing the voice data to be recognized, obtaining a first voice recognition result corresponding to the voice data to be recognized and loss corresponding to the first voice recognition result, then obtaining a keyword from the first voice recognition result, adjusting the loss of the first voice recognition result based on the keyword to obtain a first voice recognition result after loss adjustment, and finally obtaining a second voice recognition result corresponding to the voice data to be recognized from the first voice recognition result after loss adjustment. By the method, the keywords are automatically obtained from the first voice recognition result, the loss of the first voice recognition result is adjusted according to the keywords, the second voice recognition result corresponding to the voice data to be recognized is obtained from the lost adjusted first voice recognition result, and the accuracy of the voice data to be recognized is improved.

Referring to fig. 2, a speech recognition method provided in the embodiment of the present application is applied to an electronic device, and the method includes:

step S210: and acquiring voice data to be recognized.

The step S210 may specifically refer to the detailed explanation in the above embodiments, and therefore, will not be described in detail in this embodiment.

Step S220: and recognizing the voice data to be recognized, and acquiring a first voice recognition result corresponding to the voice data to be recognized and a loss corresponding to the first voice recognition result.

In an embodiment of the present application, the first speech recognition result includes a word graph, where the word graph includes m output words.

And if the first voice recognition result is a word graph, the loss corresponding to the first voice result is the loss corresponding to each output word in the word graph. The word graph and the concrete representation form of the loss corresponding to the word graph can be as shown in fig. 3, the output word and the loss can be arranged on the edge connecting the adjacent nodes, wherein "hi" is the output word, and "80.76" is the loss corresponding to "hi", wherein the loss can include language loss and acoustic loss. 80.76 is the sum of the language loss and the acoustic loss corresponding to "hi", where the language loss is the loss corresponding to "hi" output by the pre-trained language model, and the acoustic loss is the loss corresponding to "hi" output by the pre-trained acoustic model.

Step S230: and acquiring the occurrence frequency of each output word in the m output words.

In the embodiment of the application, the number of the output words included in the word graph is counted first, and then the occurrence number of each output word is counted. If an output word appears once in the word graph, the number of occurrences of the output word is increased by 1.

Step S240: determining a keyword from the m output words based on the number of occurrences of each output word.

In this embodiment of the present application, the step of determining a keyword from the m output words based on the occurrence number of each output word includes: and taking the output words with the occurrence times larger than the first preset times in the m output words as the keywords.

Or, acquiring the total occurrence times of the m output words; and taking the output words with the occurrence probability larger than or equal to a first preset probability in the m output words as the keywords, wherein the occurrence probability is the ratio of the occurrence frequency of each output word to the total occurrence frequency.

In the embodiment of the application, the first preset times are preset times of occurrence of an output word when the output word is determined as a keyword; the first preset probability is the occurrence probability of the output word when the preset output word is determined as the keyword.

As one way, the keywords may be determined by an absolute word frequency method. Specifically, as can be seen from the above contents, the occurrence frequency of each output word in the m output words is respectively obtained, then it is determined whether the occurrence frequency of each output word is greater than a first preset frequency, and if it is determined that the occurrence frequency of an output word in the m output words is greater than the first preset frequency, the output word whose occurrence frequency is greater than the first preset frequency is determined as a keyword, and then the determined keyword may be stored in the keyword list.

Alternatively, the keywords may be determined by a relative word frequency method. Specifically, according to the above contents, the occurrence frequency of each output word in m output words is respectively obtained, then the total occurrence frequency of the m output words is obtained by adding the occurrence frequencies of each output word, and then the occurrence probability of each output word can be determined according to the ratio of the occurrence frequency of each output word to the total occurrence frequency of the m output words, and then the occurrence probability of each output word can be compared with a first preset probability, and the output words with the occurrence probability greater than or equal to the first preset probability are determined as the keywords, and then the determined keywords can be stored in the keyword list.

Similarly, when the determined keywords are stored, the importance of each keyword may be stored, and further, the loss of the word graph may be adjusted according to the importance of the keywords.

Step S250: and if the word graph comprises the keyword, adjusting the loss of forward and backward jumping of the keyword contained in the word graph to a first loss value to obtain the word graph with the loss adjusted.

In the embodiment of the present application, the first loss value is a preset adjusted loss value, and the loss value may be a specific loss value. The loss of the forward and backward skip of the keyword can be understood as the loss corresponding to two output words adjacent to the front and back of the keyword.

When determining that the word graph includes the relevant keyword through the method, the loss corresponding to the two output words adjacent to the front and back of the keyword is reduced, and the loss corresponding to the two output words adjacent to the front and back of the keyword is adjusted to be the first loss value. The higher the importance of the keyword is, the smaller the loss value corresponding to the two output words adjacent to the front and rear of the keyword is adjusted to be. Further, since the losses corresponding to the two output words adjacent to each other before and after the keyword may be different, when the losses corresponding to the two output words adjacent to each other before and after the keyword are adjusted to the first loss value, it can be understood that the losses corresponding to the two output words adjacent to each other before and after the keyword may be adjusted by the same adjustment width. For example, the loss corresponding to two output words adjacent to the front and back of the keyword is reduced by 20%.

Optionally, in the embodiment of the present application, the adjusted language loss of the output word.

Step S260: and acquiring a second voice recognition result corresponding to the voice data to be recognized from the lost word graph.

After the loss of the forward and backward jumping of the keyword included in the word graph is adjusted to the first loss value by the method, the loss of each path in the word graph can be recalculated, and then one or more paths with the minimum loss can be used as the final voice recognition result, namely the second voice recognition result. In the embodiment of the present application, when the loss of each path is calculated, the loss corresponding to each output word in each path is added to obtain the loss corresponding to the path. Illustratively, as shown in fig. 3, when the loss corresponding to the path "0-1-4-10-15-27" is calculated, the loss 80.76 corresponding to "hi", the loss 16.8 corresponding to "this", the loss 22.36 corresponding to "is", the loss 63.09 corresponding to "my", and the loss 34.56 corresponding to "number" are added to obtain the loss corresponding to the path "0-1-4-10-15-27", and the loss corresponding to the path "0-1-4-10-15-27" is also 195.21, that is, "80.76 +16.8+63.09+34.56 ═ 195.21".

After the loss of each path in the vocabulary is recalculated, if the loss of a plurality of paths is the same and the loss is the minimum value, any one of the paths can be used as the final voice recognition result.

Optionally, after the loss of each path in the word graph is calculated, the losses corresponding to each path may be sorted from small to large, and then one or more paths with corresponding losses arranged in front may be selected as the final speech recognition result.

The method comprises the steps of firstly obtaining voice data to be recognized, recognizing the voice data to be recognized, obtaining a first voice recognition result corresponding to the voice data to be recognized and a loss corresponding to the first voice recognition result, then obtaining the occurrence frequency of each output word in m output words, determining a keyword from the m output words based on the occurrence frequency of each word, adjusting the loss of forward and backward jumping of the keyword included in a word graph to a first loss value to obtain a word graph after loss adjustment, and finally obtaining a second voice recognition result corresponding to the voice data to be recognized from the word graph after loss adjustment. By the method, the keywords are automatically acquired in the word graph, the loss of the word graph is adjusted according to the keywords, and the second voice recognition result corresponding to the voice data to be recognized is acquired from the lost adjusted word graph, so that the accuracy of recognizing the voice data to be recognized is improved.

Referring to fig. 4, a speech recognition method provided in the embodiment of the present application is applied to an electronic device, and the method includes:

step S310: and acquiring voice data to be recognized.

The step S310 may refer to the detailed explanation in the above embodiments, and therefore, will not be described in detail in this embodiment.

Step S320: and recognizing the voice data to be recognized, and acquiring a first voice recognition result corresponding to the voice data to be recognized and a loss corresponding to the first voice recognition result.

In an embodiment of the present application, the first recognition result includes n speech recognition results, and the n speech recognition results include p output words. The p output words are the total number of all the different output words in the n speech recognition results.

Specifically, when the first speech recognition result is n speech recognition results, the n speech recognition results may include more than or equal to 1 optimal speech recognition result, where the optimal speech recognition result is n speech recognition results that are ranked first after the output loss is ranked from small to large, or the optimal speech recognition result may also be n speech recognition results that are ranked next after the output loss is ranked from large to small.

By one approach, the loss corresponding to the first speech recognition result includes a loss corresponding to each of the n speech output results. That is, one speech recognition result corresponds to one loss.

Step S330: and acquiring the occurrence frequency of each output word in the p output words.

In the embodiment of the application, the occurrence frequency of the output word included in each of the n speech recognition results is counted in sequence, and then the occurrence frequency of the output word included in each of the n speech recognition results is added to obtain the occurrence frequency of each of the p output words.

Step S340: determining keywords from the p output words based on the number of occurrences of each output word.

In this embodiment of the present application, the step of determining a keyword from the p output words based on the occurrence number of each output word includes: and taking the output words with the occurrence frequency larger than a second preset frequency in the p output words as the keywords.

Or acquiring the total occurrence times of the p output words; and taking the output words with the occurrence probability larger than or equal to a second preset probability in the p output words as the keywords, wherein the occurrence probability is the ratio of the occurrence frequency of each output word to the total occurrence frequency.

In this embodiment of the application, the second preset number is a preset number of occurrences of the output word when the output word is determined as the keyword, and the second preset number may be the same as or different from the first preset number. The second preset probability is the occurrence probability of the preset output word when the preset output word is determined as the keyword, and the second preset probability can be the same as or different from the first preset probability and can be set according to actual requirements.

As one way, the keywords may be determined by an absolute word frequency method. Specifically, as can be seen from the above contents, the occurrence frequency of each output word in the p output words is respectively obtained, then it is determined whether the occurrence frequency of each output word is greater than the second preset frequency, and if it is determined that the occurrence frequency of an output word in the p output words is greater than the second preset frequency, the output word whose occurrence frequency is greater than the second preset frequency is determined as a keyword, and then the determined keyword may be stored in the keyword list. In the embodiment of the present application, if an output word appears once in n speech recognition results, the number of occurrences of the output word is increased by 1. Illustratively, if the n speech recognition results include 3 speech recognition results, the 3 speech recognition results include 5 output words in total, after statistics, the number of occurrences of the output word 1 is 5, the number of occurrences of the output word 2 is 3, the number of occurrences of the output word 3 is 7, the number of occurrences of the output word 4 is 3, the number of occurrences of the output word 5 is 6, and the second preset number is 5, then comparing the number of occurrences of the output word 1, the output word 2, the output word 3, the output word 4, and the output word 5 with the second preset number 5 in sequence, and further determining that the output word 3 and the output word 5 are keywords.

Alternatively, the keywords may be determined by a relative word frequency method. Specifically, according to the above content, the occurrence frequency of each output word in the p output words is respectively obtained, then the total occurrence frequency of the p output words is obtained by adding the occurrence frequencies of each output word, and then the occurrence probability of each output word can be determined according to the ratio of the occurrence frequency of each output word to the total occurrence frequency of the p output words, and then the occurrence probability of each output word can be compared with the second preset probability, and the output words with the occurrence probability greater than or equal to the second preset probability are determined as the keywords, and then the determined keywords can be stored in the keyword list.

After the keywords are determined by the method and stored in the keyword list, the importance of each keyword can be stored in the keyword list.

Step S350: and adjusting the loss of the voice recognition results including the keywords in the n voice recognition results to a second loss value so as to obtain n voice recognition results after loss adjustment.

In this embodiment, the second loss value is a preset adjusted loss value, and the second loss value may be a specific loss value or a loss interval. After the keywords are determined in the above manner, whether each of the n speech recognition results includes the relevant keyword or not can be determined by searching the keyword list, if it is determined that one of the n speech recognition results includes the relevant keyword, the loss of the speech recognition result is reduced, the probability that the speech recognition result is used as the optimal recognition result is improved, the loss value of the speech recognition result is adjusted to be the second loss value, and the second loss value is smaller than the original loss value of the speech recognition result.

As one mode, when the loss corresponding to the speech recognition result including the keyword in the n speech recognition results is adjusted, the loss value of the speech recognition result may be adjusted according to the importance of the keyword included in the speech recognition result including the keyword, and the higher the importance of the keyword is, the smaller the loss value corresponding to the speech recognition result is adjusted. Illustratively, the keyword 1 is included in one speech recognition result, and the loss of the speech recognition result is 80.4. If the importance of the keyword 1 is 50%, the loss corresponding to the speech recognition result can be adjusted to 60.4; if the importance of 1 in the keyword is 75%, the loss corresponding to the speech recognition result can be adjusted to 30.4.

As another mode, when the loss corresponding to the speech recognition result including the keyword in the n speech recognition results is adjusted, the loss corresponding to the speech recognition result may be adjusted according to the number of the keyword included in the speech recognition result, and the loss value corresponding to the speech recognition result is adjusted to be smaller as the number of the keyword included in the speech recognition result is larger.

Further, the number of keywords and the importance of the keywords may be combined to adjust the loss of the speech recognition result. Specifically, the number of the keywords, the importance of the keywords, and the correspondence between the second loss values may be preset, and the amplitude of adjusting the loss corresponding to the voice recognition result may be determined by the correspondence.

Step S360: and acquiring a second voice recognition result corresponding to the voice data to be recognized from the n voice recognition results after the loss is adjusted.

After the loss corresponding to the speech recognition result including the keyword in the n speech recognition results is adjusted, the arrangement sequence of the n speech recognition results can be adjusted according to the loss corresponding to each speech recognition result in the n speech recognition results, so as to obtain n speech recognition results after the loss is adjusted, and the speech recognition result with the loss value smaller than the preset loss value in the n speech recognition results after the loss is adjusted is used as a final output result, namely a second speech recognition result. The preset loss value is a preset loss value which can be determined as a final voice recognition result of the voice data to be recognized.

The voice recognition method comprises the steps of firstly obtaining voice data to be recognized, recognizing the voice data to be recognized, obtaining a first voice recognition result corresponding to the voice data to be recognized and loss corresponding to the first voice recognition result, then obtaining the occurrence frequency of each output word in p output words, determining a keyword from the p output words based on the occurrence frequency of each word, then adjusting the loss of the voice recognition results including the keyword in n voice recognition results to a second loss value to obtain n voice recognition results after loss adjustment, and finally obtaining a second voice recognition result corresponding to the voice data to be recognized from the n voice recognition results after loss adjustment. By the method, the keywords are automatically acquired from the n voice recognition results, the loss of the n voice recognition results is adjusted according to the keywords, the second voice recognition result corresponding to the voice data to be recognized is acquired from the n voice recognition results after the loss is adjusted, and the accuracy of the voice data to be recognized is improved.

Referring to fig. 5, a speech recognition method provided in the embodiment of the present application is applied to an electronic device, and the method includes:

step S410: and acquiring voice data to be recognized.

In this embodiment of the application, the voice data to be recognized may be the voice data which needs to be recognized and is acquired in real time, or may also be the voice data which needs to be recognized and is acquired from an external device in advance. The external device may be an electronic device that stores voice data, an electronic device that can generate voice data in real time, or the like.

In the embodiment of the application, the voice data to be recognized may be stored in the storage area of the electronic device in advance, and the voice data to be recognized may be stored according to a certain rule, for example, the voice data to be recognized may be stored in a file named according to a specified rule, and further, when the voice data to be recognized needs to be acquired, the voice data to be recognized may be acquired from the storage area of the electronic device according to the file name.

Of course, the voice data to be recognized may also be voice data transmitted by an external device. Specifically, when the electronic device needs to acquire the audio data to be recognized, a data acquisition instruction may be sent to the external device, and after the external device receives the data acquisition instruction, the external device returns the voice data to be recognized to the electronic device. Optionally, the to-be-recognized voice data returned by the external device may be designated voice data or any one of the voice data, which may be determined whether the data acquisition instruction received by the external device includes an identifier of the voice data (the identifier may be a serial number of the to-be-recognized voice data), and if the data acquisition instruction includes the identifier of the voice data, the external device returns the voice data corresponding to the identifier to the electronic device as the to-be-recognized voice data; and if the data acquisition instruction does not comprise the identification of the voice data, the external equipment returns any voice data to the electronic equipment as the voice data to be recognized.

When the external device returns the voice data to be recognized to the electronic device, the external device may transmit the voice data whose generation time is the foremost to the electronic device as the voice data to be recognized according to the time sequence of generating the voice data. In this way, it is possible to avoid the problem that the voice data to be recognized which is generated at the earliest time is not recognized due to too much stored voice data in the external device.

Optionally, the voice data to be recognized is voice data in wave format.

Step S420: inputting the voice data to be recognized into a voice recognition model, acquiring a voice recognition result output by the voice recognition model, and taking the voice recognition result as a first voice recognition result corresponding to the voice data to be recognized and a loss corresponding to the first voice recognition result.

In the embodiment of the present application, the speech recognition model may be composed of an acoustic model and a language model, which respectively correspond to the calculation of the speech-to-syllable probability and the calculation of the syllable-to-word probability.

Optionally, the speech recognition model may also be composed of an acoustic model, a dictionary, and a language model. Moreover, the voice recognition model can also be an end-to-end model, and voice data can be converted into text data through the end-to-end model, so that the simplified sequence conversion operation is realized, and the training process is simplified. The sequence may include text, voice, image, or video sequence data, among others.

After the voice data to be recognized is obtained, the voice data to be recognized is input into the voice recognition model, and then the voice recognition result corresponding to the voice data to be recognized and the loss corresponding to the voice recognition result can be output through the voice recognition model. In the embodiment of the present application, the loss may be calculated by a loss function in the speech recognition model.

Step S430: and acquiring a keyword from the first voice recognition result.

Step S440: and adjusting the loss corresponding to the first voice recognition result based on the keyword to obtain the first voice recognition result after the loss is adjusted.

Step S450: and acquiring a second voice recognition result corresponding to the voice data to be recognized from the first voice recognition result after loss adjustment.

The steps S430, S440 and S450 can be explained with reference to the detailed explanation of the above embodiments, and therefore are not described in detail in this embodiment.

The voice recognition method comprises the steps of firstly obtaining voice data to be recognized, inputting the voice data to be recognized into a voice recognition model, obtaining a voice recognition result output by the voice recognition model, taking the voice recognition result as a first voice recognition result corresponding to the voice data to be recognized and a loss corresponding to the first voice recognition result, then obtaining a keyword from the first voice recognition result, adjusting the loss corresponding to the first voice recognition result based on the keyword to obtain a first voice recognition result after loss adjustment, and then obtaining a second voice recognition result corresponding to the voice data to be recognized from the first voice recognition result after loss adjustment. By the method, the keywords are automatically obtained from the first voice recognition result, the loss of the first voice recognition result is adjusted according to the keywords, the second voice recognition result corresponding to the voice data to be recognized is obtained from the lost adjusted first voice recognition result, and the accuracy of the voice data to be recognized is improved.

Referring to fig. 6, a speech recognition apparatus 500 according to an embodiment of the present application includes:

a data obtaining unit 510, configured to obtain voice data to be recognized.

A first result obtaining unit 520, configured to recognize the voice data to be recognized, and obtain a first voice recognition result corresponding to the voice data to be recognized and a loss corresponding to the first voice recognition result.

In one mode, the first result obtaining unit 520 is configured to input the voice data to be recognized into a voice recognition model, obtain a voice recognition result output by the voice recognition model, and use the voice recognition result as a first voice recognition result corresponding to the voice data to be recognized and a loss corresponding to the first voice recognition result.

A keyword obtaining unit 530, configured to obtain a keyword from the first speech recognition result.

As one way, the keyword obtaining unit 530 is configured to obtain the occurrence number of each output word in the m output words; determining a keyword from the m output words based on the number of occurrences of each output word.

Specifically, the keyword obtaining unit 530 is configured to use an output word with a frequency greater than a first preset frequency in the m output words as a keyword; or, acquiring the total occurrence times of the m output words; and taking the output words with the occurrence probability larger than or equal to a first preset probability in the m output words as the keywords, wherein the occurrence probability is the ratio of the occurrence frequency of each output word to the total occurrence frequency.

As another way, the keyword obtaining unit 530 is further configured to obtain the occurrence number of each output word in the p output words; determining keywords from the p output words based on the number of occurrences of each output word.

Specifically, the keyword obtaining unit 530 is configured to use an output word with a frequency greater than a second preset frequency in the p output words as a keyword; or acquiring the total occurrence times of the p output words; and taking the output words with the occurrence probability larger than or equal to a second preset probability in the p output words as the keywords, wherein the occurrence probability is the ratio of the occurrence frequency of each output word to the total occurrence frequency.

A loss adjusting unit 540, configured to adjust a loss corresponding to the first speech recognition result based on the keyword, so as to obtain a first speech recognition result after the loss is adjusted.

In one manner, the loss adjusting unit 540 is configured to adjust a loss of forward and backward jumping of a keyword included in the word graph to a first loss value, so as to obtain a loss-adjusted word graph.

As another mode, the loss adjusting unit 540 is configured to adjust the loss of the speech recognition result including the keyword in the n speech recognition results to a second loss value, so as to obtain n speech recognition results after the loss is adjusted.

A second result obtaining unit 550, configured to obtain a second speech recognition result corresponding to the speech data to be recognized from the first speech recognition result after loss adjustment.

As one mode, the second result obtaining unit 550 is configured to obtain a second speech recognition result corresponding to the speech data to be recognized from the lost word graph.

As another mode, the second result obtaining unit 550 is configured to obtain a second speech recognition result corresponding to the speech data to be recognized from the n speech recognition results after the loss is adjusted.

It should be noted that the device embodiment and the method embodiment in the present application correspond to each other, and specific principles in the device embodiment may refer to the contents in the method embodiment, which is not described herein again.

An electronic device provided by the present application will be described with reference to fig. 7.

Referring to fig. 7, based on the foregoing speech recognition method and apparatus, another electronic device 800 capable of performing the speech recognition method is further provided in the embodiment of the present application. The electronic device 800 includes one or more processors 802 (only one shown), a memory 804, and a network module 806 coupled to each other. The memory 804 stores programs that can execute the content of the foregoing embodiments, and the processor 802 can execute the programs stored in the memory 804.

Processor 802 may include one or more processing cores, among others. The processor 802 interfaces with various components throughout the electronic device 800 using various interfaces and circuitry to perform various functions of the electronic device 800 and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 804 and invoking data stored in the memory 804. Alternatively, the processor 802 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 802 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 802, but may be implemented by a single communication chip.

The Memory 804 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 804 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 804 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The stored data area may also store data created by the terminal 800 during use (e.g., phone books, audio-visual data, chat log data), and the like.

The network module 806 is configured to receive and transmit electromagnetic waves, and achieve interconversion between the electromagnetic waves and the electrical signals, so as to communicate with a communication network or other devices, for example, an audio playing device. The network module 806 may include various existing circuit elements for performing these functions, such as an antenna, a radio frequency transceiver, a digital signal processor, an encryption/decryption chip, a Subscriber Identity Module (SIM) card, memory, and so forth. The network module 806 may communicate with various networks, such as the internet, an intranet, a wireless network, or with other devices via a wireless network. The wireless network may comprise a cellular telephone network, a wireless local area network, or a metropolitan area network. For example, the network module 806 can interact with the base station.

Referring to fig. 8, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer-readable medium 900 has stored therein program code that can be called by a processor to perform the methods described in the above-described method embodiments.

The computer-readable storage medium 900 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 900 includes a non-volatile computer-readable storage medium. The computer readable storage medium 900 has storage space for program code 910 to perform any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 910 may be compressed, for example, in a suitable form.

The application provides a voice recognition method, a voice recognition device, an electronic device and a storage medium, wherein voice data to be recognized are firstly obtained, the voice data to be recognized are recognized, a first voice recognition result corresponding to the voice data to be recognized and loss corresponding to the first voice recognition result are obtained, then a keyword is obtained from the first voice recognition result, loss of the first voice recognition result is adjusted based on the keyword to obtain the first voice recognition result after loss adjustment, and finally a second voice recognition result corresponding to the voice data to be recognized is obtained from the first voice recognition result after loss adjustment. By the method, the keywords are automatically obtained from the first voice recognition result, the loss of the first voice recognition result is adjusted according to the keywords, the second voice recognition result corresponding to the voice data to be recognized is obtained from the lost adjusted first voice recognition result, and the accuracy of the voice data to be recognized is improved.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method of speech recognition, the method comprising:

acquiring voice data to be recognized;

recognizing the voice data to be recognized, and acquiring a first voice recognition result corresponding to the voice data to be recognized and a loss corresponding to the first voice recognition result;

acquiring a keyword from the first voice recognition result;

based on the keyword, adjusting the loss corresponding to the first voice recognition result to obtain a first voice recognition result after the loss is adjusted;

and acquiring a second voice recognition result corresponding to the voice data to be recognized from the first voice recognition result after loss adjustment.

2. The method of claim 1, wherein the first speech recognition result comprises a word graph, wherein the word graph comprises m output words, and wherein obtaining the keyword from the first speech recognition result comprises:

acquiring the occurrence frequency of each output word in the m output words;

determining a keyword from the m output words based on the number of occurrences of each output word.

3. The method of claim 2, wherein determining keywords from the m output words based on the number of occurrences of each output word comprises:

taking output words with the occurrence frequency larger than a first preset frequency in the m output words as keywords; alternatively, the first and second electrodes may be,

acquiring the total occurrence times of the m output words;

and taking the output words with the occurrence probability larger than or equal to a first preset probability in the m output words as the keywords, wherein the occurrence probability is the ratio of the occurrence frequency of each output word to the total occurrence frequency.

4. The method according to claim 2, wherein the adjusting the loss corresponding to the first speech recognition result based on the keyword to obtain an adjusted lost first speech recognition result comprises:

adjusting the loss of forward and backward jumping of the keywords included in the word graph to a first loss value to obtain the word graph with the loss adjusted;

the obtaining of the second speech recognition result corresponding to the speech data to be recognized from the first speech recognition result after the loss adjustment includes:

and acquiring a second voice recognition result corresponding to the voice data to be recognized from the lost word graph.

5. The method according to claim 1, wherein the first recognition result comprises n speech recognition results, the n speech recognition results comprise p output words, and the obtaining the keyword from the first speech recognition result comprises:

acquiring the occurrence frequency of each output word in the p output words;

determining keywords from the p output words based on the number of occurrences of each output word.

6. The method of claim 5, wherein determining keywords from the p output words based on the number of occurrences of each output word comprises:

taking output words with the occurrence frequency larger than a second preset frequency in the p output words as keywords; alternatively, the first and second electrodes may be,

acquiring the total occurrence times of the p output words;

and taking the output words with the occurrence probability larger than or equal to a second preset probability in the p output words as the keywords, wherein the occurrence probability is the ratio of the occurrence frequency of each output word to the total occurrence frequency.

7. The method according to claim 5, wherein the adjusting the loss corresponding to the first speech recognition result based on the keyword to obtain an adjusted lost first speech recognition result comprises:

adjusting the loss of the voice recognition results including the keywords in the n voice recognition results to a second loss value to obtain n voice recognition results after loss adjustment;

and acquiring a second voice recognition result corresponding to the voice data to be recognized from the n voice recognition results after the loss is adjusted.

8. The method according to claim 1, wherein the recognizing the voice data to be recognized, and obtaining a first voice recognition result corresponding to the voice data to be recognized and a loss corresponding to the first voice recognition result comprises:

inputting the voice data to be recognized into a voice recognition model, acquiring a voice recognition result output by the voice recognition model, and taking the voice recognition result as a first voice recognition result corresponding to the voice data to be recognized and a loss corresponding to the first voice recognition result.

9. A speech recognition apparatus, characterized in that the apparatus comprises:

the data acquisition unit is used for acquiring voice data to be recognized;

a first result obtaining unit, configured to identify the voice data to be identified, and obtain a first voice identification result corresponding to the voice data to be identified and a loss corresponding to the first voice identification result;

a keyword acquisition unit configured to acquire a keyword from the first speech recognition result;

a loss adjusting unit, configured to adjust a loss corresponding to the first speech recognition result based on the keyword, so as to obtain a first speech recognition result after the loss is adjusted;

and the second result acquisition unit is used for acquiring a second voice recognition result corresponding to the voice data to be recognized from the first voice recognition result after the loss is adjusted.

10. An electronic device comprising one or more processors and memory; one or more programs stored in the memory and configured to be executed by the one or more processors to perform the method of any of claims 1-8.

11. A computer-readable storage medium, having program code stored therein, wherein the program code when executed by a processor performs the method of any of claims 1-8.